Mining for Meaning

In David Lodge’s 1984 novel, Small World, a character remarks that literary analysis of Shakespeare and T.S. Eliot “would just lend itself nicely to computerization….All you’d have to do would be to put the texts on to tape and you could get the computer to list every word, phrase and syntactical construction that the two writers had in common.”

This brave new world is upon us, but the larger question for Google and OCLC, among other purveyors of warehoused metadata and petabytes of information, is how to achieve meaning. One of the brilliant insights derived from Terry Winograd‘s research and mentoring is that popularity in the form of inbound links does matter for web pages, at least. In the case of all the world’s books turned into digitized texts, it’s a harder question to assign meaning without popularity, a canon, or search queries as a guide.

Until recently, text mining wasn’t possible at great scale. And as the great scanning projects continue on their bumpy road, the mysteries of what will come out of them have yet to emerge into meaning for users.

Nascent standards

Bill Kasdorf pointed out several XML models for books in his May NISO presentation, including NISO/ISO 12083, TEI, DocBook, NLM Book DTD, and DTBook. These existing models have served publishers well, though they have been employed for particular uses and have not yet found common ground across the breath of book types. The need for a standard has never been clearer, but it will require vision and a clear understanding of solved problems to push forward.

After the professor in Small World gains access to a server, he grows giddy with the possibilities of finding “your own special, distinctive, unique way of using the English language….the words that carry a distinctive semantic content.” While we may be delighted about the possibilities that searching books afford, there is the distinct possibility that the world of the text could be changed completely.

Another mechanism for assigning meaning to full text has been opened up by web technology and science. The Open Text Mining Interface is a method championed by Nature Publishing Group as a way to share the contents of their archives in XML for the express purpose of text mining while preserving intellectual property concerns. Now in a second revision, the OTMI is an elegant method of enabling sharing, though it remains to be seen if the initiative will spread to a larger audience.

Sense making

As the corpus lurches towards the cloud, one interesting example of semantic meaning comes in the Open Calais project, an open platform by the reconstituted Thomson Reuters. When raw text is fed into the Calais web service, terms are extracted and fed into existing taxonomies. Thus, persons, countries, and categories are first identified and then made available for verification.

This experimental service has proved its value for unstructured text, but it also works for extracting meaning from the most recent weblog posting to historic newspapers newly scanned into text via Optical Character Recognition (OCR). Since human-created metadata and indexing services are among the most expensive things libraries and publishers create, any mechanism to optimize human intelligence by using machines to create meaning is a useful way forward.

Calais shows promise for metadata enhancement, since full text can be mined for its word properties and fed into taxonomic structures. This could be the basis for search engines that understand natural language queries in the future, but could also be a mechanism for accurate and precise concept browsing.

Glimmers of understanding

One method of gaining new understanding is to examine solved problems. Melvil Dewey understood vertical integration, as he helped with innovations around 3×5 index cards, cabinets, as well as the classification systems that bears his name. Some even say he was the first standards bearer for libraries, though it’s hard to believe that anyone familiar with standards can imagine that one person could have actually been entirely responsible.

Another solved problem is how to make information about books and journals widely available. This has been done twice in the past century—first with the printed catalog card, distributed by the Library of Congress for the greater good, and the distributed catalog record, at great utility (and cost) by the Online Computer Library Center.

Pointers are no longer entirely sufficient, since the problem is not only how to find information but how to make sense of it once it has been found. Linking from catalog records has been a partial solution, but the era of complete books online is now entering its second decade. The third stage is upon us.

Small world: an academic romance

David Lodge; Penguin 1985

WorldCatLibraryThingGoogle BooksBookFinder 

Is there a bibliographic emergency?

The Bibliographic Control Working Group held its third and final public meeting on the future of bibliographic control July 9 at the Library of Congress, focusing on “The Economics and Organization of Bibliographic Data.” The conclusion of the meetings will come in a report issued in November. No dramatic changes were issued from this meeting, and public comment is invited until the end of July.

With several panels, invited speakers, and an open forum (including a public webcast), Deanna Marcum, Library of Congress associate librarian for library services, framed the discussion by saying “Worries about MARC as an international standard make it seem like we found it on a tablet.” She went on to say, “Many catalogers believe catalogs…should be a public good, but in this world, it’s not possible to ignore economic considerations.” Marcum said there is no LC budget line that provides cataloging records for other libraries, though the CIP program has been hugely successful.

Value for money
Jose-Marie Griffiths, dean of the library school at the University of North Carolina, Chapel Hill, said there are three broad areas of concern: users and uses of bibliographic data, different needs for the data, and the economics and organization of the data. “What does free cost?” she asked, “Who are the stakeholders, and how are they organizationally aligned?”

Judith Nadler, library director, University of Chicago, moderated the discussion and said the format of the meetings was based on the oral and written testimony that was used to create Section 108 of the Copyright Law. Nadler joked that “We will have authority control–if we can afford it.”

Atoms vs bits
Rick Lugg, partner, R2 Consulting, has often spoken of the need for libraries to say no before saying yes to new things. His Powerpoint-free (at Marcum’s request–no speakers used it) presentation focused on how backlogs are invisible in the digital world. “People have difficulty managing what they have,” Lugg said, “There is a sense of a long emergency, and libraries cannot afford to support what they are doing.”

Using business language, since R2 often consults for academic libraries on streamlining processes, Lugg said libraries are not taking advantage of the value chain. Competitors are now challenging libraries in the area of search, even as technical services budgets are being challenged.

In part, Lugg credited this pressure to the basic MARC record becoming a commodity, and he estimated the cost of an original cataloged record to be $150-200. He challenged libraries to abandon the “cult of perfection,” since “the reader isn’t going to read the wrong book.”

Another area of concern is the difficulty of maintaining three stove-piped bibliographic areas, from MARC records for books, to serials holdings for link resolvers, to an A-Z list of journals. With separate print and electronic records, the total cost of bibliographic control is unknown, particularly with a lifecycle that includes selection, access, digitization, and storage or deaccession.

There is a real question about inventory control vs. bibliographic control, Lugg said. The opportunity cost of the current processes lead to questions if libraries are putting their effort where it yields the most benefit. With many new responsibilities coming down the pike for technical services, including special collections, rare books, finding aids, and institutional repositories, libraries are challenged to retrain catalogers to expand their roles beyond MARC into learning new formats like MODS, METS, and Dublin Core.

Lugg said R2 found that non-MLS catalogers were often more rule-bound than professional staff, which brings about training questions. He summarized his presentation by asking:

  1. How do we reduce our efforts and redirect our focus?
  2. How can we redirect our expertise to new metadata schemes?
  3. How can we open our systems and cultures to external support from authors, publishers, abstract and indexing (A&I) services, etc?

The role of the consortium
Lizanne Payne, director of the WRLC, a library consortia for DC universities, said that with 200 international library consortia dedicated to containing the cost of content, the economics of bibliographic data is paramount. Saying that shared catalogs and systems date from a time when hardware and software was expensive, “IT staff is the most expensive line item now.”

Payne said storage facilities require explicit placement for quick retrieval, not a relative measure like call numbers. She called for algorithms to be written beyond FrBR that dedupe for unique and overlapping copies that go beyond OCLC or LCCN numbers.

Public libraries are special
Mary Catherine Little, director of technical services, Queens Library (NY), gave a fascinating overview of her library system. With 2.2 million items circulated in 2006 in 33 languages, 45,000 visitors per day, and 75,000 titles cataloged last year, Queens is the busiest library in the United States and has 66 branches within “one mile of every resident.”

Little said their ILS plans are evolving, “Heard about Sirsi/Dynix?” With its multilingual and growing collection, Little detailed their process. First, they ask if they are the first library to touch the record. Then, they investigate whether the ILS can function with the record “today, then tomorrow,” and ask if the record can be found from an outside source. The library prefers to get records from online vendors or directly from the publishers, and has 90 percent of English records in the catalog prior to publication.

Queens Public Library has devised a model for international providers which revolve around receiving monthly lists of high-demand titles, especially from high demand Chinese publishers, and standing orders. With vendors feeling the push from the library, many then enter into partnerships with OCLC.

“Uncataloged collections are better than backlogs,” Little said, and many patrons discover high-demand titles by walking around, especially audio and video. “We’ve accepted the tradeoffs,” she said.

Little made a call for community tagging, word clouds, and open source and extensible catalogs, and said there is a continuing challenge to capture non-Roman data formats.

“Global underpinnings are the key to the future, and Unicode must be present,” Little said, “The Library of Congress has been behind, and the future is open source software and interoperability through cooperation.”

Special libraries harmonize
Susan Fifer Canby, National Geographic Society vice president of library and information services, said her library contains proprietary data and works to harmonzie taxonomies across various content management systems (CMS), enhancing with useful metadata to give her users a Google-like search.

Canby said this work has led to a relative consistency and accuracy, which helps users bridge print and electronic sources. Though some special libraries are still managing print collections, most are devoting serious amounts of time to digital finding aids, competitive information gathering, and future analysis for their companies to help connect the dots. The library is working to associate latitude and longitude information with content to aid with mashups.

The National Geographic library uses OCLC records for books and serials, and simple MARC records for maps, and more complex records for ephemera, “though [there’s] no staff to catalog everything.” The big challenge, however, is cataloging photographs, since the ratio used to be 100 shots for every published photo, and now it’s 1000 to 1.”Photographers have been incentivized to provide keywords and metadata,” Canby said. With the rise of IPTC embedded data, the photographers are adding terms from drop-down menus, free-text fields, and conceptual terms.

The library is buying digital content, but not yet HD content, since it’s too expensive due to its large file size. Selling large versions of its photos through ecommerce has given the library additional funds for special librarians to do better, Canby said.

Special libraries have challenges to get their organizations to implement digital document solutions, since most people use email as a filing strategy instead of metadata-based solutions. Another large challenge is that most companies view special libraries as a cost center, and just sustaining services is difficult. Since the special library’s primary role isn’t cataloging, which is outsourced and often assigned to interns, the bottom line is to develop a flexible metadata strategy that includes collaborating with the Library of Congress and users to make it happen.

Vendors and records
Bob Nardini, Coutts Information Services, said book vendors are a major provider of MARC records, and may employ as many catalogers as the Library of Congress does. Coutts relies on the LC CIP records, and said both publishers and LC are under pressure to do more with less. Nardini advocated doing more in the early stages of a book’s life, and gave an interesting statistic about the commodity status of a MARC record from the Library of Congress: With an annual subscription to the LC records, the effective cost is $.06 per record.

PCC
Mechael Charbonneau, director of technical services at Indiana University Libraries, gave some history about how cataloging was under threat in 1996 because of budget crunches. In part, the Program for Cooperative Cataloging (PCC) came about to extend collaboration and to find cost savings. Charbonneau said that PCC records are considered to be equivalent to LC records, “trustworthy and authoritative.” With four main areas, including BIBCO for bibliographic records, NACO for name authority,  SACO for subject authority, and CONSER for serial records, international participants have effectively supplemented the Library of Congress records.

PCC’s strategic goals include looking at new models for non-MARC metadata, being proactive rather than reactive, reacting with flexibility, achieving close working relationships with publishers, and internationalizing authority files, which has begun with LC, OCLC, and the Deutsche Bibliotek.

Charbonneau said in her role as a academic librarian, she sees the need to optimize the allocation of staff in large research libraries and to free up catalogers to do new things, starting with user needs.

Abstract and indexing
Linda Beebe, senior director of PsycINFO, said the American Psychological Association (APA) has similar goals with its database, including the creation of unique metadata and controlled vocabularies. Beebe sees linking tools as a way to give users access to content. Though Google gives users breadth, not precision, partnerships to link to content using CrossRef’s DOI service has started to solve the appropriate copy problem. Though “some access is better than none,” she cautioned that in STM, a little knowledge is a dangerous thing.

Beebe said there is a continuing need for standards, but “how many, and can they be simplified and integrated?” With a dual audience of librarians and end-users, A&I providers feel the need to make the search learning curve gentle while preserving the need for advanced features that may require instruction.

A robust discussion ensued about the need for authority control for authors in A&I services. NISO emerging standards and the Scopus author profile were discussed as possible solutions. The NISO/ISO standard is being eagerly adopted by publishers as a way to pay out royalties.

Microsoft of the library world?
Karen Calhoun, OCLC VP for WorldCat and Metadata Services, listed seven economic challenges for the working group, including productivity, redundancy, value, scale, budgets, demography, and collaboration. Pointing to Fred Kilgour, OCLC founder, as leading libraries into an age of “enhanced productivity in cataloging,” Calhoun said new models of acquisition is the next frontier.

With various definitions of quality from libraries and end users, libraries must broaden their scale of bibliographic control for more materials. Calhoun argued that “narrowing our scope is premature.” With intense budget pressure “not being surprising,” new challenges include retirements building full strength starting in 2010.

Since libraries cannot work alone, and cost reductions are not ends in themselves, OCLC can create new opportunities for libraries. Calhoun compared the OCLC suite of services to the electric grid, and said remixable and reusable metadata is the way of the future, coming from publishers, vendors, authors, reviewers, readers, and selectors.

“WorldCat is an unexploited resource, and OCLC can help libraries by moving selected technical services to the network,” Calhoun said. Advocating moving library services to the OCLC bill “like PayPal,” Calhoun said libraries could reduce its manpower costs.

Teri Frick, technical services librarian at the Orange County Public Library (VA), questioned Calhoun, saying her library can’t afford OCLC, Calhoun admitted ” OCLC is struggling with that,” and “I don’t think we have the answers.”

Frick pointed out that her small public library has the same needs as the largest library, and said any change to LC cataloging policy would have a large effect on her operations in southwestern Virginia. “When LC cut–and I understand why–it really hurt.”

Library of Congress reorganizes
Beacher Wiggins, Library of Congress director for acquisitions and bibliographic control, read a paper that gave the LC perspective. Wiggins cited Marcum’s 2005 paper that disclosed the costs of cataloging at $44 million per year. LC has 400 cataloging staff (down from 750 in 1991), who cataloged 350,000 volumes last year.

The library has reorganized acquisitions and cataloging into one administrative unit in 2004, but cataloger workflows will be merged in 2008, with retraining to take place over the next 12-36 months. New job descriptions will be created, and new partners for international records (excluding authority records) are being selected. After an imbroglio about redistribution of Italian book dealer Casalini records, Wiggins said, “For this and any future agreements, we will not agree to restrict redistribution of records we receive.”

In further questioning, Karen Coyle, library consultant, pointed out that the education effort would be large, as well as the need to retrain people. Wiggins said LC is not giving up on pre-coordination, which had been questioned by LC union member Thomas Mann and others, but that they are looking at streamlining how it is done.

Judith Cannon, Library of Congress instruction specialist, said “We don’t use the products we create, and I think there’s a disconnect there. These are all interrelated subjects.”

NLM questions business model
Dianne McCutcheon, chief of technical services at the National Library of Medicine, agreed that cataloging is a public good and that managers need to come up with an efficient cost/benefit ratio. However, McCutcheon said, “No additional benefit accrues to libraries for contributing unique records–OCLC should pay libraries for each use of a unique record.”

McCutcheon spoke in favor of incorporating ONIX from publishers in place or to supplement MARC, and “to develop the appropriate crosswalks.” With publishers working in electronic environments, libraries should use the available metadata to enhance records and build in further automation. Since medical publishers are submitting citation records directly to NLM for inclusion in Medline, the library is seeing a significant cost savings, from $10 down to $1 a record. The NLM’s Medical Text Indexer (MTI) is another useful tool, which assits catalogers in assigning subject headings, with 60 percent agreement.

NAL urges more collaboration
Christopher Cole, associate director of technical services at the National Agricultural Library (NAL), said like the NLM, the NAL is both a library and a A&I provider. By using publisher supplied metadata as a starting point and adding additional access points and doing quality control, “quality has not suffered one bit.” Cole said the NAL thesaurus was recreated 6-7 years ago after previously relying on FAO and CAB information, and he advocated for a similar reinvention. Cole said, “Use ONIX. The publishers supply it.”

Tagging and privacy
Dan Chudnov, Library of Congresss Office of Strategic Initiatives, made two points, first saying that social tagging is hard, and its value is an emergent phenomenon with no obvious rhyme or reason. Chudnov said it happens in context, and referenced Tim Spalding’s talk given at LC. “The user becomes an access point, and this is incompatible with the ALA Bill of Rights on privacy that we hold dear,” he said.

Finally, Chudnov advocated for the inclusion of computer scientists from the wider community, perhaps in a joint meeting joined by vendors.

Summing up
Robert Wolven, Columbia University director of library systems and bibliographic control and working group member, summarized the meeting by saying that the purpose was to find the “cost sinks” and to find “collective efficiencies,” since metadata has a long life cycle. Cautioning that there are “no free rides,” libraries must find ways to recoup its costs.

Marcum cited LC’s mission, which is “to make the world’s creativity and the world’s knowledge accessible to Congress and the American people,” and said the LC’s leadership role can’t be diminished. With 100 million hidden items (including photos, videos, etc), curators are called upon in 21 reading rooms to direct users to hidden treasures. But “in the era of the Web, user expectations are expanding but funding is not. Thus, things need to be done differently, and we will be measuring success as never before,” Marcum said.

ALA 2007: Top Tech Trends

At the ALA Top Tech Trends Panel, panelists including Marshall Breeding, Roy Tennant, Karen Coombs, and John Blyberg discussed RFID, open source adoption in libraries, and the importance of privacy.

Marshall Breeding, director for innovative technologies and research at Vanderbilt University Libraries (TN), started the Top Tech Trends panel by referencing his LJ Automation Marketplace article, “An Industry Redefined,” which predicted unprecedented disruption the ILS market. Breeding said 60 percent of the libraries in one state are facing a migration due to the Sirsi/Dynix product roadmap being changed, but he said not all ILS companies are the same.

Breeding said open source is new to the ILS world as a product, even though it’s been used as infrastructure in libraries for many years. Interest has now expanded to the decision makers. The Evergreen PINES project in Georgia, with 55 of 58 counties participating, was mostly successful. With the recent decision to adopt Evergreen in British Columbia, there is movement to open source solutions, though Breeding cautioned it is still miniscule compared to most libraries.

Questioning the switch being compared to an avalanche, Breeding said several commercial support companies have sprung up to serve the open source ILS market, including Liblime, Equinox, and CARe Affiliates. Breeding predicted an era of new decoupled interfaces.

John Blyberg, head of technology and digital initiatives at Darien Public Library (CT), said the back end [in the ILS] needs to be shored up because it has a ripple effect on other services. Blyberg said RFID is coming, and it makes sense for use in sorting and book storage, echoing Lori Ayres point that libraries need to support the distribution demands of the Long Tail. Feeling that privacy concerns are non-starters, because RFID is essentially a barcode, he said the RFID information is stored in a database, which should be the focus of security concerns.

Finally, Blyberg said that vendor interoperability and a democratic approach to development is needed in the age of Innovative’s Encore and Ex Libris’ Primo, both which can be used with different ILS systems and can decouple the public catalog from the ILS. With the xTensible catalog (xC) and Evergreen coming along, Blyberg said there was a need for funding and partners to further enhance their development.

Walt Crawford of OCLC/RLG said the problem with RFID is the potential of having patron barcodes chipped, which could lead to the erosion of patron privacy. Intruders could datamine who’s reading what, which Crawford said is a serious issue.

Joan Frye Williams countered that both Blyberg and Crawford were insisting on using logic on what is essentially a political problem. Breeding agreed, saying that airport security could scan chips, and that my concern is that third generation RFID chips may not be readable in 30 years, much less the hundreds of years that we expect barcodes to be around for.

Karen Coombs, head of web services at the University of Houston (TX), listed three trends:
1. The end user as content contributor, which she cautioned was an issue. What happens if YouTube goes under and people lose their memories? Combs pointed to the project with the National Library of Australia and its partnership with Flickr as a positive development.
2. Digital as format of choice for users, pointing out iTunes for music and Joost for video. Coombs said there is currently no way for libraries to provide this to users, especially in public libraries. Though companies like Overdrive and Recorded Books exist to serve this need, perhaps her point was that the consumer adoption has superseded current library demand.
3. A blurred line between desktop and web applications, which Coombs demonstrated with YouTube remixer and Google Gears, which lets you read your feeds when you’re offline.

John Blyberg responded to these trends, saying that he sees academic libraries pursuing semantic web technologies, including developing ontologies. Coombs disagreed with this assessment, saying that libraries have lots of badly-tagged HTML pages. Tennant agreed, saying If the semantic web arrives, buy yourself some ice skates, because hell will have frozen over.

Breeding said that he longs for SOA [services-oriented architecture] but I’m not holding my breath. And Walt Crawford said, Roy is right—most content providers don’t provide enough detail, and they make easy things complicated and don’t tackle the hard things. Coombs pointed out, People are too concerned with what things look like, but Crawford interjected, not too concerned.

Roy Tennant, OCLC senior program manager, listed his trends:
1. Demise of the catalog, which should push the OPAC into the back room where it belongs and elevate discovery tools like Primo and Encore, as well as OCLC WorldCat Local.
2. Software as a Service (SaaS), formerly known as ASP and hosted services, which means librarians don’t have to babysit machines, and is a great thing for lots of librarians.
3. Intense marketplace uncertainty due to the private equity buyouts of ExLibris and SirsiDynix and the rise of Evergreen and Koha looming options. Tennant also said he sees WorldCat Local as a disruptive influence. Aside from the ILS, the abstract and indexing (A&I) services are being disintermediated as Google and OCLC are going direct to publishers to license content.
Someone asked if libraries should get rid of local catalogs, and Tennant said, only when it fits local needs.

Walt Crawford said:
1. Privacy still matters. Crawford questioned if patrons really wanted libraries to turn into Amazon in an era of government data mining and inferences which could track a ten year patron borrowing pattern.
2. The slow library movement, which argues that locality is vital to libraries, mindfulness matters, and open source software should be used where it works.
3. The role of the public library as publisher. Crawford pointed out libraries in Charlotte-Mecklenberg County, libraries in Vermont that Jessamyn West works with, and Wyoming as farther along this path, and said the tools are good enough that it’s becoming practical.

Blyberg said that systems need to be more open to the data that we put in there. Williams said that content must be disaggregatable and remixable, and Coombs pointed out the current difficulty of swapping out ILS modules, and said ERM was a huge issue. Tennant referenced the Talis platform, and said one of Evergreen’s innovations is its use of the XMPP (Jabber) protocol, which is easier than SOAP web services, which are too heavyweight.

Marshall Breeding responded to a question asking if MARC was dead, saying I’m married to a cataloger, but we do need things in addition to MARC, which is good for books, like Dublin Core and ONIX. Coombs pointed out that MARCXML is a mess because it’s retrofitted and doesn’t leverage the power of XML. Crawford said, II like to give Roy [Tennant] a hard time about his phrase MARC is dead, and for a dying format, the Moen panel was full at 8 a.m.

Questioners asked what happens when the one server goes down, and Blyberg responded, What if your T-1 line goes down? Joan Frye Williams exhorted the audience to examine your consciences when you ask vendors how to spend their time. Coombs agreed, saying that her experience on user groups had exposed her to crazy competing needs that vendors are faced with, [they] are spread way too thin. Williams said there are natural transition points and she spoke darkly of a pyramid scheme and that you get the vendors you deserve. Coombs agreed, saying, Feature creep and managing expectations is a fiercely difficult job, and open source developers and support staff are different people.

Joan Frye Williams, information technology consultant, listed:
1. New menu of end-user focused technologies. Williams said she worked in libraries when the typewriter was replaced by an OCLC machine, and libraries are still not using technology strategically. Technology is not a checklist, Williams chided, saying that the 23 Things movement of teaching new skills to library staff was insufficient.
2. Ability for libraries to assume development responsibility in concert with end-users
3. Have to make things more convenient, adopting (AI) artificial intelligence principles of self-organizing systems. Williams said, If computers can learn from their mistakes, why can’t we?

Someone asked why libraries are still using the ILS. Coombs said it’s a financial issue, and Breeding responded sharply, saying, How can we not automate our libraries? Walt Crawford agreed, saying, Are we going to return to index cards?

When the panel was asked if library home pages would disappear, Crawford and Blyberg both said they would be surprised. Williams said the product of the [library] website is the user experience. She said Yorba Linda Public Library (CA) is enhancing their site with a live book feed that updates as books are checked in, a feed scrolls on the site.

And another audience member asked why the panel didn’t cover toys and protocols. Crawford said outcomes matter, and Coombs agreed, saying I’m a toy geek but it’s the user that matters. Many participants talked about their use of Twitter, and Coombs said portable applications on a USB drive have the potential to change public computing in libraries. Tennant recommended viewing the Photosynth demo, first shown at the TED conference.

Finally, when asked how to keep up with trends, especially for new systems librarians, Coombs said, It depends what kind of library you’re working in. Find a network and ask questions on the code4lib [IRC] channel.

Blyberg recommended constructing a well-rounded blogroll that includes sites from the humanities, sciences, and library and information science will help you be a well-rounded feed reader. Tennant recommended a gasp dead tree magazine, Business 2.0. Coombs said the Gartner website has good information about technology adoptions, and Williams recommended trendwatch.com.

Links to other trends:
Karen Coombs Top Technology Trends
Meredith Farkas Top Technology Trends
3 Trends and a Baby (Jeremy Frumkin)
Some Trends from the LiB (Sarah Hougton-Jan)
Sum Tech Trends for the Summer of 2007 (Eric Lease Morgan)

And other writeups and podcast:
Rob Styles
Ellen Ward
Chad Haefele

Presenting at ALA panel on Future of Information Retrieval

The Future of Information Retrieval

Ron Miller, Director of Product Management, HW Wilson, hosts a panel of industry leaders including:
Mike Buschman, Program Manager, Windows Live Academic, Microsoft.
R. David Lankes, PhD, Director of the Information Institute of Syracuse, and Associate Professor, School of Information Studies, Syracuse University.
Marydee Ojala, Editor, ONLINE, and contributing feature and news writer to Information Today, Searcher, EContent, Computers in Libraries, among other publications.
Jay Datema, Technology Editor, Library Journal

Add to calendar:
Monday, 25 June 2007
8-10 a.m, Room 103b
Preliminary slides and audio attached.

IDPF: Google and Harvard

Libraries And Publishers
At the 2007 International Digital Publishing Forum (IDPF) in New York May 9th, publishers and vendors discussed the future of ebooks in an age increasingly dominated by large-scale digitization projects funded by the deep pockets of Google and Microsoft.

In a departure from the other panels, which discussed digital warehouses and repositories, both planned and in production from Random House and HarperCollins, Peter Brantley, executive director of the Digital Library Federation and Dale Flecker of Harvard University Library made a passionate case for libraries in an era of information as a commodity.

Brantley began by mentioning the Library Project on Flickr, and led with a slightly ominous series of slides: “Libraries buy books (For a while longer), followed by “Libraries don’t always own what’s in the book, just the book (the “thing” of the book).¨

He then reiterated the classic rights that libraries protect: The Right to Borrow, Right to Browse, Right to Privacy, and Right to Learn, and warned that “some people may become disenfranchised in the the digital world, when access to the network becomes cheaper than physical things.” Given the presentation that followed from Tom Turvey, director of the Google Book Search project, this made sense.

Brantley made two additional points, saying “Libraries must permanently hold the wealth of our many cultures to preserve fundamental Rights, and Access to books must be either free or low-cost for the world’s poor.”¨ He departed from conventional thinking on access, though, when he argued that this low-cost access didn’t need to include fiction. Traditionally, libraries began as subscription libraries for those who couldn’t afford to purchase fiction in drugstores and other commercial venues.

Finally, Brantley said that books will become communities as they are integrated, multiplied, fragmented, collaborative, and shared, and publishing itself will be reinvented. Yet his conclusion contained an air of inevitability, as he said, “Libraries and publishers can change the world, or it will be transformed anyway.”

A podcast recording of his talk is available on his site.

Google Drops A Bomb
Google presented a plan to entice publishers to buy into two upcoming models for making money from Google Book Search, including a weekly rental “that resembles a library loan” and a purchase option, “much like a bookstore,” said Tom Turvey, director of Google Book Search Partnerships.¨ The personal library would allow search across the books, expiration and rental, and copy and paste. No pricing was announced. Google has been previewing the program at events including the London Book Fair.

Turvey said Google Book Search is live in 70 countries and eight languages. Ten years ago, zero percent of consumers clicked before buying books online, and now $4 billion of books are purchased online. “We think that’s a market,”Turvey said, “and we think of ourselves as the switchboard.”

Turvey, who previously worked at bn.com and ebrary, said publishers receive the majority of the revenue share as well as free marketing tools, site-brandable search inside a book with restricted buy links, and fetch and push statistical reporting.¨He said an iTunes for Books was unlikely, since books don’t have one device, model or user experience that works across all categories. Different verticals like fiction, reference, and science, technology and medicine (STM), require a different user experience, Turvey said.

Publishers including SparkNotes requested a way to make money from enabling a full view of their content on Google Books, as did many travel publishers. Most other books are limited to 20 percent visibility, although Turvey said there is a direct correlation between the number of pages viewed and subsequent purchases.

This program raises significant privacy questions. If Google has records that can be correlated with all the other information it stores, this is the polar opposite of what librarians have espoused about intellectual freedom and the privacy of circulation records. Additionally, the quality control questions are significant and growing, voiced by historian Robert Townsend and others.

Libraries are a large market segment to publishers. It seems reasonable to voice concerns about this proposal at this stage, especially those libraries who haven’t already been bought and sold. Others at the forum were skeptical. Jim Kennedy, vice president and director at the Associated Press, said, “The Google guy’s story is always the same: Send us your content and we’ll monetize it.”

Ebooks Ejournals And Libraries
Dale Flecker of the Harvard University Library gave a historical overview of the challenges libraries have grappled with in the era of digital information.

Instead of talking about ebooks, which he said represent only two percent of usage at Harvard, Flecker described eight challenges about ejournals, which are now “core to what libraries do” and have been in existence for 15-20 years. Library consultant October Ivins challenged this statistic about ebook usage as irrelevant, saying “Harvard isn’t typical.” She said there were 20 ebook platforms demonstrated at the 2006 Charleston Conference, though discovery is still an issue.

First, licensing is a big deal. There were several early questions: Who is a user? What can they do? Who polices behavior? What about guaranteed performance and license lapses? Flecker said that in an interesting shift, there is a move away from licenses to “shared understandings,” where content is acquired via purchase orders.¨

Second, archiving is a difficult issue. Harvard began in 1630, and has especially rich 18th century print collections, so it has been aware that “libraries buy for the ages.” The sticky issues come with remote and perpetual access, and what happens when a publisher ceases publishing.

Flecker didn’t mention library projects like LOCKSS or Portico in his presentation, though they do exist to answer those needs. He did say that “DRM is a bad actor” and it’s technically challenging to archive digital content. Though there have been various initiatives from libraries, publishers, and third parties, he said “Publishers have backed out,” and there are open questions about rights, responsibilities, and who pays for what. In the question and answer period that followed, Flecker said Harvard “gives lots of money” to Portico.”

Third, aggregation is common. Most ejournal content is licensed in bundles and consortia and buying clubs are common. Aggregated platforms provide useful search options and intercontent functionality.

Fourth, statistics matter, since they show utility and value for money spent. Though the COUNTER standard is well-defined and SUSHI gives a protocol for exchange of multiple stats, everyone counts differently.

Fifth, discovery is critical. Publishers have learned that making content discoverable increases use and value. At first, metadata was perceived to be intellectual property (as it still is, apparently), but then there was a grudging acceptance and finally, enthusiastic participation. It was unclear which metadata Flecker was describing, since many publisher abstracts are still regarded as intellectual property. He said Google is now a critical part of the discovery process.

Linkage was the sixth point. Linking started with citations, when publishers and aggregators realized that many footnotes contained links to articles that were also online. Bilateral agreements came next, and finally, the Digital Object Identifier (DOI) generalized the infrastructure and helped solve the “appropriate copy” problem, along with OpenURL. With this solution came true interpublished, interplatform, persistent and actionable links which are now growing beyond citations.

Seventh, there are early glimpses of text mining in ejournals. Text is being used as fodder for computational analysis, not just individual reading. This has required somewhat different licenses geared for computation, and also needs a different level of technical support.¨Last, there are continuing requirements for scholarly citation that is: • Unambiguous •Persistent • At a meaningful level. Article level linking in journals has proven to be sufficient, but the equivalent for books (the page? chapter? paragraph?) has not been established in an era of reflowable text.

In the previous panel, Peter Brantley asked the presenters on digital warehouses about persistent URLS to books, and if ISBNs would be used to construct those URLs. There was total silence, and then LibreDigital volunteered that redirects could be enabled at publisher request.

As WorldCat.org links have also switched from ISBN to OCLC number for permanlinks, this seems like an interesting question to solve and discuss. Will the canonical URL for a book point to Amazon, Google, OCLC, or OpenLibrary?

Open Data: What Would Kilgour Think?

The New York Public Library has reached a settlement with iBiblio, the public’s library and digital archive at the University of Chapel Hill, North Carolina, for harvesting records from its Research Libraries catalog, which it claims is copyrighted.

Heike Kordish, director of the NYPL Humanities Library, said a cease and desist letter was sent because a 1980s incident by an Australian harvesting effort which turned around and resold the NYPL records.

Simon Spero, iBiblio employee and technical assistant to the assistant vice chancellor at UNC-Chapel Hill, said NYPL requested that its library records be destroyed, and the claim was settled with no admission of wrongdoing. “I would characterize the New York Public Library as being neither public nor a library,” Spero said.

It is a curious development that while the NYPL is making arrangements under private agreements to allow Google to scan its book collection into full-text that it feels free to threaten other research libraries over MARC records.

The price of open data
This follows a similar string of disagreements about open data with OCLC and the MIT Simile project. The Barton Engineering Library catalog records were widely made available via Bit Torrent, a decentralized network file sharing format.

This has since been resolved by making the Barton data available again, though in RDF and MODS, not MARC, under a Creative Commons license for non-commercial use.

OCLC CEO Jay Jordan said the issues around sharing data had their genesis in concerns about the Open WorldCat project and sharing records with Microsoft, Google, and Amazon. Other concerns about private equity firms entering the library market also drove recent revisions to the data sharing policies.

OCLC quietly revised its policy about sharing records, which had not been updated since 1987 after numerous debates in the 1980s about the legality of copyrighting member records.

The new WorldCat policy, reads in part, “WorldCat® records, metadata and holdings information (“Data”) may only be used by Users (defined as individuals accessing WorldCat via OCLC partner Web interfaces) solely for the personal, non-commercial purpose of assisting such Users with locating an item in a library of the User’s choosing… No part of any Data provided in any form by WorldCat may be used, disclosed, reproduced, transferred or transmitted in any form without the prior written consent of OCLC except as expressly permitted hereunder.”

Looking through the most recent board minutes, it looks like concerns have been raised about “the risk to FirstSearch revenues from OpenWorldCat,” and management incentive plans have been approved.

What is good for libraries?
Another project initiated by Simon Spero, entitled Fred 2.0 after recently deceased Fred Kilgour of OCLC, Yale, and Chapel Hill fame, recently released Library of Congress authority file and subject information, which was gathered by similar means as the NYPL records.

Spero said the purpose of the project is “dedicated to the men and women at the Library of Congress and outside, who have worked for the past 108 years to build these authorities, often in the face of technology seemingly designed to make the task as difficult as possible.

Since Library of Congress data by definition cannot be copyrighted as free government information, the project was more collaborative in nature and has received acclaim for its help in pointing out cataloging irregularities in the records. OCLC also offers a linked authority file as a research project.

Firefox was born from open source
While the purpose of releasing library data has not yet reached consensus about what will be built as a result, it can be compared to Netscape open-sourcing the Mozilla code in 2000, which eventually brought Firefox and other open source projects to light. It also shows that the financial motivations of library organizations by necessity dictate the legal mechanisms of protection.

Open source metasearch

Now there’s a new kid on the (meta)search block. LibraryFind, an open-source project funded by the State Library of Oregon, is currently live at Oregon State University. The library has just packaged up a release for anyone to download and install.

Jeremy Frumkin, Gray chair for Innovative Library Services at OSU, said the goals were to contribute to the support of scholarly workflow, remove barriers between the library and Web information, and to establish the digital library as platform.

Lead developers Dan Chudnov, soon to join the Library of Congress’s Office of Strategic Initiatives, and Terry Reese, catalog librarian and developer of popular application MarcEdit, worked with the following guiding principles: Two clicks–one to find, and one to get; a goal of getting results in four seconds, and known and adjustable results ranking.

Other OSU project members included Tami Herlocker, point person for interface development, and Ryan Ordway, system administrator. Frumkin said, “The Ruby on Rails platform provided easy, quick user interface development. It gives a variety of UI possibilities, and offers new interfaces for different user groups.”

The application includes collaborations on the OpenURL module from Ross Singer, library applications developer at the Georgia Tech library, and Ed Summers, Library of Congress developer. Journal coverage can be imported from a SerialsSolutions export, and more import facilities are planned in upcoming releases.

OSU is working on a contract with OCLC’s WorldCat to download data, and is looking to build greater trust relationships with vendors. “The upside for vendors is they can see how their data is used when developing new services,” Frumkin said.

Future enhancements include an information dashboard and a personal digital library. Developers are also staffing a support chatroom for technical support, help, and development discussion of LibraryFind.

ALA 2006: Top Tech Trends

In yet another crowded ballroom, the men (and woman) of LITA prognosticated on the future of libraries and technology.

Walt Crawford moderated the panel and spoke in absentia for Sarah Houghton. Her trends were:

  • Returning power to content owners
  • An OCLC ILS with RedLightGreen as the front-end
  • Outreach online

Karen Schneider listed four:

  • Faceted Navigation, from Endeca and others
  • eBooks–the Sophie project from the Institute for the Future of the Book
  • The Graphic Novel–Fun House
  • Net Neutrality

Eric Lease Morgan listed several, and issued a call for a Notre Dame Perl programmer throughout his trends:

  • VoIP, which he thought would cure email abuse
  • Web pages are now blogs and wikis, which may cause preservation issues since they are dynamically generated from a database
  • Social Networking tools
  • Open Source
  • Metasearch, which he thought may be dead given its lack of de-duplication
  • Mass Digitization, and the future services libraries can provide against it
  • Growing Discontent with Library Catalogs
  • Cataloging is moving to good enough instead of complete
  • OCLC is continuing to expand and refine itself
  • LITA 40 year anniversary–Morgan mentioned how CBS is just celebrating their 55th anniversary of live color TV broadcasting

Tom Wilson noted two things: “Systems aren’t monolithic, and everything is an interim solution.”

Roy Tennant listed three trends:
Next generation finding tools, not just library catalogs. Though the NGC4Lib mailing list is a necessary step, metasearch still needs to be done, and it’s very difficult to do. Some vendors are introducing products like Innovative’s Encore and Ex Libris’ Primo which attempt to solve this problem.
The rise of filtering and selection. Tennant said, “The good news is everyone can be a publisher. And the bad news is, Everyone can be a publisher.”
Rise of microcommunities, like code4lib, which give rise to ubiquitious and constant communication.

Discussion after the panelists spoke raised interesting questions, including Clifford Lynch’s recommendation of Microsoft’s Stuff I’ve Seen. Marshall Breeding recommended tagging WorldCat, not local catalogs, but Karen Schneider pointed out that the user reviews on Open World Cat were deficient compared to Amazon.

When asked how to spot trends, Eric Lease Morgan responded, “Read and read and read–listservs, weblogs; Listen; Participate.” Roy Tennant said, “Look outside the library literature–Read the Wall Street Journal, Fast Company, and Business 2.0. Finally, look for patterns.”

More discussion, and better summaries:
LITA Blog » Blog Archive » Eric Lease Morgan’s Top Tech Trends for ALA 2006; Sum pontifications
LITA Blog » Blog Archive » The Annual Top 10 Trends Extravaganza
Hidden Peanuts » ALA 2006 – LITA Top Tech Trends
ALA TechSource | Tracking the Trends: LITA’s ALA Annual ’06 Session
Library Web Chic » Blog Archive » LITA Top Technology Trends

ALA 2006: Future of Search

This oversubscribed session (I sat on the floor, as did many others) featured Stephen Abram of Sirsi/Dynix/SLA president and Joe Janes of the University of Washington debating the future of search, moderated by LJ columnist Roy Tennant.

Abram asked a pointed question, which decided the debate early, “Were libraries ever about search? Search was rarely the point…unless you wanted to become a librarian.”  In Abram’s view, the current threat to libraries comes from user communities like Facebook/MySpace, since MySpace is now the 6th largest search engine. Other threats to libraries include the Google patent on quality.

Abram said the problem of the future is winnowing, and that you cannot teach people to search.”Boolean doesn’t work,” he said. Abram felt it was a given that more intelligence needs to be built into the interface.

In more sociological musings, Abram said “Facts have a half-life of 12 years,” and social networks matter since “teens and 20s live through their social networks. The world is ahead of us, and teams are contextual. People solve problems in groups.”

Joe Janes asked, “What would happen if you made WorldCat open source? Would the fortress of metadata in Dublin, OH crumble?” When asked if libraries should participate in OpenWorldCat, Abram said, “Sure, why not? Our competitor is ignorance, not access. Libraries transform lives.”

Janes pointed out that none of the current search services (Google Answers, Yahoo Answers, and the coming Microsoft  Answers) have worked well, and Tennant said “While Google and Yahoo may have the eyeballs of users, libraries have the feet of users.”

In an interesting digression from the question at hand, Abram asked why libraries aren’t creating interesting tools like LibraryThing and LibraryELF (look for a July NetConnect feature about the ELF by Liz Burns). Janes said it comes back to privacy concerns, since this is the “looking over your shoulder decade. Hi, NSA!” With the NSA and TSA examining search, banking, and phone records, library privacy ethics are being challenged like no recent time in history.

Roy Tennant asked if libraries should incorporate better interface design, relevance ranking, spelling suggestions, and faceting browsing. Abram said it’s already happening at North Carolina State University with the Endeca catalog project. The Grokker pilot at Stanford is another notable example, and the visual contents and tiled results set mirror how people learn. “Since the search engines are having problems putting ads in visual search, it’s good for librarians.”

Abram got the most laughter by pointing out that the thing that killed Dialog was listening to their users. As librarian requests made Dialog even more precise, “At the end of a Dialog search, you could squeeze a diamond out of your ass.” Janes said the perfect search is “no search at all, one that has the lightest cognitive load.”

Since libraries are, in Janes’ words, “a conservation organization because the human record is at stake, the worst nightmare is that nothing changes and libraries die. The finest vision is to put Google out of business.” Abram’s view was libraries must become better at advocacy and trust users to lay paths through catalog tagging and other vendor initiatives.

The question of the future of search turned into the future of libraries, and Joe Janes concluded that “Libraries are in the business of vacations we enabled, cars we helped fix, businesses we started, and helping people move.” Abram ended with a pithy slogan for libraries, the place of “Bricks, Clicks, and Tricks.”

Other commentary here:
The Shifted Librarian: 20060624 Who Controls the Future of Search?
Library Web Chic » Blog Archive » The Ultimate Debate : Who Controls the Future of Search
LITA Blog » Blog Archive » The Ultimate Debate: Who Controls the Future of Search
AASL Weblog – The Ultimate Debate: Who Controls the Future of Search?

Open World Cat

In typical OCLC style, a quiet revolution is brewing. Formerly a subscription-only database, WorldCat has begun to progagate into search engines–Google, Yahoo, and Ask in particular–and with the merger of RLG, it looks like a truly spectacular interface could be created to the union catalog.

In the meantime, it’s curious that OCLC chose to use an ISBN-based permalink structure instead of OpenURL. It does showcase FRBR, but beyond that it’s not very interoperable.

The real question is, will OCLC enter the SEO (search engine optimization) business so that library results show on the first page?