Repurposing Metadata

As the Open Archive Initiative Protocol for Metadata Harvesting has become a central component of digital library projects, increased attention has been paid to the ways metadata can be reused. As every computer project since the beginning of time has had occasion to understand, the data available for harvesting is only as good as the data entered. Given these quality issues, there are larger questions about how to reuse the valuable metadata once it has been originally described, cataloged, annotated, and abstracted.

Squeezing metadata into a juicer
As is often the case, the standards and library community were out in front in thinking about how to make metadata accessible in a networked age. With the understanding that most of the creators of the metadata would be professionals, choices were left about repeating elements, etc., in the Dublin Core standard.

This has proved to be an interesting choice, since validators and computers tend to look unfavorably on the unique choices that may make sense only locally. Thus, as the weblog revolution started in 2000 and became used in even the largest publications by 2006, these tools could not be ignored as a mass source of metadata creation.

Reusing digital objects
In the original 2006 proposal to the Mellon Foundation, Carl Lagoze wrote that “Terms like cyberinfrastructure, e-scholarship, and e-science all describe a concept of data-driven scholarship where researchers access shared data sets for analysis, reuse, and recombination with other network-available resources. Interest in this new scholarship is not limited to the physical and life sciences. Increasingly, social scientists and humanists are recognizing the potential of networked digital scholarship. A core component of this vision is a new notion of the scholarly document or publication.

Rather than being static and text-based, this scholarly artifact flexibly combines data, text, images, and services in multiple ways regardless of their location and genre.”

After being funded, this proposal has turned into something interesting, with digital library participation augmented by Microsoft, Google, and other large company representatives joining the digital library community. Since Atom feeds have garnered much interest and have become a IETF recommended standard, there is community interest in bringing these worlds together. Now known as the Open Archives Initiative for Object Reuse and Exchange (OAI-ORE), the alpha release is drawing interesting reference implementations as well as criticism for the methods used to develop it.

Resource maps everywhere
Using existing web tools is a good example of working to extend rather to invent. As Herbert van de Sompel noted in his Fall NISO Forum presentation, “Materials from repositories must be re-usable in different contexts, and life for those materials starts in repositories, it does not end there.” And as the Los Alamos National Laboratory Library experiments have shown, the amount of reuse that’s possible when you have journal data in full-text is extraordinary.

Another potential use of OAI-ORE beyond the repositories it was meant to assist can be found in the Flickr Commons project. With pilot implementations from the Library of Congress, the Powerhouse Museum and the Brooklyn Museum, OAI-ORE could play an interesting role in aggregating user-contributed metadata for evaluation, too. Once tags have been assigned, this metadata could be collected for further curation. In this same presentation, van de Sompel showed a Flickr photoset as an example of a compound information object.

Anything but lack of talent
A great way to understand the standard is to see it in use. Michael Giarlo of the Library of Congress developed a plugin for WordPress, a popular content management system that generates Atom. His plugin generates a resource map that is valid Atom and which contains some Dublin Core elements, including title, creator, publisher, date, language, and subject. This resource map can be transformed into RDF triples via GRDDL, which again facilitate reuse by the linked data community.

This turns metadata creation on its head, since the Dublin Core elements are taken directly from what the weblog author enters as the title, the name of the weblog author, subjects that were assigned, and the date and time of the entry. One problem OAI-ORE problem promises to solve is how to connect disparate URLs into one unified object, which the use of Atom simplifies.

As the OAI-ORE specification moves into beta, it will be interesting to see if the constraints of the wider web world will breathe new life into carefully curated metadata. I certainly hope it does.

IDPF: Google and Harvard

Libraries And Publishers
At the 2007 International Digital Publishing Forum (IDPF) in New York May 9th, publishers and vendors discussed the future of ebooks in an age increasingly dominated by large-scale digitization projects funded by the deep pockets of Google and Microsoft.

In a departure from the other panels, which discussed digital warehouses and repositories, both planned and in production from Random House and HarperCollins, Peter Brantley, executive director of the Digital Library Federation and Dale Flecker of Harvard University Library made a passionate case for libraries in an era of information as a commodity.

Brantley began by mentioning the Library Project on Flickr, and led with a slightly ominous series of slides: “Libraries buy books (For a while longer), followed by “Libraries don’t always own what’s in the book, just the book (the “thing” of the book).¨

He then reiterated the classic rights that libraries protect: The Right to Borrow, Right to Browse, Right to Privacy, and Right to Learn, and warned that “some people may become disenfranchised in the the digital world, when access to the network becomes cheaper than physical things.” Given the presentation that followed from Tom Turvey, director of the Google Book Search project, this made sense.

Brantley made two additional points, saying “Libraries must permanently hold the wealth of our many cultures to preserve fundamental Rights, and Access to books must be either free or low-cost for the world’s poor.”¨ He departed from conventional thinking on access, though, when he argued that this low-cost access didn’t need to include fiction. Traditionally, libraries began as subscription libraries for those who couldn’t afford to purchase fiction in drugstores and other commercial venues.

Finally, Brantley said that books will become communities as they are integrated, multiplied, fragmented, collaborative, and shared, and publishing itself will be reinvented. Yet his conclusion contained an air of inevitability, as he said, “Libraries and publishers can change the world, or it will be transformed anyway.”

A podcast recording of his talk is available on his site.

Google Drops A Bomb
Google presented a plan to entice publishers to buy into two upcoming models for making money from Google Book Search, including a weekly rental “that resembles a library loan” and a purchase option, “much like a bookstore,” said Tom Turvey, director of Google Book Search Partnerships.¨ The personal library would allow search across the books, expiration and rental, and copy and paste. No pricing was announced. Google has been previewing the program at events including the London Book Fair.

Turvey said Google Book Search is live in 70 countries and eight languages. Ten years ago, zero percent of consumers clicked before buying books online, and now $4 billion of books are purchased online. “We think that’s a market,”Turvey said, “and we think of ourselves as the switchboard.”

Turvey, who previously worked at bn.com and ebrary, said publishers receive the majority of the revenue share as well as free marketing tools, site-brandable search inside a book with restricted buy links, and fetch and push statistical reporting.¨He said an iTunes for Books was unlikely, since books don’t have one device, model or user experience that works across all categories. Different verticals like fiction, reference, and science, technology and medicine (STM), require a different user experience, Turvey said.

Publishers including SparkNotes requested a way to make money from enabling a full view of their content on Google Books, as did many travel publishers. Most other books are limited to 20 percent visibility, although Turvey said there is a direct correlation between the number of pages viewed and subsequent purchases.

This program raises significant privacy questions. If Google has records that can be correlated with all the other information it stores, this is the polar opposite of what librarians have espoused about intellectual freedom and the privacy of circulation records. Additionally, the quality control questions are significant and growing, voiced by historian Robert Townsend and others.

Libraries are a large market segment to publishers. It seems reasonable to voice concerns about this proposal at this stage, especially those libraries who haven’t already been bought and sold. Others at the forum were skeptical. Jim Kennedy, vice president and director at the Associated Press, said, “The Google guy’s story is always the same: Send us your content and we’ll monetize it.”

Ebooks Ejournals And Libraries
Dale Flecker of the Harvard University Library gave a historical overview of the challenges libraries have grappled with in the era of digital information.

Instead of talking about ebooks, which he said represent only two percent of usage at Harvard, Flecker described eight challenges about ejournals, which are now “core to what libraries do” and have been in existence for 15-20 years. Library consultant October Ivins challenged this statistic about ebook usage as irrelevant, saying “Harvard isn’t typical.” She said there were 20 ebook platforms demonstrated at the 2006 Charleston Conference, though discovery is still an issue.

First, licensing is a big deal. There were several early questions: Who is a user? What can they do? Who polices behavior? What about guaranteed performance and license lapses? Flecker said that in an interesting shift, there is a move away from licenses to “shared understandings,” where content is acquired via purchase orders.¨

Second, archiving is a difficult issue. Harvard began in 1630, and has especially rich 18th century print collections, so it has been aware that “libraries buy for the ages.” The sticky issues come with remote and perpetual access, and what happens when a publisher ceases publishing.

Flecker didn’t mention library projects like LOCKSS or Portico in his presentation, though they do exist to answer those needs. He did say that “DRM is a bad actor” and it’s technically challenging to archive digital content. Though there have been various initiatives from libraries, publishers, and third parties, he said “Publishers have backed out,” and there are open questions about rights, responsibilities, and who pays for what. In the question and answer period that followed, Flecker said Harvard “gives lots of money” to Portico.”

Third, aggregation is common. Most ejournal content is licensed in bundles and consortia and buying clubs are common. Aggregated platforms provide useful search options and intercontent functionality.

Fourth, statistics matter, since they show utility and value for money spent. Though the COUNTER standard is well-defined and SUSHI gives a protocol for exchange of multiple stats, everyone counts differently.

Fifth, discovery is critical. Publishers have learned that making content discoverable increases use and value. At first, metadata was perceived to be intellectual property (as it still is, apparently), but then there was a grudging acceptance and finally, enthusiastic participation. It was unclear which metadata Flecker was describing, since many publisher abstracts are still regarded as intellectual property. He said Google is now a critical part of the discovery process.

Linkage was the sixth point. Linking started with citations, when publishers and aggregators realized that many footnotes contained links to articles that were also online. Bilateral agreements came next, and finally, the Digital Object Identifier (DOI) generalized the infrastructure and helped solve the “appropriate copy” problem, along with OpenURL. With this solution came true interpublished, interplatform, persistent and actionable links which are now growing beyond citations.

Seventh, there are early glimpses of text mining in ejournals. Text is being used as fodder for computational analysis, not just individual reading. This has required somewhat different licenses geared for computation, and also needs a different level of technical support.¨Last, there are continuing requirements for scholarly citation that is: • Unambiguous •Persistent • At a meaningful level. Article level linking in journals has proven to be sufficient, but the equivalent for books (the page? chapter? paragraph?) has not been established in an era of reflowable text.

In the previous panel, Peter Brantley asked the presenters on digital warehouses about persistent URLS to books, and if ISBNs would be used to construct those URLs. There was total silence, and then LibreDigital volunteered that redirects could be enabled at publisher request.

As WorldCat.org links have also switched from ISBN to OCLC number for permanlinks, this seems like an interesting question to solve and discuss. Will the canonical URL for a book point to Amazon, Google, OCLC, or OpenLibrary?

Open source metasearch

Now there’s a new kid on the (meta)search block. LibraryFind, an open-source project funded by the State Library of Oregon, is currently live at Oregon State University. The library has just packaged up a release for anyone to download and install.

Jeremy Frumkin, Gray chair for Innovative Library Services at OSU, said the goals were to contribute to the support of scholarly workflow, remove barriers between the library and Web information, and to establish the digital library as platform.

Lead developers Dan Chudnov, soon to join the Library of Congress’s Office of Strategic Initiatives, and Terry Reese, catalog librarian and developer of popular application MarcEdit, worked with the following guiding principles: Two clicks–one to find, and one to get; a goal of getting results in four seconds, and known and adjustable results ranking.

Other OSU project members included Tami Herlocker, point person for interface development, and Ryan Ordway, system administrator. Frumkin said, “The Ruby on Rails platform provided easy, quick user interface development. It gives a variety of UI possibilities, and offers new interfaces for different user groups.”

The application includes collaborations on the OpenURL module from Ross Singer, library applications developer at the Georgia Tech library, and Ed Summers, Library of Congress developer. Journal coverage can be imported from a SerialsSolutions export, and more import facilities are planned in upcoming releases.

OSU is working on a contract with OCLC’s WorldCat to download data, and is looking to build greater trust relationships with vendors. “The upside for vendors is they can see how their data is used when developing new services,” Frumkin said.

Future enhancements include an information dashboard and a personal digital library. Developers are also staffing a support chatroom for technical support, help, and development discussion of LibraryFind.

Using Drupal to put Endnote online

There is still no easy way to manage a library of references on a personal or institutional site. Librarians who want to put up a list of institutional publications, or researchers who want to share references are limited by existing software limitations, privacy concerns, or technical road blocks. This problem has been mitigated by a open source CMS with a handy bibliographic data module.

The Drupal content management system is attractive to many librarians and information scientists because of its deep use of taxonomy. Daniel Chudnov uses it to power Open Source Systems for Libraries, and his personal weblog, One Big Library. Roy Tennant uses Drupal for the TechEssence.info, and the Ann Arbor Public Library uses it for user registration, resource weblogs, and the overall site.

However, state of the art in bibliographic management and collaboration is still stuck in 1990. When a writer wants to collect articles, there are a number of client applications (all owned by Thomson ISI ResearchSoft, including Endnote, ProCite, and Reference Manager, plus WriteNote) that do a nice job of saving the references and integrating with word processors to format the citations.

Endnote is the most commonly-used program, but it was not designed to share references. Modern science is all about collaboration, from grant proposals to international research. In the worst case, sharing an Endnote library on a network server can cause corruption. In the best case, shared Endnote libraries are limited to read-only if another person has it open, which limits collaboration.

A version of EndnoteWeb has been in development for most of 2006, and is promised by January of next year. Early reports of integration with Web of Science tell of limited functionality and interoperability.

In 2002, a number of former Reference Manager employees waited for their non-compete agreements with ISI to expire, then founded RefWorks, an online version of the familiar bibliographic managers.In the last two years, applications including Connotea and CiteULike have integrated bilbiographic manager capabilities to their social bookmarking applications. Both allow RIS and BibTeX upload and download to systems managed at Nature Publishing Group and the University of Manchester, respectively.

At Cold Spring Harbor Laboratory the annual reports of the institution have listed lab publications for over 100 years. These references have not been added to Pubmed, which still only goes back to 1950. Thus, this unique information needed to be put into a format so that scholars could cite the early history of genetics, and the tragic misfire of eugenics research.

Many approachs were tried. One early method was programmer-centric, where the data was entered into a SQL database and a web front-end was scripted to add basic fields. While this was a promising start, it left out the rich data fields that enable bibliographic managers to capture complete citation information.

Since the library was examining digital asset management systems, Greenstone was assessed for its citation abilities. Ian Witten was able to jury-rig a solution that imported RIS information about citations, but getting them to display in a full way wasn’t simple.

As the prototyping continued, the initial database of 1800 records was exported out of the SQL database into comma separated value (CSV) format, and imported into Endnote. The archives clerk started assessing the reference types, and added new fields. For example, Institution was added so that a sort by the name could be used. A new reference type was added for non-standard reports.

In the process of adding this information, Endnote’s integration with OpenURL became useful. Using the standard bibliographic fields, it was possible to launch a search that queried the library’s subscriptions to see if a full-text version existed. And for many articles in Science magazine, a full-text scan was available.

In the short-term, links to the JSTOR archive were added to Endnote. Longer-term, it would be useful to put in COinS from the web interface so that every citation could be queried via OpenURL.

Cold Spring Harbor Laboratory already had a site license for Endnote, so switching to RefWorks wasn’t feasible. In addition, the local version of Connotea isn’t exacly lightweight to deploy, requriing two MySQL databases and memcached to handle the online load. Since Nature is currently funding the open-source project, questions were raised about the continuting development of the project.

The archives clerk finished the authority control work on the Endnote database, which included hand-checking the references to the print version of the annual reports. Once this was completed, a need was voiced to make these references available online.

Ron Jeromeof the National Research Council Canada Institute for Chemical Process and Environmental Technology wrote a Bibliography module for Drupal which allows Endnote import in .enw or XML formats. This module is currently being extended to allow Open Archives Initiative harvesting.

This module was installed, and the 2200 Cold Spring Harbor Laboratory publications from 1890-1950 were imported into MySQL. The display is clear, and the default display is citation format. All other fields were imported, but live in the database for display on demand.

This module holds great promise for archive integration, since harvesting by OAI would allow libraries to harvest the records from web resources that aren’t specifically enabled for archives management. Endnote format is a lowest barrier format for scientists and researchers.

In the future, Cold Spring Harbor Laboratory hopes to integrate these early records with the other archives collections managed by Digitool. For now, other laboratories and libraries can use Drupal and the Bibliography module for easy reference sharing.

Open World Cat

In typical OCLC style, a quiet revolution is brewing. Formerly a subscription-only database, WorldCat has begun to progagate into search engines–Google, Yahoo, and Ask in particular–and with the merger of RLG, it looks like a truly spectacular interface could be created to the union catalog.

In the meantime, it’s curious that OCLC chose to use an ISBN-based permalink structure instead of OpenURL. It does showcase FRBR, but beyond that it’s not very interoperable.

The real question is, will OCLC enter the SEO (search engine optimization) business so that library results show on the first page?

Open URL

Open URL solves the appropriate copy issue, but many other questions have sprung up for library discussion.

You can learn more in Roy Tennant and Carol Tenopir‘s forthcoming July columns.

Should Google have a list of resolvers? What about Microsoft?
Is it useful for OCLC to be developing a registry?
Why is the usability so poor? Pop window after pop up window…
Do users want a limit to full-text programmed for them?
Should it be as easy as writing a weblog entry to link to library subscription resources? The inventors of COinS think so.