Repurposing Metadata

As the Open Archive Initiative Protocol for Metadata Harvesting has become a central component of digital library projects, increased attention has been paid to the ways metadata can be reused. As every computer project since the beginning of time has had occasion to understand, the data available for harvesting is only as good as the data entered. Given these quality issues, there are larger questions about how to reuse the valuable metadata once it has been originally described, cataloged, annotated, and abstracted.

Squeezing metadata into a juicer
As is often the case, the standards and library community were out in front in thinking about how to make metadata accessible in a networked age. With the understanding that most of the creators of the metadata would be professionals, choices were left about repeating elements, etc., in the Dublin Core standard.

This has proved to be an interesting choice, since validators and computers tend to look unfavorably on the unique choices that may make sense only locally. Thus, as the weblog revolution started in 2000 and became used in even the largest publications by 2006, these tools could not be ignored as a mass source of metadata creation.

Reusing digital objects
In the original 2006 proposal to the Mellon Foundation, Carl Lagoze wrote that “Terms like cyberinfrastructure, e-scholarship, and e-science all describe a concept of data-driven scholarship where researchers access shared data sets for analysis, reuse, and recombination with other network-available resources. Interest in this new scholarship is not limited to the physical and life sciences. Increasingly, social scientists and humanists are recognizing the potential of networked digital scholarship. A core component of this vision is a new notion of the scholarly document or publication.

Rather than being static and text-based, this scholarly artifact flexibly combines data, text, images, and services in multiple ways regardless of their location and genre.”

After being funded, this proposal has turned into something interesting, with digital library participation augmented by Microsoft, Google, and other large company representatives joining the digital library community. Since Atom feeds have garnered much interest and have become a IETF recommended standard, there is community interest in bringing these worlds together. Now known as the Open Archives Initiative for Object Reuse and Exchange (OAI-ORE), the alpha release is drawing interesting reference implementations as well as criticism for the methods used to develop it.

Resource maps everywhere
Using existing web tools is a good example of working to extend rather to invent. As Herbert van de Sompel noted in his Fall NISO Forum presentation, “Materials from repositories must be re-usable in different contexts, and life for those materials starts in repositories, it does not end there.” And as the Los Alamos National Laboratory Library experiments have shown, the amount of reuse that’s possible when you have journal data in full-text is extraordinary.

Another potential use of OAI-ORE beyond the repositories it was meant to assist can be found in the Flickr Commons project. With pilot implementations from the Library of Congress, the Powerhouse Museum and the Brooklyn Museum, OAI-ORE could play an interesting role in aggregating user-contributed metadata for evaluation, too. Once tags have been assigned, this metadata could be collected for further curation. In this same presentation, van de Sompel showed a Flickr photoset as an example of a compound information object.

Anything but lack of talent
A great way to understand the standard is to see it in use. Michael Giarlo of the Library of Congress developed a plugin for WordPress, a popular content management system that generates Atom. His plugin generates a resource map that is valid Atom and which contains some Dublin Core elements, including title, creator, publisher, date, language, and subject. This resource map can be transformed into RDF triples via GRDDL, which again facilitate reuse by the linked data community.

This turns metadata creation on its head, since the Dublin Core elements are taken directly from what the weblog author enters as the title, the name of the weblog author, subjects that were assigned, and the date and time of the entry. One problem OAI-ORE problem promises to solve is how to connect disparate URLs into one unified object, which the use of Atom simplifies.

As the OAI-ORE specification moves into beta, it will be interesting to see if the constraints of the wider web world will breathe new life into carefully curated metadata. I certainly hope it does.

Presenting at ALA panel on Future of Information Retrieval

The Future of Information Retrieval

Ron Miller, Director of Product Management, HW Wilson, hosts a panel of industry leaders including:
Mike Buschman, Program Manager, Windows Live Academic, Microsoft.
R. David Lankes, PhD, Director of the Information Institute of Syracuse, and Associate Professor, School of Information Studies, Syracuse University.
Marydee Ojala, Editor, ONLINE, and contributing feature and news writer to Information Today, Searcher, EContent, Computers in Libraries, among other publications.
Jay Datema, Technology Editor, Library Journal

Add to calendar:
Monday, 25 June 2007
8-10 a.m, Room 103b
Preliminary slides and audio attached.

Open source metasearch

Now there’s a new kid on the (meta)search block. LibraryFind, an open-source project funded by the State Library of Oregon, is currently live at Oregon State University. The library has just packaged up a release for anyone to download and install.

Jeremy Frumkin, Gray chair for Innovative Library Services at OSU, said the goals were to contribute to the support of scholarly workflow, remove barriers between the library and Web information, and to establish the digital library as platform.

Lead developers Dan Chudnov, soon to join the Library of Congress’s Office of Strategic Initiatives, and Terry Reese, catalog librarian and developer of popular application MarcEdit, worked with the following guiding principles: Two clicks–one to find, and one to get; a goal of getting results in four seconds, and known and adjustable results ranking.

Other OSU project members included Tami Herlocker, point person for interface development, and Ryan Ordway, system administrator. Frumkin said, “The Ruby on Rails platform provided easy, quick user interface development. It gives a variety of UI possibilities, and offers new interfaces for different user groups.”

The application includes collaborations on the OpenURL module from Ross Singer, library applications developer at the Georgia Tech library, and Ed Summers, Library of Congress developer. Journal coverage can be imported from a SerialsSolutions export, and more import facilities are planned in upcoming releases.

OSU is working on a contract with OCLC’s WorldCat to download data, and is looking to build greater trust relationships with vendors. “The upside for vendors is they can see how their data is used when developing new services,” Frumkin said.

Future enhancements include an information dashboard and a personal digital library. Developers are also staffing a support chatroom for technical support, help, and development discussion of LibraryFind.

Casey Bisson named one of first winners of Mellon Award for Technology Collaboration

Casey Bisson, information architect at Plymouth State University, was presented with a $50,000 Mellon award for Technology Collaboration by Tim Berners-Lee at the Coalition for Networked Information meeting in Washington DC December 4.

His project, WP-OPAC, is seen as the first step for allowing library catalogs to integrate with WordPress, a popular open-source content management system.

The awards committee included Mitchell Baker, Mozilla; Tim Berners-Lee,W3; Vinton Cerf, Google; Ira Fuchs, Mellon; John Gage, Sun Microsystems; Tim O’Reilly, O’Reilly Media; John Seely Brown, and Donald Waters, Mellon. Berners-Lee said, “These awards are about open source. It’s a good thing because it makes our lives easier, and the award winners used open source to solve problems.”

Library of Congress?
The revolutionary part of the announcement, however, was that Plymouth State University would use the $50,000 to purchase Library of Congress catalog records and redistribute them free under a Creative Commons Share-Alike license or GNU. OCLC has been the source for catalog records for libraries, and its license restrictions do not permit reuse or distribution. However, catalog records have been shared via Z39.50 for several years without incident.

“Libraries’ online presence is broken. We are more than study halls in the digital age. For too long, libraries have have been coming up with unique solutions for common problems,” Bisson said. “Users are looking for an online presence that serves them in the way they expect.” He said “The intention is to bring together the free or nearly-free services available to the user.”

Free download
Bisson said Plymouth State University is committed to supporting it, and will be offering it as a free download from its site, likely in the form of sample records plus WordPress with WP-OPAC included. “With nearly 140,000 registered users of Amazon Web Services, it’s time to use common solutions for our unique problems,” Bisson said.

The internal data structure works with iCal for calendar information and Flickr for photos, and can be used with historical records. It allows libraries to go beyond Library of Congress subject headings. Bisson said. Microformats are key to the internal data, and the OpenSearch API is used for interoperability. Bisson is looking at adding unAPI and OAI in the future.

At this time, there is no connection to the University of Rochester Mellon-funded project which is prototyping a new extensible catalog, though both are funded by Mellon. [see LJ Baker’s Smudges, 9/1/2006]

Other winners include:Open University (Moodle), RPI (bedework), University of British Columbia Vancouver (Open Knowledge Project), Virginia Tech (Sakai), Yale (CAS single signon), University of Washington (pine and IMAP), Internet Archive (Wayback Machine), and Humboldt State University (Moodle).

LITA National Forum 2006

“Shift Happens”
Preservation, entertainment in the library, and integrating Library 2.0 into a Web 2.0 world dominated the Library and Information Technology Association (LITA) National Forum in Nashville, TN, October 26-29, 2006.
With 378 registered attendees from 43 states and several countries, including Sweden and Trinidad, attendance held steady with previous years, though the Internet Librarian conference, held in the same week, attracted over 1300 librarians.

Free wireless has still not made it into technology conferences, though laptops were clearly visible, and the LITA blog faithfully kept up with sessions for librarians who were not able to attend.

Keynotes
The forum opened with a fascinating talk from librarians at the Country Music Hall of Fame entitled “Saving America’s Treasures.” Using Bridge Media Solutions in Nashville as a technology partner, the museum has migrated unique content from the Grand Ole Opry, including the first known radio session from October 14, 1939, as well as uncovering demos on acetate and glass from Hank Williams. The migration project uses open source software and will generate MARC records that will be submitted to OCLC.

Thom Gillespie of Indiana University described his shift from being a professor in the Library and Information Science program to launching a new program from the Telecommunications department. The MIME program for art, music, and new media has propelled students into positions at Lucas Arts, Microsoft, and other gaming companies. Gillespie said the program has practical value, “Eye candy was good but it’s about usability.” Saying that peering in is the first step but authoring citizen media is the future, he posed a provocative question: “What would happen if your library had a discussion of the game of the month?”

Buzz
Integration into user environments was a big topic of discussion. Peter Webster of St. Mary’s University, Halifax, Canada, spoke about how embedded toolbars are enabling libraries to enter where users search.

Annette Bailey, digital services librarian at Virginia Tech, announced that the LibX project has received funding for two years from IMLS to expand their research toolbar into Internet Explorer as well as Firefox, and will let librarians build their own test editions of toolbars online.

Presenters from the Los Alamos National Laboratory described their work with MPEG-21, a new standard from the Motion Pictures Experts group. The standard reduces some of the ambiguities of METS, and allows for unique identifiers in locally-loaded content. Material from Biosis, Thomson’s Web of Science, APS, the Institute of Physics, Elsevier, and Wiley, is being integrated into cataloging operations and existing local Open Archives Initiative (OAI) repositories.

Tags and Maps
The University of Rochester has received funding for an open source catalog, which they are calling the eXtensible Catalog (xC). Using an export of 3 million records from their Voyager catalog, David Lindahl and Jeff Susczynski described how their team used User Centered Design to conduct field interviews with their users, sometimes in their dorm rooms. They have prototyped four different versions of the catalog, and CUPID 4 includes integration of several APIs, including Google, Amazon, Technorati, and OCLC’s xISBN. They are actively looking for partners for the next phase, and plan to work on issues with diacritics, incremental updates, and integrating holdings records, potentially using the NCIP protocol.

Challenge
Steven Abram, of Sirxi/Dynix and incoming SLA president, delivered the closing keynote, “Web 2.0 and Library 2.0 in our Future.” Abram and Sirsi/Dynix have conducted research on 15,000 users, which highlighted the need for community, learning, and interaction. He asked the audience, “Are you working in your comfort zone or my end user’s comfort zone?” In a somewhat controversial set of statements, Abram compared open source software to being “free like kittens” and challenged librarians about the “My OPAC sucks” meme that’s been popular this year. “Do your users want an OPAC, or do they want information?” Stating that libraries need to compete in an era when education is moving towards the distance learning model, Abram asked, “How much are we doing to serve the user when 60-80% of users are virtual?” Saying that librarians help people improve the quality of their questions, Abram said that major upcoming challenges include 50 million digitized books coming online in the next five years. “What is at risk is not the book. It’s us: librarians.”

Using Drupal to put Endnote online

There is still no easy way to manage a library of references on a personal or institutional site. Librarians who want to put up a list of institutional publications, or researchers who want to share references are limited by existing software limitations, privacy concerns, or technical road blocks. This problem has been mitigated by a open source CMS with a handy bibliographic data module.

The Drupal content management system is attractive to many librarians and information scientists because of its deep use of taxonomy. Daniel Chudnov uses it to power Open Source Systems for Libraries, and his personal weblog, One Big Library. Roy Tennant uses Drupal for the TechEssence.info, and the Ann Arbor Public Library uses it for user registration, resource weblogs, and the overall site.

However, state of the art in bibliographic management and collaboration is still stuck in 1990. When a writer wants to collect articles, there are a number of client applications (all owned by Thomson ISI ResearchSoft, including Endnote, ProCite, and Reference Manager, plus WriteNote) that do a nice job of saving the references and integrating with word processors to format the citations.

Endnote is the most commonly-used program, but it was not designed to share references. Modern science is all about collaboration, from grant proposals to international research. In the worst case, sharing an Endnote library on a network server can cause corruption. In the best case, shared Endnote libraries are limited to read-only if another person has it open, which limits collaboration.

A version of EndnoteWeb has been in development for most of 2006, and is promised by January of next year. Early reports of integration with Web of Science tell of limited functionality and interoperability.

In 2002, a number of former Reference Manager employees waited for their non-compete agreements with ISI to expire, then founded RefWorks, an online version of the familiar bibliographic managers.In the last two years, applications including Connotea and CiteULike have integrated bilbiographic manager capabilities to their social bookmarking applications. Both allow RIS and BibTeX upload and download to systems managed at Nature Publishing Group and the University of Manchester, respectively.

At Cold Spring Harbor Laboratory the annual reports of the institution have listed lab publications for over 100 years. These references have not been added to Pubmed, which still only goes back to 1950. Thus, this unique information needed to be put into a format so that scholars could cite the early history of genetics, and the tragic misfire of eugenics research.

Many approachs were tried. One early method was programmer-centric, where the data was entered into a SQL database and a web front-end was scripted to add basic fields. While this was a promising start, it left out the rich data fields that enable bibliographic managers to capture complete citation information.

Since the library was examining digital asset management systems, Greenstone was assessed for its citation abilities. Ian Witten was able to jury-rig a solution that imported RIS information about citations, but getting them to display in a full way wasn’t simple.

As the prototyping continued, the initial database of 1800 records was exported out of the SQL database into comma separated value (CSV) format, and imported into Endnote. The archives clerk started assessing the reference types, and added new fields. For example, Institution was added so that a sort by the name could be used. A new reference type was added for non-standard reports.

In the process of adding this information, Endnote’s integration with OpenURL became useful. Using the standard bibliographic fields, it was possible to launch a search that queried the library’s subscriptions to see if a full-text version existed. And for many articles in Science magazine, a full-text scan was available.

In the short-term, links to the JSTOR archive were added to Endnote. Longer-term, it would be useful to put in COinS from the web interface so that every citation could be queried via OpenURL.

Cold Spring Harbor Laboratory already had a site license for Endnote, so switching to RefWorks wasn’t feasible. In addition, the local version of Connotea isn’t exacly lightweight to deploy, requriing two MySQL databases and memcached to handle the online load. Since Nature is currently funding the open-source project, questions were raised about the continuting development of the project.

The archives clerk finished the authority control work on the Endnote database, which included hand-checking the references to the print version of the annual reports. Once this was completed, a need was voiced to make these references available online.

Ron Jeromeof the National Research Council Canada Institute for Chemical Process and Environmental Technology wrote a Bibliography module for Drupal which allows Endnote import in .enw or XML formats. This module is currently being extended to allow Open Archives Initiative harvesting.

This module was installed, and the 2200 Cold Spring Harbor Laboratory publications from 1890-1950 were imported into MySQL. The display is clear, and the default display is citation format. All other fields were imported, but live in the database for display on demand.

This module holds great promise for archive integration, since harvesting by OAI would allow libraries to harvest the records from web resources that aren’t specifically enabled for archives management. Endnote format is a lowest barrier format for scientists and researchers.

In the future, Cold Spring Harbor Laboratory hopes to integrate these early records with the other archives collections managed by Digitool. For now, other laboratories and libraries can use Drupal and the Bibliography module for easy reference sharing.

Open Archives Initiative

Following the success of Open URL, the Open Archives Initiative has been one of the most promising development in the digital library world. Tools like Oaister (pronounced oyster), the National Science Digital Library, and the IMLS Digital Collections Registry show there has been a dramatic uptake in the number of libraries and tools that have implemented it.

This relatively light-weight protocol was designed to make sharing of metadata as simple as RSS aggregation. As the number of adoptors has risen, the aggregators have seen a few XML-related snags.

In short, metadata is user input. First law of programming: Never trust user input.

Many papers at library conferences are designed to showcase a particular implementation that went better than expected. That’s great–it’s always good to see libraries succeeding. However, it takes much more courage to share lessons learned, so that pitfalls can be avoided.

The winning paper at JCDI 2006 was written by Carl Lagoze, one of the original architects of the OAI protocol. In the paper, “Metadata aggregation and “automated digital libraries”: A retrospective on the NSDL experience.” he shares his rude awakening that many OAI archives are stuck with XML that don’t validate, which makes aggregators like the NSDL subject to truckloads of autogenerated emails.

As Dorothea’s commentary put it:

“The winning non-student paper both amused and frustrated me. Carl Lagoze talked about the National Science Digital Library, and how it was believed that the Magic Metadata Fairy would use OAI-PMH to build a beautiful searchable garden of science, and how everyone ended up with an ugly, weed-choked, cracked-asphalt vacant lot instead.”

She goes on to say what few technologists want to say. People still matter.

“I’ll be blunt. The solution for NSDL’s problem is hiring cataloguers, or metadata librarians, or indexers/abstracters, or whatever you want to call ’em, to clean up the incoming garbage. Ideally, OAI-PMH would be a two-way protocol, so that nice cleaned-up metadata made its way back to the repository that had spewed the garbage in the first place. That, however (despite all the jaw-flapping about frameworks that went on during JCDL) does not seem to be in the offing. It should be.”

Catalogers still matter. Especially the new breed of catalogers.