Jay Datema

Open Data: What Would Kilgour Think?

The New York Public Library has reached a settlement with iBiblio, the public’s library and digital archive at the University of Chapel Hill, North Carolina, for harvesting records from its Research Libraries catalog, which it claims is copyrighted.

Heike Kordish, director of the NYPL Humanities Library, said a cease and desist letter was sent because a 1980s incident by an Australian harvesting effort which turned around and resold the NYPL records.

Simon Spero, iBiblio employee and technical assistant to the assistant vice chancellor at UNC-Chapel Hill, said NYPL requested that its library records be destroyed, and the claim was settled with no admission of wrongdoing. “I would characterize the New York Public Library as being neither public nor a library,” Spero said.

It is a curious development that while the NYPL is making arrangements under private agreements to allow Google to scan its book collection into full-text that it feels free to threaten other research libraries over MARC records.

The price of open data
This follows a similar string of disagreements about open data with OCLC and the MIT Simile project. The Barton Engineering Library catalog records were widely made available via Bit Torrent, a decentralized network file sharing format.

This has since been resolved by making the Barton data available again, though in RDF and MODS, not MARC, under a Creative Commons license for non-commercial use.

OCLC CEO Jay Jordan said the issues around sharing data had their genesis in concerns about the Open WorldCat project and sharing records with Microsoft, Google, and Amazon. Other concerns about private equity firms entering the library market also drove recent revisions to the data sharing policies.

OCLC quietly revised its policy about sharing records, which had not been updated since 1987 after numerous debates in the 1980s about the legality of copyrighting member records.

The new WorldCat policy, reads in part, “WorldCat® records, metadata and holdings information (“Data”) may only be used by Users (defined as individuals accessing WorldCat via OCLC partner Web interfaces) solely for the personal, non-commercial purpose of assisting such Users with locating an item in a library of the User’s choosing… No part of any Data provided in any form by WorldCat may be used, disclosed, reproduced, transferred or transmitted in any form without the prior written consent of OCLC except as expressly permitted hereunder.”

Looking through the most recent board minutes, it looks like concerns have been raised about “the risk to FirstSearch revenues from OpenWorldCat,” and management incentive plans have been approved.

What is good for libraries?
Another project initiated by Simon Spero, entitled Fred 2.0 after recently deceased Fred Kilgour of OCLC, Yale, and Chapel Hill fame, recently released Library of Congress authority file and subject information, which was gathered by similar means as the NYPL records.

Spero said the purpose of the project is “dedicated to the men and women at the Library of Congress and outside, who have worked for the past 108 years to build these authorities, often in the face of technology seemingly designed to make the task as difficult as possible.

Since Library of Congress data by definition cannot be copyrighted as free government information, the project was more collaborative in nature and has received acclaim for its help in pointing out cataloging irregularities in the records. OCLC also offers a linked authority file as a research project.

Firefox was born from open source
While the purpose of releasing library data has not yet reached consensus about what will be built as a result, it can be compared to Netscape open-sourcing the Mozilla code in 2000, which eventually brought Firefox and other open source projects to light. It also shows that the financial motivations of library organizations by necessity dictate the legal mechanisms of protection.

Kafka

“There are two cardinal human vices, from which all the others derive their being: impatience and carelessness. Impatience got people evicted from Paradise; carelessness kept them from making their way back there. Or perhaps there is one cardinal vice: impatience. Impatience got people evicted, and impatience kept them from making their way back.”—Franz Kafka
The Zurau aphorisms of Franz KafkaFranz Kafka; Schocken Books 2007

WorldCatLibraryThingGoogle BooksBookFinder 

Life Archive

Many libraries including the Greenwich Public Library (CT) have oral history collections of residents, famous and not so famous. But as the US population ages, people are starting to wonder if what they’re creating online will survive them.

Libraries have always kept some kind of vertical file for local residents. The DeKalb Public Library (IL) has a file on author Richard Powers, which proved recently valuable when The Echo Maker won the National Book award.

Perhaps it’s time for libraries to run their own blog aggregators, so that the next Richard Powers’ juvenelia can be preserved for posterity. Open source aggregators exist, from Gregarius (PHP) to Planet (Python) to Plagger (Perl).

Dave Winer, popular for an early vision of weblogs, RSS, and podcasting, among other things, wrote in a post entitled Future Archives, “When a scholar dies, he or she leaves behind a life of work, papers, unfinished manuscripts, notebooks, pictures, recordings, and nowadays computers, disks and websites. Their family and university generally don’t know what to do with them, often the problem is given to the libraries.” Winer went on to say, “Our thought is to try to anticipate the problem, while the scholar is alive, and now that our work is largely electronic, to have it future-safe at all times, leave no work for the librarian, let the families and colleagues deal with the death of a relative and colleague at a personal level, and not as a professional problem.”

Amazon’s Simple Storage Service (S3), which offers metered storage on its servers, has been discussed as one possible solution. Other internet service providers have seen this need, and offer their own solutions. Joyent has Strongspace, which promises to give “a secure place to gather, backup and share any type of file.” Dreamhost has Files Forever, which promises to “keep uploaded files private [to use] as a permanent archive.”

Solution within reach
Jon Udell, Microsoft technology evangelist and pioneer of LibraryLookup, has been thinking along the same lines, writing “I have ventured into this confusing landscape because I think that the issues that libraries and academic publishers are wrestling with persistent long-term storage, permanent URLs, reliable citation indexing and analysis are ones that will matter to many businesses and individuals. As we project our corporate, professional, and personal identities onto the web, we’ll start to see that the long-term stability of those projections is valuable and worth paying for.

In Udell’s podcast with Dan Chudnov, librarian and technologist, they discuss possible alternatives. Chudnov went on to post a vision of what a library project dedicated to archiving weblogs would look like from a 2004 conference discussion (see below), since updated to include Atom instead. This service, which mirrors the journal archive service LOCKSS (Lots of Copies Keep Stuff Safe), holds promise for keeping electronic content from falling into a digital black hole.

Weblog mirroring system diagram, originally uploaded by dchud.

code4lib 2007

Working Code Wins
Responding to increasing consolidation in the ILS market, library developers demonstrated alternatives and supplements to library software at the second annual code4lib conference in Athens, GA, February 27-March 2, 2007. With 140 registered attendees from many states and several countries, including Canada and the United Kingdom, the conference was a hot destination for a previously isolated group of developers.

Network connectivity was a challenge for the Georgia Center for Continuing Education, but the hyperconnected group kept things interesting and the attendees coordinated by Roy Tennant artfully architected workarounds and improvements as the conference progressed.

In a nice mixture of emerging conference trends, code4lib combined the flexibility of the unconference with 20 minute prepared talks, keynotes, five minute Lightning Talks, and breakout sessions. The form was derived from Access, the Canadian library conference.

Keynotes
The conference opened with a talk from Karen Schneider, associate director for technology and research at Florida State University. She challenged the attendees to sell open source software to directors in terms of solutions it provides, since the larger issue in libraries is saving digital information. Schneider also debated Ben Ostrowsky, systems librarian at the Tampa Bay Library Consortium, about the importance of open source software from the stage, to which Ostrowsky responded, “Isn’t that Firefox [a popular open source browser] you’re using there?”

Erik Hatcher, author of Lucene in Action, gave a keynote about using the full-text search server, Apache Solr, open-source search engine Lucene and faceted browser, Flare, to construct a new front-end to library catalog data. The previous day, Hatcher led a free preconference for 80 librarians who brought exported MARC records, including Villanova University and the University of Virginia.

Buzz
One of the best-received talks revolved around BibApp, an “institutional bibliography” written in Ruby on Rails by Nate Vack and Eric Larson, two librarians at the University of Wisconsin-Madison. The prototype application is available for download, but currently relies on citation data from engineering databases to construct a profile of popular journals, publishers, citation types, and who researchers are publishing with. “This is copywrong, which is sometimes what you have to do to construct digital library projects. Then you get money to license it,” Larson said.

More controversially, Luis Salazar gave a talk about using Linux to power public computing in the Howard County (MD) public library system. A former NSA systems administrator, he presented the pros and cons of supporting 300 staff and 400 public access computers using Groovix, a customized Linux distribution. Since the abundant number of computers serves the public without needing sign up sheets, “patrons are able to sit down and do what they want.”

Salazar created a script for monitoring all the public computers, and described how he engaged in a dialog with a patron he dubbed “Hacker Jon,” who used the library computers to develop his nascent scripting skills. Bess Sadler, librarian and metadata services specialist at the University of Virginia, asked about the privacy implications of monitoring patrons. “Do you have a click-through agreement? Privacy Policy?” she asked. Salazar joked that “It’s Maryland, we’re like a communist country” and said he wouldn’t do anything in a public library that he wouldn’t expect to be monitored.

Casey Durfee presented a talk on “Endeca in 250 lines of code or less,” which showed a prototype of faceted searching at the Seattle Public Library. The new catalog front-end sits on top of a Horizon catalog, and uses Python and Solr to present results in an elegant display, from a Google-inspired single search box start to rich subject browse options.

The future
This year’s sponsors included Talis, LibLime, OCLC, Logical Choice Technologies, and Oregon State University. OSU awarded two scholarships to Nicole Engard, Jenkins Law Library (2007 LJ Mover and Shaker), and Joshua Gomez, Getty Research Institute.

Next year’s conference will be held in Portland, OR.

Taiga 2 Forum moves into Open Space

Assistant University Librarians and Assistant Directors met for the second annual Taiga Forum a day before ALA Midwinter, Seattle, to discuss the changing dynamics of academic libraries.

In a change from last year, the participants utilized the Open Spaces structure to stage an unconference, where the conversation topics were chosen by the participants.

Topics included Search, Radical Collaboration, and Google: Friend or Foe, among others. The guiding principles were, “Whoever comes is the right person, whatever happens is the only thing that could have happened, whenever it starts is the right time, and when it’s over, it’s over.” The Endangered Species conference met in an adjoining conference room.

Meg Bellinger, Yale University Associate University Librarian, said, “We came away with the sense that we don’t have all of the answers but we all share the same problems. We must spend time moving beyond the current issues towards solutions.”

The meeting was sponsored by Innovative Interfaces, Inc.

Open source metasearch

Now there’s a new kid on the (meta)search block. LibraryFind, an open-source project funded by the State Library of Oregon, is currently live at Oregon State University. The library has just packaged up a release for anyone to download and install.

Jeremy Frumkin, Gray chair for Innovative Library Services at OSU, said the goals were to contribute to the support of scholarly workflow, remove barriers between the library and Web information, and to establish the digital library as platform.

Lead developers Dan Chudnov, soon to join the Library of Congress’s Office of Strategic Initiatives, and Terry Reese, catalog librarian and developer of popular application MarcEdit, worked with the following guiding principles: Two clicks–one to find, and one to get; a goal of getting results in four seconds, and known and adjustable results ranking.

Other OSU project members included Tami Herlocker, point person for interface development, and Ryan Ordway, system administrator. Frumkin said, “The Ruby on Rails platform provided easy, quick user interface development. It gives a variety of UI possibilities, and offers new interfaces for different user groups.”

The application includes collaborations on the OpenURL module from Ross Singer, library applications developer at the Georgia Tech library, and Ed Summers, Library of Congress developer. Journal coverage can be imported from a SerialsSolutions export, and more import facilities are planned in upcoming releases.

OSU is working on a contract with OCLC’s WorldCat to download data, and is looking to build greater trust relationships with vendors. “The upside for vendors is they can see how their data is used when developing new services,” Frumkin said.

Future enhancements include an information dashboard and a personal digital library. Developers are also staffing a support chatroom for technical support, help, and development discussion of LibraryFind.

Dreaming in Code (review)

Salon’s Scott Rosenberg has written an elegant bird’s eye view of modern software development by observing the development of Chandler, an open source calendaring project. It was originally publicized as a way to kill the Exchange server hegemony in much the same way that Apache has dominated Microsoft’s IIS.

Yet as the subtitle says, “two dozen programmers, three years, 4,732 bugs, and one quest for transcendent software” hasn’t yet resulted in a product ready for general consumption.

The detours have been interesting. We witness the birth of PyLucene, as developers seek a full-text indexing solution that works with their unified repository. And perhaps CalDAV, soon to ship with OS X’s Leopard, will be the project’s legacy.

It’s a compelling vision: a type-agnostic program to manage email, calendar events, and contacts. Yet Google chose dis-integration with its calendar and Gmail. And Apple has made backend data integration possible, but has kept the individual applications separate.

As the project enters its third year, Rosenberg takes a detour into the history of software development. After surveying the hilltop, he makes a modest recommendation. Computer science programs should be more like MFA programs, which require students to study great works, share work, and revise constantly.

During this chapter, 37 Signals’s Getting Real methodology is held up, along with The Joel Test for software development as possible signposts on the road ahead. Since Ruby on Rails came from a simple tasks list, perhaps there is some life in Getting Real for complicated projects, too.

In fact, the scenery is often as enjoyable as the narrative. I was happy to learn that CivicSpace, a Drupal module/modification came from Chandler’s benevolent dictator-for-life, Mitch Kapor. An excerpt from the book is up at Technology Review that delves into the history of Hungarian notation.

As the Chandler project continues to take shape, one ponders the irony that if the developers had been using a completed program that fulfilled the dream, their project might be done already. The hardest software to finish may be that which measures time. Perhaps we need the next Proust to reinvent computer science. Until then, Dreaming in Code will have to suffice.

Dreaming in code Dreaming in code: two dozen programmers, three years, 4,732 bugs, and one quest for transcendent software

Scott Rosenberg; Crown Publishers 2007

WorldCatRead OnlineLibraryThingGoogle BooksBookFinder 

NetConnect Winter 2007 podcast episode 2

This is the second episode of the Open Libraries podcast, and I was pleased to have the opportunity to talk to some of the authors of the Winter netConnect supplement, entitled Digitize This!

The issue covers how libraries can start to digitize their unique collections. K. Matthew Dames and Jil Hurst-Wahl wrote an article about copyright and practical considerations in getting started. They join me, along with Lotfi Belkhir, CEO of Kirtas Technologies, to discuss the important issue of digitization quality.

One of the issues that has surfaced recently is exactly what libraries are receiving from the Google Book Search project. As the project grows beyond the initial five libraries into more university and Spanish libraries, many of the implications have become more visible.

The print issue of NetConnect is bundled with the January 15th issue of Library Journal, or you can read the articles online.

Recommended Books:
Kevin
Knowledge Diplomacy

Jill
Business as Unusual

Lotfi
Free Culture
Negotiating China
The Fabric of the Cosmos

Software
SuperDuper
Google Documents
Arabic OCR

0 Music and Intro
1:59 Kevin Dames on his weblog Copycense
2:48 Jill Hurst-Wahl on Digitization 101
4:16 Jill and Kevin on their article
4:34 SLA Digitization Workshop
5:24 Western NY Project
6:45 Digitization Expo
7:43 Lotfi Belkhir
9:00 Books to Bytes
9:26 Cornell and Microsoft Digitization
11:00 Scanning vs Digitization
11:48 Google Scanning
15:22 Michael Keller’s OCLC presentation
16:14 Google and the Public Domain
17:52 Author’s Guild sues Google
21:13 Quality Issues
24:10 MBooks
26:56 Public Library digitization
27:14 Incorporating Google Books into the catalog
28:49 CDL contract
30:22 Microsoft Book Search
31:15 Double Fold
39:20 Print on Demand and Digitization
39:25 Books@Google
43:14 History on a Postcard
45:33 iPRES conference
45:46 LOCKSS
46:45 OAIS

Evergreen now has two support options

Evergreen, the open source ILS system in use by the Georgia PINES libraries, now has a couple of support options. The developers behind the system have launched Equinox Software, modestly billed as “The Future of Library Automation.”

The company consists of members of the Evergreen development team as well as the Georgia Assistant State Librarian, Julie Walker. Libraries are being offered custom development, hosting, migration, and support.

This is an interesting development, and brings to mind some automation history, from NOTIS originating out of Northwestern to the original Innovative software coming from UC Berkeley.

Tag, you’re it

Another interesting tagging project from the art world is steve.museum, billed as “the first experiment in social tagging of museum collections,” which has recently been funded by IMLS for two years.

At a 17 November New York Technical Services Librarians meeting, Susan Chun of the Metropolitan Musuem of Art said steve solves the problem of “additional access points, multilingual information, and things that aren’t often included in art catalog records, like color.” Though the audience was somewhat skeptical, Chun said “steve won’t replace anything, and tags must exist alongside traditional cataloging.”

Though tags like “you will die” may have nebulous value, the Met found that 92% of tags added new information that wasn’t present in traditional sources.

Active since 2005, the tag collection is being studied by social scientists at Princeton and the University of Michigan. Questions being studied include “What produces good tags?” and looking at types and tag clusters using deduping and stemming analysis.

The project includes an open API and open source download. Installation is quite simple (requiring PHP and MySQL), but the upload of images requires a custom XML schema for description.