SPARC Innovation Fair

At the podium

From a brief talk given 8 November at the SPARC 2010 Digital Repositories Forum:
Hello, I’m Jay Datema, associate director at the Bern Dibner Library, Polytechnic Institute of NYU. I’m honored to be included in this year’s Innovation Fair at the SPARC conference. I have two minutes, so I’ll keep it short.

My poster is entitled “Full Circle Research: Occam’s Razor for Collection.” As many of you know, Occam’s Razor is a principle taken from the philosopher William of Ockam, who posited that “when several theories model the available facts adequately, the simplest theory is to be preferred.”  This principle dates back to the 1300s, so it’s had some time to prove itself. Institutional repositories, on the other hand, are just a decade old.

Simply stated, my poster shows that research is a process that starts with an analysis of publications, which of course will then produce more publications. As Samuel Johnson said, “The greatest part of a writer’s time is spent in reading, in order to write; a man will turn over half a library to make one book.” What is the online equivalent? I suppose it would have to be endless surfing of bibliographies, databases, and PDFs. Research only ends when your attention span falters or a deadline awaits.
Continue reading

Evolution not Revolution

Swimming in salt water is wonderful; drinking it is not. Four hundred years ago, the first American settlers in Jamestown, Virginia, ran into troubles during their first five years because the fresh water they depended upon for drinking turned brackish in the summer. Suddenly, besides the plagues, angry Indians, and crop difficulties, they had to find new sources of fresh water inland. Libraries and publishers are facing a similar challenge as the hybrid world of print and online publications have changed the economic certainties that have kept both healthy.

The past five years in the information world have been full of revolutionary promise, but the new reality has not yet matched the promise of a universal library. Google Scholar promised universal access to scholarly information, yet its dynamic start in 2004 has not brought forth many new evolutionary changes since its release. In fact, the addition of Library Links using OpenURL support is the last newest major feature Scholar has seen. The NISO standard that enables seamless full-text access has shown its value.

For years, it’s been predicted that the Google Books project would revolutionize scholarship, and in some respects it has done so. But in seeking a balance between cornering Amazon’s market for searching inside books, respecting authors’ rights, finding the rights holders of so-called orphan works, and solving metadata and scanning quality issues, its early promise is not yet fulfilled.
Continue reading

IDPF: Google and Harvard

Libraries And Publishers
At the 2007 International Digital Publishing Forum (IDPF) in New York May 9th, publishers and vendors discussed the future of ebooks in an age increasingly dominated by large-scale digitization projects funded by the deep pockets of Google and Microsoft.

In a departure from the other panels, which discussed digital warehouses and repositories, both planned and in production from Random House and HarperCollins, Peter Brantley, executive director of the Digital Library Federation and Dale Flecker of Harvard University Library made a passionate case for libraries in an era of information as a commodity.

Brantley began by mentioning the Library Project on Flickr, and led with a slightly ominous series of slides: “Libraries buy books (For a while longer), followed by “Libraries don’t always own what’s in the book, just the book (the “thing” of the book).¨

He then reiterated the classic rights that libraries protect: The Right to Borrow, Right to Browse, Right to Privacy, and Right to Learn, and warned that “some people may become disenfranchised in the the digital world, when access to the network becomes cheaper than physical things.” Given the presentation that followed from Tom Turvey, director of the Google Book Search project, this made sense.

Brantley made two additional points, saying “Libraries must permanently hold the wealth of our many cultures to preserve fundamental Rights, and Access to books must be either free or low-cost for the world’s poor.”¨ He departed from conventional thinking on access, though, when he argued that this low-cost access didn’t need to include fiction. Traditionally, libraries began as subscription libraries for those who couldn’t afford to purchase fiction in drugstores and other commercial venues.

Finally, Brantley said that books will become communities as they are integrated, multiplied, fragmented, collaborative, and shared, and publishing itself will be reinvented. Yet his conclusion contained an air of inevitability, as he said, “Libraries and publishers can change the world, or it will be transformed anyway.”

A podcast recording of his talk is available on his site.

Google Drops A Bomb
Google presented a plan to entice publishers to buy into two upcoming models for making money from Google Book Search, including a weekly rental “that resembles a library loan” and a purchase option, “much like a bookstore,” said Tom Turvey, director of Google Book Search Partnerships.¨ The personal library would allow search across the books, expiration and rental, and copy and paste. No pricing was announced. Google has been previewing the program at events including the London Book Fair.

Turvey said Google Book Search is live in 70 countries and eight languages. Ten years ago, zero percent of consumers clicked before buying books online, and now $4 billion of books are purchased online. “We think that’s a market,”Turvey said, “and we think of ourselves as the switchboard.”

Turvey, who previously worked at bn.com and ebrary, said publishers receive the majority of the revenue share as well as free marketing tools, site-brandable search inside a book with restricted buy links, and fetch and push statistical reporting.¨He said an iTunes for Books was unlikely, since books don’t have one device, model or user experience that works across all categories. Different verticals like fiction, reference, and science, technology and medicine (STM), require a different user experience, Turvey said.

Publishers including SparkNotes requested a way to make money from enabling a full view of their content on Google Books, as did many travel publishers. Most other books are limited to 20 percent visibility, although Turvey said there is a direct correlation between the number of pages viewed and subsequent purchases.

This program raises significant privacy questions. If Google has records that can be correlated with all the other information it stores, this is the polar opposite of what librarians have espoused about intellectual freedom and the privacy of circulation records. Additionally, the quality control questions are significant and growing, voiced by historian Robert Townsend and others.

Libraries are a large market segment to publishers. It seems reasonable to voice concerns about this proposal at this stage, especially those libraries who haven’t already been bought and sold. Others at the forum were skeptical. Jim Kennedy, vice president and director at the Associated Press, said, “The Google guy’s story is always the same: Send us your content and we’ll monetize it.”

Ebooks Ejournals And Libraries
Dale Flecker of the Harvard University Library gave a historical overview of the challenges libraries have grappled with in the era of digital information.

Instead of talking about ebooks, which he said represent only two percent of usage at Harvard, Flecker described eight challenges about ejournals, which are now “core to what libraries do” and have been in existence for 15-20 years. Library consultant October Ivins challenged this statistic about ebook usage as irrelevant, saying “Harvard isn’t typical.” She said there were 20 ebook platforms demonstrated at the 2006 Charleston Conference, though discovery is still an issue.

First, licensing is a big deal. There were several early questions: Who is a user? What can they do? Who polices behavior? What about guaranteed performance and license lapses? Flecker said that in an interesting shift, there is a move away from licenses to “shared understandings,” where content is acquired via purchase orders.¨

Second, archiving is a difficult issue. Harvard began in 1630, and has especially rich 18th century print collections, so it has been aware that “libraries buy for the ages.” The sticky issues come with remote and perpetual access, and what happens when a publisher ceases publishing.

Flecker didn’t mention library projects like LOCKSS or Portico in his presentation, though they do exist to answer those needs. He did say that “DRM is a bad actor” and it’s technically challenging to archive digital content. Though there have been various initiatives from libraries, publishers, and third parties, he said “Publishers have backed out,” and there are open questions about rights, responsibilities, and who pays for what. In the question and answer period that followed, Flecker said Harvard “gives lots of money” to Portico.”

Third, aggregation is common. Most ejournal content is licensed in bundles and consortia and buying clubs are common. Aggregated platforms provide useful search options and intercontent functionality.

Fourth, statistics matter, since they show utility and value for money spent. Though the COUNTER standard is well-defined and SUSHI gives a protocol for exchange of multiple stats, everyone counts differently.

Fifth, discovery is critical. Publishers have learned that making content discoverable increases use and value. At first, metadata was perceived to be intellectual property (as it still is, apparently), but then there was a grudging acceptance and finally, enthusiastic participation. It was unclear which metadata Flecker was describing, since many publisher abstracts are still regarded as intellectual property. He said Google is now a critical part of the discovery process.

Linkage was the sixth point. Linking started with citations, when publishers and aggregators realized that many footnotes contained links to articles that were also online. Bilateral agreements came next, and finally, the Digital Object Identifier (DOI) generalized the infrastructure and helped solve the “appropriate copy” problem, along with OpenURL. With this solution came true interpublished, interplatform, persistent and actionable links which are now growing beyond citations.

Seventh, there are early glimpses of text mining in ejournals. Text is being used as fodder for computational analysis, not just individual reading. This has required somewhat different licenses geared for computation, and also needs a different level of technical support.¨Last, there are continuing requirements for scholarly citation that is: • Unambiguous •Persistent • At a meaningful level. Article level linking in journals has proven to be sufficient, but the equivalent for books (the page? chapter? paragraph?) has not been established in an era of reflowable text.

In the previous panel, Peter Brantley asked the presenters on digital warehouses about persistent URLS to books, and if ISBNs would be used to construct those URLs. There was total silence, and then LibreDigital volunteered that redirects could be enabled at publisher request.

As WorldCat.org links have also switched from ISBN to OCLC number for permanlinks, this seems like an interesting question to solve and discuss. Will the canonical URL for a book point to Amazon, Google, OCLC, or OpenLibrary?

NetConnect Spring 2007 podcast episode 3

In Requiem for a Nun, William Faulkner famously said, “The past isn’t dead. It isn’t even past.” With the advent of new processes, the past can survive and be retrieved in new ways and forms. The new skills needed to preserve digital information are the same ones that librarians have always employed to serve users: selection, acquisition, and local knowledge.

The print issue of NetConnect is bundled with the April 15th issue of Library Journal, or you can read the articles online.

Jessamyn West of librarian.net says in Saving Digital History that librarians and archivists should preserve digital information, starting with weblogs. Tom Hyry advocates using extensible processing in Reassessing Backlogs to make archives more accessible to users. And newly appointed Digital Library Federation executive director Peter Brantley covers the potential of the rapidly evolving world of print on demand in a Paperback in 4 Minutes. Melissa Rethlefsen describes the new breed of search engines in Product Pipeline, including those that incorporate social search. Gail Golderman and Bruce Connolly compare databases’ pay-per-view in Pay by the Slice, and Library Web Chic Karen Coombs argues that librarians should embrace a balancing act in the debate between Privacy vs Personalization.

Jessamyn and Peter join me in a far-ranging conversation about some of the access challenges involved for readers and librarians in the world of online books, including common APIs for online books and how to broaden availability for all users.

Books
New Downtown Library
Neal Stephenson
Henry Petroski

Software
Greasemonkey User Scripts
Twitter
Yahoo Pipes
Dopplr

Outline
0:00 Music
0:10 Introduction

1:46 DLF Executive Director Peter Brantley
2:30 California Digital Library

4:13 Jessamyn West
5:08 Ask Metafilter
6:17 Saving Digital History
8:01 What Archivists Save
12:02 Culling from the Firehose of Information
12:34 API changes
14:15 Reading 2.0
15:13 Common APIs and Competitive Advantage
17:15 A Paperback in 4 Minutes
18:36 Lulu
19:06 On Demand Books
21:24 Attempts at hacking Google Book Search
22:30 Contracts change?
23:17 Unified Repository
23:57 Long Tail Benefit
24:45 Full Text Book Searching is Huge
25:08 Impact of Google
27:08 Broadband in Vermont
29:16 Questions of Access
30:45 New Downtown Library
33:21 Library Value Calculator
34:07 Hardbacks are Luxury Items
35:47 Developing World Access
37:54 Preventing the Constant Gardener scenario
40:21 Book on the Bookshelf
40:54 Small Things Considered
41:53 Diamond Age
43:10 Comment that spurred Brantley to read the book
43:40 Marketing Libraries
44:15 Pimp My Firefox
45:45 Greasemonkey User Scripts
45:53 Twitter
46:25 Yahoo Pipes
48:07 Dopplr
50:25 Software without the Letter E
50:45 DLF Spring Forum
52:00 OpenID in Libraries
53:40 Outro
54:00 Music

Listen here or subscribe to the podcast feed

Open Data: What Would Kilgour Think?

The New York Public Library has reached a settlement with iBiblio, the public’s library and digital archive at the University of Chapel Hill, North Carolina, for harvesting records from its Research Libraries catalog, which it claims is copyrighted.

Heike Kordish, director of the NYPL Humanities Library, said a cease and desist letter was sent because a 1980s incident by an Australian harvesting effort which turned around and resold the NYPL records.

Simon Spero, iBiblio employee and technical assistant to the assistant vice chancellor at UNC-Chapel Hill, said NYPL requested that its library records be destroyed, and the claim was settled with no admission of wrongdoing. “I would characterize the New York Public Library as being neither public nor a library,” Spero said.

It is a curious development that while the NYPL is making arrangements under private agreements to allow Google to scan its book collection into full-text that it feels free to threaten other research libraries over MARC records.

The price of open data
This follows a similar string of disagreements about open data with OCLC and the MIT Simile project. The Barton Engineering Library catalog records were widely made available via Bit Torrent, a decentralized network file sharing format.

This has since been resolved by making the Barton data available again, though in RDF and MODS, not MARC, under a Creative Commons license for non-commercial use.

OCLC CEO Jay Jordan said the issues around sharing data had their genesis in concerns about the Open WorldCat project and sharing records with Microsoft, Google, and Amazon. Other concerns about private equity firms entering the library market also drove recent revisions to the data sharing policies.

OCLC quietly revised its policy about sharing records, which had not been updated since 1987 after numerous debates in the 1980s about the legality of copyrighting member records.

The new WorldCat policy, reads in part, “WorldCat® records, metadata and holdings information (“Data”) may only be used by Users (defined as individuals accessing WorldCat via OCLC partner Web interfaces) solely for the personal, non-commercial purpose of assisting such Users with locating an item in a library of the User’s choosing… No part of any Data provided in any form by WorldCat may be used, disclosed, reproduced, transferred or transmitted in any form without the prior written consent of OCLC except as expressly permitted hereunder.”

Looking through the most recent board minutes, it looks like concerns have been raised about “the risk to FirstSearch revenues from OpenWorldCat,” and management incentive plans have been approved.

What is good for libraries?
Another project initiated by Simon Spero, entitled Fred 2.0 after recently deceased Fred Kilgour of OCLC, Yale, and Chapel Hill fame, recently released Library of Congress authority file and subject information, which was gathered by similar means as the NYPL records.

Spero said the purpose of the project is “dedicated to the men and women at the Library of Congress and outside, who have worked for the past 108 years to build these authorities, often in the face of technology seemingly designed to make the task as difficult as possible.

Since Library of Congress data by definition cannot be copyrighted as free government information, the project was more collaborative in nature and has received acclaim for its help in pointing out cataloging irregularities in the records. OCLC also offers a linked authority file as a research project.

Firefox was born from open source
While the purpose of releasing library data has not yet reached consensus about what will be built as a result, it can be compared to Netscape open-sourcing the Mozilla code in 2000, which eventually brought Firefox and other open source projects to light. It also shows that the financial motivations of library organizations by necessity dictate the legal mechanisms of protection.

Open source metasearch

Now there’s a new kid on the (meta)search block. LibraryFind, an open-source project funded by the State Library of Oregon, is currently live at Oregon State University. The library has just packaged up a release for anyone to download and install.

Jeremy Frumkin, Gray chair for Innovative Library Services at OSU, said the goals were to contribute to the support of scholarly workflow, remove barriers between the library and Web information, and to establish the digital library as platform.

Lead developers Dan Chudnov, soon to join the Library of Congress’s Office of Strategic Initiatives, and Terry Reese, catalog librarian and developer of popular application MarcEdit, worked with the following guiding principles: Two clicks–one to find, and one to get; a goal of getting results in four seconds, and known and adjustable results ranking.

Other OSU project members included Tami Herlocker, point person for interface development, and Ryan Ordway, system administrator. Frumkin said, “The Ruby on Rails platform provided easy, quick user interface development. It gives a variety of UI possibilities, and offers new interfaces for different user groups.”

The application includes collaborations on the OpenURL module from Ross Singer, library applications developer at the Georgia Tech library, and Ed Summers, Library of Congress developer. Journal coverage can be imported from a SerialsSolutions export, and more import facilities are planned in upcoming releases.

OSU is working on a contract with OCLC’s WorldCat to download data, and is looking to build greater trust relationships with vendors. “The upside for vendors is they can see how their data is used when developing new services,” Frumkin said.

Future enhancements include an information dashboard and a personal digital library. Developers are also staffing a support chatroom for technical support, help, and development discussion of LibraryFind.

NetConnect Winter 2007 podcast episode 2

This is the second episode of the Open Libraries podcast, and I was pleased to have the opportunity to talk to some of the authors of the Winter netConnect supplement, entitled Digitize This!

The issue covers how libraries can start to digitize their unique collections. K. Matthew Dames and Jil Hurst-Wahl wrote an article about copyright and practical considerations in getting started. They join me, along with Lotfi Belkhir, CEO of Kirtas Technologies, to discuss the important issue of digitization quality.

One of the issues that has surfaced recently is exactly what libraries are receiving from the Google Book Search project. As the project grows beyond the initial five libraries into more university and Spanish libraries, many of the implications have become more visible.

The print issue of NetConnect is bundled with the January 15th issue of Library Journal, or you can read the articles online.

Recommended Books:
Kevin
Knowledge Diplomacy

Jill
Business as Unusual

Lotfi
Free Culture
Negotiating China
The Fabric of the Cosmos

Software
SuperDuper
Google Documents
Arabic OCR

0 Music and Intro
1:59 Kevin Dames on his weblog Copycense
2:48 Jill Hurst-Wahl on Digitization 101
4:16 Jill and Kevin on their article
4:34 SLA Digitization Workshop
5:24 Western NY Project
6:45 Digitization Expo
7:43 Lotfi Belkhir
9:00 Books to Bytes
9:26 Cornell and Microsoft Digitization
11:00 Scanning vs Digitization
11:48 Google Scanning
15:22 Michael Keller’s OCLC presentation
16:14 Google and the Public Domain
17:52 Author’s Guild sues Google
21:13 Quality Issues
24:10 MBooks
26:56 Public Library digitization
27:14 Incorporating Google Books into the catalog
28:49 CDL contract
30:22 Microsoft Book Search
31:15 Double Fold
39:20 Print on Demand and Digitization
39:25 Books@Google
43:14 History on a Postcard
45:33 iPRES conference
45:46 LOCKSS
46:45 OAIS

Casey Bisson named one of first winners of Mellon Award for Technology Collaboration

Casey Bisson, information architect at Plymouth State University, was presented with a $50,000 Mellon award for Technology Collaboration by Tim Berners-Lee at the Coalition for Networked Information meeting in Washington DC December 4.

His project, WP-OPAC, is seen as the first step for allowing library catalogs to integrate with WordPress, a popular open-source content management system.

The awards committee included Mitchell Baker, Mozilla; Tim Berners-Lee,W3; Vinton Cerf, Google; Ira Fuchs, Mellon; John Gage, Sun Microsystems; Tim O’Reilly, O’Reilly Media; John Seely Brown, and Donald Waters, Mellon. Berners-Lee said, “These awards are about open source. It’s a good thing because it makes our lives easier, and the award winners used open source to solve problems.”

Library of Congress?
The revolutionary part of the announcement, however, was that Plymouth State University would use the $50,000 to purchase Library of Congress catalog records and redistribute them free under a Creative Commons Share-Alike license or GNU. OCLC has been the source for catalog records for libraries, and its license restrictions do not permit reuse or distribution. However, catalog records have been shared via Z39.50 for several years without incident.

“Libraries’ online presence is broken. We are more than study halls in the digital age. For too long, libraries have have been coming up with unique solutions for common problems,” Bisson said. “Users are looking for an online presence that serves them in the way they expect.” He said “The intention is to bring together the free or nearly-free services available to the user.”

Free download
Bisson said Plymouth State University is committed to supporting it, and will be offering it as a free download from its site, likely in the form of sample records plus WordPress with WP-OPAC included. “With nearly 140,000 registered users of Amazon Web Services, it’s time to use common solutions for our unique problems,” Bisson said.

The internal data structure works with iCal for calendar information and Flickr for photos, and can be used with historical records. It allows libraries to go beyond Library of Congress subject headings. Bisson said. Microformats are key to the internal data, and the OpenSearch API is used for interoperability. Bisson is looking at adding unAPI and OAI in the future.

At this time, there is no connection to the University of Rochester Mellon-funded project which is prototyping a new extensible catalog, though both are funded by Mellon. [see LJ Baker’s Smudges, 9/1/2006]

Other winners include:Open University (Moodle), RPI (bedework), University of British Columbia Vancouver (Open Knowledge Project), Virginia Tech (Sakai), Yale (CAS single signon), University of Washington (pine and IMAP), Internet Archive (Wayback Machine), and Humboldt State University (Moodle).

Open Access

It’s laudable to make the entirety of human knowledge, especially scientific, available for free. But what about that free lunch?

news @ nature.com – Open-access journal hits rocky times – Financial analysis reveals dependence on philanthropy.

The Public Library of Science (PLoS), the flagship publisher for the open-access publishing movement, faces a looming financial crisis. An analysis of the company’s accounts, obtained by Nature, shows that the company falls far short of its stated goal of quickly breaking even. In an attempt to redress its finances, PLoS will next month hike the charge for publishing in its journals from US$1,500 per article to as much as $2,500.

In the beginning, libraries were excited about the open access movement because it promised to save them money from the serials budget. However, as Phil Davis pointed out last year, libraries still face the price of print subscriptions, plus membership fees, as well as having to subsidize author submission fees. From this angle, open access looks like less of a bargain than a mechanism to subsidize research and development for new publications.