The Information Bomb and Activity Streams

In 1993, Yale computer science professor David Gelertner opened a package he thought was a dissertation in progress. Instead, it was a bomb from the Unabomber, who had written in his manifesto that “Technological society is incompatible with individual freedom and must therefore be destroyed and replaced by primitive society so that people will be free again.” Though Kaczynski’s point was lost when attached to violence, it’s ironic that his target was a computer science professor who professed not to like computers, the tool of a technological society.

In addition, in one of the dissertations Gelertner supervised, Eric Thomas Freeman proposed a new direction for information management. Freeman argued that “In an attempt to do better we have reduced information management to a few simple and unifying concepts and created Lifestreams. Lifestreams is a software architecture based on simple data structure, a time-ordered stream of documents, that can be manipulated with a small number of powerful operators to locate, organize, summarize, and monitor information.” Thus, the stream was born of a desire to answer information overload.

While Freeman anticipated freedom from common desktop computing metaphors, the Web had not reached ubiquity 12 years ago. His lifestreams principles live on in the software interfaces of twitter, delicious, Facebook, and FriendFeed. But have you tried to find a tweet from three months ago? How about something you wrote on Facebook last year? And FriendFeed discussions have no obvious URL, so there’s no easy way to return to the past. This planned obsolescence is by design, and the stream comes and goes like an information bomb.

In The Anxiety of Obsolescence, Pomoma College English professor Kathleen Fitzpatrick says that “The Internet is merely the latest of the competitors that print culture has been pitted against since the late nineteenth century. Threats to the book’s presumed dominance over the hearts and minds of Americans have arisen at every technological turn; or so the rampant public discourse of print’s obsolescence would lead one to believe.” Fitzpatrick goes on to say that her work is dedicated to demonstrating the “peacable coexistence of literature and television, despite all the loud claims to the contrary.” This objective is a useful response to the usual kvetching about the utter uselessness of the activity stream of the day.

A Standard for Sharing

Now popularized as activity streams, the flow of information has gained appeal because it gives users a way to curate their own information. Yet there is no standard way for this information to be recast by the user or data providers in a way that preserves privacy or archival access.

Chris Messina has advocated for social network interoperability, and suggests that “with a little effort on the publishing side, activity streams could become much more valuable by being easier for web services to consume, interpret and to provide better filtering and weighting of shared activities to make it easier for people to get access to relevant information from people that they care about, as it happens.” Messina points out that the activity stream “provide what all good news stories provide: the who, what, when, where and sometimes, how.”

In the digital age, activity streams could be used as a way to record interactions with scholarly materials. Just as COUNTER and Metrics from Scholarly Use of Electronic Records (MESUR) record statistics about how journal articles are viewed, an activity stream standard could be used to provide context around browsing.

For example, Swarthmore has a fascinating collection of W.H. Auden incunabula. You can see what books he checked out, the books he placed on reserve for his students, and even his unauthorized annotations, including his exasperated response on his own work, “Oh God, what rubbish.” What seemed ephemeral is a fascinating exercise in tracing the thought of a poet in America at a crucial period in his scholarly development. If we had captured what Auden was listening to, reading, and attending at the same time, what a treasure trove it would be for biographers and scholars.

The Appeal of Activity Streams

In 2007, Dan Chudnov wrote in Social Software: You Are an Access Point, “There’s a downside to all of this talk of things “social.” As soon as you become an access point, you also become a data point. Make no mistake-Facebook and MySpace wouldn’t still be around if they couldn’t make a lot of money off of each of us, so remember that while your use of these services makes it all seem better for everybody else, the sites’ owners are skimming profit right off the top of that network effect.” How then can the user access and understand their own streams and data points?

Maciej Ceglowski, former Mellon Foundation grant officer and Yahoo engineer, has founded an antisocial bookmarking service called Pinboard which safeguards user privacy over monetization and sharing features. One of its appealing features is placing the user at the center of what they choose to share, without presuming that the record is open by default. In fact, bookmarks can be made private with ease.

In The Information Bomb, Paul Virilio wrote that “Digital messages and images matter less than their instantaneous delivery: the shock effect always wins out over the consideration of the informational content. Hence the indistinguishable and unpredictable character of the offensive act and the technical breakdown.” Users can manage or drown in the stream. To safeguard this information, users should push for their own data to made available so that they can make educated choices.

With the well-founded Department of Justice inquiry into the Google Book project about monopoly pricing and privacy, libraries can now ask for book usage information. Just as position information enables the Hathi Project to provide full-text searchability, usage information would give libraries a way to better serve patrons, and to give special collections a treasure trove of information.

Mining for Meaning

In David Lodge’s 1984 novel, Small World, a character remarks that literary analysis of Shakespeare and T.S. Eliot “would just lend itself nicely to computerization….All you’d have to do would be to put the texts on to tape and you could get the computer to list every word, phrase and syntactical construction that the two writers had in common.”

This brave new world is upon us, but the larger question for Google and OCLC, among other purveyors of warehoused metadata and petabytes of information, is how to achieve meaning. One of the brilliant insights derived from Terry Winograd‘s research and mentoring is that popularity in the form of inbound links does matter for web pages, at least. In the case of all the world’s books turned into digitized texts, it’s a harder question to assign meaning without popularity, a canon, or search queries as a guide.

Until recently, text mining wasn’t possible at great scale. And as the great scanning projects continue on their bumpy road, the mysteries of what will come out of them have yet to emerge into meaning for users.

Nascent standards

Bill Kasdorf pointed out several XML models for books in his May NISO presentation, including NISO/ISO 12083, TEI, DocBook, NLM Book DTD, and DTBook. These existing models have served publishers well, though they have been employed for particular uses and have not yet found common ground across the breath of book types. The need for a standard has never been clearer, but it will require vision and a clear understanding of solved problems to push forward.

After the professor in Small World gains access to a server, he grows giddy with the possibilities of finding “your own special, distinctive, unique way of using the English language….the words that carry a distinctive semantic content.” While we may be delighted about the possibilities that searching books afford, there is the distinct possibility that the world of the text could be changed completely.

Another mechanism for assigning meaning to full text has been opened up by web technology and science. The Open Text Mining Interface is a method championed by Nature Publishing Group as a way to share the contents of their archives in XML for the express purpose of text mining while preserving intellectual property concerns. Now in a second revision, the OTMI is an elegant method of enabling sharing, though it remains to be seen if the initiative will spread to a larger audience.

Sense making

As the corpus lurches towards the cloud, one interesting example of semantic meaning comes in the Open Calais project, an open platform by the reconstituted Thomson Reuters. When raw text is fed into the Calais web service, terms are extracted and fed into existing taxonomies. Thus, persons, countries, and categories are first identified and then made available for verification.

This experimental service has proved its value for unstructured text, but it also works for extracting meaning from the most recent weblog posting to historic newspapers newly scanned into text via Optical Character Recognition (OCR). Since human-created metadata and indexing services are among the most expensive things libraries and publishers create, any mechanism to optimize human intelligence by using machines to create meaning is a useful way forward.

Calais shows promise for metadata enhancement, since full text can be mined for its word properties and fed into taxonomic structures. This could be the basis for search engines that understand natural language queries in the future, but could also be a mechanism for accurate and precise concept browsing.

Glimmers of understanding

One method of gaining new understanding is to examine solved problems. Melvil Dewey understood vertical integration, as he helped with innovations around 3×5 index cards, cabinets, as well as the classification systems that bears his name. Some even say he was the first standards bearer for libraries, though it’s hard to believe that anyone familiar with standards can imagine that one person could have actually been entirely responsible.

Another solved problem is how to make information about books and journals widely available. This has been done twice in the past century—first with the printed catalog card, distributed by the Library of Congress for the greater good, and the distributed catalog record, at great utility (and cost) by the Online Computer Library Center.

Pointers are no longer entirely sufficient, since the problem is not only how to find information but how to make sense of it once it has been found. Linking from catalog records has been a partial solution, but the era of complete books online is now entering its second decade. The third stage is upon us.

Small world: an academic romance

David Lodge; Penguin 1985

WorldCatLibraryThingGoogle BooksBookFinder 

Presenting at ALA panel on Future of Information Retrieval

The Future of Information Retrieval

Ron Miller, Director of Product Management, HW Wilson, hosts a panel of industry leaders including:
Mike Buschman, Program Manager, Windows Live Academic, Microsoft.
R. David Lankes, PhD, Director of the Information Institute of Syracuse, and Associate Professor, School of Information Studies, Syracuse University.
Marydee Ojala, Editor, ONLINE, and contributing feature and news writer to Information Today, Searcher, EContent, Computers in Libraries, among other publications.
Jay Datema, Technology Editor, Library Journal

Add to calendar:
Monday, 25 June 2007
8-10 a.m, Room 103b
Preliminary slides and audio attached.

NetConnect Spring 2007 podcast episode 3

In Requiem for a Nun, William Faulkner famously said, “The past isn’t dead. It isn’t even past.” With the advent of new processes, the past can survive and be retrieved in new ways and forms. The new skills needed to preserve digital information are the same ones that librarians have always employed to serve users: selection, acquisition, and local knowledge.

The print issue of NetConnect is bundled with the April 15th issue of Library Journal, or you can read the articles online.

Jessamyn West of librarian.net says in Saving Digital History that librarians and archivists should preserve digital information, starting with weblogs. Tom Hyry advocates using extensible processing in Reassessing Backlogs to make archives more accessible to users. And newly appointed Digital Library Federation executive director Peter Brantley covers the potential of the rapidly evolving world of print on demand in a Paperback in 4 Minutes. Melissa Rethlefsen describes the new breed of search engines in Product Pipeline, including those that incorporate social search. Gail Golderman and Bruce Connolly compare databases’ pay-per-view in Pay by the Slice, and Library Web Chic Karen Coombs argues that librarians should embrace a balancing act in the debate between Privacy vs Personalization.

Jessamyn and Peter join me in a far-ranging conversation about some of the access challenges involved for readers and librarians in the world of online books, including common APIs for online books and how to broaden availability for all users.

Books
New Downtown Library
Neal Stephenson
Henry Petroski

Software
Greasemonkey User Scripts
Twitter
Yahoo Pipes
Dopplr

Outline
0:00 Music
0:10 Introduction

1:46 DLF Executive Director Peter Brantley
2:30 California Digital Library

4:13 Jessamyn West
5:08 Ask Metafilter
6:17 Saving Digital History
8:01 What Archivists Save
12:02 Culling from the Firehose of Information
12:34 API changes
14:15 Reading 2.0
15:13 Common APIs and Competitive Advantage
17:15 A Paperback in 4 Minutes
18:36 Lulu
19:06 On Demand Books
21:24 Attempts at hacking Google Book Search
22:30 Contracts change?
23:17 Unified Repository
23:57 Long Tail Benefit
24:45 Full Text Book Searching is Huge
25:08 Impact of Google
27:08 Broadband in Vermont
29:16 Questions of Access
30:45 New Downtown Library
33:21 Library Value Calculator
34:07 Hardbacks are Luxury Items
35:47 Developing World Access
37:54 Preventing the Constant Gardener scenario
40:21 Book on the Bookshelf
40:54 Small Things Considered
41:53 Diamond Age
43:10 Comment that spurred Brantley to read the book
43:40 Marketing Libraries
44:15 Pimp My Firefox
45:45 Greasemonkey User Scripts
45:53 Twitter
46:25 Yahoo Pipes
48:07 Dopplr
50:25 Software without the Letter E
50:45 DLF Spring Forum
52:00 OpenID in Libraries
53:40 Outro
54:00 Music

Listen here or subscribe to the podcast feed

Open Data: What Would Kilgour Think?

The New York Public Library has reached a settlement with iBiblio, the public’s library and digital archive at the University of Chapel Hill, North Carolina, for harvesting records from its Research Libraries catalog, which it claims is copyrighted.

Heike Kordish, director of the NYPL Humanities Library, said a cease and desist letter was sent because a 1980s incident by an Australian harvesting effort which turned around and resold the NYPL records.

Simon Spero, iBiblio employee and technical assistant to the assistant vice chancellor at UNC-Chapel Hill, said NYPL requested that its library records be destroyed, and the claim was settled with no admission of wrongdoing. “I would characterize the New York Public Library as being neither public nor a library,” Spero said.

It is a curious development that while the NYPL is making arrangements under private agreements to allow Google to scan its book collection into full-text that it feels free to threaten other research libraries over MARC records.

The price of open data
This follows a similar string of disagreements about open data with OCLC and the MIT Simile project. The Barton Engineering Library catalog records were widely made available via Bit Torrent, a decentralized network file sharing format.

This has since been resolved by making the Barton data available again, though in RDF and MODS, not MARC, under a Creative Commons license for non-commercial use.

OCLC CEO Jay Jordan said the issues around sharing data had their genesis in concerns about the Open WorldCat project and sharing records with Microsoft, Google, and Amazon. Other concerns about private equity firms entering the library market also drove recent revisions to the data sharing policies.

OCLC quietly revised its policy about sharing records, which had not been updated since 1987 after numerous debates in the 1980s about the legality of copyrighting member records.

The new WorldCat policy, reads in part, “WorldCat® records, metadata and holdings information (“Data”) may only be used by Users (defined as individuals accessing WorldCat via OCLC partner Web interfaces) solely for the personal, non-commercial purpose of assisting such Users with locating an item in a library of the User’s choosing… No part of any Data provided in any form by WorldCat may be used, disclosed, reproduced, transferred or transmitted in any form without the prior written consent of OCLC except as expressly permitted hereunder.”

Looking through the most recent board minutes, it looks like concerns have been raised about “the risk to FirstSearch revenues from OpenWorldCat,” and management incentive plans have been approved.

What is good for libraries?
Another project initiated by Simon Spero, entitled Fred 2.0 after recently deceased Fred Kilgour of OCLC, Yale, and Chapel Hill fame, recently released Library of Congress authority file and subject information, which was gathered by similar means as the NYPL records.

Spero said the purpose of the project is “dedicated to the men and women at the Library of Congress and outside, who have worked for the past 108 years to build these authorities, often in the face of technology seemingly designed to make the task as difficult as possible.

Since Library of Congress data by definition cannot be copyrighted as free government information, the project was more collaborative in nature and has received acclaim for its help in pointing out cataloging irregularities in the records. OCLC also offers a linked authority file as a research project.

Firefox was born from open source
While the purpose of releasing library data has not yet reached consensus about what will be built as a result, it can be compared to Netscape open-sourcing the Mozilla code in 2000, which eventually brought Firefox and other open source projects to light. It also shows that the financial motivations of library organizations by necessity dictate the legal mechanisms of protection.

NetConnect Winter 2007 podcast episode 2

This is the second episode of the Open Libraries podcast, and I was pleased to have the opportunity to talk to some of the authors of the Winter netConnect supplement, entitled Digitize This!

The issue covers how libraries can start to digitize their unique collections. K. Matthew Dames and Jil Hurst-Wahl wrote an article about copyright and practical considerations in getting started. They join me, along with Lotfi Belkhir, CEO of Kirtas Technologies, to discuss the important issue of digitization quality.

One of the issues that has surfaced recently is exactly what libraries are receiving from the Google Book Search project. As the project grows beyond the initial five libraries into more university and Spanish libraries, many of the implications have become more visible.

The print issue of NetConnect is bundled with the January 15th issue of Library Journal, or you can read the articles online.

Recommended Books:
Kevin
Knowledge Diplomacy

Jill
Business as Unusual

Lotfi
Free Culture
Negotiating China
The Fabric of the Cosmos

Software
SuperDuper
Google Documents
Arabic OCR

0 Music and Intro
1:59 Kevin Dames on his weblog Copycense
2:48 Jill Hurst-Wahl on Digitization 101
4:16 Jill and Kevin on their article
4:34 SLA Digitization Workshop
5:24 Western NY Project
6:45 Digitization Expo
7:43 Lotfi Belkhir
9:00 Books to Bytes
9:26 Cornell and Microsoft Digitization
11:00 Scanning vs Digitization
11:48 Google Scanning
15:22 Michael Keller’s OCLC presentation
16:14 Google and the Public Domain
17:52 Author’s Guild sues Google
21:13 Quality Issues
24:10 MBooks
26:56 Public Library digitization
27:14 Incorporating Google Books into the catalog
28:49 CDL contract
30:22 Microsoft Book Search
31:15 Double Fold
39:20 Print on Demand and Digitization
39:25 Books@Google
43:14 History on a Postcard
45:33 iPRES conference
45:46 LOCKSS
46:45 OAIS

Digipalooza begins

Overdrive’s first annual user group meeting was held in Cleveland, OH July 26-28. Mixing audio book publishers, public librarians, and hardware manufacturers, the gathering showcased innovative uses of digital media and upcoming features from Overdrive. New additions include a wiki for users (dlrwiki.overdrive.com), improved collection development tools with preordering capabilities and RSS feeds, and multilingual capabilities.

Although Overdrive content is not available for iPods, Overdrive “is hopeful that Apple and Microsoft can reach an agreement that would enable support for Microsoft-based DRM-protected materials on the iPod/Mac.”

Finally, the New York Public Library announced their plans to roll out a direct download service (ebooks.nypl.org), which will enable patrons to read digital content directly from their phones and other devices.

This is a welcome development, since discovery and download is quite a process right now. It took me over 30 minutes to figure out how to get the Mobipocket version of Freakonomics onto my Treo, and it was a little disheartning to find that the old models of print (placing holds, books that expire) have been replicated. I did like the lack of overdue fines, though.

Open Content

Brewster Kahle and the Open Content Alliance are doing some interesting and credible things.

It’s especially interesting to see the open source software being made available from it, like Dojo.

Some of the scans are quite beautiful, like this Henry James book.