Open Libraries "… are signs of life and hope: They are the cornerstone of democracy"

Flickr View All » Humpty-Dumpty at the DeKalb Public LibraryBrooklyn Public LibraryBrooklyn Public LibraryBrooklyn Public LibraryBrooklyn Public LibraryOak Park Public LibraryLast day the Binghamton Public Library was openBinghamton Public LibraryCold Spring Harbor Laboratory Library

Evolution not Revolution

Swimming in salt water is wonderful; drinking it is not. Four hundred years ago, the first American settlers in Jamestown, Virginia, ran into troubles during their first five years because the fresh water they depended upon for drinking turned brackish in the summer. Suddenly, besides the plagues, angry Indians, and crop difficulties, they had to find new sources of fresh water inland. Libraries and publishers are facing a similar challenge as the hybrid world of print and online publications have changed the economic certainties that have kept both healthy.

The past five years in the information world have been full of revolutionary promise, but the new reality has not yet matched the promise of a universal library. Google Scholar promised universal access to scholarly information, yet its dynamic start in 2004 has not brought forth many new evolutionary changes since its release. In fact, the addition of Library Links using OpenURL support is the last newest major feature Scholar has seen. The NISO standard that enables seamless full-text access has shown its value.

For years, It’s been predicted that the Google Books project would revolutionize scholarship, and in some respects it has done so. But in seeking a balance between cornering Amazon’s market for searching inside books, respecting author’s rights, finding the rights holders of so-called orphan works, and solving metadata and scanning quality issues, its early promise is not yet fulfilled.

Although one view of history is predicated on innovation bringing radical change, cultural institutions know that changes often come from slow progress over time instead of abrupt changes wrought from dramatic agreements.

Et tu, LC?
The history of putting information online shows that the process has been filled with government investment and intervention. There have been dramatic research and development from large companies like Lockheed Martin (with Dialog) and IBM (for ILS development) followed by retrenchment. Sometimes the for-profit waters are saltier than scholarship can abide, though companies are certainly needed to fund development and push scale forward.

In A History of Online Information Sources, 1963-1974, Charles P. Bourne and Trudi Bellardo Hahn assert, “The serious application of computers to document reference retrieval began in the late 1950s, with slow serial searches of small files of bibliographic records on magnetic tapes. A precursor of effective, large-scale online information retrieval (IR) systems was an experiment in searching bibliographic records on an IBM disk memory system called RAMAC (Random Access Method of Accounting and Control). …The genesis of online retrieval systems can be traced to the first half of the 1960s.” What we see as nearly comprehensive coverage of journals online is a product of over forty years of experimentation, from metadata to full-text delivery.

Early efforts from New York University to digitize their research collection with the help of the country of Abu Dhabi, the Hathi Trust, and OCLC shows the multiplicity of effort required to operate at the scale required for a large research collection. The initial questions are large: What format is most useful? Is this a one-time effort, or will there be opportunities to revise the scanning? Is it necessary to scan all works, or only unique collections? What about copyright? Do lawyers need to hammer out agreements for campus-wide access vs. universal access?

One wonders why a national library isn’t rising to answer the call to make all books ever published available online. The Library of Congress has developed a promising answer in the World Digital Library, but they have not yet been vocal about their successes. Sometimes, code speaks louder than op-eds in the New York Times. And as Karen Schneider pointed out, Sergey Brin seems to be unaware that WorldCat has a number of holdings for books he claims are “no longer available.” And perhaps without Google Ad $$ powering the digitization, there is a delay while the numbers are run to account for the cost of full versus selective coverage, since historic cooperation is required to pay for the total cost.

Solutions via Standards
Journals still exist; newspapers are published every day; architectural wonders in the form of libraries are standing. But one day, you enter the library and the magazines are in the basement instead of the entrance. In this era, books have gone from closed to open stacks.

Jean Luc-Nancy provides a helpful theoretical framework on what makes a book unique in On the Commerce of Thinking: Of Books and Bookstores. In it, he says “Libraries and bookstores are the depots, reserves, and shop windows of those coffers, whose locks must be forced before they are closed again with a new bolt and latch.” He goes on to assert that “the library or bookstore—as we know, they used to be the same thing—is nothing but the Idea of the book as exposed substance, as subject that shows and presents itself.” His point is that books are unique, since they contain their essence within themselves. Unlike letters, memoirs, treatises, pamphlets, or lampoons, a book is neither reducible to container nor content.

Since books are perhaps the last item in society that have a known function that transcends the ability to be contained within a given economic system, it’s important to think about ways of preserving access to information that we have today. In The Anarchist in the Library, Siva Vaidhyanathan writes “Recently, individuals have used widespread, low-cost, high-quality technologies to persuade, and organize over long distances, beyond the prying ears and eyes of powerful institutions. Digitization and networking make anarchy relevant in ways it has not been before….Anarchy is not necessarily chaotic and dangerous. It is organization through disorganization—anarchist tactics generally involve uncoordinated actions toward a coordinated goal.” To me, this is what the standards community represents at its heart—the ability to leverage community thinking about seemingly intractable issues like wide-scale digitization to form consensus that brings forth a series of changes that benefit readers, librarians, publishers and companies.

Going direct
As Andrew Carnegie famously said, “I choose free libraries as the best agencies for improving the masses of the people, for they give nothing for nothing. They only help those who help themselves. They never pauperize. They reach the aspiring and open the chief treasures of the world—those stored up in books. A taste for reading drives out lower tastes.”

One notable advance in the past fifteen years is the lowly weblog, which Scott Rosenberg chronicles to good effect in Say Everything: How Blogging Began, What It’s Becoming, and Why It Matters. Giving librarians the direct ability to express themselves to patrons is a great start, but what if this expertise was aggregated and categorized to bring order to the universal library? What if standards for sharing, citing, and exporting books made the salty water of the for-profit enterprises into the brackish water they are, and what if free access to searching and reading books online was the tonic water democracy needs?

Note: This was supposed to be the last editorial in a series of eight written for Information Standards Quarterly. Thanks to Todd Carpenter and the NISO board for funding the expansion of a newsletter into a magazine, though its viability and direction are shifting yet again. As ISQ optimizes its financial model, I hope to continue to write about these topics. In addition, I’m available for consulting and speaking via my company, Bookism LLC.

The Anarchist In The Library
The Anarchist In The Library: How The Clash Between Freedom And Control Is Hacking The Real World And Crashing The System
Siva Vaidhyanathan; Basic Books
LibraryThingGoogle BooksBookFinder


The Information Bomb and Activity Streams

In 1993, Yale computer science professor David Gelertner opened a package he thought was a dissertation in progress. Instead, it was a bomb from the Unabomber, who had written in his manifesto that “Technological society is incompatible with individual freedom and must therefore be destroyed and replaced by primitive society so that people will be free again.” Though Kaczynski’s point was lost when attached to violence, it’s ironic that his target was a computer science professor who professed not to like computers, the tool of a technological society.

In addition, in one of the dissertations Gelertner supervised, Eric Thomas Freeman proposed a new direction for information management. Freeman argued that “In an attempt to do better we have reduced information management to a few simple and unifying concepts and created Lifestreams. Lifestreams is a software architecture based on simple data structure, a time-ordered stream of documents, that can be manipulated with a small number of powerful operators to locate, organize, summarize, and monitor information.” Thus, the stream was born of a desire to answer information overload.

While Freeman anticipated freedom from common desktop computing metaphors, the Web had not reached ubiquity 12 years ago. His lifestreams principles live on in the software interfaces of twitter, delicious, Facebook, and FriendFeed. But have you tried to find a tweet from three months ago? How about something you wrote on Facebook last year? And FriendFeed discussions have no obvious URL, so there’s no easy way to return to the past. This planned obsolescence is by design, and the stream comes and goes like an information bomb.

In The Anxiety of Obsolescence, Pomoma College English professor Kathleen Fitzpatrick says that “The Internet is merely the latest of the competitors that print culture has been pitted against since the late nineteenth century. Threats to the book’s presumed dominance over the hearts and minds of Americans have arisen at every technological turn—or so the rampant public discourse of print’s obsolescence would lead one to believe.” Fitzpatrick goes on to say that her work is dedicated to demonstrating the “peacable coexistence of literature and television, despite all the loud claims to the contrary.” This objective is a useful response to the usual kvetching about the utter uselessness of the activity stream of the day.

A Standard for Sharing
Now popularized as activity streams, the flow of information has gained appeal because it gives users a way to curate their own information. Yet there is no standard way for this information to be recast by the user or data providers in a way that preserves privacy or archival access.

Chris Messina has advocated for social network interoperability, and suggests that “with a little effort on the publishing side, activity streams could become much more valuable by being easier for web services to consume, interpret and to provide better filtering and weighting of shared activities to make it easier for people to get access to relevant information from people that they care about, as it happens.” Messina points out that the activity stream “provide what all good news stories provide: the who, what, when, where and sometimes, how.”

In the digital age, activity streams could be used as a way to record interactions with scholarly materials. Just as COUNTER and Metrics from Scholarly Use of Electronic Records (MESUR) record statistics about how journal articles are viewed, an activity stream standard could be used to provide context around browsing.

For example, Swarthmore has a fascinating collection of W.H. Auden incunabula. You can see what books he checked out, the books he placed on reserve for his students, and even his unauthorized annotations, including his exasperated response on his own work, “Oh God, what rubbish.” What seemed ephemeral is a fascinating exercise in tracing the thought of a poet in America at a crucial period in his scholarly development. If we had captured what Auden was listening to, reading, and attending at the same time, what a treasure trove it would be for biographers and scholars.

The Appeal of Activity Streams
In 2007, Dan Chudnov wrote in Social Software: You Are an Access Point, “There’s a downside to all of this talk of things “social.” As soon as you become an access point, you also become a data point. Make no mistake-Facebook and My Space wouldn’t still be around if they couldn’t make a lot of money off of each of us, so remember that while your use of these services makes it all seem better for everybody else, the sites’ owners are skimming profit right off the top of that network effect.” How then can the user access and understand their own streams and data points?

Macej Ceglowski, former Mellon Foundation grant officer and Yahoo engineer, has founded an antisocial bookmarking service called Pinboard which safeguards user privacy over monetization and sharing features. One of its appealing features is placing the user at the center of what they choose to share, without presuming that the record is open by default. In fact, bookmarks can be made private with ease.

In The Information Bomb, Paul Virilio wrote that “Digital messages and images matter less than their instantaneous delivery: the shock effect always wins out over the consideration of the informational content. Hence the indistinguishable and unpredictable character of the offensive act and the technical breakdown.” Users can manage or drown in the stream. To safeguard this information, users should push for their own data to made available so that they can make educated choices.

With the well-founded Department of Justice inquiry into the Google Book project about monopoly pricing and privacy, libraries can now ask for book usage information. Just as position information enables the Hathi Project to provide full-text searchability, usage information would give libraries a way to better serve patrons, and to give special collections a treasure trove of information.

The Information Bomb
The Information Bomb
Paul Virilio; Verso Books
LibraryThingGoogle BooksBookFinder


Annotating Video

It seems that everything’s available online, except the ability to search for particular video scenes. Recently, I was searching for an actress I’d last seen in a film 15 years ago and imdb.com was no help. I eventually found Lena Olin by watching the credits, but the experience made me wonder if video standards could aid the discovery process.

In a conversation last year, Kristen Fisher Ratan of Highwire Press wondered if there was a standards-based way to jump to a particular place in a video, which YouTube currently offers through URL parameters. This is an obvious first step for citation, much as the page number is the lingua franca of academic citations and footnotes. And after a naming convention is established, the ability to retrieve passages and to optimize by searching strings is a basic requirement for all video applications.

Josh Bernoff, a Forrestor researcher, is quite skeptical about video standards, saying, “Don’t expect universal metadata standards. Standards will develop around discrete applications, driven primarily by distributors like cable and satellite operators.” While this is likely true of the present, use of established markup languages like RDF using relevant subsets of Dublin Core extensions could enable convergence. As John Toebes, Cisco chief architect, wrote for the W3C Video on the Web workshop, “Industry support for standards alignment, adoption, and extension would positively impact the overall health of the content management and digital distribution industry.”

Existing Models
It’s useful to examine the standards that have formed around still images, since there is a mature digital heritage for comparisons. NISO’s Standard and Data Dictionary for Digital Still Images, known as MIX, is a comprehensive guide for defining the fields that are in use for managing images.

IPTC and EXIF standards for images have the secondary benefit of embedding metadata so that information is added at the point of capture in a machine-readable format. However, many images, particularly historical ones, need metadata to be added. Browsing Flickr images gives an idea of the model—camera information comes from the EXIF metadata, and IPTC can be used to capture rights information. However, tags and georeferencing is typically added after the image has been taken, which requires a different standard.

Fotonotes is one of the best annotation technologies going, and has been extended by Flickr and others to give users and developers the ability to add notes to particular sections of an image. The annotations are saved in an XML file, and are easily readable, if not exactly portable.

The problem
For precise retrieval, video requires either a text transcript or complete metadata. Jane Hunter and Renato Iannella did an excellent job of proposing a model system for news video indexing using RDF and Dublin Core extensions in their proposal, now ten years old. There has been some standardization around the use of Flash and MPEG standards for web display of video, which narrows the questions just as PDF adoption standardized journal article display.

With renewed interest in Semantic Web technologies from the Library of Congress and venture capital investors, the combination of Dublin Core extensions for video and the implementation of SMIL (pronounced smile) may be prime territory for mapping to an archival standard for video.

Support is being built into Firefox and Safari, but the exciting part of SMIL is that it can reference metadata from markup. So, if you have a video file, metadata about the object, a transcript, and various representations (archival, web, and mobile encodings of the file), SMIL can contain the markup for all of these things. Simply stated, SMIL is a text file that describes a set of media files and how they should be presented.

Prototypes on the horizon
Another way of obtaining metadata is through interested parties or scholars collaborating to create a shared pool of information to reference. The Open Annotation Collaboration, just now seeking grant funding, and featuring Herbert van de Sompel and Jane Hunter as investigators, seeks to establish a mechanism for client-side integration of video snippets and text as well as machine-to-machine interaction for deeper analysis and collection.

And close by is a new Firefox add-on, first described in D-Lib as NeoNote, which promises a similar option for articles and videos. One attraction it offers is the ability for scholars to capture their annotations, share them selectively, and use a WebDAV server for storage. This assumes a certain level of technical proficiency, but the distributed approach to storage has been a proven winner in libraries for many years now.

The vision
Just as the DOI revolutionized journal article URL permanence, I hope for a future where a video URL can be passed to an application and all related annotations can be retrieved, searched, and saved for further use. Then, my casual search for the actress in The Reader and The Unbearable Lightness of Being will be a starting point for retrieval instead of a journey down the rabbit hole.


Are You Paying Attention?

Not for the first time, the glut of incoming information threatens to push out useful knowledge into merely a cloud of data. And there’s no doubt that activity streams and linked data are two of the more interesting things to aid research in this onrushing surge of information. In this screen-mediated age, the advantages of deep focus and hyper attention are mixed up like never before, since the advantage accrues to the company who can collect the most data, aggregate it, and repurpose it to willing marketers.

N. Katherine Hayles does an excellent job of distinguishing between the uses of hyper and deep attention without privileging either. Her point is simple,”Deep attention is superb for solving complex problems represented in a single medium, but it comes at the price of environment alertness and flexibility of response. Hyper attention excels at negotiating rapidly changing environments in which multiple foci compete for attention; its disadvantage is impatience with focusing for long periods on a noninteractive object such as a Victorian novel or complicated math problem.”

Does data matter?
The MESUR project is one of the more interesting research projects going, now living on as a product from Ex Libris called bx. Under the hood, MESUR looks at the research patterns of searches, not simply the number of hits, and stores the information as triples, or subject-predicate-object information in RDF, the resource description framework. RDF triple stores can put the best of us to sleep, so one way of thinking about it is smart filters. Having semantic information available allows computers to distinguish between Apple the fruit and Apple the computer.

In use, semantic differentiation gives striking information gains. I picked up the novel Desperate Characters, by Paula Fox. While reading it, I remembered that I first heard it mentioned in an essay by Jonathan Franzen, who wrote the foreward to the edition I purchased. This essay was published in Harper’s, and the RDF framework in use on harpers.org gave me a way to see articles both by Franzen, as well articles that were about him. This semantic disambiguation is the obverse of the firehose of information that is monetized from advertisements.

Since MESUR is pulling information from CalTech and Los Alamos National Laboratory’ SFX link resolver service logs, there’s a immediate relevance filter applied, given the scientists who are doing research in those institutions. Using the information contained in the logs, it’s possible to see if a given IP address belonging to faculty or department) goes through an involved research process, or a short one. The researcher’s clickstream is captured, and fed back for better analysis.  Any subsequent researcher who clicks on a similar SFX link has a recommender system seeded with ten billion clickstreams. This promises researchers a smarter Works Cited, so that they can see what’s relevant in their field prior to publication. Competition just got smarter.

Standards based way of description
Attention.xml, first proposed in 2004 as an open standard by Technorati technologist Tantek Çelik and journalist Steve Gilmor, promised to give priority to items that users want to see. The problem, articulated five years ago, was that feed overload is real, and the need to see new items and what friends are also reading requires a standard that allows for collaborative reading and organizing.

The standard seems to have been absorbed into Technorati, but the concept lives on in the latest beta of Apple’s browser Safari, which lists Top Sites by usage and recent history, as does Firefox’s Speed Dial. And of course, Google Reader has Top Recommendations, which tries to leverage the enormous corpus of data it collects into useful information.

Richard Powers’ novel Galatea 2.2 describes an attempt to train a neural network to recognize the Great Books, but finds socializing online to be a failing project: “The web was a neighborhood more efficiently lonely than the one it replaced. Its solitude was bigger and faster. When relentless intelligence finally completed its program, when the terminal drop box brought the last barefoot, abused child on line and everyone could at last say anything to everyone else in existence, it seemed to me we’d still have nothing to say to each other and many more ways not to say it.” Machine learning has its limits, including whether the human chooses to pay attention to the machine in a hyper or deep way.

Hunch, a web application designed by Caterina Fake, known as co-founder of Flickr, is a new example of machine learning. The site offers to “help you make decisions and gets smarter the more you use it.” After signing up, you’re given a list of preferences to answer. Some are standard marketing questions, like how many people live in your household, but others are clever or winsome. The answers are used to construct a probability model, which is used when you answer “Today, I’m making a decision about…” As the application is a work in progress, it’s not yet a replacement for a clever reference librarian, even if its model is quite similar to the classic reference interview. It turns out that machines are best at giving advice about other machines, and if the list of results incorporates something larger than the open Web, then the technology could represent a leap forward. Already, it does a brilliant job at leveraging deep attention to the hypersprawling web of information.

How to Achieve True Greatness
Privacy has long returned to norms first seen in small-town America before World War II, and our sense of self is next up on the block.  This is  as old as the Renaissance described in Baldesar Castiglione’s The Book of the Courtier and as new as twitter, the new party line, which gives ambient awareness of people and events.

In this age of information overload, it seems like a non sequitur that technology could solve what it created. And yet, since the business model of the 21st century is based on data and widgets made of code, not things, there is plenty of incentive to fix the problem of attention. Remember, Google started as a way to assign importance based on who was linking to who.

This balance is probably best handled by libraries, with their obsessive attention to user privacy and reader needs, and librarians are the frontier between the machine and the person. The open question is, will the need to curate attention be overwhelming to those doing the filtering?

Galatea 2.2
Galatea 2.2
Richard Powers; Farrar, Straus, Giroux
LibraryThingGoogle BooksBookFinder


← Before