Open Libraries “… are signs of life and hope: They are the cornerstone of democracy”

Flickr View All » Brooklyn Public LibraryBrooklyn Public LibraryBrooklyn Public LibraryBrooklyn Public LibraryCold Spring Harbor Laboratory LibraryBinghamton Public LibraryBinghamton Public LibraryOak Park Public LibraryLast day the Binghamton Public Library was open

Are You Paying Attention?

Not for the first time, the glut of incoming information threatens to push out useful knowledge into merely a cloud of data. And there’s no doubt that activity streams and linked data are two of the more interesting things to aid research in this onrushing surge of information. In this screen-mediated age, the advantages of deep focus and hyper attention are mixed up like never before, since the advantage accrues to the company who can collect the most data, aggregate it, and repurpose it to willing marketers.

N. Katherine Hayles does an excellent job of distinguishing between the uses of hyper and deep attention without privileging either. Her point is simple,”Deep attention is superb for solving complex problems represented in a single medium, but it comes at the price of environment alertness and flexibility of response. Hyper attention excels at negotiating rapidly changing environments in which multiple foci compete for attention; its disadvantage is impatience with focusing for long periods on a noninteractive object such as a Victorian novel or complicated math problem.”

Does data matter?
The MESUR project is one of the more interesting research projects going, now living on as a product from Ex Libris called bx. Under the hood, MESUR looks at the research patterns of searches, not simply the number of hits, and stores the information as triples, or subject-predicate-object information in RDF, the resource description framework. RDF triple stores can put the best of us to sleep, so one way of thinking about it is smart filters. Having semantic information available allows computers to distinguish between Apple the fruit and Apple the computer.

In use, semantic differentiation gives striking information gains. I picked up the novel Desperate Characters, by Paula Fox. While reading it, I remembered that I first heard it mentioned in an essay by Jonathan Franzen, who wrote the foreward to the edition I purchased. This essay was published in Harper’s, and the RDF framework in use on harpers.org gave me a way to see articles both by Franzen, as well articles that were about him. This semantic disambiguation is the obverse of the firehose of information that is monetized from advertisements.

Since MESUR is pulling information from CalTech and Los Alamos National Laboratory’ SFX link resolver service logs, there’s a immediate relevance filter applied, given the scientists who are doing research in those institutions. Using the information contained in the logs, it’s possible to see if a given IP address belonging to faculty or department) goes through an involved research process, or a short one. The researcher’s clickstream is captured, and fed back for better analysis.  Any subsequent researcher who clicks on a similar SFX link has a recommender system seeded with ten billion clickstreams. This promises researchers a smarter Works Cited, so that they can see what’s relevant in their field prior to publication. Competition just got smarter.

Standards based way of description
Attention.xml, first proposed in 2004 as an open standard by Technorati technologist Tantek Çelik and journalist Steve Gilmor, promised to give priority to items that users want to see. The problem, articulated five years ago, was that feed overload is real, and the need to see new items and what friends are also reading requires a standard that allows for collaborative reading and organizing.

The standard seems to have been absorbed into Technorati, but the concept lives on in the latest beta of Apple’s browser Safari, which lists Top Sites by usage and recent history, as does Firefox’s Speed Dial. And of course, Google Reader has Top Recommendations, which tries to leverage the enormous corpus of data it collects into useful information.

Richard Powers’ novel Galatea 2.2 describes an attempt to train a neural network to recognize the Great Books, but finds socializing online to be a failing project: “The web was a neighborhood more efficiently lonely than the one it replaced. Its solitude was bigger and faster. When relentless intelligence finally completed its program, when the terminal drop box brought the last barefoot, abused child on line and everyone could at last say anything to everyone else in existence, it seemed to me we’d still have nothing to say to each other and many more ways not to say it.” Machine learning has its limits, including whether the human chooses to pay attention to the machine in a hyper or deep way.

Hunch, a web application designed by Caterina Fake, known as co-founder of Flickr, is a new example of machine learning. The site offers to “help you make decisions and gets smarter the more you use it.” After signing up, you’re given a list of preferences to answer. Some are standard marketing questions, like how many people live in your household, but others are clever or winsome. The answers are used to construct a probability model, which is used when you answer “Today, I’m making a decision about…” As the application is a work in progress, it’s not yet a replacement for a clever reference librarian, even if its model is quite similar to the classic reference interview. It turns out that machines are best at giving advice about other machines, and if the list of results incorporates something larger than the open Web, then the technology could represent a leap forward. Already, it does a brilliant job at leveraging deep attention to the hypersprawling web of information.

How to Achieve True Greatness

Privacy has long returned to norms first seen in small-town America before World War II, and our sense of self is next up on the block.  This is  as old as the Renaissance described in Baldesar Castiglione’s The Book of the Courtier and as new as twitter, the new party line, which gives ambient awareness of people and events.

In this age of information overload, it seems like a non sequitur that technology could solve what it created. And yet, since the business model of the 21st century is based on data and widgets made of code, not things, there is plenty of incentive to fix the problem of attention. Remember, Google started as a way to assign importance based on who was linking to who.

This balance is probably best handled by libraries, with their obsessive attention to user privacy and reader needs, and librarians are the frontier between the machine and the person. The open question is, will the need to curate attention be overwhelming to those doing the filtering?


Lock-in leads to lockdown

What goes up must come down. This simple law of gravity can been seen in baseball, and these days, the stock market.

As I attended the Web 2.0 conference in New York recently, I had occasion to ask Tim O’Reilly what he thought about libraries. “Well, OCLC’s doing some good things,” he said. I encouraged him to continue looking at library standards, as the 2006 Reading 2.0 conference pulled together a number of interesting people who have been poking at the standards that knit libraries and publishers together. 

But the phrase Web 2.0, coined by O’Reilly, was showing signs of age. From the halycon days, where every recently funded website showed rounded corners and artful form submission fades, the new companies were a shadow of their former booth size. Sharing space with the Interop conference, Web 2.0 was the bullpen to the larger playing field.

Interoperability
What helps companies to grow and expand? Some posit that the value of software is estimated by lock-in, that is, the number of users who would incur switching costs by moving to a competitor or another platform.

In the standards world, lock-in is antithetical to good functioning. Certainly proprietary products and features play a role to keep innovation happening, but cultural institutions are too important to risk balkanization of data for short-term profits.

Trusted peers
It seems to me that curation has moved to the network level, and a certain amount of democratization is now possible. The cautions about privacy and users as access points are true and useful, but librarians and publishers have a role in recommending information, and this is directly correlated to expert use of recommender systems. Web 2.0 applications like del.icio.us for bookmarks, last.fm for music, and Twitter and Facebook for social networks provide a level of personal guidance that was algorithmically impossible before data was easily collectible.

Prior to last.fm’s 2007 purchase by CBS Music, public collective data about listening habits was deemed “too valuable” to be mashed up by programmers any longer. In the library world, there’s a unique opportunity to give users the ability to see recommendations from trusted people. Though del.icio.us does this quite well for Internet-accessible sources, there’s an opportunity extant for the scholarly publishers to standardize on a method. Elsevier’s recent Article 2.0 contest shows encouraging signs of moving towards a release of control back to the authors and institutions that originally wrote and sponsored the work.

In the end, though, companies that are forced to choose between opening up their data or paying their employees will not likely choose the long-term reward. Part of this difficulty, however, has been tied to the lack of available legal options, standards, or licenses for releasing data into the public domain. The Creative Commons project has pointed many people to defined choices if they choose to enable their works into the public domain or for reuse.

Jonathan Rochkind of Johns Hopkins University points out that “A Creative Commons license is inappropriate for cataloging records, precisely because they are unlikely to be copyrightable. The whole legal premise of Creative Commons (and open source) licenses is that someone owns the copyright, and thus they have the right to license you to use it, and if you want a license, these are the terms. If you don’t own a copyright in the first place, there’s no way to license it under Creative Commons.

The Open Data Commons has released a set of community norms for sharing data. This is a great step towards a standard way of separating profit concerns from the public good, and also frees companies from agonizing legal discussions about liability and best practices. 

Standard widgets
If sharing entire data sets isn’t feasible, one practice that was nearly universal in Web 2.0 companies was the use of widgets to embed data and information.

In his prescient entry, “Blogs, widgets, and user sloth,” Stu Weibel describes the difficulty he had installing a widget, a still-depressing reality today.

Netvibes, a company that provides personalized start pages, has proposed a standard for a universal widget API. The jOPAC, an “integrated web widget,” uses this suggestion to make its library catalog embeddable in several online platforms and operating systems. Since widgets are still being used for commercial ventures, there seems to be an opportunity to define a clear method of data exchange. The University of Pennsylvania’s Library Portal is a good example of where this future could lead, as its portal page is flexible and customizable.

Perhaps a widget standard would give emerging companies and established ventures a method to exchange information in a way that promotes profits, privacy, and potential.


Jhumpa Lahiri • Unaccustomed Earth

 Sometimes, a short story sticks with you until you find it with pleasure living in a larger collection. In 1991, I read a short story by Tobias Wolff standing up in a Chicago bookstore that I looked for until it was included in The Night in Question.

Jhumpa Lahiri’s new book of short stories, Unaccustomed Earth, contains another haunting story, “Nobody’s Business,” first published in The New Yorker in 2001. It gives a stark account of graduate student despair—first at life delayed due to years of study, then postponed because of deferred relationships left to explode into messy life. Paul, the narrator, gives an outsider account of Indian courtship rituals drawn into housemate drama. Desperate to prove his innocence of what he learns, he provides telephonic evidence of how she is being betrayed.

Lahiri isn’t afraid to show life as it is. Painful, entangled with family obligations and academic aspirations, the stories show adult parents and children reaching accomodations with hidden truths and adjustments to immigrant life. Her stories show how second-generation Bengali immigrants draw pleasure from their Harvard and MIT PhDs, just as their accomplishments push them away from their families of origin. When the characters marry outside their connections, as in “Only Goodness,” they feel guilt and relief in equal measure.

The final three stories, linked through the characters Hema and Kaushik, give a tragic account of a family left to reconstitute itself after a mother’s early death rips it asunder. Though Lahiri leaves a narrative option for easy closure, the devastating ending feels, well, like life in the midst of death.


Mining for Meaning

 In David Lodge’s 1984 novel, Small World, a character remarks that literary analysis of Shakespeare and T.S. Eliot “would just lend itself nicely to computerization….All you’d have to do would be to put the texts on to tape and you could get the computer to list every word, phrase and syntactical construction that the two writers had in common.”

This brave new world is upon us, but the larger question for Google and OCLC, among other purveyors of warehoused metadata and petabytes of information, is how to achieve meaning. One of the brilliant insights derived from Terry Winograd’s research and mentoring is that popularity in the form of inbound links does matter for web pages, at least. In the case of all the world’s books turned into digitized texts, it’s a harder question to assign meaning without popularity, a canon, or search queries as a guide.

Until recently, text mining wasn’t possible at great scale. And as the great scanning projects continue on their bumpy road, the mysteries of what will come out of them have yet to emerge into meaning for users.

Nascent standards
Bill Kasdorf pointed out several  XML models for books in his May NISO presentation, including NISO/ISO 12083, TEI, DocBook, NLM Book DTD, and DTBook. These existing models have served publishers well, though they have been employed for particular uses and have not yet found common ground across the breath of book types. The need for a standard has never been clearer, but it will require vision and a clear understanding of solved problems to push forward.

After the professor in Small World gains access to a server, he grows giddy with the possibilities of finding “your own special, distinctive, unique way of using the English language….the words that carry a distinctive semantic content.” While we may be delighted about the possibilities that searching books afford, there is the distinct possibility that the world of the text could be changed completely.

Another mechanism for assigning meaning to full text has been opened up by web technology and science. The Open Text Mining Interface is a method championed by Nature Publishing Group as a way to share the contents of their archives in XML for the express purpose of text mining while preserving intellectual property concerns. Now in a second revision, the OTMI is an elegant method of enabling sharing, though it remains to be seen if the initiative will spread to a larger audience.

Sense making
As the corpus lurches towards the cloud, one interesting example of semantic meaning comes in the Open Calais project, an open platform by the reconstituted Thomson Reuters. When raw text is fed into the Calais web service, terms are extracted and fed into existing taxonomies. Thus, persons, countries, and categories are first identified and then made available for verification.

This experimental service has proved its value for unstructured text, but it also works for extracting meaning from the most recent weblog posting to historic newspapers newly scanned into text via Optical Character Recognition (OCR). Since human-created metadata and indexing services are among the most expensive things libraries and publishers create, any mechanism to optimize human intelligence by using machines to create meaning is a useful way forward.

Calais shows promise for metadata enhancement, since full text can be mined for its word properties and fed into  taxonomic structures. This could be the basis for search engines that understand natural language queries in the future, but could also be a mechanism for accurate and precise concept browsing.

Glimmers of understanding
One method of gaining new understanding is to examine solved problems. Melvil Dewey understood vertical integration, as he helped with innovations around 3×5 index cards, cabinets, as well as the classification systems that bears his name. Some even say he was the first standards bearer for libraries, though it’s hard to believe that anyone familiar with standards can imagine that one person could have actually been entirely responsible.

Another solved problem is how to make information about books and journals widely available. This has been done twice in the past century—first with the printed catalog card, distributed by the Library of Congress for the greater good, and the distributed catalog record, at great utility (and cost) by the Online Computer Library Center.

Pointers are no longer entirely sufficient, since the problem is not only how to find information but how to make sense of it once it has been found. Linking from catalog records has been a partial solution, but the era of complete books online is now entering its second decade. The third stage is upon us.


← Before