Carnegie Mellon University

NSF and Digital Libraries

It's a truism today that "more data beats better algorithms" (Haley, Norvig, and Pereira). Even in the 1960s we knew that we needed more content to improve information retrieval. Early attempts to build large corpora were limited by funding and jealousy, with both the Oxford Text Archive and Project Gutenberg stumbling at first.

At the 1991 SIGIR conference in Chicago, Maria Zemankova asked for a plan comparable to what the database community had done at the Asilomar meetings, or the astronomers in their Decadal plans. Mike McGill, Ed Fox, and I proposed digitizing all works needed for a first degree in any science. The plan was ambitious, and it did not deal effectively with copyright. But Y. T. Chien approved, and the idea caught favor in the Clinton Administration, which changed the name "electronic library" to "digital library".

The result was the 1993 DLI-1 project. All the projects were supposed to have a collection. Six institutions were funded: CMU, UM, UIUC, UCB, Stanford, UCSB with topics of video, metadata, xml to html, environment, interoperability, and geography. A number of these projects were over-run by larger industry efforts - the publishing industry went online without specific help, ditto videos, but Stanford, although it punted on the collection, created Google.

The second DLI, five years later, had more of an effort to force libraries into the effort, wider range of collections. DARPA remained as a funder, and NSF gained IMLS, NLM, NEH, and even the FBI as supporting agencies. The breadth was impressive: music at JHU, paleontology at Texas, patient care at Columbia, speech at MSU, and many others. Key questions were content, algorithms, and sustainability. We still have the Harvard Dataverse, the Perseus project, and several others.

However, despite the impressive breadth, we were missing depth. Researchers would note that we had a literary text but not the scholarly commentary, or that we only had samples of video. There were other efforts at comprehensive collections - for example an effort to digitize all 18th century newspapers. But the Million Book Project was one of the steps towards the current world of what Brewster Kahle calls "Universal Access to All Knowledge".