Lunch and Learn: John Wilkin, Jon Stroop, and Marvin Bielawski on HathiTrust.

Hathi Trust Digital Library Catalog

Image by taihung via Flickr


John Wilkin at the University of Michigan, and Jon Stroop & Marvin Bielawski at Princeton University are helping HathiTrust to digitize and share the world’s recorded knowledge using the combined effort of fifty institutions. HathiTrust is described on their web site at as “a partnership of major research institutions and libraries working to ensure that the cultural record is preserved and accessible long into the future” and their mission is “to contribute to the common good by collecting, organizing, preserving, communicating, and sharing the record of human knowledge.” Wilkin explains that HathiTrust is often associated with the Google scanning project, but that it is a misrepresentation to consider the two efforts one and the same. HathiTrust also contains works scanned for institutions by the Internet Archive and by the institutions themselves. HathiTrust also has its own set of values, quality standards, and goals that are filters for data from Google, and perhaps the most important distinction is the project’s attention to detail when it comes to having the most correct metadata possible attached to the scanned items.

When Wilkin was asked what was in HathiTrust’s catalog that was not in Google’s scanning project catalog, he explained that HathiTrust has a much higher standard of quality for bibliographic and other metadata for scanned items, and sometimes must refuse scanned items from Google that do not meet these standards. The reason that Google is essential to the process, Wilkin noted, is the volume of scanning that they do. While library scanning efforts of the past might have done 10,000 volumes in a year, Google can easily do that much in a day. HathiTrust is also doing post-1923 public domain determination, while Google is not, according to Wilkin.

There are approximately 8 million scanned items (written works) with properly aligned metadata currently in the HathiTrust database, and Wilkin says that the number will rise to 10 million by the end of 2011, then 12 million by the end of 2012. It is, he says “a very, very big library.” Jon Stroop noted that Princeton has sent 255,357 items to the Hathi catalog since October 2010. Stroop listed the following collections at Princeton as contributors: Architecture Library, Lewis Library, Marquand Library (in May or June of 2011), Firestone Library, Stokes Library, and Special Collections.

HathiTrust is a digital preservation effort, but simply having a digital record is not the point. Wilkin says that access is critical. At the website is an interface that provides a catalog search, a full-text search, and a collection builder and viewer. If you belong to one of the participating institutions, you get some special rights. As a Princeton NetID holder, for instance, you can log in and create a new collection of works to support an academic project.

Given the importance of access to the project, it is important to remove barriers to diverse groups. 74% of the items in the HathiTrust catalog are copyrighted, while 26% are in the public domain. The copyrighted items are generally inaccessible, even to those associated with the project by institution. Not everything after 1922 is in copyright, and one ongoing task in  the project is to review the catalog to assess the copyright status of catalog items. While 48% of the catalog’s items are in English. 400 languages are currently represented there.

Sustainability is another key goal for the project, one that HathiTrust takes very seriously. Right now the project uses a “depositor pays” business model, in which the project is paid for by the institutions that use it for storage of items. The atomic cost unit is 1 GB of content, and the price flows up and down, over time. At the time of the talk, the price per gigabyte was $3.

The costs of the project are mostly related to maintenance of the servers and datacenter. Storage is about 47% of overall costs. Staff is about 25% of cost Tape backup and disaster recovery are about 14% of cost.

In 2013, HathiTrust plans to implement a new sustainability model. Cost will be based on based on “holdings overlap”. Academic print books in the collection are already substantially duplicated in the catalog. In June of 2009, the average duplication rate between institutions was 19% of items, meaning that almost a fifth of each institution’s work was being duplicated. By sharing duplicated works that each institution owns digitally, a single digital copy could be retained, and the other copies could be deleted to save on storage and backup costs.

Details on the cost model for HathiTrust are at

Wilkin described three ways in which HathiTrust makes a difference for participants.The first, collective digital curation, drives down costs for materials, increases a cataloged item’s discoverability, improves the quality of archived works through digitizing, reduces bibliographic indeterminacy via collective research, and helps libraries make meaningful decisions about formats and quality. The second, collective print curation, is a means by which to associate all of the participating institutions’ holdings of print materials, which helps librarians perform record-keeping in a coordinated way. The third way is a series of subsidiary benefits. For instance, the HathiTrust process improves descriptions of materials, and quantifies problems, such as the size of the public domain.

Wilkin, Stroop and Bielawski explained that the HathiTrust is interested in archiving and sharing the cultural record in a single searchable interface. It is a collaborative effort, which Princeton is a part of, along with 49 other institutions. Many benefits exist in the project, including the quality of metadata, the discoverability of the works, and the cross-organizational sharing of content. To learn more about HathiTrust, visit

Podcast of this talk is available here.

Slides from this talk are available here.

Speaker biographies:

John P. Wilkin is executive director of HathiTrust and associate university librarian for library information technology (LIT) for the University of Michigan. The LIT Division supports the library’s online catalog and related technologies, provides the infrastructure to both digitize and access digital library collections, supports the library’s web presence, and provides frameworks and systems to coordinate Library technology activities. Wilkin previously served as the head of the Digital Library Production Service at the University of Michigan. Among the units in the DLPS is the University of Michigan’s Humanities Text Initiative, an organization responsible for SGML document creation and online systems that Wilkin founded in 1994. He earned graduate degrees in English from the University of Virginia (1980) and Library Science from the University of Tennessee at Knoxville (1986). In 1992, he worked at the University of Virginia as the Systems Librarian for Information Services, where he shaped the Library’s plan for establishing a group of electronic centers and consulted for the University’s Institute for Advanced Technology in the Humanities (IATH) in textual issues.

Marvin Bielawski is Princeton’s Deputy University Librarian and Head of the Library Systems Office. He’s been involved in negotiating the Library’s contract with Google and the settlement amendment. He also advocated for and negotiated Princeton’s contract for membership in the HathiTrust.

Jon Stroop is the Metadata Analyst in the Library Systems Office. He is responsible for the ingest of digital content from Princeton into the HathiTrust and is a member the Library’s Google Project Steering Committee. Jon is also a co-chair of the Library’s Metadata Committee and serves on the Library of Congress’ MODS (Metadata Object Description Schema) Editorial Committee.




Enhanced by Zemanta
This entry was posted in Lunch & Learn and tagged , , , , , , , , . Bookmark the permalink.