Our NHPRC-Funded Digitization Project at Six Months

Late last year, the Mudd Manuscript Library was granted an award by the National
Historical Publications and Records Commission (NHPRC) to digitize our most-used Public Policy collections, serve them online, and create a report for the larger archival community about cost-efficient digitization practices. Excerpts from our six-month progress report is below.

nhprc-logo-l

Work so far

  1. Project planning

From the time we were awarded the grant to the present, we have produced an overall project plan and timeline, a vendor RFQ and plan of work, in-house quality control procedures for vendor-supplied images, a workplan for in-house scanning, and hardware-specific instructions for in-house scanning. All activities are either on schedule or ahead of schedule. Vendor-supplied digitization is currently eight months ahead of schedule.

  1. Finding a vendor

After distributing an RFQ and collecting bids, we decided on The Crowley Company as our vendor, based on both price and our confidence that they would be able to manage the materials and the work carefully and efficiently.

  1. Managing vendor-supplied digitization

Before materials can go out to the vendor, we first create a manifest of everything we want to send by transforming the EAD-encoded finding aid into an easily-read Excel worksheet. Since we want each folder of material to have a cover sheet that explains the collection name, box number, folder number, URL, and copyright policy, we used collection manifests to make target sheets with this information. A total of 6,943 target sheets were created, printed, and inserted into the beginnings of folders by student workers before materials were sent out to the vendor.

Once materials have been imaged by the vendor, students sample ten percent of the collection to check for completeness and readability. So far, everything has passed quality control with flying colors.

Each month, Crowley sends us a report of how many images have been created that month, how many images have been created cumulatively, and average scanning rate per hour. This information is below:

Boxes Scanned

Pages Scanned

2013 March

15

17119

2013 April

32

45761

2013 May

50

49499

2013 June

65

97896

Totals

162

210275

  1. In-house imaging

Imaging of the John Foster Dulles papers started in June. So far, we have completed a pilot of scanning with the sheet-feed of the photocopier, and pilots of microfilm scanning and scanning with a Zeutschel face-up scanner are underway.

Project goals and deliverables

  1. Twelve series or subseries from six collections digitized

To date, five series or subseries have been completely digitized, and three others are in the process of being digitized.

  1. Approximately 416,000 images created and posted online

As of July 1, 2013, 210,275 images have been scanned by the vendor. Of this total, 39,834 images have been posted online. Our vendor is several months ahead of schedule for this project, and in-house scanning is on track. Since beginning in-house scanning in June, 1,838 pages have been scanned by student workers. In the next months, we will calculate the per-page costs for scanning on a Zeutschel face-up scanner and with a microfilm scanner. From there, we plan to image fifty feet of materials with the sheet feeder of the photocopier, 10.3 feet with the Zeutschel face-up scanner, and 33.4 feet with the microfilm scanner.

  1. Six EAD finding aids updated to include links for 17,508 components (folders)

Two finding aids (Council on Foreign Relations Records and Adlai Stevenson Papers) have been updated to include links to digitized content. Another (George F. Kennan Papers) is ready to be updated. This process is managed semi-automatically with a series of shell scripts. After quality control hard drives of images are sent to Princeton’s digital studios. Staff there verify and copy digital assets to permanent storage. After this, PDF and JPEG2000 files are derived from the master TIFFs, and the relationship between these objects is described in an automatically generated METS file. The digital archival object (<dao>) tag is added to the EAD-encoded finding aid for each component.

  1. Digital imaging cost of less than 80 cents per page achieved

The plan of work with our vendor calls for scanning costs well below the 80 cents per page. Our first (and likely least expensive) of three in-house scanning pilots estimates the costs of scanning with the sheet feeder of a copier to be two cents per page. We will have numbers for microfilm scanning and scanning with a face-up scanner at the time of our next report.

  1. Metrics for digital imaging of 20th century archival collections for

    1. In-house microfilm conversion

    2. Sheet feeding through a networked photocopier

    3. Vendor supplied images

The information that we have collected thus far is below. Our vendor metrics are based on the quote and plan of work with The Crowley Company. Sheet feed metrics are collected by having a student worker fill out a minimal, time-stamped form at the beginning and end of each scan, and then analyzing that information. These numbers are preliminary. Sheet-fed scans have not yet been checked for quality control — re-scans may increase the total time per page and dollars per page for this method.

Vendor

Sheet Feed

Microfilm

Zeutschel

Total pages:

270,600*

1838

Total feet:

530.95

1.68

Total time:

2:25:14

Total time (decimal):

2.42

Time per page:

0:00:04

Pages per hour:

270.75

759.33

Hours per foot:

1:26:26

Feet per hour

0.69

Cost per page:

TBD

$0.02

*This number is an estimate, based on an assumed 1200 pages per box. Our reports from Crowley show anywhere from 1050-1750 pages in a box.

Note: in addition to these three methods, we plan to add a fourth – scanning with a face-up scanner (in our case, a Zeutschel scanner table).

  1. Policies and documentation for large-scale digitization initiative created and shared with archival community

As we go forward with our project, we have been blogging not just about the content of our digitized collections, but also our methods and rationales. A blog post written in February explains how this project fits into our other digitization activities and our approach to access. In early June, we wrote about the reasons why this kind of project is so important, and how our materials will now reach researchers worldwide (and of all ages) who might otherwise never come to our reading room in Princeton, New Jersey.

A more formal report on our methods and results will be made available once more data has been gathered.

Why — and How — We Digitize

It’s February, and we’re now in the second month of our NHPRC-funded digitization project. In twenty-three more months, we’ll have completed scanning and uploading 400,000 pages of our most-viewed material to our finding aids, and anyone with an internet connection will be able to view it.

This is just the most recent effort to introduce digitization as a normal part of our practice at Mudd. As I said in my previous post, we know that it’s well and good that we have collections that document the history of US diplomacy, economics, journalism and civil rights in the twentieth and twenty-first centuries. But for the majority of potential users, who may never be able to come to Princeton, NJ, this is irrelevant. However interested they may be, they may never be able to afford to visit us. And there’s a whole other subset of potential users — let’s call them working people — who can’t come between the hours of 9:00 and 4:45, Monday through Friday. Are we really providing fair and equitable access under these conditions? Since we have the resources to digitize, it’s imperative that we develop the infrastructure and political will to do so.

We know that it’s time to get serious — and smart — about scanning.

The ball has been rolling in this direction for some time. We have three “streams” of making digital content available, and with our new finding aids site, we have an intuitive way of linking descriptions of our materials to the materials themselves.

Images of the collection in the context of the finding aid

Images of the collection in the context of the finding aid

Our first is patron-driven digitization.

The Zeutschel -- our amazing German powerhouse face-up scanner

This is our Zeutschel scanner. It does amazing work, is easy on our materials, and usually requires very little quality control.

Archives have been providing photoduplication services since the advent of the photocopier. At Mudd, we have dedicated staff who have been doing this work for decades. Recently, we’ve just slightly tweaked our processes to create scans instead of paper copies and to (in many cases) re-use the scans that we make so that they’re available to all patrons, not just the one requesting the scan.

A patron (maybe you!) finds something in our finding aids that he thinks he may be interested in, and asks for a copy.

If he’s in our reading room, he flags the pages of material he wants. If he’s remote, he identifies the folders or volumes to be scanned. The archivist tells him how much the scan will cost, and he pre-pays.

Now, the scanning. This either happens on our photocopier (the technician can press “scan” instead of “photocopy” to create a digital file instead of a paper one) or on our Zeutschel scanner. And while we feel happy and lucky to have the Zeutschel, we don’t strictly need it to fulfill our mission to digitize.

The scan is named in a way that associates it with the description of the material in the finding aid, and is then linked up and served online. We currently send the patron an email of this scan, but in the future we may just send them a link to the uploaded content.

Our second stream is targeted digitization based on users’ viewing patterns

Our friendly student receptionist, Ashley, scans materials at the front desk when she isn't welcoming patrons.

Our student receptionist, Ashley, scans materials at the front desk when she isn’t welcoming patrons.

We try to keep lots of good information about what our users find interesting. We use a service called google analytics to learn about what users are browsing online, and we keep statistics about which physical materials patrons see in the reading room.

From these sources, we create a list of most-viewed materials, and set up a system for our students to scan them in their downtime when they’re working at the front desk.

We do this because we want to make sure that we’re putting the effort into digitizing resources that patrons actually want to see — there are more than 35,000 linear feet of materials at the Mudd Library. We probably won’t ever be able to digitize absolutely everything, and it wouldn’t make sense to start from “A” and go to “Z”. So, we pay attention to trends and try to anticipate what researchers might find useful.

Our final stream — and the one for which we currently have to rely on external support — is large-scale vendor-supplied digitization.

Our current cold war project is a great example of this. We’ve put together a project plan, chosen materials, called for quotes and chosen a vendor. We recently shipped our first collection to be digitized, and I’ll be posting information to the blog as we move forward.

Another good example of an externally-supported digitization activity is the scanning of microfilm from our American Civil Liberties Union Records. Our earliest records were microfilmed decades ago and recently, Professor Sam Walker supported the digitization of some of this microfilm so that they could be made available online.

No single stream — externally-supported projects, left-to-right scanning, or patron-driven digitization — would be enough to support our goal of maximizing the content available online. We hope that the three, each pursued aggressively, will help us realize our mission of providing equitable access to our materials. And we think that focusing on this cold war project will help us reflect on and improve all of our digitization activities.

Mudd Library Awarded Grant to Provide Global Access to Records of the Cold War

by: Maureen Callahan

The historian John Lewis Gaddis, author of a 2012 Pulitzer Prize-winning biography of George Kennan, has stated that the Mudd Library holds “the most significant set of papers for the study of modern American history outside of federal hands.”

This may be true, but is often only relevant to researchers who have the resources to access them. We have worked diligently to make sure people could find information about our collections, but until now, there were only a very few ways to actually study these records – come to Princeton, New Jersey and access them in the reading room, or order photocopies of what you think you might be interested in, based on descriptions in our finding aids (we also have a few collections digitized and online, and some microfilmed collections of our records may be in your local library).

We want to change this to make it easier for everyone to access our materials. Thanks to the generosity of the National Historical Publications and Records Commission (NHPRC), a taxpayer-funded organization that supports efforts to promote documentary sources, over 400,000 pages of records from six of our most-used collections will be digitized and put online for anyone with an internet connection to access. We hope that our records will become newly accessible and indispensible to international researchers, high school and college students, and anyone else with an interest in the history of the Cold War.  As Gaddis wrote in a letter of support for our grant, this kind of access “has the potential, quite literally, to globalize the possibility of doing archival research. That’s no guarantee that this will produce a greater number of great books than in the past. What it will ensure, however, is a quantum leap in the opportunities students and their teachers will have to bring the excitement of working with original documents into all classrooms.”

Collections include:

John Foster Dulles Papers

John Foster Dulles (1888-1959), the fifty-third Secretary of State of the United States for President Dwight D. Eisenhower, had a long and distinguished public career with significant impact upon the formulation of United States foreign policies. He was especially involved with efforts to establish world peace after World War I, the role of the United States in world governance, and Cold War relations between the United States and the Soviet Union. The Dulles papers document his entire public career and his influence on the formation of United States foreign policy, especially for the period when he was Secretary of State.

We plan to digitize the following:

Series 1. Selected Correspondence 1891-1960

Series 3. Diaries and Journals 1907-1938

Series 5. Speeches, Statements, Press Conferences, Etc 1913-1958

 

George Kennan Papers

George F. Kennan (1904-2005) was a diplomat and a historian, noted especially for his influence on United States policy towards the Soviet Union during the Cold War and for his scholarly expertise in the areas of Russian history and foreign policy. Kennan’s papers document his career as a scholar at the Institute for Advanced Study and his time in the Foreign Service.

We plan to digitize the following:

Subseries 1A, Permanent Correspondence 1947-2004

Subseries 4D, Major Unused Drafts 1933-1978

Subseries 4G, Unpublished Works 1938-2000

 

Council on Foreign Relations Records

The Council on Foreign Relations is a nonprofit, nonpartisan research and national membership organization dedicated to improving understanding of international affairs by promoting a range of ideas and opinions on United States foreign policy. The Council has had a significant impact in the development of twentieth century United States foreign policy. The Records of the Council on Foreign Relations document the history of the organization from its founding in 1921 through the present.

We plan to digitize the following:

Studies Department 1918-1945

 

Allen W. Dulles Papers

The Allen W. Dulles Papers contains correspondence, speeches, writings, and photographs documenting the life of this lawyer, diplomat, businessman, and spy. One of the longest-serving directors of the Central Intelligence Agency (1953-1961), he also served in a key intelligence post in Bern, Switzerland during World War II, as well as on the Warren Commission.

We plan to digitize the following:

Series 1, Correspondence 1891-1969

Series 4, Warren Commission Files 1959-1967

 

Adlai E. Stevenson Papers

The Adlai E. Stevenson Papers document the public life of Adlai Stevenson (1900-1965), governor of Illinois, Democratic presidential candidate, and United Nations ambassador. The collection contains correspondence, speeches, writings, campaign materials, subject files, United Nations materials, personal files, photographs, and audiovisual materials, illuminating Stevenson’s career in law, politics, and diplomacy, primarily from his first presidential campaign until his death in 1965.

We plan to digitize the following:

Subseries 5D, U.S. Ambassador to the United Nations 1946-1947

 

James Forrestal Papers

James V. Forrestal (1892-1949) was a Wall Street businessman who played an important role in U.S. military operations during and immediately after World War II. From 1940 to 1949 Forrestal served as, in order, assistant to President Roosevelt, Under Secretary of the Navy, Secretary of the Navy, and the first Secretary of Defense.

We plan to digitize the following:

Subseries 1A, Alphabetical Correspondence

Subseries 5A, Diaries

 

Digitization will occur over the course of two years, and materials will be added to the web as they are digitized. Please be in touch with us if you have any questions about any of our materials.

 

The Daily Princetonian is digitized and keyword searchable

prince_inverted.gif

The Princeton University Archives, working in conjunction with the Princeton University Library Digital Initiatives, has nearly completed a monumental project that will change the way researchers investigate University history. The student newspaper, The Daily Princetonian, has been digitized from its inception in 1876 through 2002. The site has been available in beta for almost two years, but all issues will be loaded as of June 30, 2012. At the suggestion of The Daily Princetonian alumni board who have been among the prime backers of this project, the site is named in honor of the newspaper’s long-serving production manager Larry Dupraz, and researchers are able to perform sophisticated keyword searches that can unlock the vast richness of the daily newspaper that documents so much of the University’s history. (For the years 2002- present, users may search online via the Daily Prince site.)

DailyPsearchsreenshop

“I wrote my final paper for my Freshman Writing Seminar about how the presence of veterans on Princeton’s campus following World War II affected Princeton’s academic environment and social atmosphere,” said Jennifer Klingman ’13. “My research heavily relied on The Daily Princetonian archives, and I had to spend a lot of time and energy searching for relevant articles in Firestone’s microform versions of the newspaper. It was difficult to comb through the articles, and as a result my research was limited in scope. This spring, I wrote my history department junior paper on academic and social changes taking place at Princeton during the late 1940s and 1950s. The online Daily Princetonian archives proved to be invaluable. I was able to access the archives anywhere and at any time, and use the archives’ search function to find a number of extremely useful articles. My independent work has definitely benefited from the existence of the online archives.”

100_0988

Freelance journalist W. Barksdale Maynard ’88 states “I am able to write about the social history of Princeton in an entirely new way and have restructured my research to take full advantage of this exciting new resource. For my Princeton Alumni Weekly article on the early history of automobiles at Princeton, the Dupraz Digital Archives allowed me to identify every reference to cars as early as 1901, to pinpoint who owned them and what kinds. I would never have attempted this article without The Dupraz Digital Archives.”

Maynard’s PAW colleague, Gregg Lange ’70, regularly uses the site for his column, “Rally Round the Cannon,” which examines and appraises University history. “You can piece together the story of Princeton football or Woodrow Wilson in a dozen ways. But the unique accessibility of a daily publication allows more subtle topics to arise and recede, and for cross-generational tales to emerge. Be it Ella Fitzgerald singing at a Princeton dance at age 19, then receiving an honorary degree 54 years later; or student revolts against the clubs’ Bicker selection system in 1917 and 1940 presaging its loss of monopoly in 1968, the combination of detail and long view is indispensable in understanding the ethos of the institution over time, and essentially inaccessible without the DuPraz technology and precision. And existentially, if I never see another microfiche in my life I will die a happy man.”

Maynard added, “My regular column in PAW, “From Princeton’s Vault,” has benefited enormously. Recently I was able to identify the earliest references to Princetonians as “tigers,” which had been guesswork previously. It turns out we were wrong by a decade.

This has been an international project, with the newspapers sent from Princeton to Brechin Imaging in Canada, where TIFF images are generated using high end German cameras. The files are then sent via a hard drive to Cambodia, where Digital Divide Data analyzes the structure of each page and uses an optical character recognition (OCR) program to derive machine-readable text, which allows for keyword searching. The hard drive is then shipped to Austin, Texas, where the US office of New Zealand company DL Consulting loads the data into a content-management system called Veridian, which supports searching and browsing, online reading, article extraction and printing, and other features.

Within the library, many hands have worked for this project’s success. At Mudd Library, project archivists Dan Brennan and then Adriane Hanson have overseen the day-to-day work of the project, managing the shipment of the newspapers to Brechin, as well as supervising students with the quality control phase. University Archivist Dan Linke raised the funds from various University and alumni sources and coordinated the project.

Within the greater Library system, Cliff Wulfman, the Library’s Digital Initiatives Coordinator, took the lead in writing the Request for Proposals and then selecting and coordinating the work with DDD, as well as providing technical assistance, support and vision. The Library System Office’s Antonio Barrera designed the front end web page with Phil Menos providing server support, and Deputy University Librarian and Systems Librarian Marvin Bielawski allocated the funds to acquire the Veridian software.

The project employs the METS/ALTO markup standard, the same used by the Library of Congress’s Newspaper Digitization Project, which means that as software changes and improves, we will be able to sustain this resource for many years to come.

100_0996