Wrangling Legacy Media: Gaining Intellectual Control Over (Born) Digital Materials

Not unlike many manuscript repositories, Princeton’s Manuscripts Division has been somewhat slow to act when it comes to managing born-digital materials. This largely has to do with the nature of our collections, most of which predate the 20th century, as well as the fact that we lacked the policies and infrastructure to deal with these materials. The number of more contemporary collections we’re acquiring, however, is rapidly increasing—particularly, the papers of still-active literary and cultural figures– as are the digital materials included in these collections. Coming to terms with this reality has led our division to begin taking the necessary steps to properly manage digital materials.

As a first step in doing so, we wanted to gain as much intellectual control as we could over our extant digital media. Our endeavor happened to coincide with SAA’s 2015 Jump In 3 initiative, which we decided to participate in as it provided structure, guidance, and a timeline for us to get the ball rolling. Jump In 3 was the third iteration of the “Jump In” initiative led by the Manuscript Repositories Section meant to help repositories begin to manage their born-digital records. It invited archivists to submit a report and survey of one or more collections in their repositories and also encouraged participants to take the additional steps of prioritizing collections for further treatment and developing the technical infrastructure for dealing with readable media.

STEP 1: Survey Finding Aids (XQuery Magic)

With the assumption that we had a minimal amount of digital media in our collections, we decided to survey all ~1600 of them. We first surveyed our EAD finding aids, which we manage in SVN (Subversion client within Oxygen XML editor), to locate description indicative of possible digital materials. Since descriptive practices for digital media haven’t been conducted in a consistent manner, if at all, we anticipated that existing descriptions would vary from finding aid to finding aid. This meant that our survey tool would need to capture descriptions of digital materials located in various EAD elements.

With the help of our colleague, Regine Heberlein (Principal Cataloger and Metadata Analyst), we wrote a simple XQuery that scanned all descriptive components of our EADs to locate text strings that matched a regular expression based on a list of words, including variant spellings, that we determined would indicate the likely presence of born-digital materials, such as “disk,” “floppy,” “CD,” “DVD,” “drive,” “digital,” “electronic,” etc.

xquery version “1.0”;
declare namespace ead = “urn:isbn:1-931666-22-9”;
declare default element namespace “urn:isbn:1-931666-22-9”;
declare copy-namespaces no-preserve, inherit;
import module namespace functx=“http://www.functx.com”
at “http://www.xqueryfunctions.com/xq/functx-1.0-doc-2007-01.xq”;
declare variable $COLL as document-node()+ := collection(“path/to/EAD/directory”);
let $contains_media := $COLL//ead:c/ead:*[not(self::ead:c) and matches(string(.), ‘(\s|^)flopp(y|ies)(\s|$)|(\s|^)dis(k|c|kette)s?(\s|$)|(\s|^)cd(-rom)?s?(\s|$)|(\s|^)dvds?(\s|$)|(\s|^)digital(\s|$)|(\s|^)(usb|hard|flash)\sdrives?(\s|$)’, ‘i’)]
return
<results xmlns=“urn:isbn:1-931666-22-9”>
{
for $media in $contains_media/parent::ead:c
return
<c level={$media/@level} id={$media/@id}>
{$media/*[not(self::ead:c)]}
</c>
}
</results>

The XQuery generated a list of matching EAD components in XML, which we then imported into Excel. Each row in the spreadsheet represented a component that the XQuery located, and each column, an EAD element within that component.

EADresults

Along with assistance from student workers, we meticulously examined and revised this spreadsheet, removing any irrelevant, extraneous, or redundant information; for example, false positives that resulted from other text strings matching our regular expression, or duplicate records for the same item resulting from multi-level description. We then extracted and transferred the most relevant data, including media type, file formats, and quantities, into additional columns, in order to provide more structure for this technical data, as well as to compute estimated totals. We also added columns to track our attempts to determine from the EAD data whether an item was likely born-digital or contained files of materials that had been digitized.

Controlled list for media type and estimated capacity (not always applicable as many descriptions were very general, i.e. “discs” and “floppies.”)

Controlled list for media type and estimated capacity (not always applicable as many descriptions were very general, i.e. “discs” and “floppies.”)

With the understanding that we were relying on imperfect metadata from older finding aids and, more importantly, that not all digital media were even described in the finding aid, our initial survey determined that the MSS Division held approximately 232 (born) digital media totaling 394 GB. In order to store two preservation copies and one access copy of all of the files on the digital media we found, we determined that we’d need a total of about 1.2 TB of storage space. While this number was likely somewhat inflated due to the fact that all media are probably not filled to full capacity, we preferred to err on the side of overestimating our storage needs, especially due to the anticipated presence of additional digital materials in our holdings for which we have no description. These were the figures we reported in our survey report for Jump In.

EADresults

STEP 2: Physically Survey Collections

The next step entailed physically looking at the materials that were identified in the EAD survey to enhance the data we had captured. We had two students conduct the survey and create an item-level inventory of the materials.

For this survey, we were able to more strictly adhere to the controlled list we had drafted for media type and estimated capacity. Other data we captured included any annotations on the items that may have existed, including media labels (or markings by the manufacturer) and media markings (or creator notes that may have been added); fortunately, for some items, markings included the actual media capacity that had been used as well as file format information. We also wanted to continue to determine, if possible, which items were actually born-digital as opposed to those that contained files of digitized materials, and to note whether or not the latter were duplicates of original paper copies that we also held. The thinking behind this was that this information would assist us in determining how to prioritize these materials.

physicalsurvey3

The number of media found during the physical survey was much higher than what we had determined from the EAD survey: rather than 232 items, we identified 394 media; the estimated storage capacity need more than tripled: we estimated that we needed about 4.2 TB of space as opposed to 1.2 TB.

Physicalsurveyresults1

Physicalsurveyresults2

Since conducting the survey we’ve discovered more digital media in existing collections and have also received materials in new acquisitions. For example, the Toni Morrison Papers includes over 150 floppy disks, both 3.5” and 5.25,” as well as a number of CDs and DVDs. We also recently received the papers of Argentine poet Juan Gelman with over 160 3.5” floppies as well as several CDs, DVDs, and flash drives.

Next Steps

We in the Manuscripts Division are very fortunate in that a solid foundation of policy and infrastructure for managing born-digital materials has already largely been developed by our colleagues who oversee the University Archives and Public Policy Papers at Mudd Library. We’re currently in the process of trying to apply what they’ve established for our division, mirroring what they’ve developed in some respects, but also tweaking things as appropriate due to the differing nature of our collections.

We’ve begun to emulate Mudd’s environment here at Firestone, developing the infrastructure necessary to properly process and preserve digital records; and have also started to revisit and draft various related policies and procedures, specifically those that address new acquisitions (i.e. our donor agreement), processing workflows, preservation, and access.  Other next steps include working with curators to prioritize the extant media we identified in our survey; beginning to process media from new collections; and refining the workflows we currently have in place. (These issues will be discussed in more detail in future posts.)