Wrangling Legacy Media: Gaining Intellectual Control Over (Born) Digital Materials

Not unlike many manuscript repositories, Princeton’s Manuscripts Division has been somewhat slow to act when it comes to managing born-digital materials. This largely has to do with the nature of our collections, most of which predate the 20th century, as well as the fact that we lacked the policies and infrastructure to deal with these materials. The number of more contemporary collections we’re acquiring, however, is rapidly increasing—particularly, the papers of still-active literary and cultural figures– as are the digital materials included in these collections. Coming to terms with this reality has led our division to begin taking the necessary steps to properly manage digital materials.

As a first step in doing so, we wanted to gain as much intellectual control as we could over our extant digital media. Our endeavor happened to coincide with SAA’s 2015 Jump In 3 initiative, which we decided to participate in as it provided structure, guidance, and a timeline for us to get the ball rolling. Jump In 3 was the third iteration of the “Jump In” initiative led by the Manuscript Repositories Section meant to help repositories begin to manage their born-digital records. It invited archivists to submit a report and survey of one or more collections in their repositories and also encouraged participants to take the additional steps of prioritizing collections for further treatment and developing the technical infrastructure for dealing with readable media.

STEP 1: Survey Finding Aids (XQuery Magic)

With the assumption that we had a minimal amount of digital media in our collections, we decided to survey all ~1600 of them. We first surveyed our EAD finding aids, which we manage in SVN (Subversion client within Oxygen XML editor), to locate description indicative of possible digital materials. Since descriptive practices for digital media haven’t been conducted in a consistent manner, if at all, we anticipated that existing descriptions would vary from finding aid to finding aid. This meant that our survey tool would need to capture descriptions of digital materials located in various EAD elements.

With the help of our colleague, Regine Heberlein (Principal Cataloger and Metadata Analyst), we wrote a simple XQuery that scanned all descriptive components of our EADs to locate text strings that matched a regular expression based on a list of words, including variant spellings, that we determined would indicate the likely presence of born-digital materials, such as “disk,” “floppy,” “CD,” “DVD,” “drive,” “digital,” “electronic,” etc.

xquery version “1.0”;
declare namespace ead = “urn:isbn:1-931666-22-9”;
declare default element namespace “urn:isbn:1-931666-22-9”;
declare copy-namespaces no-preserve, inherit;
import module namespace functx=“http://www.functx.com”
at “http://www.xqueryfunctions.com/xq/functx-1.0-doc-2007-01.xq”;
declare variable $COLL as document-node()+ := collection(“path/to/EAD/directory”);
let $contains_media := $COLL//ead:c/ead:*[not(self::ead:c) and matches(string(.), ‘(\s|^)flopp(y|ies)(\s|$)|(\s|^)dis(k|c|kette)s?(\s|$)|(\s|^)cd(-rom)?s?(\s|$)|(\s|^)dvds?(\s|$)|(\s|^)digital(\s|$)|(\s|^)(usb|hard|flash)\sdrives?(\s|$)’, ‘i’)]
<results xmlns=“urn:isbn:1-931666-22-9”>
for $media in $contains_media/parent::ead:c
<c level={$media/@level} id={$media/@id}>

The XQuery generated a list of matching EAD components in XML, which we then imported into Excel. Each row in the spreadsheet represented a component that the XQuery located, and each column, an EAD element within that component.


Along with assistance from student workers, we meticulously examined and revised this spreadsheet, removing any irrelevant, extraneous, or redundant information; for example, false positives that resulted from other text strings matching our regular expression, or duplicate records for the same item resulting from multi-level description. We then extracted and transferred the most relevant data, including media type, file formats, and quantities, into additional columns, in order to provide more structure for this technical data, as well as to compute estimated totals. We also added columns to track our attempts to determine from the EAD data whether an item was likely born-digital or contained files of materials that had been digitized.

Controlled list for media type and estimated capacity (not always applicable as many descriptions were very general, i.e. “discs” and “floppies.”)

Controlled list for media type and estimated capacity (not always applicable as many descriptions were very general, i.e. “discs” and “floppies.”)

With the understanding that we were relying on imperfect metadata from older finding aids and, more importantly, that not all digital media were even described in the finding aid, our initial survey determined that the MSS Division held approximately 232 (born) digital media totaling 394 GB. In order to store two preservation copies and one access copy of all of the files on the digital media we found, we determined that we’d need a total of about 1.2 TB of storage space. While this number was likely somewhat inflated due to the fact that all media are probably not filled to full capacity, we preferred to err on the side of overestimating our storage needs, especially due to the anticipated presence of additional digital materials in our holdings for which we have no description. These were the figures we reported in our survey report for Jump In.


STEP 2: Physically Survey Collections

The next step entailed physically looking at the materials that were identified in the EAD survey to enhance the data we had captured. We had two students conduct the survey and create an item-level inventory of the materials.

For this survey, we were able to more strictly adhere to the controlled list we had drafted for media type and estimated capacity. Other data we captured included any annotations on the items that may have existed, including media labels (or markings by the manufacturer) and media markings (or creator notes that may have been added); fortunately, for some items, markings included the actual media capacity that had been used as well as file format information. We also wanted to continue to determine, if possible, which items were actually born-digital as opposed to those that contained files of digitized materials, and to note whether or not the latter were duplicates of original paper copies that we also held. The thinking behind this was that this information would assist us in determining how to prioritize these materials.


The number of media found during the physical survey was much higher than what we had determined from the EAD survey: rather than 232 items, we identified 394 media; the estimated storage capacity need more than tripled: we estimated that we needed about 4.2 TB of space as opposed to 1.2 TB.



Since conducting the survey we’ve discovered more digital media in existing collections and have also received materials in new acquisitions. For example, the Toni Morrison Papers includes over 150 floppy disks, both 3.5” and 5.25,” as well as a number of CDs and DVDs. We also recently received the papers of Argentine poet Juan Gelman with over 160 3.5” floppies as well as several CDs, DVDs, and flash drives.

Next Steps

We in the Manuscripts Division are very fortunate in that a solid foundation of policy and infrastructure for managing born-digital materials has already largely been developed by our colleagues who oversee the University Archives and Public Policy Papers at Mudd Library. We’re currently in the process of trying to apply what they’ve established for our division, mirroring what they’ve developed in some respects, but also tweaking things as appropriate due to the differing nature of our collections.

We’ve begun to emulate Mudd’s environment here at Firestone, developing the infrastructure necessary to properly process and preserve digital records; and have also started to revisit and draft various related policies and procedures, specifically those that address new acquisitions (i.e. our donor agreement), processing workflows, preservation, and access.  Other next steps include working with curators to prioritize the extant media we identified in our survey; beginning to process media from new collections; and refining the workflows we currently have in place. (These issues will be discussed in more detail in future posts.)

Not Changing, But Expanding: Managing Digital Archives at Firestone Library

The following post introduces an upcoming series about managing digital archives at Firestone Library. In the next few months, the processing archivists of RBSC’s Manuscripts Division will be posting about:

– updating donor and purchase agreements to reflect language inclusive of managing digital content;
– gaining intellectual control over legacy born-digital materials;
– tools and programs we will use to capture, process, and preserve materials;
– access models for making this content publicly accessible;
– storage options for long-term preservation; and
– description of born-digital and digitized content.  

We plan to post case studies on how we process various types of digital media including audio, email, documents, and images. We will also share any relevant publications, presentations, webinars, etc., that helped inform our process.

Why now?

Simply put, archival processing of digital materials should not only fall on the responsibility of digital archivists. Princeton University Library currently employs one Digital Archivist whose primary responsibilities are to develop, implement, and execute workflows specific to the management of University Archives. And while the Digital Archivist and other colleagues at the Seeley G. Mudd Library have encountered a steady flow of born-digital materials in University Archives over the past several years, the Manuscripts Division has only recently received an increasing amount of ‘hybrid’ collections that include analog and born-digital materials on floppy disks, CDs, DVDs, USBs, hard drives, and other removable storage media.  We are also taking steps to digitally migrate some of our audiovisual content. Acknowledging these concurrent realities, we see that our roles as traditional processing archivists are not changing but are expanding to include the management of digitized and born-digital content; and we’re ready to assume this responsibility. In looking to the foreseeable future, traditional processing archivists will eventually become digital archivists as backlogs shift from dusty, unprocessed boxes to terabytes of unprocessed data.

Institutionally, the timing is right for us to begin tackling this growing issue with the hopes that we get a handle of our digital content before we let terabytes of unprocessed data sit on the shelf to “collect dust.” We hope that our efforts to begin managing digital materials now prevent this new form of backlog from becoming a reality.

Absolute Identifiers for Boxes and Volumes

AbID in action in the RBSC vault.

AbID in action in the RBSC vault.

Most library users are familiar with call numbers such as 0639.739 no.6, or D522 .K48 2015q, or MICROFILM S00888. These little bits of text look peculiar standing on their own. However, together with an indication of location such as Microforms Services (FLM) or Firestone Library (F) they can guide a user directly to a desired item and often to similar items shelved in the area. In Rare Books and Special Collections (RBSC), there’s no self-service. Instead, departmental staff members retrieve (or “page” as we call it) items requested by users. Finding those items has not always been easy. Over the years many unique locations and practices were established on an ad-hoc basis. The multiplicity of exceptional locations required the paging staff to develop a complex mental mapping that rivaled “The Knowledge” that must be mastered by London taxi drivers. The Big Move of collections in May 2015 provided the opportunity for a fresh start. Faced with growing collections and finite space, we imagined a system that would adapt to the new vault layout, which is strictly arranged by size. By “listening” to the space, we realized that shelving more of our collections by size would result in the most efficient use of shelving and minimize staff time spent on retrieval and stack maintenance. We couldn’t do anything immediately about the legacy of hundreds of subcollections–except to shelve them in a comprehensible order, by size and then alphabetically. However, the break with the past in terms of physical location of materials prompted some rethinking that eventually led to the use of “Absolute Identifiers” or “AbIDs” for almost all new additions to the collections.

RBSC has long used call number notations that are strictly for retrieval rather than subject-related classification for browsing by users. “Sequential” call numbers, which were adopted for most books in 2003, look like “2008-0011Q.”  Collection coding came along many years before that, in forms such as “C1091.” The Cotsen Children’s Library used database numbers as call numbers, such as “92470.” As a result RBSC has vast runs of materials where one call number group has nothing to do with its neighbors in terms of subject or much of anything else. A book documenting the history of the Bull family in England sits between an Irish theatre program and the catalog of an exhibition of artists’ books in Cincinnati. In the Manuscripts Division, the Harold Laurence Ruland papers on the early German cosmographer Sebastian Munster (1489-1552) have a collection of American colonial sermons as a neighbor. So what’s different about AbID? Three things are different: The form of the call numbers, their uniform application across collections and curatorial domains, and the means by which they are created.

The form of AbIDs is simple: a size designation and a number. Something like “B-000201” provides the exact pathway to the item, reading normally. It tells the person paging an item to go to the area for size “B” and look along the shelves for the 201st item. Pretty simple. Size designations are critical in our new storage areas. In order to maximize shelving efficiency, and thus the capacity for on-site storage, everything is strictly sorted into 11 size categories. (Some apply to very small numbers of materials. So far only two sizes account for two thirds of AbIDs.) In a sense, using AbIDs is simply a way of conforming to the floor plan and the need to shelve efficiently. That’s why the designations can be called Absolute Identifiers. They are “absolute” because the text indicates unambiguously the type and location of each item in the Firestone storage compartments. In other words, all information needed to locate an object appears right in the call number, with no need for any additional data. Even if materials must be shifted within the vault in the future, AbIDs remain accurate since they are not tied to specific shelf designations.

AbIDs are applied across curatorial domains (with the notable exception of the Scheide Library). Manuscripts Division, Graphic Arts, Cotsen Children’s Library, Western Americana — all are included. A great step toward this practice was taken with adoption of sequential call numbers for books in 2003. To make the sequential system work, items from all curatorial units were mixed together. Since curatorial units were previously the primary determinant of shelving, the transition required some mental adjustment. The success of sequential call numbers as a means of efficient shelving and easy retrieval made the move to AbIDs easier. So, bound manuscript volumes of size “N” sit in the same shelving run as cased road maps for the Historic Maps Collection, Cotsen volumes, and others. The items are all safely stored on shelving appropriate for their size and are easy to find. Of course, designation of curatorial responsibility remains in the records and on labels, but it is no longer the first key to finding and managing items on the shelves.

Finally, AbIDs are created via a wholly new process that was developed by a small committee of technical services staff. At the heart of the process is a Microsoft Access database. The database has a simple structure, with only two primary tables. A user signs in, selects a metadata format (MARC, EAD, and “None” are the current options–“None” is for a special case), and designates a size, along with several other data elements required for EAD. The database provides the next unique number for the size a user declares and if items are not already barcoded provides smart barcodes (ones that know which physical item they go with). Rather than requiring users to scan each barcode individually, the database incorporates an algorithm that automatically assigns sequential barcodes after a user enters the first and last item number in a range. For collections described in EAD, the database exports an XML file containing AbIDs, barcodes, and other data. A set of scripts written by Principal Cataloger and Metadata Analyst, Regine Heberlein, then transforms and inserts data from the XML export file into the correct elements in the corresponding collection’s EAD file. (These scripts also generate printable PDF box and folder labels at the same time!) For books and other materials cataloged in MARC, the database uses MacroExpress scripts to update appropriate records in the Voyager cataloging client. Overall, the database complements and improves existing workflows, allowing technical services staff to swiftly generate AbIDs and related item data for use in metadata management systems.

Getting started in the AbID database.

Getting started in the AbID database.

Making metadata and size selections in the AbID database.

Making metadata and size selections in the AbID database.

With the 2015 move we are starting afresh. Old locations and habits are no longer valid. We have a chance to re-think the nature of storage and the purposes being served by our collection management practices. Our vault space is a shared resource, and inefficient use in any area of our department’s collection affects all. With AbIDs, the simple form of the call numbers, their uniform application across curatorial domains, and the means by which they are created make for efficient shelving and retrieval, which ultimately translates into better service for our patrons.

Out With The Old [Lions] And In With The New [Tigers]

P1040052 - Copy

Our current and on-going library renovation has meant a lot of shuffling of our collections from place to place. “For the move” is a phrase constantly uttered in Rare Books these days. But which move? There have been at least 13 so far since 2011! In order to manage all of these moves (great and small), our department formed a task force in 2013, although work had started well before that. This group consists of four members of Technical Services tasked with assessing the current collections situation, providing data for more informed decision-making, and mapping them to new locations. Like all the best endeavors, it usually starts with a survey and an Access database. For each move our basic raison d’etre is to identify what we have on the shelves (sometimes harder than it sounds), identify projects to facilitate the move, fix any problems, anticipate how (and if) the materials will fit on the new shelves, determine any new organizational methods—physical or virtual, plan the move logistics, and see that the collection materials actually get moved…unscathed.

All of these earlier moves brought to light just how many objects our department has stashed away in various corners and drawers—from hulking great historic furniture to delicate pocket watches. You don’t realize it until you have to find a new place for them all! One project that sprung from the last move was to consolidate many of our museum objects into one location and organize them by size and collection number, both of which are numerous. I have come across many interesting objects during this process of reorganization—new to me, but long-standing library “residents”.

One such item is a 21” wide x 10” high bronze tiger statue that had lost its identification number. The distinct patina and artistic style of this beast looked rather familiar. I could come up with at least two sets of life-sized statues on campus that have a similar look. After a rummage through old accession books (worthy of a blog post themselves), I discovered the secret of our diminutive big cat: it is a cast made by A.P. Proctor.

With further investigation, I found out that it is [probably] a cast for the tiger sculptures outside of Nassau Hall. According to an article in the Daily Princetonian from February 19, 1909, the tiger statues currently flanking the steps were preceded by lions. Alexander P. Proctor (1860-1950) was called upon to make the new sculptures, gifted by the class of 1879. Known for his meticulous portrayal of animals, Proctor has sculptures across the country—including Princeton.

So, out with the old and in with the new. Eventually the collections will settle into their new locations and the RBSC staff will settle into our new offices on C-floor later this year. All this moving gives us a chance to reconnect with the past while enjoying new beginnings.

Opening the unmarked door: Communicating about technical services

The unmarked door

The unmarked door

Welcome to our blog!

You are reading about library technical services.  Perhaps it’s your first time; perhaps you’re a fellow practitioner; perhaps you’re somewhere in between.  If you are new to this aspect of library work we are delighted by your initial interest and hope to provide posts that hold your attention.  We also hope that readers with some background in the field will find value in our experiences and perspectives.

Writing about technical services for a broad audience immediately brings up two conditions: we are out of public view (literally in the “back room” behind an unmarked door) and we habitually use vocabulary that can be difficult to interpret for those not in the know.  Let’s check on those topics here.  In subsequent posts we will get down to business.


In technical services we are in the business of creating infrastructure.  We provide the intellectual and physical control of library resources that enable users–including other library staff–to carry out their work.  Infrastructure is just that: “infra,” or below.  It’s meant to be used rather than to call attention to itself.  The best indication of an infrastructure job well done is invisibility.  For example, just about everybody drives across bridges without bestowing a thought on the engineers and construction workers who designed, built, and maintain (we hope) the structures that get riders from one side of something to the other side.  It’s the same with us in library technical services.  If the metadata we create readily gets users to resources that they are seeking (via a great deal of systems work: another realm of invisibility), then we have succeeded.  Normally metadata creators rise to the level of conscious thought only in cases of error or inadequacy.  (Users, understandably enough, typically consider our fundamental aim of “user service” to mean “service to me at this moment” and thus they tend to judge adequacy and correctness in terms of their own goals.  Our working perspective has to encompass all current and future users, and the limits on our capacity to serve them.  However, these are topics for other posts!)  There’s nothing unusual or lamentable about our general lack of visibility.  Our products such as catalog records and finding aids are eminently visible, and as long as they serve to connect users with the department’s amazing collections we are content.

So, a number of those who peruse our posts will encounter activities or ideas that they have not previously thought of very much, if at all.  Good!  We hope to convey information about the work we do, and along with it some hint of the intellectual vigor and liveliness that keep us engaged in our obscure but consequential functions on this side of processing.

The language barrier

We talk funny.

String. Property. Field. Tag. Element. WEMI. FISO. XSLT. Entification. Expression. 506. odd. Subdivision. Ancestor. Authorities.  These are all words (or “words” in some cases) that we use in our daily discourse.  They have specific contextual meanings, and behind those meanings are concepts and models that we use to construct our intellectual workspace.  For example, to us a field isn’t an expanse of property that one might find in a subdivision, possibly demarcated by a string.  It’s a constituent part of a MARC record, which is …  Well, we’d best move on before explanations overwhelm us.  All professional specialties have their own jargon and precise terminology.  You would likely be uneasy if you heard your doctor referring to your inner workings as a collection of thingamabobs and doohickeys rather than bronchi and glomeruli and so forth.  The technical services terms listed above and others like them provide a means for us to communicate effectively among ourselves.  Fluency in their use signals full in-group status, much like a prison tattoo.  However, in this blog we are generally going to avoid emphasizing our mastery of what for most people is esoteric vocabulary.  Instead, we are going to write in plain language, except when our primary target audience is our fellow practitioners.  Even then we will make some effort to explain the jargony terms that we use.

That said, we do expect to be writing with a technical services-oriented audience in mind.  Much of what we have to say is about innovation.  We are constantly thinking about ways to improve services and take advantage of developments in technology.  Members of our unit have lots of ideas on topics such as creating “absolute identifiers” for holdings or improving holdings management in finding aids, to name just two.  We are involved with leading-edge projects such as SNAC and LD4P (more specialized terminology!).  You’ll read about them here.  Discussion of such matters is necessarily laden with jargon and acronyms.  If you’re unfamiliar with the wording or underlying concepts and want more background, just let us know.  We want to give everyone a chance to learn about us and to gain some insight into activities that affect all users.  Our goal is to communicate with everyone and to make all members of our audience feel welcome in our world.

Revealing the Folks Behind the Curtain


Welcome to the Rare Books and Special Collections Technical Services blog at Princeton University Library. The Technical Services team consists of the archivists, catalogers, and associated staff who create and manage the resource records and discovery systems that connect users to Princeton’s impressive range of rare books, manuscripts, archival collections, and other unique materials.

In addition to creating discovery tools such as catalog records and finding aids, our team is also responsible for overall collections management, including maintaining the data and metadata about our collections, both old and new; overseeing space management, such as organizing the vaults that house rare books and special collections materials; and leading special projects to enhance access and support discovery, to name a few. While our work happens behind closed doors, we play an essential role as mediators to facilitate interaction between users and library materials, preparing collections for the reading room, and increasingly, for online access.

Our aim with this blog is to share some of the highlights of our work with fellow professionals and patrons alike, in order to provide a glimpse of the behind-the-scenes activities of the Rare Books and Special Collections Department as well as to foster important conversations about access, discovery, and use of special collections both in theoretical and practical terms. In the spirit of promoting creativity in our field and sharing what we’ve learned with others, we warmly welcome you to “the other side.”