About Don Thornbury

Head, Technical Services for Special Collections Leader, Rare Books Cataloging Team

Transformed: From the Visuals database to MARC in 4 (depending on how you count them) easy (depending on what you consider easy) steps

In technical services we emphasize our ability to query, manage, and transform data. At any given moment one or more of us will be using XSLT, XQuery, Microsoft Access, Open Refine, Python, or some other such tool to analyze or edit data in our various systems.  In this post one such project is highlighted: moving data from the Visuals database for graphic arts to the Voyager catalog and the Blacklight discovery system.

The Visuals database was around for a long time. Its scope was ambitious: prints, drawings, photographs, paintings, sculpture, and other non-book objects in the Princeton University Library collections.  However, its chief content was records for holdings of the Graphic Arts collection in RBSC.  There was no declared descriptive standard, and field content was 100% free text.  Inevitably data problems occurred.  Only the iron discipline of Vicki Principi of RBSC brought some semblance of order to the data starting in 2004.

Visuals started out as a SPIRES database. Eventually it was migrated to SQL Server.  At migration time the inconsistent data could not be normalized as usual for relational databases, so Stan Yates of the Library’s Systems Office (now Information Technology) created a very simple and flexible three-part data structure as shown below.  Here is what one of over 750,000 Visuals database entries looked like via Microsoft Access:

idRecord Element Value
10 ARTIST Cruikshank, George, 1792-1878 [etcher]

These bits and pieces of data were assembled into complete records in forms and reports for presentation. The public interface, though greatly improved in recent years by Gary Buser of Information Technology, was difficult to use for those not already familiar with the structure and data conventions of Visuals.  Also, because of its unorthodox internal structure the Visuals catalog was not connected to Aeon requesting functions.  In an innovation, in 2014 Visuals records were added to Primo, then our discovery system.  When Blacklight replaced Primo, Visuals records made the transition.  The results were easily foreseeable: users could find and request materials in a familiar local searching environment.  The transformation for Primo and Blacklight employed a local XML format called “Generic” that mimicked the structure of a MARC record, since Voyager MARC was the model the discovery systems were based on.  The Visuals-to-Generic stylesheet became the basis of the one that was used to transform Visuals to MARC.

The decision to retire Visuals in favor of MARC was based on the specific need to facilitate the transition from the original Princeton University Digital Library (PUDL) to Digital PUL, or DPUL. DPUL has 2 “canonical” systems as data sources: Voyager (MARC) and the finding aids XML database (EAD).  PUDL data based on Visuals constituted a large block of non-MARC, non-EAD records.  Those records would not be migrated in their current custom VRA-encoded state.  The simplest and most forward-looking solution is to transform Visuals as a whole to Voyager in MARC.  Voyager/MARC is our best system for rich item-level bibliographic descriptions such as those in Visuals.  Voyager/MARC provides an easy metadata path for future digitization projects.  At the same time the shift to Voyager eliminates the need to maintain an isolated standalone system and provides a more functional and sustainable environment for cataloging and data management in RBSC.

The first step in converting Visuals to MARC was to output data in a workable format, by extract to XML via the Access front end. Queries produced 10 files based on the last digit of the Visuals record numbers.  (Division into multiple files was done simply to provide files of manageable size.)  Each resulting XML file was over 9 MB.  The structure was the very opposite of complex: a <dataroot> element wrapping tens of thousands of elements like this from the file of records with ID numbers ending in 0:

<data0>
<idRecord>10</idRecord>
<Element>ARTIST</Element>
<Value>Cruikshank, George, 1792-1878 [etcher]</Value>
</data0>

The XSL stylesheet begins transformation by gathering “records” together in variables (by grouping the various Visuals elements that have idRecord in common), and then processes each resulting variable as a unit to produce a MARC XML record. For instance, the Cruikshank “Value” in the database illustration above—a single line of text–turns into the following MARC field 100 with 3 subfields, as part of bibliographic record number 10.  The field even gets added punctuation per MARC conventions.  If there are additional ARTIST elements they turn into 700 fields.

<marc:controlfield tag=”001″>10</marc:controlfield>

<marc:datafield ind1=”1″ ind2=” ” tag=”100″>
<marc:subfield code=”a”>Cruikshank, George,</marc:subfield>
<marc:subfield code=”d”>1792-1878,</marc:subfield>
<marc:subfield code=”e”>etcher.</marc:subfield>
</marc:datafield>

Of necessity, given the structure and content of Visuals, MARC leader and 008 values are chiefly arbitrary–with the exception of 008/07-10 (Date 1), which in many cases could be parsed out of the Visuals DATE field. Other infelicities exist when Visuals data varied from major patterns that could be coded in XSL, though valuable pointers from Nikitas Tampakis of Information Technology and Joyce Bell of Cataloging and Metadata Services brought the encoding much closer to standard MARC.  Joyce’s thoroughgoing review prompted many stylesheet changes to make the records similar to current ones created according to RDA.  A great number of the exceptions are being dealt with in bulk during post-processing now that the records have been loaded into Voyager.

Transformation of the Visuals XML output to MARC XML was a matter of minutes. After validation against the MARC XML format, the files were ready for conversion to MARC21 via Terry Reese’s MARCEdit conversion utility.  This remarkable tool took only a few seconds to produce records that could easily be handed off for loading to Voyager.

In MARC 21 the field looks like this, as one of many fields in the record with 001 “10” and 003 “Visuals”:

100  1 ‡a Cruikshank, George, ‡d 1792-1878, ‡e etcher.

As part of a MARC record it can be validated, indexed, displayed, and communicated in this form.

The MARC files were then bulk-loaded into Voyager by Kambiz Eslami of Information Technology and became available to users of Blacklight. Anyone can see Visuals record #10 at https://pulsearch.princeton.edu/catalog/10634093, identified by its new Voyager bibliographic record ID and with the original Visuals record ID now in an 035 field.  If we so choose, the Voyager records can be exported to OCLC for inclusion in WorldCat.

Finished! (Except for the post-processing clean-up.)

Postscript

The migration from Visuals to MARC is attended by a degree of irony. MARC is on the way out, according to many observers.  What will take its place?  BIBFRAME, or some other data format based on RDF (Resources Description Framework).  What is the key structural feature of RDF?  It’s “triples” – a combination of subject and predicate and object, with the predicate being the term that describes the connection between the subject and the object.  Seen any triples lately?  Why, yes: Visuals!!!  Let’s apply RDF terms from Dublin Core (a simpler alternative to BIBFRAME) to our three-part Visuals statement, with a little help from VIAF, the Virtual International Authority file.

This Visuals “triple”:

<idRecord>10</idRecord>
<Element>ARTIST</Element>
<Value>Cruikshank, George, 1792-1878 [etcher]</Value>

turns into this RDF triple in Dublin Core:

<rdf:RDF xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dcterms=”http://purl.org/dc/terms/”>
<rdf:Description rdf:about=”https://pulsearch.princeton.edu/catalog/visuals10″>
<dcterms:creator rdf:resource=”https://viaf.org/viaf/69915118/”/>
</rdf:Description>
</rdf:RDF>

Or, as a human would read it: The work described in Princeton Visuals record #10 has creator Cruikshank, George, 1792-1878.  “What’s a creator?” you might ask.  The namespace prefix “dcterms” tells you that you can find out at the address indicated.  (Cruikshank is labelled as an “etcher” in Visuals and in MARC, but Dublin Core does not go into specific function terms like that.  Cruikshank is in the MARC 100 field in this record.  The closest Dublin Core term representing the 100 “Main Entry” field is “creator” and that will serve to get users to the resource description.)  In our Dublin Core statement, everything’s a URI, and life is sweet.

So, just in this brief note the same functional text has been shown encoded in 5 different ways: Visuals database format, XML extract from the database, MARC XML, MARC21, and RDF. Such multiple-identity situations are familiar to us.  Many of the technical services staff are adept at understanding, manipulating, and when necessary actually inventing data encoding schemes and at moving data from one to another (and adapting the data as required to the new encoding environment).  These skills have grown to be just as significant in our work as creating the data (metadata) in the first place.  As the examples show, the significant content lives on no matter what form of coding is wrapped around it, and whether it is represented by text or by a URI.  MARC is by no means the end of the road for Visuals or anything else.  Conversion of MARC fields to RDF triples would mean a round trip for Visuals data.  Was Visuals futuristic, structurally speaking?  It’s a question for the historians.  For now, we are going to move ahead one step at a time.

Absolute Identifiers for Boxes and Volumes

AbID in action in the RBSC vault.

AbID in action in the RBSC vault.

Most library users are familiar with call numbers such as 0639.739 no.6, or D522 .K48 2015q, or MICROFILM S00888. These little bits of text look peculiar standing on their own. However, together with an indication of location such as Microforms Services (FLM) or Firestone Library (F) they can guide a user directly to a desired item and often to similar items shelved in the area. In Rare Books and Special Collections (RBSC), there’s no self-service. Instead, departmental staff members retrieve (or “page” as we call it) items requested by users. Finding those items has not always been easy. Over the years many unique locations and practices were established on an ad-hoc basis. The multiplicity of exceptional locations required the paging staff to develop a complex mental mapping that rivaled “The Knowledge” that must be mastered by London taxi drivers. The Big Move of collections in May 2015 provided the opportunity for a fresh start. Faced with growing collections and finite space, we imagined a system that would adapt to the new vault layout, which is strictly arranged by size. By “listening” to the space, we realized that shelving more of our collections by size would result in the most efficient use of shelving and minimize staff time spent on retrieval and stack maintenance. We couldn’t do anything immediately about the legacy of hundreds of subcollections–except to shelve them in a comprehensible order, by size and then alphabetically. However, the break with the past in terms of physical location of materials prompted some rethinking that eventually led to the use of “Absolute Identifiers” or “AbIDs” for almost all new additions to the collections.

RBSC has long used call number notations that are strictly for retrieval rather than subject-related classification for browsing by users. “Sequential” call numbers, which were adopted for most books in 2003, look like “2008-0011Q.”  Collection coding came along many years before that, in forms such as “C1091.” The Cotsen Children’s Library used database numbers as call numbers, such as “92470.” As a result RBSC has vast runs of materials where one call number group has nothing to do with its neighbors in terms of subject or much of anything else. A book documenting the history of the Bull family in England sits between an Irish theatre program and the catalog of an exhibition of artists’ books in Cincinnati. In the Manuscripts Division, the Harold Laurence Ruland papers on the early German cosmographer Sebastian Munster (1489-1552) have a collection of American colonial sermons as a neighbor. So what’s different about AbID? Three things are different: The form of the call numbers, their uniform application across collections and curatorial domains, and the means by which they are created.

The form of AbIDs is simple: a size designation and a number. Something like “B-000201” provides the exact pathway to the item, reading normally. It tells the person paging an item to go to the area for size “B” and look along the shelves for the 201st item. Pretty simple. Size designations are critical in our new storage areas. In order to maximize shelving efficiency, and thus the capacity for on-site storage, everything is strictly sorted into 11 size categories. (Some apply to very small numbers of materials. So far only two sizes account for two thirds of AbIDs.) In a sense, using AbIDs is simply a way of conforming to the floor plan and the need to shelve efficiently. That’s why the designations can be called Absolute Identifiers. They are “absolute” because the text indicates unambiguously the type and location of each item in the Firestone storage compartments. In other words, all information needed to locate an object appears right in the call number, with no need for any additional data. Even if materials must be shifted within the vault in the future, AbIDs remain accurate since they are not tied to specific shelf designations.

AbIDs are applied across curatorial domains (with the notable exception of the Scheide Library). Manuscripts Division, Graphic Arts, Cotsen Children’s Library, Western Americana — all are included. A great step toward this practice was taken with adoption of sequential call numbers for books in 2003. To make the sequential system work, items from all curatorial units were mixed together. Since curatorial units were previously the primary determinant of shelving, the transition required some mental adjustment. The success of sequential call numbers as a means of efficient shelving and easy retrieval made the move to AbIDs easier. So, bound manuscript volumes of size “N” sit in the same shelving run as cased road maps for the Historic Maps Collection, Cotsen volumes, and others. The items are all safely stored on shelving appropriate for their size and are easy to find. Of course, designation of curatorial responsibility remains in the records and on labels, but it is no longer the first key to finding and managing items on the shelves.

Finally, AbIDs are created via a wholly new process that was developed by a small committee of technical services staff. At the heart of the process is a Microsoft Access database. The database has a simple structure, with only two primary tables. A user signs in, selects a metadata format (MARC, EAD, and “None” are the current options–“None” is for a special case), and designates a size, along with several other data elements required for EAD. The database provides the next unique number for the size a user declares and if items are not already barcoded provides smart barcodes (ones that know which physical item they go with). Rather than requiring users to scan each barcode individually, the database incorporates an algorithm that automatically assigns sequential barcodes after a user enters the first and last item number in a range. For collections described in EAD, the database exports an XML file containing AbIDs, barcodes, and other data. A set of scripts written by Principal Cataloger and Metadata Analyst, Regine Heberlein, then transforms and inserts data from the XML export file into the correct elements in the corresponding collection’s EAD file. (These scripts also generate printable PDF box and folder labels at the same time!) For books and other materials cataloged in MARC, the database uses MacroExpress scripts to update appropriate records in the Voyager cataloging client. Overall, the database complements and improves existing workflows, allowing technical services staff to swiftly generate AbIDs and related item data for use in metadata management systems.

Getting started in the AbID database.

Getting started in the AbID database.

Making metadata and size selections in the AbID database.

Making metadata and size selections in the AbID database.

With the 2015 move we are starting afresh. Old locations and habits are no longer valid. We have a chance to re-think the nature of storage and the purposes being served by our collection management practices. Our vault space is a shared resource, and inefficient use in any area of our department’s collection affects all. With AbIDs, the simple form of the call numbers, their uniform application across curatorial domains, and the means by which they are created make for efficient shelving and retrieval, which ultimately translates into better service for our patrons.

Opening the unmarked door: Communicating about technical services

The unmarked door

The unmarked door

Welcome to our blog!

You are reading about library technical services.  Perhaps it’s your first time; perhaps you’re a fellow practitioner; perhaps you’re somewhere in between.  If you are new to this aspect of library work we are delighted by your initial interest and hope to provide posts that hold your attention.  We also hope that readers with some background in the field will find value in our experiences and perspectives.

Writing about technical services for a broad audience immediately brings up two conditions: we are out of public view (literally in the “back room” behind an unmarked door) and we habitually use vocabulary that can be difficult to interpret for those not in the know.  Let’s check on those topics here.  In subsequent posts we will get down to business.

Invisibility

In technical services we are in the business of creating infrastructure.  We provide the intellectual and physical control of library resources that enable users–including other library staff–to carry out their work.  Infrastructure is just that: “infra,” or below.  It’s meant to be used rather than to call attention to itself.  The best indication of an infrastructure job well done is invisibility.  For example, just about everybody drives across bridges without bestowing a thought on the engineers and construction workers who designed, built, and maintain (we hope) the structures that get riders from one side of something to the other side.  It’s the same with us in library technical services.  If the metadata we create readily gets users to resources that they are seeking (via a great deal of systems work: another realm of invisibility), then we have succeeded.  Normally metadata creators rise to the level of conscious thought only in cases of error or inadequacy.  (Users, understandably enough, typically consider our fundamental aim of “user service” to mean “service to me at this moment” and thus they tend to judge adequacy and correctness in terms of their own goals.  Our working perspective has to encompass all current and future users, and the limits on our capacity to serve them.  However, these are topics for other posts!)  There’s nothing unusual or lamentable about our general lack of visibility.  Our products such as catalog records and finding aids are eminently visible, and as long as they serve to connect users with the department’s amazing collections we are content.

So, a number of those who peruse our posts will encounter activities or ideas that they have not previously thought of very much, if at all.  Good!  We hope to convey information about the work we do, and along with it some hint of the intellectual vigor and liveliness that keep us engaged in our obscure but consequential functions on this side of processing.

The language barrier

We talk funny.

String. Property. Field. Tag. Element. WEMI. FISO. XSLT. Entification. Expression. 506. odd. Subdivision. Ancestor. Authorities.  These are all words (or “words” in some cases) that we use in our daily discourse.  They have specific contextual meanings, and behind those meanings are concepts and models that we use to construct our intellectual workspace.  For example, to us a field isn’t an expanse of property that one might find in a subdivision, possibly demarcated by a string.  It’s a constituent part of a MARC record, which is …  Well, we’d best move on before explanations overwhelm us.  All professional specialties have their own jargon and precise terminology.  You would likely be uneasy if you heard your doctor referring to your inner workings as a collection of thingamabobs and doohickeys rather than bronchi and glomeruli and so forth.  The technical services terms listed above and others like them provide a means for us to communicate effectively among ourselves.  Fluency in their use signals full in-group status, much like a prison tattoo.  However, in this blog we are generally going to avoid emphasizing our mastery of what for most people is esoteric vocabulary.  Instead, we are going to write in plain language, except when our primary target audience is our fellow practitioners.  Even then we will make some effort to explain the jargony terms that we use.

That said, we do expect to be writing with a technical services-oriented audience in mind.  Much of what we have to say is about innovation.  We are constantly thinking about ways to improve services and take advantage of developments in technology.  Members of our unit have lots of ideas on topics such as creating “absolute identifiers” for holdings or improving holdings management in finding aids, to name just two.  We are involved with leading-edge projects such as SNAC and LD4P (more specialized terminology!).  You’ll read about them here.  Discussion of such matters is necessarily laden with jargon and acronyms.  If you’re unfamiliar with the wording or underlying concepts and want more background, just let us know.  We want to give everyone a chance to learn about us and to gain some insight into activities that affect all users.  Our goal is to communicate with everyone and to make all members of our audience feel welcome in our world.