XSLT | This Side of Metadata

In technical services we emphasize our ability to query, manage, and transform data. At any given moment one or more of us will be using XSLT, XQuery, Microsoft Access, Open Refine, Python, or some other such tool to analyze or edit data in our various systems. In this post one such project is highlighted: moving data from the Visuals database for graphic arts to the Voyager catalog and the Blacklight discovery system.

The Visuals database was around for a long time. Its scope was ambitious: prints, drawings, photographs, paintings, sculpture, and other non-book objects in the Princeton University Library collections. However, its chief content was records for holdings of the Graphic Arts collection in RBSC. There was no declared descriptive standard, and field content was 100% free text. Inevitably data problems occurred. Only the iron discipline of Vicki Principi of RBSC brought some semblance of order to the data starting in 2004.

Visuals started out as a SPIRES database. Eventually it was migrated to SQL Server. At migration time the inconsistent data could not be normalized as usual for relational databases, so Stan Yates of the Library’s Systems Office (now Information Technology) created a very simple and flexible three-part data structure as shown below. Here is what one of over 750,000 Visuals database entries looked like via Microsoft Access:

idRecord	Element	Value
10	ARTIST	Cruikshank, George, 1792-1878 [etcher]

These bits and pieces of data were assembled into complete records in forms and reports for presentation. The public interface, though greatly improved in recent years by Gary Buser of Information Technology, was difficult to use for those not already familiar with the structure and data conventions of Visuals. Also, because of its unorthodox internal structure the Visuals catalog was not connected to Aeon requesting functions. In an innovation, in 2014 Visuals records were added to Primo, then our discovery system. When Blacklight replaced Primo, Visuals records made the transition. The results were easily foreseeable: users could find and request materials in a familiar local searching environment. The transformation for Primo and Blacklight employed a local XML format called “Generic” that mimicked the structure of a MARC record, since Voyager MARC was the model the discovery systems were based on. The Visuals-to-Generic stylesheet became the basis of the one that was used to transform Visuals to MARC.

The decision to retire Visuals in favor of MARC was based on the specific need to facilitate the transition from the original Princeton University Digital Library (PUDL) to Digital PUL, or DPUL. DPUL has 2 “canonical” systems as data sources: Voyager (MARC) and the finding aids XML database (EAD). PUDL data based on Visuals constituted a large block of non-MARC, non-EAD records. Those records would not be migrated in their current custom VRA-encoded state. The simplest and most forward-looking solution is to transform Visuals as a whole to Voyager in MARC. Voyager/MARC is our best system for rich item-level bibliographic descriptions such as those in Visuals. Voyager/MARC provides an easy metadata path for future digitization projects. At the same time the shift to Voyager eliminates the need to maintain an isolated standalone system and provides a more functional and sustainable environment for cataloging and data management in RBSC.

The first step in converting Visuals to MARC was to output data in a workable format, by extract to XML via the Access front end. Queries produced 10 files based on the last digit of the Visuals record numbers. (Division into multiple files was done simply to provide files of manageable size.) Each resulting XML file was over 9 MB. The structure was the very opposite of complex: a <dataroot> element wrapping tens of thousands of elements like this from the file of records with ID numbers ending in 0:

<data0>
<idRecord>10</idRecord>
<Element>ARTIST</Element>
<Value>Cruikshank, George, 1792-1878 [etcher]</Value>
</data0>

The XSL stylesheet begins transformation by gathering “records” together in variables (by grouping the various Visuals elements that have idRecord in common), and then processes each resulting variable as a unit to produce a MARC XML record. For instance, the Cruikshank “Value” in the database illustration above—a single line of text–turns into the following MARC field 100 with 3 subfields, as part of bibliographic record number 10. The field even gets added punctuation per MARC conventions. If there are additional ARTIST elements they turn into 700 fields.

<marc:controlfield tag=”001″>10</marc:controlfield>

…

<marc:datafield ind1=”1″ ind2=” ” tag=”100″>
<marc:subfield code=”a”>Cruikshank, George,</marc:subfield>
<marc:subfield code=”d”>1792-1878,</marc:subfield>
<marc:subfield code=”e”>etcher.</marc:subfield>
</marc:datafield>

…

Of necessity, given the structure and content of Visuals, MARC leader and 008 values are chiefly arbitrary–with the exception of 008/07-10 (Date 1), which in many cases could be parsed out of the Visuals DATE field. Other infelicities exist when Visuals data varied from major patterns that could be coded in XSL, though valuable pointers from Nikitas Tampakis of Information Technology and Joyce Bell of Cataloging and Metadata Services brought the encoding much closer to standard MARC. Joyce’s thoroughgoing review prompted many stylesheet changes to make the records similar to current ones created according to RDA. A great number of the exceptions are being dealt with in bulk during post-processing now that the records have been loaded into Voyager.

Transformation of the Visuals XML output to MARC XML was a matter of minutes. After validation against the MARC XML format, the files were ready for conversion to MARC21 via Terry Reese’s MARCEdit conversion utility. This remarkable tool took only a few seconds to produce records that could easily be handed off for loading to Voyager.

In MARC 21 the field looks like this, as one of many fields in the record with 001 “10” and 003 “Visuals”:

100 1 ‡a Cruikshank, George, ‡d 1792-1878, ‡e etcher.

As part of a MARC record it can be validated, indexed, displayed, and communicated in this form.

The MARC files were then bulk-loaded into Voyager by Kambiz Eslami of Information Technology and became available to users of Blacklight. Anyone can see Visuals record #10 at https://pulsearch.princeton.edu/catalog/10634093, identified by its new Voyager bibliographic record ID and with the original Visuals record ID now in an 035 field. If we so choose, the Voyager records can be exported to OCLC for inclusion in WorldCat.

Finished! (Except for the post-processing clean-up.)

Postscript

The migration from Visuals to MARC is attended by a degree of irony. MARC is on the way out, according to many observers. What will take its place? BIBFRAME, or some other data format based on RDF (Resources Description Framework). What is the key structural feature of RDF? It’s “triples” – a combination of subject and predicate and object, with the predicate being the term that describes the connection between the subject and the object. Seen any triples lately? Why, yes: Visuals!!! Let’s apply RDF terms from Dublin Core (a simpler alternative to BIBFRAME) to our three-part Visuals statement, with a little help from VIAF, the Virtual International Authority file.

This Visuals “triple”:

<idRecord>10</idRecord>
<Element>ARTIST</Element>
<Value>Cruikshank, George, 1792-1878 [etcher]</Value>

turns into this RDF triple in Dublin Core:

<rdf:RDF xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:dcterms=”http://purl.org/dc/terms/”>
<rdf:Description rdf:about=”https://pulsearch.princeton.edu/catalog/visuals10″>
<dcterms:creator rdf:resource=”https://viaf.org/viaf/69915118/”/>
</rdf:Description>
</rdf:RDF>

Or, as a human would read it: The work described in Princeton Visuals record #10 has creator Cruikshank, George, 1792-1878. “What’s a creator?” you might ask. The namespace prefix “dcterms” tells you that you can find out at the address indicated. (Cruikshank is labelled as an “etcher” in Visuals and in MARC, but Dublin Core does not go into specific function terms like that. Cruikshank is in the MARC 100 field in this record. The closest Dublin Core term representing the 100 “Main Entry” field is “creator” and that will serve to get users to the resource description.) In our Dublin Core statement, everything’s a URI, and life is sweet.

So, just in this brief note the same functional text has been shown encoded in 5 different ways: Visuals database format, XML extract from the database, MARC XML, MARC21, and RDF. Such multiple-identity situations are familiar to us. Many of the technical services staff are adept at understanding, manipulating, and when necessary actually inventing data encoding schemes and at moving data from one to another (and adapting the data as required to the new encoding environment). These skills have grown to be just as significant in our work as creating the data (metadata) in the first place. As the examples show, the significant content lives on no matter what form of coding is wrapped around it, and whether it is represented by text or by a URI. MARC is by no means the end of the road for Visuals or anything else. Conversion of MARC fields to RDF triples would mean a round trip for Visuals data. Was Visuals futuristic, structurally speaking? It’s a question for the historians. For now, we are going to move ahead one step at a time.

This Side of Metadata

Princeton University Library Special Collections Technical Services

Tag Archives: XSLT

Transformed: From the Visuals database to MARC in 4 (depending on how you count them) easy (depending on what you consider easy) steps

Share this: