Digital Processing Workflows & Improvisation: A Foray into Bash Scripting

Adventures in the command line.

Over the past year, processing archivists in the Manuscripts Division have begun to integrate digital processing into our regular work. So far, each group of digital materials we’ve put through our workflow has presented at least one unique challenge that required us to supplement the standard steps in order to meet our twin goals of preservation and access.

Improvisation, however, is no stranger to “traditional” archival processing in a paper-based setting. Professional standards and local guidelines lend structure to actions taken during processing. Still, there is always a degree of improvisation involved due to the fact that no two collections are alike (Hence, the archivist’s favorite answer to any question: “It depends.”) In the inevitable cases where existing documentation stops short of addressing a particular situation we encounter in the day-to-day, we use our professional judgement to get where we need to go a different way. We improvise.

Good guidelines have room for improvisation built-in. By documenting the context and reasoning behind each step, they empower users to deviate from the particulars while achieving the same general result. We’re lucky to have inherited this kind of thoughtful documentation from our colleagues at Mudd Library (notably, Digital Archivist Jarrett Drake). Archivists across the department who are processing digital materials have recently begun writing informal narrative reflections on particular problems we’ve encountered while working with digital materials and the improvisations and workarounds we’ve discovered to solve them. By linking these reflections to our digital processing guidelines, we hope to aid future troubleshooting, repurpose what we’ve learned, and share knowledge horizontally with our peers.

One example (1) is a recent issue I encountered while working with a group of text files extracted from several 3.5” floppy disks and a zip disk from the papers of an American poet. After acquiring files from the disks, our workflow involves using DROID (Digital Record Object Identification), an open source file format identification tool developed by the U.K. National Archives, to identify file formats and report any file extension mismatches. In this case, the report listed a whopping 4,791 mismatches, nearly all of which lacked file extensions entirely.

While Mac and Linux operating systems rely on internal file metadata (MIME types) rather than file extensions to determine which program is needed to open a file, Windows operating systems (and humans) rely on file extensions. Reconciling file extension mismatches is important both because future digital preservation efforts may require moving files across operating systems and because file extensions provide important metadata that can help future users identify the programs they need to access files.

Small quantities of files can be renamed pretty painlessly in the file browser, as can larger numbers of files requiring uniform changes within a relatively flat directory structure by using the rename or mv command in the terminal. In my case, the collection creator managed her files in a complex directory structure, and single directories often contained files of various types, making these manual solutions prohibitively time- and labor-intensive.

If there’s one thing I’ve learned from working with metadata, it’s that you should never do a highly repetitive task if, in the time it would take you to do the task manually, you could learn how to automate it. (2) In addition to completing the task at hand, you’ll develop a new skill you can reuse for future projects. While sitting down and taking a comprehensive online class in various programming languages and other technologies is certainly useful for archivists venturing into the digital, it’s often hard to find the time to do so and difficult to retain so much technical information ingested all at once. Problems encountered in day-to-day work provide an excellent occasion for quickly learning new skills in digestible chunks in a way that adds immediate value to work tasks and leads to better information retention in the long run. This philosophy is how I decided to solve my mass file renaming problem with bash scripting. I was somewhat familiar with the command line in Linux, and I figured that if I knew the command to rename one file, with a little finagling, I should be able to write a script to rename all 4,791 based on data from DROID.

To create my input file, I turned the file extension mismatch report from DROID into a simple CSV file containing one field with the full path to each file missing an extension and a second field with the matching file extension to be appended. To do so, I looked up the correct file extensions in the PRONOM technical registry using the identification number supplied by the DROID report. I then inserted the extensions into my input file using the Find/Replace dialog in Google Sheets, deleted columns with extraneous information, and saved as a CSV file.

The script I wrote ended up looking like this: (3)

My bash script.

In a nutshell, a bash script is a series of commands that tells the computer what to do. The first line, which appears at the beginning of every bash script, tells the computer which shell is needed to interpret the script. Lines 3-9 first clear the terminal window of any clutter (clear) and then prompt the user to type the file path to the input file and the location where a log file should be stored into the terminal (echo); after the user types in each file path, the script reads it (read) and turns the user’s response into variables ($input and $log). These variables are used in the last line of the script in a process called redirection. The < directs the data from the input file into the body of the script, and the > directs the output into a log file.

Terminal window containing prompts initiated by the echo command, along with user input.

The real mover and shaker in this script is a handy little construct called a while loop (while, do, & done are all part of its syntax). Basically, it says to repeat the same command over and over again until it runs out of data to process. In this case, it runs until it gets to the last line of my input file. IFS stands for internal field separator; by using this, I’m telling the computer that the internal field separator in my input is a comma, allowing the script to parse my CSV file correctly. The read command within the while loop reads the CSV file, placing the data from field 1 (the full path to each file to be renamed) into the variable $f1 and the data from field 2 into the variable $f2 (the file extension to be appended). The mv command is responsible for the actual renaming; the syntax is: mv old_name.txt new_name.txt. In this case, I’m inserting my variables from the CSV file. Quotes are used to prevent any problems with filenames or paths that have spaces in them. -v is an option that means “verbose” and prompts the terminal to spit out everything that it’s doing as it runs. In this case, I’m redirecting this output into the log file so that I can save it along with other administrative metadata documenting my interventions as an archivist.

In the end, what would have taken me or a student worker countless hours of tedious clicking now takes approximately four seconds in the command line. I’m glad I spent my time learning bash.


Notes

(1) For an even better example, see Latin American Collections Processing Archivist Elvia Arroyo-Ramirez’s “Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files” (here).

(2) Maureen Callahan’s “Computational Thinking and Archives” blog post certainly helped inspire this way of thinking.

(3) Here’s where I learned how to make my script executable and how to execute it from the command line.