Digital Archives Workstation Update: KryoFlux, FRED, and BitCurator Walk into a Bar…

The Manuscripts Division processing team’s new digital archives workstation.

Over the past year and a half, the Manuscripts Division processing team has made two significant additions to our digital archives workstation. The first, mentioned briefly in our July 2016 post, was a KryoFlux forensic floppy controller, which allows archivists to create disk images from a variety of obsolete floppy disk formats. The second, more recent addition was a forensic computer workstation called the Forensic Recovery of Evidence Device (FRED), which arrived this May (1). We now use the FRED in a native BitCurator environment as our primary workstation, along with the KryoFlux and a growing collection of external drives. While this streamlined setup has vastly improved our born-digital processing workflow, we would be lying if we didn’t admit that getting to this point has involved back-and-forth discussions with IT colleagues, FRED troubleshooting, and a fair share of headaches along the way. This post will describe how we got these new tools, FRED and KryoFlux, to work together in BitCurator.

Before we had the FRED, we operated the KryoFlux from the Windows partition of our digital processing laptop (which is partitioned to dual boot Bitcurator/Ubuntu Linux and Windows 7). To get to this point, however, we had to jump over some hurdles, including confusion over the orientation of drives on our data cable, which differed from the diagram in the official KryoFlux manual; finicky USB ports on our laptop; and the fact that the laptop didn’t seem to remember that it had the KryoFlux drivers installed between uses (2). While operating the KryoFlux meant we had to do some sort of extra finagling with each use, it nonetheless allowed us to image floppies we couldn’t with our previous tools.

In addition to hardware components such as the controller board, floppy drive, and associated cables, the KryoFlux “package” also consists of a piece of software called DiskTool Console (DTC), which can be run directly in the terminal as a command line tool or through a more human-friendly graphical user interface (GUI). The KryoFlux software is compatible with Windows, Mac, and Linux. However, we initially went with a Windows install after hearing a few horror stories about failed attempts to use KryoFlux with Linux. Though operational, this set-up quickly became unsustainable due to the laptop’s tendency to crash when we switched over from disk imaging in Windows to complete later processing steps in BitCurator. Whenever this happened, we had to completely reinstall the BitCurator partition and start from scratch, sometimes losing our working files in the process. In addition to this problem was the issue of our quickly dwindling hard drive space. In order to sidestep this mess, we needed to install the KryoFlux on the FRED. Since we planned to have the FRED running the BitCurator environment as its only operating system to avoid any future partitioning issues, this meant we would have to attempt the dreaded Linux install.

Our feelings about Linux before the Archivist’s Guide to KryoFlux. Source: http://gph.is/1c55ovc

Luckily, the arrival of our FRED in May 2017 coincided with the advent of the Archivist’s Guide to KryoFlux. Although the KryoFlux is gaining popularity with archivists, it was originally marketed towards tech-savvy computer enthusiasts and gamers with a predilection for vintage video games. The documentation that came with it was, to put it nicely, lacking. That’s where an awesome group of archivists, spearheaded by Dorothy Waugh (Emory), Shira Peltzman (UCLA), Alice Prael (Yale), Jennifer Allen (UT Austin), and Matthew Farrell (Duke) stepped in. They compiled the first draft (3) of the Archivist’s Guide to KryoFlux, a collaborative, user-friendly manual intended to address the need for clearer documentation written by archivists for archivists. Thanks to the confidence inspired by this guide, our dark days of Linux-fearing were over. We did encounter some additional hiccups on our way to a successful Linux install on the FRED — but nothing we couldn’t handle without the tips and tricks found in the guide. The following are some words of wisdom we would offer to other archivists who want to use KryoFlux in conjunction with the FRED and/or in a natively installed BitCurator environment.

First, when installing KryoFlux on a Linux machine, there are a few extra steps you need to take to ensure that the software will run smoothly. These include installing dependencies (libusb and the JDK Java Runtime Platform) and creating a udev rule that will prevent future permissions issues. If the previous sentence is meaningless to you, that’s ok because the Archivist’s Guide to KryoFlux explains exactly how to do both of these steps here.

A second problem we ran into was that, even though we had Java installed, our computer wasn’t invoking Java correctly when we launched the KryoFlux GUI; the GUI would appear to open, but important functionality would be missing (such as a completely blank settings window). A tip for bypassing this problem can be found several paragraphs into the README.linux file that comes with the KryoFlux software download; these instructions indicate that the command java -jar kryoflux_ui.jar makes Java available when running the GUI. To avoid having to run this command in the terminal every single time we use the GUI, we dropped the command into a very short bash script. We keep this script on the FRED’s desktop and click on it in order to start up the GUI in place of a desktop icon. There are likely other solutions to this problem out there, but this the first one that worked consistently for us.

Annotated section of README.linux file from the KryoFlux software for Linux (which you can download from this page.)

One particularity of the FRED to keep in mind when working with the KryoFlux, or any external floppy controller or drive, is the FRED’s internal Tableau write blocker (UltraBay). Since the KryoFlux employs hardware write-blocking (after you remove a certain jumper block (4) ), the FRED’s internal hardware write blocker is unnecessary and will create problems when interfacing with external floppy drives. To bypass the FRED’s Tableau write blocker, make sure to plug the KryoFlux USB data cable into one of the USB ports along the very top of the FRED or those on the back, not the port in the UltraBay.

Plug the KryoFlux data cable into the USB ports that are not connected to the internal write blocker in the FRED’s UltraBay. Like so.

Technical woes aside, the best part about our new FRED/KryoFlux/BitCurator set-up is that it allows us to access data from floppy disks that were previously inaccessible due to damage and obscure formatting. Just this summer, our inaugural Manuscripts Division Archival Fellow, Kat Antonelli, used this workstation to successfully image additional disks from the Toni Morrison Papers. Kat was also able to use Dr. Gough Lui’s excellent six-part blog series on the KryoFlux to interpret some of the visualizations that the KryoFlux GUI provides. From these visualizations, she was able to glean that several of the disks that even the KryoFlux couldn’t image were most likely blank. While the Archivist’s Guide to KryoFlux provides a great way to get started with installation, disk imaging, and basic interpretation of the KryoFlux GUI’s various graphs, learning how to navigate these visualizations is still somewhat murky once you get beyond the basics. As archivists continue to gain experience working with the KryoFlux, it will be interesting to see how much of this visual information proves useful for archival workflows (and of course, we’ll document what we learn as we go!)

What does it all mean? (The left panel shows us how successful our disk image was. The right panel contains more detailed information about the pattern of data on the disk.)

(1) The processing team drafted a successful proposal to purchase the FRED based on the results of a survey we conducted asking 20 peer institutions about their digital archives workstations. We plan to publish the results of this survey in the near future. More to come on this project in a future post!

(2) You can read more about our troubleshooting process for these issues in the “Tale of Woe” we contributed to the Archivist’s Guide to KryoFlux. More on this resource later in this post.

(3) The guide is still in draft form and open for comments until November 1, 2017. The creators encourage feedback from other practitioners!

(4) See page 3 of the official KryoFlux manual for instructions on enabling the write blocker on the KryoFlux. (3.5” floppies can also be write-blocked mechanically.)

Princeton goes to RAC!

Portrait of John D. Rockefeller Jr. (1874-1960)

On Wednesday, January 18th the Manuscripts Division team went on a field trip to the Rockefeller Archive Center (RAC) located in Sleepy Hollow, NY. The team, who consists of Kelly Bolding, Faith Charlton, Allison Hughes, Chloe Pfendler, and myself, was graciously invited by RAC’s Assistant Digital Archivist Bonnie Gordon to meet with our RAC counterparts and have discussions about born digital processing, in specific to knowledge sharing, providing peer support, and horizontal leadership. The team also received a tour of the Center by the Director of Archives, Bob Clark.

The “seed” to this exchange was planted when Faith reached out to Bonnie about RAC’s digital processing workstation specifications (hardware, software, peripheral tools, etc.), which is part of a larger endeavor that we have taken on to best inform building our own workstation (more on this project to come!). In the exchange, Bonnie asked about DABDAC (Description and Access to Born Digital Archival Collections), a peer-working group here in our Department that I had previously mentioned in a presentation I gave at last year’s PASIG meeting at MoMA in New York. Side note: both Bonnie and her colleague Hillel Arnold wrote about their PASIG experience in RAC’s Digital Team blog, Bits & Bytes, which is an excellent resource to keep abreast of digital preservation news and RAC’s innovative projects. In our communication, it became clear that the two respective processing teams at Princeton and RAC were currently undergoing similar efforts of collaborative knowledge and skill set building with regards to born-digital processing. In fact, representatives of both teams are set to give presentations at the upcoming 2017 code4lib conference (*cough*, *cough*) that revolve around how to build competence (and confidence) across an entire team, regardless of whether the word “digital” appears in an archivist’s job title.

Considering that both teams are figuring out ways to get everyone on their team up to speed with digital processing, we decided to meet, to learn from each other, and talk strategy out loud. Here is what I learned from the Rockefeller Archive Center:

RAC’s institutional history:

I’ve always wondered what else could be found in the Rockefeller Archive Center other than Rockefeller family archives. It turns out RAC is also home to the archives of many philanthropic and service organizations like the Ford Foundation, the Commonwealth Fund, and other organizations that were founded by Rockefeller family members like the Rockefeller Foundation. The Center operates in an actual house that was originally built for Martha Baird Rockefeller, John D. Rockefeller Jr.’s second wife, which makes for a very interesting set up for an archival repository. There were many, many bathrooms in the house, which left me wondering if RAC’s entire archival staff of 25 has their own personal bathroom.

RAC’s digital processing workstation:

RAC’s digital processing workstation

RAC currently uses a FRED (Forensic Recovery Evidence Device) in conjunction with a KryoFlux, and a number of floppy disk controllers, like FC5025, to disk image 3.5” and 5.25” floppies; and FTK Imager to image optical disks and hard drives. The fact that the KryoFlux and floppy disk controllers can be connected to the FRED, and the FRED is able to access the contents of floppies from these devices, is perhaps the most important thing the Princeton team learned from RAC since we were previously working under the assumption that the FRED’s internal tableau write blocker will disallow the FRED’s ability to access these. The Center also has the use of a MacBook in order to be able to image Mac-formatted materials, which is something that our team is thinking of also including in our workstation in the near future since we have, and anticipate, legacy Mac-formatted materials in our collections.

Bonnie also mentioned that they have FTK installed on their FRED workstation, though it requires a lot of manual labor and is considering getting a second processing workstation with BitCurator installed. Neither the Manuscripts Division or our colleagues in the University Archives at Mudd Library utilize FTK or FTK Imager, and so far both we’ve been satisfied with the suite of tools BitCurator provides. Though because donor imposed access restrictions are much more prevalent in manuscript collections than in university archive materials, FTK might be particularly useful to zero in on particular files and folders with varying access restrictions.

Manuscript Division’s processing archives team with Bonnie Gordon, Assistant Digital Archivist for the Rockefeller Archive Center.

One thing that Bonnie said that I am beginning to understand and be critical about is the need for archivists to retool or hack their way through the current tools available to be able to use them for our needs. Many of these tools, FTK and FRED for example, are not built with archivists as primary customers in mind, but forensic investigators who use them to analyze evidence for criminal investigation. These forensic tools, at times, require significant time investments to be able to get them to be responsive to archivists’ needs, which makes hacking, or improvisation, necessary for folks who want to, or are currently doing, archival work for cultural institutions. In our own limited but growing knowledge and experience in preparing digital archives for long-term preservation, we’ve come across some challenges with configuring each discrete piece of equipment with the necessary operating system, hardware specifications, etc. It is like taking on the task of solving a giant jigsaw puzzle:

  • when you first start, you definitely don’t know which piece goes where;
  • you may even question if you have all the pieces you need;
  • or if you know how to put a damn jigsaw puzzle together;
  • you get frustrated and want to give up mid-completed puzzle;
  • then you realize that, if you assemble a bigger team together, each one of you can take different sides of the puzzle and go from there.

And truly, that is at the core of what our processing team here is trying to get at: empowering all of our processing staff with the skills and expertise to share the labor of processing born-digital archives so that more diverse sets of skills and experiences can influence conversations about workflows and configurations. If we put more heads together we invite more creative ways of working through roadblocks; and if we have more bodies, we can tackle the growing digital backlog more efficiently.

The Princeton team had an excellent field trip out to the Rockefeller Archive Center. Stay tuned as we return the favor to the RAC team and host them at our digs here in early spring!

Digital Processing Workflows & Improvisation: A Foray into Bash Scripting

Adventures in the command line.

Over the past year, processing archivists in the Manuscripts Division have begun to integrate digital processing into our regular work. So far, each group of digital materials we’ve put through our workflow has presented at least one unique challenge that required us to supplement the standard steps in order to meet our twin goals of preservation and access.

Improvisation, however, is no stranger to “traditional” archival processing in a paper-based setting. Professional standards and local guidelines lend structure to actions taken during processing. Still, there is always a degree of improvisation involved due to the fact that no two collections are alike (Hence, the archivist’s favorite answer to any question: “It depends.”) In the inevitable cases where existing documentation stops short of addressing a particular situation we encounter in the day-to-day, we use our professional judgement to get where we need to go a different way. We improvise.

Good guidelines have room for improvisation built-in. By documenting the context and reasoning behind each step, they empower users to deviate from the particulars while achieving the same general result. We’re lucky to have inherited this kind of thoughtful documentation from our colleagues at Mudd Library (notably, Digital Archivist Jarrett Drake). Archivists across the department who are processing digital materials have recently begun writing informal narrative reflections on particular problems we’ve encountered while working with digital materials and the improvisations and workarounds we’ve discovered to solve them. By linking these reflections to our digital processing guidelines, we hope to aid future troubleshooting, repurpose what we’ve learned, and share knowledge horizontally with our peers.

One example (1) is a recent issue I encountered while working with a group of text files extracted from several 3.5” floppy disks and a zip disk from the papers of an American poet. After acquiring files from the disks, our workflow involves using DROID (Digital Record Object Identification), an open source file format identification tool developed by the U.K. National Archives, to identify file formats and report any file extension mismatches. In this case, the report listed a whopping 4,791 mismatches, nearly all of which lacked file extensions entirely.

While Mac and Linux operating systems rely on internal file metadata (MIME types) rather than file extensions to determine which program is needed to open a file, Windows operating systems (and humans) rely on file extensions. Reconciling file extension mismatches is important both because future digital preservation efforts may require moving files across operating systems and because file extensions provide important metadata that can help future users identify the programs they need to access files.

Small quantities of files can be renamed pretty painlessly in the file browser, as can larger numbers of files requiring uniform changes within a relatively flat directory structure by using the rename or mv command in the terminal. In my case, the collection creator managed her files in a complex directory structure, and single directories often contained files of various types, making these manual solutions prohibitively time- and labor-intensive.

If there’s one thing I’ve learned from working with metadata, it’s that you should never do a highly repetitive task if, in the time it would take you to do the task manually, you could learn how to automate it. (2) In addition to completing the task at hand, you’ll develop a new skill you can reuse for future projects. While sitting down and taking a comprehensive online class in various programming languages and other technologies is certainly useful for archivists venturing into the digital, it’s often hard to find the time to do so and difficult to retain so much technical information ingested all at once. Problems encountered in day-to-day work provide an excellent occasion for quickly learning new skills in digestible chunks in a way that adds immediate value to work tasks and leads to better information retention in the long run. This philosophy is how I decided to solve my mass file renaming problem with bash scripting. I was somewhat familiar with the command line in Linux, and I figured that if I knew the command to rename one file, with a little finagling, I should be able to write a script to rename all 4,791 based on data from DROID.

To create my input file, I turned the file extension mismatch report from DROID into a simple CSV file containing one field with the full path to each file missing an extension and a second field with the matching file extension to be appended. To do so, I looked up the correct file extensions in the PRONOM technical registry using the identification number supplied by the DROID report. I then inserted the extensions into my input file using the Find/Replace dialog in Google Sheets, deleted columns with extraneous information, and saved as a CSV file.

The script I wrote ended up looking like this: (3)

My bash script.

In a nutshell, a bash script is a series of commands that tells the computer what to do. The first line, which appears at the beginning of every bash script, tells the computer which shell is needed to interpret the script. Lines 3-9 first clear the terminal window of any clutter (clear) and then prompt the user to type the file path to the input file and the location where a log file should be stored into the terminal (echo); after the user types in each file path, the script reads it (read) and turns the user’s response into variables ($input and $log). These variables are used in the last line of the script in a process called redirection. The < directs the data from the input file into the body of the script, and the > directs the output into a log file.

Terminal window containing prompts initiated by the echo command, along with user input.

The real mover and shaker in this script is a handy little construct called a while loop (while, do, & done are all part of its syntax). Basically, it says to repeat the same command over and over again until it runs out of data to process. In this case, it runs until it gets to the last line of my input file. IFS stands for internal field separator; by using this, I’m telling the computer that the internal field separator in my input is a comma, allowing the script to parse my CSV file correctly. The read command within the while loop reads the CSV file, placing the data from field 1 (the full path to each file to be renamed) into the variable $f1 and the data from field 2 into the variable $f2 (the file extension to be appended). The mv command is responsible for the actual renaming; the syntax is: mv old_name.txt new_name.txt. In this case, I’m inserting my variables from the CSV file. Quotes are used to prevent any problems with filenames or paths that have spaces in them. -v is an option that means “verbose” and prompts the terminal to spit out everything that it’s doing as it runs. In this case, I’m redirecting this output into the log file so that I can save it along with other administrative metadata documenting my interventions as an archivist.

In the end, what would have taken me or a student worker countless hours of tedious clicking now takes approximately four seconds in the command line. I’m glad I spent my time learning bash.


Notes

(1) For an even better example, see Latin American Collections Processing Archivist Elvia Arroyo-Ramirez’s “Invisible Defaults and Perceived Limitations: Processing the Juan Gelman Files” (here).

(2) Maureen Callahan’s “Computational Thinking and Archives” blog post certainly helped inspire this way of thinking.

(3) Here’s where I learned how to make my script executable and how to execute it from the command line.

Tooling Up: Building a Digital Processing Workstation

Learning about jumper settings on our 5.25" floppy disk workstation.

Learning about jumper settings on our 5.25″ floppy disk workstation.

Since completing a comprehensive survey of born-digital holdings within the Manuscripts Division in 2015, the archival processing team at Firestone Library has been steadily gathering the equipment necessary to safely access and preserve digital data stored on obsolete computer media. In addition to the nearly 400 digital media uncovered by our recent survey, the Manuscripts Division continues to acquire digital materials at an increasing pace, most recently within the papers of Toni Morrison, Juan Gelman, and Alicia Ostriker.

We’ve leaned heavily over the past year on the infrastructure and expertise of our colleagues at the Seeley G. Mudd Manuscript Library to get our feet wet with digital processing, including for help with extracting data from over 150 floppy disks in the Toni Morrison Papers. This year, we’ve taken the deep dive into assembling a digital processing workstation of our own. As born-digital archival processing becomes a core part of “regular” archival processing, the tools available to archivists must expand to reflect the materials we encounter on a day-to-day basis in manuscript collections; and we, as archivists, have to learn how to use them.

Manuscripts Division Digital Processing Toolkit (Because everything looks better in a cool toolkit).

3.5″ and 5.25″ floppy disks are a common occurrence within personal papers dating from the mid-1970s through the mid-2000s. Disks often arrive on our desks labeled only with a few obscure markings (if we’re lucky), but their contents remain inaccessible without equipment to read them. Since contemporary computers no longer contain floppy disk drives or controllers, we had to get creative. Based on research and recommendations from Jarrett Drake, Digital Archivist at Mudd Library, we assembled a toolkit of drives, controller boards, and connectors that have enabled us to read both 3.5” and 5.25” floppy disks on our dedicated digital processing laptop, which dual boots Windows 7 and BitCurator (a digital forensics environment running in Linux’s Ubuntu distribution).

3.5" Floppy Drive with USB Connector

3.5″ Floppy drive with USB connector

Fortunately, external drives for 3.5” floppy disks are still readily available online for around $15 from Amazon, eBay, or Newegg. We purchased one that connects directly to the USB port on our laptop, which Latin American Collections Processing Archivist Elvia Arroyo-Ramirez and our student assistant Ann-Elise Siden ’17 recently used to read and transfer data from 164 floppy disks in the Juan Gelman Papers (which will be the subject of an upcoming post).

5.25″ floppy disks, which preceded the 3.5″ model, present a somewhat hairier challenge since new drives are no longer commercially available. Based on positive results with a similar set-up at Mudd Library, we purchased a FC5025 USB 5.25″ floppy controller from Device Side Data to use in conjunction with an internal TEAC FD-55GFR 5.25″ floppy disk drive we bought from a used electronics dealer on Amazon. The Device Side Data floppy controller came with a 34-pin dual-row cable to connect the controller board to the drive and a USB cable to connect to our laptop. After hooking everything up, we also realized we would need a molex AC/DC power adapter to power the 5.25″ drive from a wall outlet, which we were also able to secure online at Newegg. All in all, our 5.25″ floppy disk workstation cost us about $130. Compare that to the price of archival boxes, folders, and bond paper, and it’s actually pretty reasonable.

5.25" Floppy Drive

5.25″ Floppy drive (Purchased used from Amazon dealer)

5.25" Floppy Drive Controller from Device Side Data

5.25″ Floppy drive controller from Device Side Data

While these set-ups have been largely successful so far, there have been a handful of problem 3.5″ floppy disks our drive couldn’t read, likely due to prior damage to the disk or obscure formatting. After doing some additional research into methods adopted by peer institutions, we decided to try out the KryoFlux, a forensic floppy controller that conducts low-level reads of disks by sampling “flux transitions” and allows for better trouble-shooting and handling of multiple encoding formats. While an institutional KryoFlux license is a significantly costlier option than the others we’ve discussed in this post, funds from our purchase will support future development of the tool, and it will be available for use by University Archives and Public Policy Papers staff as well as for those in the Manuscripts Division.

Very recently, we received our KryoFlux by mail from Germany. Upon opening and inspecting the package, among the hardware kit, controller boards, disk drive, and cables, we were delighted to find a gift: several packages of Goldbären (i.e. adorable German gummy bears). Our next steps will be installing the KryoFlux software on our laptop, connecting the hardware, and testing out the system on our backlog of problematic floppy disks, the results of which we will document in a future post. In the meantime, we are interpreting the arrival of these candies as a fortuitous omen of future success, and at the very least, a delicious one.

A gift from our friends at KryoFlux.

A fortuitous omen of future successes in disk imaging.