Below are the slides and script I used for the talk I gave with Danielle Robichaud at Access 2017 in Saskatoon. Note that the script has been edited from my messy notes to be readable, so it’s not 100% verbatim.

Our slides can be found in UWaterloo’s institutional repository:

Slide 1 (Sara)

Thank you, welcome, etc.

Slide 2 (Sara)

So today we’re going to be talking about migrating archival data, and since this is (generally speaking) a non-archival crowd I’m going to start with a primer on archival data and then a quick history of archival description technology. And then I’m going to turn it over to Danielle who’s going to present a case study of migrating archival data at the University of Waterloo.

I should start by saying that, well, we do script a lot of data transformations. I do a lot of data migration work, and I’m certainly not manually transforming data to match a specific standard. I rely on the developers that I work with to do a ton of scripting. But today we wanted to focus on the factors that make that work difficult and look at a situation where saying “Why can’t you just script it?” isn’t helpful.

Slide 3 (Sara)

So, let’s jump right in. Archives aren’t libraries and our data isn’t like library data! Sorry.

Slide 4 (Sara)

All of our descriptions are original work. Every time we describe archival material, it’s the first time it’s ever been described. So we spend a lot of time researching our collections, learning about them, writing about them, and getting deeply, emotionally involved with them. This is why archivists, like all cataloguers, are protective of our data.

Slide 5 (Sara)

Each archival record, along with being original all the time, is part of, or represents, an organic, inter-related, hierarchical conglomeration of material. One record could describe one photograph, or it could describe a folder full of photographs, or it could describe a collection of hundreds of folders containing thousands of photographs. There isn’t a one-to-one relationship between the object and the record.

Slide 6 (Sara)

Generally speaking, we don’t describe the single photograph, because we don’t have the mandate to do so. We describe some level of the hierarchy above that, relying on researchers to drill down and find the individual items. For many archivists, describing individual objects only happens when we receive special funding from the donor, or we decide that item is important enough, or we have a specific need to write those descriptions, as Dr Christen emphasized this morning.

Slide 7 (Sara)

So we’ve got complex data; I would also think of it as fragile data – not easy to replicate, maybe only stored in a database somewhere. Surely, archivists should be concerned with standardization. And this is true – but adoption of national or international standards has been slow and uneven, especially when those standards conflict with systems that are already in place. In libraries, cataloguing standards have been shared by institutions for many decades, but archives never had that, so we developed systems internal to our institutions, sometimes even internal to a single collection, to describe our material. Over the past, say, 40 years, progress has been made, but the adoption of technological solutions to support standardized, machine readable data has been even slower.

Slide 8 (Sara)

So this is a very brief history of archival description and archival description technology. This is in no way comprehensive, and it’s from a very Canadian perspective.

  • 1990: In 1990, RAD, the Canadian Rules for Archival Description, was first published; it’s become the de facto standard for archival description in Canada today.
  • Mid-90s: By the mid-90s, some institutions had started to adopt database solutions to take the place of paper records; this was in no way universal, though, and many institutions still used catalogue cards and paper finding aids (and some still do today).
  • 1995: In 1995, a hyperlinked version of RAD was released, which helped archivists navigate the 200+ page standard.
  • Late 90s-00s: By the late 90s and early 2000s, specialized archival management software started to become available.

Slide 9 (Sara)

  • 1998: In 1998, Encoded Archival Description brought XML to the archives, giving us the ability to harvest and share our data.
  • 2001: In 2001, the International Council on Archives published a report recommending a standardized, open source tool for encoding archival finding aids, building on the availability of EAD.
  • 2008 (July): After 7 long years, but as an indirect result of that OSARIS report, ICA-AtoM 1.0-beta was released at the ICA Congress in Kuala Lumpur.
  • 2008 (November): And by November of that year it had implemented support for RAD.

So it took 18 years for our Canadian standard to have a system where it could be represented online. And in the years since 2008, AtoM has become the de facto system for online archival description in Canada.

Slide 10 (Sara)

Finally, this quote is perhaps a bit unfair, because certainly there are tech-inclined, progressive, forward-thinking archivists among us. But I like it, because I think that there are parallels to what we’re talking about today. In 1970, Jay Atherton wrote that “Just to mention the words “computer” or “automation” in some circles is to invite cold suspicious stares of hostility, making one feel as though he had said something dirty.” Thinking about our complicated, messy, homegrown data, and the snail’s pace progress that we’ve made in developing technological solutions to make archivists’ work easier has perhaps made us especially wary of trying to adopt a “tech will fix it” mentality.

Slide 11 (Danielle)

Time for the file migration case study!

Slide 12 (Danielle)

  • Photo: Drawer of Kitchener-Waterloo Record photographic negatives collection
  • The collection consists of approximately 2 million negatives and is one of our most heavily used collections, which made it an obvious choice for our Islandora (Waterloo Digital Library) pilot, but the decision presented a series of challenges tied to how the negatives were described.
  • The descriptive information is limited to what was provided by KWR staff, which is title, shoot date and whether or not the photos are col. or b&w. The information is useful to have but since it was added by the photographer for use by office staff, it doesn’t align well with the types of questions researchers want answered.

Slide 13 (Danielle)

Here’s an  exercise to illustrate disconnect between envelope titles and expectations..

What type of images would you expect to see in an envelope from 1953 titled “St. Agatha Orphanage?

Some good guesses might be: children, staff, interior or exterior shots, landscape photos, etc.

Slide 14 (Danielle)

Did anyone guess a man in a butcher’s apron sharpening a knife?

Challenges to figure out as part of migration:

  1. How to efficiently create item-level descriptions?
    • Copying file level records for re-use at the item-level doesn’t work because they often don’t reflect the contents of the images which falls short our migration goals which are improving accessibility and discoverability
  2. How can we improve the descriptions to facilitate keyword searching and the identification of people or events?
    • UW 60th Anniversary, retirements, repurposing of historical buildings as part of KW tech boom = constant requests for photos of XYZ
  3. And most importantly for me, how can I answer these questions without spending the rest of my career answering them?
    • Individually describing each image as unsustainable and incomplete – lots of opportunity for “unidentified people”

Slide 15 (Danielle)

My solution: Going to the source explains that community members would volunteer at a butchering bee every fall to make sure the orphanage had a meat supply during the winter months.

  • Newspaper headlines and photo captions answer key how and why questions, they also confirm when the photos ran in the paper

Transcribing photo captions into descriptive records as a substitute for original descriptive work.

  • Hand transcription selected over OCR outputs due to quality of microfilm and required post scan clean-up work and formatting
  • Energy redirected to image scanning, alt-text and intentional showcasing of underrepresented stories
  • Improved records impact multiple migration, centralization and modernization projects

Slide 16 (Danielle)

  • Photo captions provide text for more precise keyword searching
  • Source of the info clearly identified as coming from the paper

Where has scripting helped?

  • Generating standalone XML files from spreadsheets
  • Batch updates

What you can do to support colleagues working with special collections?

  • Assume and respect their expertise
  • Offer help based on what problems they bring forward
  • Nothing more discouraging than someone asking if you’ve heard about OCR

Small print

Danielle and I are very grateful to the conference organizers for accepting our talk and for bringing more archivists to Access!

“No, we can’t just script it” and other refrains from (tired) archival data migrators

Leave a Reply

Your email address will not be published. Required fields are marked *