“People have had a lot of trouble getting stuff out of RecordPoint.”
This sentence was a little worrying to hear. It was 2015, and our archive was contemplating digital preservation for the first time. We didn’t really know what it was, or how it worked. Neither did anyone else: the idea of having a “digital preservation system” received blank stares around the office. “Is it like a database? Why not use one of our CMS’s instead? Why do we need this?”
And so it was that I realised I was in over my head and needed outside help. I looked up state records offices to find out what they were doing, and realised there is such a thing as the job title “Digital Preservation Officer”. I contacted one of these “Digital Preservation Officers” to get on the right path.
The Digital Preservation Officer’s knowledge in that early conversation was invaluable, and helped us get over those early hurdles. She explained the basics: why digital preservation is important for an archive. How to get started. Breaking down jargon. Convincing non-archivists that yes, it is necessary. And – the importance of figuring out what you want to preserve.
“We will need to preserve digital donations,” I listed, “and digitizations of our physical inventory. Plus, I manage our digital records management system, RecordPoint – if we’re serious about our permanent records we will need to preserve those as well.” (The international digital records management system standard, ISO 16175 Part 2, says that “long-term preservation of digital records… should be addressed separately within a dedicated framework for digital preservation or ‘digital archiving’”.)
It was at this point that the Digital Preservation Officer replied with the quote that began this article.
I don’t think she was quite right – getting digital objects and metadata out of RecordPoint was quite easy. The challenge, it turned out, would be getting the exported digital objects into our digital preservation system, Archivematica.
In the image shown below, the folders on the left represent the top level of a RecordPoint export of two digital objects. The folders on the right are what Archivematica expects in a transfer package.
In the example above, there are three folders for ‘binaries’ (digital objects) and two folders for ‘records’ (metadata). Immediately something doesn’t make sense – why are there three binary folders for two objects?
The reason is that the export includes not only the final version of the digital object but also all previous drafts. In my example there is only a single draft, but if a digital object had 100 drafts, they would all be included here. This is great for compliance, but not so great for digital preservation where careful appraisal is necessary. The priority when doing an ‘extract, transform, load’ (ETL) from RecordPoint to Archivematica would be to ensure that the final version of each binary made it across to the ‘objects’ folder on the right.
An Archivematica transfer package should not only consist of digital objects themselves, of course – you are not truly preserving digital objects unless you also preserve their descriptive metadata. This is why the ‘metadata’ folder on the right exists: you can optionally create a single CSV file, ‘metadata.csv’, which contains the metadata for every digital object in the submission as a separate line. Archivematica uses this CSV file as part of its metadata preservation process.
In contrast, RecordPoint creates a metadata file for every one of the digital objects it exports. If you wanted to pull metadata across into the metadata CSV file for the Archivematica submission, you would need to go through every single metadata XML in the export and copy and paste each individual metadata element. Based on a test, sorting the final record from the drafts and preparing its metadata for Archivematica might take two to four minutes per record. Assuming we have 70,000 records requiring preservation, the entire process of transforming these records manually would take over 6,000 hours. Although technically possible, this is too much work to be achievable, and there would be a high likelihood of errors due to the tedious, detail-oriented work.
Fortunately, I knew the R programming language. R is used by statisticians to solve data transformation problems – and this was a data transformation problem! I created an application using a tool called R Shiny, providing a graphical interface that sits on the Archivematica server. I creatively called it RecordPoint Export to Archivematica Transfer (RPEAT). After running a RecordPoint export, you select the export to be transformed from a drop-down list in RPEAT and select the metadata to be included from a checklist. RPEAT then copies the final version of each digital object from the export into an ‘objects’ folder and trawls through each XML file to extract the required metadata. Finally, RPEAT creates a CSV file that contains all of the required metadata, and moves it into the ‘metadata’ folder. Everything is then ready for transfer into Archivematica.
Pushing 212 records exported from RecordPoint through RPEAT, selecting the correct metadata from the checklist, and doing some quick human quality assurance took 7 minutes. Scaled up, transforming all 70,000 records this way would take fewer than 39 hours. RPEAT reduces the time taken to prepare records for Archivematica by over 99% compared to manual processes.
The advice that the Digital Preservation Officer provided all those years ago was invaluable, and I think in particular the warning on “getting stuff out of RecordPoint” was pertinent – but I wish to expand on her point. The challenge is not unique to RecordPoint – the challenge is ETL in general. At a meeting of Australia and New Zealand’s digital preservation community of practice, Australasia Preserves, in early 2019, other archivists shared their struggle to do ETL from records management systems into their digital archive. This ability is an important addition to the growing suite of technical skills valuable to us digital preservation practitioners.
International Organisation for Standardisation. (2011). Information and documentation —
Principles and functional requirements for records in electronic office environments — Part 2: Guidelines and functional requirements for digital records management systems (ISO 16175-2). Retrieved from https://www.saiglobal.com/.