Wednesday, October 10, 2012

Scaling! Or, why I need a CS degree to do this stuff

When last we spoke, I had just accessioned some materials from the EAD Roundtable which I thought were going to be good test cases for using these tools. Shortly after that, however, I was presented with an even juicier opportunity: the event files of the UWM Chancellor. These files contain event overview forms, some correspondence, some presentations, and other material relevant to the chancellor's activities during the course of a given academic year. The files are likely to be relatively high-use and high-value, which shot them to the front of my testing queue. They were also, by and large, actual document files, which meant not much weeding is necessary to pick the researcher-usable wheat from the supporting-files chaff.

Oh, yes, one more thing about these files: there are 13,000 of them. 10 GB of Chancellory goodness.

I hear the Digital Archivists laughing at my paltry numbers even now. "Ohhh, TEN GIGS," they say. "That's almost the size of the manifest for this huge research data set I'm archiving." Yeah, well, we can't all have integrated TDR software tools, OK guys? Some of us have to hold together their E-records processes with chewing gum and baling wire. 10 GB is far and away the largest accession that the UWM Archives has taken in thus far, so for me it's a challenge, all right? (I am sure that if and when I get this data curation initiative off the ground I will look back fondly on the days when I worried about accessions that were merely 10 GB.)

Anyway. First thing I noticed is that NONE of the tools I've been using scale on their own. From the reading I've done, the reason for this is the same reason that these tools are platform neutral: they run in Java, which means they use virtual memory to extract file metadata/build checksums/whatever. This many files means that virtual memory runs out fast and the process stops (if the program doesn't crash altogether). OK, fine. So I just have to do the analysis in chunks instead. This is annoying, but not undoable. DROID (which I'm going to discuss soon, I promise!) runs fine when I chunk it out by year, although it does give me the following in my SIP metadata folder:
Spot the irony: ".droid" is not listed in PRONOM as a recognized file format. This is what passes for humor in the Archives world, folks. 
Note the "comprehensive breakdown" files-- these are various outputs of DROID's reporting function, which makes the issue of multiple profiles significantly less problematic because I can open all of these profiles and have them vomit their output into a single chart for comparison. So far, so good.


The New Zealand Metadata Harvester, on the other hand... less far, less good. It didn't extract the whole accession, as I expected, so I attempted to chunk it down further. In most cases it did it by year fine; in other cases, I got this result instead:

Specific!

OK. No problem. Presumably by breaking it down even further, I can get it to where the number of files will be small enough that I can chunk it out even further. This works for me for the first few years, though I have to remove the October folder from all of them to make them work, which leads me to believe that the NZMH is tripping over a parse error. This theory is quickly demolished when I get to 2009, at which point *none* of the folders will extract properly. In a fit of optimism, I hit the logs to attempt to determine the file or folder the program is tripping over.
"Your tears are delicious."--New Zealand Metadata Harvester
I have now run into one of the biggest problems to plague Human-Computer Interaction since the UNIVAC days, namely that front line coders are, by and large, AWFUL at writing for laymen. I need to know from the logs:
  1. What happened
  2. What file it happened to
  3. How I, the average user, can fix it so it stops happening
What I have been given instead:
  1. What happened
  2. WHEN it happened (Seriously? I need to know to the second when my process crapped out?)
  3. The specific script violation(s) invoked
  4. How someone who is competent with coding can fix it so it stops happening (I think. I honestly don't know, since I don't read code)
The whole point of writing a GUI is so that people like me who aren't comfortable around command lines can use the program and solve problems when they arise. So, of course, when a problem DOES arise... I'm sent back to information that I need facility with the command line to use. Thanks, guys. Honestly, the lack of intelligible documentation for so many of these tools is the single greatest barrier to entry to e-records processing, and until it gets better too many archivists are just going to throw up their hands and give up on this process altogether. 

In the meantime, *I'm* not giving up, though I am leaving huge gaps in the metadata for some of these chunks (as well as seriously considering going back to school for Comp Sci if I am going to continue down the e-recs road... though that would also require time and money that I don't have). I suppose I could go folder by folder to determine which specifically is causing the problem, but that sort of defeats the purpose of automating the process with metadata extraction tools, doesn't it? Oh well. Something is better than nothing, I suppose, though it will be necessary to indicate somehow that the metadata is incomplete (for now I am using the "comment" tag in the NZMH output XML). I haven't even tried the Duke Data Accessioner on a collection this big yet, so that's going to be my next step, though I also want to figure out if I can run it without having to create copies of the files-- one set of this stuff is enough! (Until I create a working copy set for processing, anyway).

I AM pleased to note that the NARA File Analyzer seems to work fine, if sluggishly, for generating a manifest of files with checksums, and was able to get the entire accession in one go. Score one for the Federal Government! As noted in my post about ingest, however, the output is not very pretty, though if we have to use it we have to use it. I think when my fieldworker puts together her procedures list for e-records processing a separate section for large accessions is going to be necessary anyway, so might as well think about which tools might work better and how use of them is going to change.

Next up: heading backwards and talking about DROID, JHOVE, and NZMH in more detail as part two of metadata generation for ingest. Wheee!