Oh, yes, one more thing about these files: there are 13,000 of them. 10 GB of Chancellory goodness.
I hear the Digital Archivists laughing at my paltry numbers even now. "Ohhh, TEN GIGS," they say. "That's almost the size of the manifest for this huge research data set I'm archiving." Yeah, well, we can't all have integrated TDR software tools, OK guys? Some of us have to hold together their E-records processes with chewing gum and baling wire. 10 GB is far and away the largest accession that the UWM Archives has taken in thus far, so for me it's a challenge, all right? (I am sure that if and when I get this data curation initiative off the ground I will look back fondly on the days when I worried about accessions that were merely 10 GB.)
Anyway. First thing I noticed is that NONE of the tools I've been using scale on their own. From the reading I've done, the reason for this is the same reason that these tools are platform neutral: they run in Java, which means they use virtual memory to extract file metadata/build checksums/whatever. This many files means that virtual memory runs out fast and the process stops (if the program doesn't crash altogether). OK, fine. So I just have to do the analysis in chunks instead. This is annoying, but not undoable. DROID (which I'm going to discuss soon, I promise!) runs fine when I chunk it out by year, although it does give me the following in my SIP metadata folder:
|Spot the irony: ".droid" is not listed in PRONOM as a recognized file format. This is what passes for humor in the Archives world, folks.|
|"Your tears are delicious."--New Zealand Metadata Harvester|
I have now run into one of the biggest problems to plague Human-Computer Interaction since the UNIVAC days, namely that front line coders are, by and large, AWFUL at writing for laymen. I need to know from the logs:
- What happened
- What file it happened to
- How I, the average user, can fix it so it stops happening
- What happened
- WHEN it happened (Seriously? I need to know to the second when my process crapped out?)
- The specific script violation(s) invoked
- How someone who is competent with coding can fix it so it stops happening (I think. I honestly don't know, since I don't read code)
In the meantime, *I'm* not giving up, though I am leaving huge gaps in the metadata for some of these chunks (as well as seriously considering going back to school for Comp Sci if I am going to continue down the e-recs road... though that would also require time and money that I don't have). I suppose I could go folder by folder to determine which specifically is causing the problem, but that sort of defeats the purpose of automating the process with metadata extraction tools, doesn't it? Oh well. Something is better than nothing, I suppose, though it will be necessary to indicate somehow that the metadata is incomplete (for now I am using the "comment" tag in the NZMH output XML). I haven't even tried the Duke Data Accessioner on a collection this big yet, so that's going to be my next step, though I also want to figure out if I can run it without having to create copies of the files-- one set of this stuff is enough! (Until I create a working copy set for processing, anyway).
I AM pleased to note that the NARA File Analyzer seems to work fine, if sluggishly, for generating a manifest of files with checksums, and was able to get the entire accession in one go. Score one for the Federal Government! As noted in my post about ingest, however, the output is not very pretty, though if we have to use it we have to use it. I think when my fieldworker puts together her procedures list for e-records processing a separate section for large accessions is going to be necessary anyway, so might as well think about which tools might work better and how use of them is going to change.
Next up: heading backwards and talking about DROID, JHOVE, and NZMH in more detail as part two of metadata generation for ingest. Wheee!