Monday, September 24, 2012

A (not-so-)Brief Aside on Metadata and Its Uses


So, you may have noticed the gap in updates for this so-called “series”. This is largely because I have been actually doing the work instead of blogging about it (oh sure, “Productivity”, a likely story. But in this case it’s true!). In addition to this meaning that I will have a better grasp on what these tools can do when I DO get around to writing about them, I am also benefitting from having a fieldworker from SOIS working on this project with me. Her perspective is valuable because she hasn’t been immersing herself in documentation—I am trying to see a) how well I can teach other users to use these tools for our processes; and b) how easy these tools are to use full stop. For the most part, the answers have been "fair to middling" for both.

Part of the problem I’m running into is that separating out the steps like I am in this series is to a certain extent an artificial separation. With e-records, more so than with paper records, ingest is accessioning is appraisal is processing is access. By which I mean that the state of the system or tool in the next step down the road determines what can be done with the tool or tools you’re using in this step. This is most obvious in metadata harvesting,  hence the title of this post.

These tools that I’m using? Collect a LOT of metadata. Which on the one hand is good because it’s usually better to have more information about a file than less. On the other hand, the surfeit of metadata led my fieldworker (and, by extension, me) to ask the obvious question, i.e. “what are we going to do with all of this?” I doubt very much that we want to devote a huge amount of time and effort entering full descriptions for every file into Archivists’ Toolkit, especially for those collections (such as SAA) where I know we won’t be providing online access to all or even most of the files (we don’t need to create DAOs for webpage stylesheets, e.g.).

I’ve been giving this a lot of thought and comparing the various outputs of the tools that I’ve been testing, and I have reached a preliminary conclusion, i.e. that I need to be looking at different metadata outputs to do different things for me. (This is where my digital archivist readers say “Duh!”. Leave me alone, I am learning this stuff on the fly and never took a metadata course in library school). I’m just not going to use every single bit of metadata that is being extracted by the various tools (especially since a lot of it is provided redundantly by a number of them), but I am going to be using a lot of it in various capacities. To wit, I’ve identified four basic categories:


  1. Collection Summary metadata. How much data in how many files I have, what types of files are represented, the overall date range of the digital collection, etc. This is the stuff that goes in the general section of the finding aid for extent statement, technical requirements, general scope and content, etc. 
  2. Collection Preservation/Manifest metadata. Something human-browsable that is going to indicate paths to files, dates last modified of individual files, any subject metadata, file versioning etc. This is the stuff that will be provided as a spreadsheet or database for patrons if we provide near-line access in that manner. It will also be “packaged” with the data to serve as the preservation description information (including format requirements, fixity measures, etc.)
  3. Digital Object Manifest metadata. A specialized manifest for those files we’re going to provide access to through the finding aid (actually a subset of Collection Manifest metadata). These are the file names, paths, document titles (as applicable), relevant date (created or last modified), and basic technical requirements (probably file type only). This stuff will be imported as DAOs into Archivist’s Toolkit or a modified version of our EAD tagging spreadsheet and exported into our finding aids’ contents lists.
  4. Individual Object Preservation and Access metadata. This one I figured out from all the programs that spit out XML for individual files rather than one big XML file for the whole collection. A lot of the metadata in this category may actually be covered in the other three, but by creating individual metadata files for each file in the collection, it creates a more complete preservation package and allows for expedited ingest of items into a repository by associating the items with relevant metadata.


It’s worth noting that we DID talk about these distinctions in the SAA Arrangement and Description of Electronic Records workshop that I attended in May. But we definitely didn’t talk about them in these terms, which is why it has taken me the better part of four months to figure out what to do with this stuff—we were basically given the tools, told about what was required for a SIP/AIP, and left to make our own connections. I may just be slow on the uptake, but the relationships between the tools being used and the appropriate use of the metadata being collected in the arrangement and description are not necessarily obvious in my opinion. I’ve made this post (in hopefully-clear English) to clarify to myself which kind of metadata is which and to help my equally-or-less-tech savvy readers (all 3 of you) make the distinction as well.

For the rest of this series (and there WILL be a rest of it!) I am going to try to identify which kind of metadata is being created by which tool (and be ye not fooled, most of the tools I’m looking at ARE for metadata extraction, so there’ll be a lot more discussion of this).  Going back to the ones I’ve already done, the tools at point of ingest are largely in category 2: They create a manifest, with or without checksums, which I can import into a spreadsheet or database to produce a browsable list of the files in the collection, but without distinction between documents and metafiles at this point. We’ll get to categories 1 and 3 during accessioning, using DROID and JHOVE to get that other stuff (and for right now, I AM using JHOVE 1.7 instead of JHOVE2. Why I am doing that will be revealed in the next post).

Oh, and lest I forget—because of redundant information and XML structure, almost none of the metadata I am harvesting with these tools is usable out of the box. If I am lucky, I am able to pick and choose fields from the various tools and paste them into a master spreadsheet that collects all of the information I need. If I am slightly less lucky, the import from XML to spreadsheet is screwy because of the nested structure and I need to figure out a way to flatten it to the point where I can import it into the master (doing folder-by-folder analysis may be the way to go on this). If I am UNLUCKY, I will need to convert the metadata from its existing form into an interoperable standard, such as MODS, METS, or PREMIS (this is an especially fun game because I am not really familiar with any of the three—see bit about not taking a metadata course in library school). I may be saved on the latter by our lack of a TDR for access to born-digital objects—but then, that poses an entirely new problem for our potential digital collection. Stay tuned.