Opening Data Is Not Like Opening a Door

In the fall of 2012 AVPreserve received a data dump of the 2.4 million records that had been generated as part of the American Archive Content Inventory Project (AACIP) managed by WGBH Media Library & Archives. There’s a reason it’s referred to as a dump — parsing, mapping, and making that data useable or accessible is complex and messy, no matter how clean or well packaged it is at the point of transfer. Now why were we so lucky to get this dump? AVPreserve had been contracted by the Corporation for Public Broadcasting to manage the inventory metadata during the first digitization phase of the American Archive of Public Broadcasting (AAPB) project, a task that primarily centered around the development of the Archival Management System (AMS). In short, the AMS was to be a web-based tool accessible by all AAPB stakeholders (stations, CPB, digitization vendor) that would:

– Send alerts to stations regarding their timelines for packing and shipping materials for digitization;

– Identify and track materials that had been selected for digitization;

– Provide an access point for viewing or listening to digitized content for contributing stations;

– Provide an access point for searching inventory records, editing or adding metadata, and performing cleanup or normalization of records;

– Provide reporting features to allow AAPB staff to track the progress of digitization according to region, number of hours, number of assets, radio versus television, format types, and more.

You could say the purpose of the AMS was to take all those thousands of records created across the country and make the data do what the digitization project needed it to do…Which was a lot.

Having all of those item level station records was fantastic (and kudos to the WGBH team and the stations for getting that massive task done), but if that data is not searchable and useable, there is little point to having it. Before we get too deep into AMS here, I’d like to step back and take a look at how we got to this point.

The AAPB has been years in the making (and the planning, and the contracting, and the planning), our involvement in the project dating back to at least 2010 when Senior Consultant Kara Van Malssen (just prior to joining AVPreserve) worked on the research and writing of a comprehensive plan for how the AAPB could be built, taking into consideration things such as what metadata to capture, how it could be captured, what the infrastructure could look like, and what specifications should be used for digitization.

Around this same time, AVPreserve was a contributing consultant on the development of PBCore 2.0. A metadata schema developed specifically for use by public broadcasting, PBCore is one of the few audiovisual-specific schemas around that incorporates both descriptive and technical metadata to a significant degree, but development and revisions on it had ended many years ago at version 1.2. As there were plans to use PBCore as the schema of the AAPB, there was an immediate need to fix long-standing issues with the structure and update it to better fit the new realities of media production and distribution. A group was put together to make quick fixes that could be released as version 1.3 — which then could also be used for inventory gathering — and work longer term to make more substantial revision to release as version 2.0.

With a revamped PBCore in place, WGBH was able to build templates for the stations to use in the CIP. As participants may recall, stations had the option of conducting their own inventories in-house, or they could apply to CPB to have a third party come onsite and perform the inventory. AVPreserve was also active in this phase, generating the inventory for WXXI in Rochester, NY, and then separate from the CIP, for the former NJN (New Jersey Network). NJN, the only public broadcasting station in New Jersey, was shut down by the State in 2011 (ostensibly for budgetary reasons) just before the inventories were going to kick off. AVPreserve was hired separately to perform an inventory of their 120,000 item collection left in the former studios in Trenton and, thanks to the AAPB staff, that inventory was then included in the AAPB database and digitization funding was made available.

Back to the AMS, however. From the get go, we had proposed developing the AMS as an open source application using an agile development process. We used the Scrum form of the Agile project management methodology — applicable to many areas but most widely adopted in software development — which takes an iterative approach to projects, the goal being to produce working, usable software at regular intervals, as opposed to the “waterfall” approach of doing all of the design, and then all of the backend, and then all of the front end and then all of the quality control, and then the release. In Agile Scrum speak the intervals are called sprints. A sprint, typically 2 to 4 weeks long, essentially consists of: a planning meeting to create and prioritize a list of tasks or features based on which ones are most critical to have completed, and identifying the number of those tasks that can reasonably be completed in the sprint; performing the agreed upon work; and then demoing the completed features for review and actual live implementation when tested and approved. Then the process starts all over again to produce the next set of features.

We were fortunate to have proposed an agile approach, because as it turned out the digitization would begin only a few months after our development began and then run in parallel during the term of our work. There were also many other moving parts, lots of unknowns, and we needed working, reliable software almost immediately. Under the agile process we were able to be flexible, prioritize and develop immediate needs (basic framework, station alerts, shipping and tracking functions, asset prioritization) and save for a later date those functions that would not be critical until after digitization had begun in earnest (digitized asset playback, reporting, metadata clean up).

We were also fortunate that AAPB Project Manager Stephanie Sapienza was a willing collaborator in the role of the Product Owner. Within the Agile Scrum methodology, the Product Owner makes the call on what functions are a priority, whether the functioning code at the end of the sprint meets their needs or not, and also decides when previously defined functions are no longer needed for development or if an unthought of function is needed. One of the benefits of agile is this flexibility in rethinking or reprioritizing projects as they grow without resources being wasted on unneeded functions or mistakes in what direction a project is going.

And that flexibility was very important during our 18 months building the AMS and managing the AAPB inventory data, because dealing with such a large data set and an unprecedented, ambitious project like the AAPB brought all kinds of unforeseen issues or prompted new ideas as the project grew. It can be hard enough to keep good internal controls on data entry, making sure the right fields are used for the right data and that terms are spelled or used consistently. WGBH had to plan and manage that type of thing for over 120 nationwide stations. They did an excellent job under the circumstances, but there was no true oversight during the inventory process. Data could only be reviewed after it was submitted, and then it was up to a Metadata Manager to enforce consistency. This type of normalization and cleanup (make sure all the dates are written the same way and didn’t get corrupted in Excel; make sure the value in the Format field is actually a format and is spelled in the approved way; make sure all the required fields are completed; etc.) generally takes several passes to do because there are varying levels of complexity in problems and solutions (for which there may not be resources to fix all of them) and because certain functions in the database system may require the data to be presented a certain way that was not anticipated prior. Both WGBH and AVPreserve spent a considerable amount of time performing cleanup and normalization of data in order to turn it into a consolidated data set representing usable information.

One example of a common problem area here are dates. In official PBCore the accepted format for dates is the ISO 8601 standard, which at its simplest is expressed as YYYY-MM-DD. If the field does not match that it is an invalid record. However, as many people who work with audiovisual materials are familiar with, tapes frequently do not list a date, or it may be incomplete information (“Spring 1998”, “7/15”, “April 10th”, etc.). In these cases, if not directed otherwise, inventory takers transcribed the date exactly as written, or used notation like “7/15/????”, or wrote some version of “Unknown”. Now in the case of something like formats, it’s fairly easy to normalize data that may include something like BetacamSP, Betacam SP, BetaSP, BetaCamSP, etc., because patterns in the text are identifiable and limited in variation. But in the case of dates where there is a mixture of letters and numbers and characters, separating punctuation, order of the date parts, and so on, it can be quite a mess.

On top of this, prior to upload the dates could have been corrupted in Excel if not formatted as Text. When a cell in Excel is formatted as Date, it stores dates (or anything that looks like a date) as a string of numbers that represents that date in the program. The date is visually presented as 7/15/1998, but underneath Excel actually stores it as 34529, which you can see if you change the formatting of the cell from Date to Text. When these fields get moved between systems the data can sometimes flip to text and end up encoded as that number string in the new database. In these instances the data itself becomes unreliable for analysis because it is inconsistent. Our developers had to write several algorithms to run through the date fields and normalize things piece by piece.

It should be underscored that the AMS was not developed as a long-term records database or access portal. It was built for the very specific purpose of aggregating records from multiple organizations, cleaning up and normalizing the data to a central standard, and managing the digitization of materials from the multiple, geographically distant areas. Nothing like this had really been built before, and though the AAPB is a unique project, the use of a tool like the AMS is actually becoming a need. In the past year AVPreserve developed a version of the AMS for VIAAPB (Flemish Institute for Archiving – http://viaa.be/), which is the central manager for the large scale digitization of Flemish audiovisual materials held by Universities, Broadcasters, Museums, and Libraries in Belgium. Their version of AMS is now being expanded to managing the digitization workflow for newspaper collections across the country.

We’re seeing an increasing number of efforts like this, especially in Europe where there is a stronger tie of archives to the government, and at US universities beginning to follow the Indiana University Media Preservation Initiative model where there is a central manager of digitization efforts for audiovisual materials held across all departments. The efforts of the AAPB and their agreement to let us develop the AMS as open source has now contributed a valuable preservation management tool to archives and collections across the globe. The source code for AMS is now available on Github at https://github.com/avpreserve/AMS. We’re looking forward to seeing the AAPB continue to grow and have been proud contributors to this important project.

This blog post was contributed by Josh Ranger, Senior Consultant at AVPreserve.

Opening Data Is Not Like Opening a Door

Like this:

Related

Published by American Archive of Public Broadcasting

Leave a ReplyCancel reply

Share this:

Like this:

Related

Published by American Archive of Public Broadcasting

Leave a ReplyCancel reply

Discover more from