I’ve previously written about developing and automating management of our workflows for the NewsHour project (click for link), and WGBH’s processes for ingesting and preserving the NewsHour digitizations (click for link). Now that the project is moving along, and over one thousand episodes of the NewsHour are already on the AAPB (with recently added transcript search functionality!!), I thought I would share more information about our access workflows and how we make NewsHour recordings available.
In this post I will describe our “Asset Review” and “Online Workflow” phases. The “Asset Review” phase is where we determine what work we will need to do to a recording to make it available online, and the “Online Workflow” phase is where we extract metadata from a transcript, add the metadata to our repository, and make the recording available online.
The goals and realities of the NewsHour project necessitate an item level content review of each recording. The reasons for this are distinct and compounding. The scale of the collection (nearly 10,000 assets) meant that the inventories from which we derived our metadata were generated only from legacy databases and tape labels, which are sometimes wrong. At no point were we able to confirm that the content on any tape is complete and correct prior to digitization. In fact, some of the tapes are unplayable before being prepared to be digitized. Additionally, there is third-party content that needs to be redacted from some episodes of the NewsHour before they can be made available. A major complication is that the transcripts only match 7pm Eastern broadcasts, and sometimes 9pm or 11pm updates would be recorded and broadcast if breaking news occurred. The tapes are not always marked with broadcast times, and sometimes do not contain the expected content – or even an episode of the NewsHour!
These complications would be fine if we were only preserving the collection, but our project goal is to make each recording and corresponding transcript or closed caption file broadly accessible. To accomplish that goal each record must have good metadata, and to have that we must review and describe each record! Luckily, some of the description, redaction, and our workflow tracking is automatable.
Access and Description Workflow Overview
As I’ve mentioned before, we coordinate and document all our NewsHour work in a large Google Sheet we call the “NewsHour Workflow workbook” (click here for link). The chart below explains how a GUID moves through sheets of the NewsHour workbook throughout our access and description work.
After a digitized recording has been delivered to WGBH and preserved, it is automatically placed in queue on the “Asset Review” sheet of our workbook. During the Asset Review, the reviewer answers thirteen different questions about the GUID. Using these responses, the Google Sheet automatically places the assets into the appropriate workflow trackers in our workbook. For instance, if a recording doesn’t have a transcript, it is placed in the “No Transcript tracker”, which has extra workflow steps for generating a description and subject metadata. A GUID can have multiple issues that place it into multiple trackers simultaneously. For instance, a tape that is not an episode will also not have a transcript, and will be placed on both the “Not an Episode tracker” and the “No Transcript tracker”. The Asset Review is critical because the answers determine the work we must perform, and ensures that each record will be correctly presented to the public when work on it is completed.
A GUID’s status in the various trackers is reflected in the “Master GUID Status sheet”, and is automatically updated when different criteria in the trackers are met and documented. When a GUID’s workflow tasks have been completely resolved in all the trackers, it appears as “Ready to go online” on the “Master GUID Status sheet.” The GUID is then automatically placed into to the “AAPB Online Status tracker”, which presents the metadata necessary to put the GUID online and indicates if tasks have been completed in the “Online Workflow tracker”. When all tasks are completed, the GUID will be online and our work on the GUID is finished.
In this post I am focusing on a workflow that follows digitizations which don’t have problems. This means the GUIDs are episodes, contain no technical errors, and have transcripts that match (green arrows in the chart). In future blog posts I’ll elaborate on our workflows for recordings that go into the other trackers (red arrows).
Each row of the “Asset Review sheet” represents one asset, or GUID. Columns A-G (green cell color) on the sheet are filled with descriptive and administrative metadata describing each item. This metadata is auto-populated from other sheets in the workbook. Columns H-W (yellow cell color) are the reviewer’s working area, with questions to answer about each item reviewed. As mentioned earlier, the answers to the questions determines the actions that need to be taken before the recording is ready to go online, and place the GUID into the appropriate workflow trackers.
The answers to some questions on the sheet impact the need to answer others, and cells auto-populate with “N/A” when one answer precludes another. Almost all the answers require controlled values, and the cells will not accept input besides those values. If any of the cells are left blank (besides questions #14 and #15) the review will not register as completed on the “Master GUID Status Sheet”. I have automated and applied value control to as much of the data entry in the workbook as possible, because doing so helps mitigate human error. The controlled values also facilitate workbook automation, because we’ve programmed different actions to trigger when specific expected text strings appear in cells. For instance, the answer to “Is there a transcript for this video?” must be “Yes” or “No”, and those are the only input the cell will accept. A “No” answer places the GUID on the “No Transcript tracker”, and a “Yes” does not.
To review an item, staff open the GUID on an access hard drive. We have a multiple access drives which contain copies of all the proxy files delivered NewsHour digitizations. Reviewers are expected to watch between one and a half to three minutes of the beginning, middle, and end of a recording, and to check for errors while fast-forwarding through everything not watched. The questions reviewers answer are:
- Is this video a nightly broadcast episode?
- If an episode, is the recording complete?
- If incomplete, describe the incompleteness.
- Is the date we have recorded in the metadata correct?
- If not, what is the corrected date?
- Has the date been updated in our metadata repository, the Archival Management System?
- Is the audio and video as expected, based on the digitization vendor’s transfer notes?
- If not, what is wrong with the audio or video?
- Is there a transcript for this video?
- If yes, what is the transcript’s filename?
- Does the video content completely match the transcript?
- If no, in what ways and where doesn’t the transcript match?
- Does the closed caption file match completely (if one exists)?
- Should this video be part of a promotional exhibit?
- Any notes to project manager?
- Date the review is completed.
- Initials of the reviewer.
Our internal documentation has specific guidelines on how to answer each of these questions, but I will spare you those details! If you’re conducting quality control and description of media at your institution, these questions are probably familiar to you. After a bit of practice reviewers become adept at locating transcripts, reviewing content, and answering the questions. Each asset takes about ten minutes to review if the transcript matches, the content is the expected recording, and the digitization is error free. If any of those criteria are not true, the review will take longer. The review is laborious, but an essential step to make the records available.
A large majority of recordings are immediately ready to go online following the asset review. These ready GUIDs are automatically placed into the “AAPB Online Status tracker,” where we track the workflow to generate metadata from the transcript and upload that and the recording to the AAPB.
About once a month I use the “AAPB Online Status tracker” to generate a list of GUIDs and corresponding transcripts and closed caption files that are ready to go online. To do this, all I have to do is filter for GUIDs in the “AAPB Online Status tracker” that have the workflow status “Incomplete” and copy the relevant data for those GUIDs out of the tracker and into a text file. I import this list into a FileMaker tool we call “NH-DAVE” that our Systems Analyst constructed for the project.
“NH-DAVE” is a relational database containing all of the metadata that was originally encoded within the NewsHour transcripts. The episode transcripts provided by NewsHour contained the names of individuals appearing and subject terms for that episode in marked up values. Their subject terms were much more specific than ours, so we mapped them to the more broad AAPB controlled vocabulary we use to facilitate search and discovery on our website. When I ingest a list of GUIDs and transcripts to “NH-DAVE” and click a few buttons, it uses an AppleScript to match metadata from the transcript to the corresponding NewsHour metadata records in our Archival Management System and generate SQL statements. We use the statements to insert the contributor and subject metadata from the transcripts into the GUIDs’ AAPB metadata records in the Archival Management System.
Once the transcript metadata has been ingested we use both a Bash and a Ruby script to upload the proxy recordings to our streaming service, Sony Ci, and the transcripts and closed caption SRT files to our web platform, Amazon. We run a Bash script to generate another set of SQL statements to add the Sony Ci URLs and some preservation metadata (generated during the digital preservation phase) to our Archival Management System. We then export the GUIDs’ Archival Management System records into PBCore XML and ingest the XML into the AAPB’s website. As each step of this process is completed, we document it in the “Online Workflow tracker,” which will eventually register that work on the GUID is completed. When the PBCore ingest is completed and documented on the “Online Workflow tracker,” the recording and transcript are immediately accessible online and the record displays as complete on the “Master GUID Status spreadsheet”!
We consider a record that has an accurate full text transcript, contributor names, and subject terms to be sufficiently described for discovery functions on the AAPB. The transcript and terms will be fully indexed to facilitate searching and browsing. When a transcript matches, our descriptive process for NewsHour is fully automated. This is because we’re able to utilize the NewsHour’s legacy data. Without that data, the descriptive work required for this collection would be tremendous.
A large majority of NewsHour records follow the workflow I’ve described in this post in their journey to the AAPB. If, unlike those covered here, a record is not an episode, does not have a matching transcript, needs to be redacted, or has technical errors, then it requires more work than I have outlined. Look forward to blog posts about those records in the future! Click here to see a NewsHour record that went through this workflow. If you’re interested in our workflow, I encourage you to open the workbook and use “Find” to follow this GUID (“cpb-aacip-507-0r9m32nr3f”) through the various trackers. Click here to see all NewsHour records that have been put online!