Beyond the Screen: Machine Learning and Metadata Creation at GBH Archives

Fall 2024 Metadata and Training Data Intern, Madison Courtney.

The following was submitted by Fall 2024 Metadata and Training Data Intern, Madison Courtney.

My time interning at GBH challenged my preconceived notions surrounding machine learning, cataloguing and the way we think about humans and machines working together in the future.

The first week of my internship a game plan emerged for the next 12 weeks of the internship. Being the first Training Data and Metadata intern I felt slightly intimidated before the plan was laid out (I had all the potential to make it so I would be the first and last), however, my objective was clear.

  1. Familiarize myself with both the end user and the American Archive of Public Broadcasting. Observe and learn, then comment and test.
  2. Evaluating/experiencing the efficacy of different uses of machine learning in cataloguing and metadata creation.
  3. Create a training data set to contribute to CLAMS (Computational Linguistics Applications for Multimedia Services).

There was much that I had to be introduced to, industry-specific terms that don’t come up while creating metadata for physical or A/V materials created outside of public broadcasting. I familiarized myself with the AAPB by understanding it as a collaborative preservation effort as well as the many hours of A/V materials that effort stewards. I began with reading the history of the American Archive and exploring it by coming up with various search queries to find a diverse selection of items. Reading the history of the AAPB as a preservation project helped me understand the inception of the CLAMS project in ways I didn’t expect.

During the application and interview process I, of course, spent time on the website, but reading about the inception of the digital repository while closely examining specific item’s metadata and PBCore instantiations brought me closer to the scale of the work being done. The mass digitization of public broadcasts from across the United States sounded hefty before I got a closer look, but truly understanding the undertaking made the feat seem even more herculean. This moved me beyond an academic understanding of media becoming increasingly accessible and cresting towards today’s mass media production, and the unprecedented amount of content requires creative solutions like machine learning.

This understanding positions the existence of programs like CLAMS and Whisper, an Automated Speech Recognition software created by OpenAI, as non-negotiable tools in the toolbox of the cataloguer.

If it is unfamiliar, the CLAMS project develops machine-learning tools to generate metadata from archived footage.  The output can be represented as a collection of screengrabs presented in a visual interface called a “visaid” or visual index. The application searches for “Scenes with text” (SWT) and pulls them out, sorting them based on datasets created from AAPB holdings. The following are the different labels possible: bars, slate, an audience warning, main title card, a chyron that identifies a person, subtitles, miscellaneous text, etc. All of these SWT could contain critical metadata for cataloguing.

The combining of AI with the work of cataloguing was demystified over the course of my internship through testing the efficacy of these systems. As I moved on from research and discovery in the AAPB to light cataloguing, I compared cataloguing with information gathered from the CLAMS visaid to cataloguing without the visaid. As I filled out the names of on-screen contributors, off-screen collaborators, and temporal information. The application allows for information to be gathered by quickly viewing the item’s visaid instead of watching or listening to the entirety of the item. Similarly, Whisper compares the shape of audio waveforms to predict the text.

I quickly realized both applications had their pros and cons. In Whisper, using speech that is meant to mimic digital-born methods of communication was often transcribed incorrectly. For example, if an on-screen contributor was using “re:” out loud like in the same context as email Whisper would have a hard time representing this in the transcript. Different punctuation creates different inflection in screen readers and end-users of the archive might represent this specific affectation of spoken English differently. This could in turn affect the outcome of search results. This raises the question of when should there be certain punctuation employed that is usually for delineation of words usually written being spoken out loud.  Cataloguing using the visaid had its complications as well. It was hard to ascertain from only the visual index what the type of contributor was, at least within a certain level of specificity. If it is the host or guest it was clear from the Chyron or, how often I saw the person within my specific set of footage to catalog. What type of guest is difficult because whether they are being interviewed or not is usually delegated by the timbre of the conversation.

I only lead with two complications because the net positive is clear. Without either application I would have spent three times as long watching each program, making my own faulty metadata or typo-filled transcript that would still have to be edited. The question becomes, what does an ideal audio transcript or catalog contain and does the application speed the process of crafting that ideal?

In the final stretch of my internship, I spent time meticulously creating a dataset that could help a future extension of CLAMS capabilities. The current application recognizes patterns in on-screen text and provides the cataloguer with what it believes are the most useful SWTs. The training data I created was for a version of the application that would push to not only recognize the type of textual information visually, but to take the information from the SWT and process it into structured data. For this, we used SWT identified as slates and transcribed the text directly before contextualizing the information by recording the metadata the transcribed text provides.

Machine learning-derived applications such as this one operate much like having an extra member of the team. A team member who you have to take care to train, but once you do, they provide a pair of unsalaried hands to be quality controlled. Its presence allows for a level of detail that would be impossible to achieve without grunt work done on the front end. Throughout this virtual internship, I came to think of the CLAMS application as my fellow intern, slowly learning each part of the dataset the same way I progressed through each level of the cataloguing process at GBH’s Media Library and Archives department. So when I was finally creating the training data myself, I could appreciate the gaps in knowledge I was filling, keenly feel the distance and the artifice of its intelligence. Moving forward, I hope to be able to continue working on projects like this taking my entry-level experience in training data creation to other libraries where similar item-level recognition could be utilized.

Leave a comment