Outside the Black Box: AI Shedding Light for AV Cataloging

By Owen King, Caroline Mango, Raananah Sarid-Segal, and Miranda Villesvik for the Description Section of the Society of American Archivists‘ blog Descriptive Notes.

Facing the challenge of item-level records

A shelf of analog video tapes may contain several seasons of a historic television program. The program title and date on each label might be enough to justify adding an item to a collection, but those data do not constitute a rich catalog record, let alone a record that would maximize discoverability.

At the American Archive of Public Broadcasting (AAPB)—a collaboration between the Library of Congress and GBH—we face this challenge at scale. With the support of a Mellon Foundation grant to digitize 150,000 tapes over four years, we have had a huge influx of valuable content. As of early 2026, we steward more than 271,00 digitized items. With only 2.0 FTE devoted to cataloging, however, there is no way a traditional descriptive workflow can keep up with the work to provide meaningful information about this material. 

In response, we have developed a workflow that utilizes AI—not to replace the archivist, but to empower them. We think that responsible use of AI tools requires avoiding the cognitive black box of purely algorithmic metadata generation, but rather using AI to surface data to which professional human archivists can apply their judgment.

AI tools for analyzing audio items

It is not uncommon for the AAPB to receive an inventory of new items with little to no metadata, as contributing organizations often lack the time and personnel to fully catalog their collections. Although we are limited in the time we can devote to cataloging, we still want to make sure materials are discoverable. We have put a lot of effort into developing tools for assisting video cataloging (to be discussed below), but what about cataloging a collection consisting only of audio?

The AAPB’s WIPR collection consists of radio broadcasts from Corporación de Puerto Rico para la Difusión Pública (Puerto Rico Public Broadcasting Corporation), including programs from the 1950s to the 1990s highlighting life and culture in Puerto Rico. The collection included 8,000 records total, all in Spanish. Unfortunately, about 2,000 items lacked metadata aside from local identification numbers. Furthermore, during our post-digitization quality control process, we discovered that over 4,000 of the records were associated with incorrect metadata: titles listed in WIPR’s inventory did not match the digitized content, consequently raising questions about the accuracy of any related dates and descriptions.

Like many AV archives, we now use automatic speech recognition models, like Whisper, to create transcripts. We already planned to create transcripts for the entire WIPR collection, and so we decided to prioritize records with no metadata, which we had labeled as “unknown.” 

Whisper helps populate our search indexes, and in this case showed it can also give us a document to scan when trying to provide minimal item-level descriptions. The WIPR collection also allowed us to test the tool with Spanish language materials. Even though it did produce transcripts with hallucinations (especially for audio of instrumental music) and misspellings, it handled the language well enough, giving us enough information to identify series titles, episode titles when possible, and subjects. It also allowed us to check for the presence of copyrighted content, such as poetry or musical performances, as well as sensitive information about individuals, which determine which items we can make available on the AAPB’s website.

We gather that creating and using transcripts along these lines is increasingly common among AV archives. We have been fortunate to benefit from communities, such as the AI4LAM Speech-to-Text Working Group and Code4Lib, that eagerly share knowledge about the relevant technologies, protocols, and workflows.

From audio to video

Audio only gets you so far if the aim is accurate catalog records. Just listening to a recording—whether a human or a computer is doing it—cannot tell you the proper spelling of a proper name. This problem is amplified for broad collections with items covering a diverse range of persons, places, and things, which is exactly what we find across the history of public broadcasting. 

We have long been aware that crucial metadata about broadcast television items can be found in what we refer to as “the scenes with text,” such as the slate, the Chyrons (lower third text), and the credits sequences. However, the process of manually scrubbing through video files to find these few seconds of text is time-consuming and error-prone. Our answer to this problem was the creation of visual indexes (or “visaids”), which give our catalogers a way to see an overview of the video content and read the main “scenes with text” without needing to play the full file or even scrubbing through it.

Unfortunately, using AI to find and classify all the scenes with text is not straightforward. Off-the-shelf computer vision systems are inadequate, since the particular kinds of scenes we are trying to classify, like slates, chyrons, and credits, are not represented in the corpora of labeled images usually used for training and evaluating these systems. To accomplish our goals, we would need a system trained on data relevant to our collections.

So, we embarked on a data creation project. We wrote scripts to extract still images from videos and developed a custom annotation interface, our “Keystroke Labeler,” for labeling images at maximum speed. Through the work of our staff and several interns, we have curated and labeled a set of more than 100,000 images in 18 categories for training data, with another 8600 held out as a representative evaluation set. 

Our partners at the Lab for Linguistics and Computation at Brandeis University, have developed the CLAMS platform (Computational Linguistics Applications for Multimedia Services), which provides tooling for the automated analysis of audiovisual media, along with a common data structure called MMIF (Multi-Media Interchange Format). The Brandeis team used our training and evaluation data to finetune a custom image classifier and crafted procedures for “stitching” appropriately categorized images back together into scenes with text. They packaged the computer vision model and the stitcher code as the Scene-with-Text Detection (SWT detection) CLAMS app. Finally, the GBH team wrote a system that takes MMIF output and produces HTML-based visualizations. The full AV computing pipeline yields visual indexes or “visaids.” 

Screenshot of an example of a visaid for an episode of New Mexico In Focus. The interactive visaid is available online.

In the visaid you can see one deliberate choice we made about how to use AI in our description process. Notice the “unlabeled sample” checkbox. That allows the cataloger or other user to see a random sample of the frames that the SWT Detection app did not label. This gets at a crucial difference between collaborating with a human and collaborating with a machine: the ability to check the work of a human translator or interpreter. In general, a computer can’t meaningfully respond to criticism or take into account philosophical differences which are bound to arise in the fuzzier areas of human knowledge practices. So, we make sure that visaids capture a number of still images outside the parameters of what we are looking for. Although we may not always make use of these extra still frames, they allow us to check the work of the machine. By displaying a sample of uncategorized frames, the visaids allow us to notice any significant patterns of false negative classifications by our computer vision model, even without a full viewing of the video. That way we continue to assess the system, both to continue improving it and to inform our judgments about how we are integrating AI into our descriptive workflows. 

Since mid-2024, we have been making visaids for every new video item added to the AAPB. As of early 2026, we have created over 34,000 visaids. These are ready in time for cataloging, and, in most cases, available prior to the digitization quality control process. This year, with the help of an updated SWT detection app based on a re-trained image classifier, we are hoping to start creating visaids for the video items already in the collection.

Human-led, AI-assisted video cataloging workflow

In outline, our digital ingest and cataloging workflow goes like this:

  • Contribution: Partner organizations and the AAPB agree on an inventory of media for digitization and access. The inventory usually includes minimal metadata, such as title and date.
  • Digitization: Video tapes are sent to a vendor for digitization.
  • Digital ingest: Access copies are delivered to GBH via Sony Ci.
  • SWT+visaid processing: Visaids are created for each video item.
  • Quality checking: Digital files are examined and light cataloging is performed.
  • Catalog enhancement: Over time, catalog records are enhanced by catalogers consulting visaids (and selectively watching video) to find names, affiliations, and dates.
  • Push records to public site: The newly enhanced catalog records populate search indexes and are viewable on item-level pages of the AAPB website.

This workflow benefits from advances in AI and the automation those advances enable. However, it runs counter to the Taylorist trends that dominated 20th century industrial production—trends that tend toward de-emphasis of human expertise and de-skilling of human labor. Rather than de-skilling the work or removing human judgment, we use AI to remove the mechanical friction of locating information. 

Furthermore, our use of AI avoids contributing to the new glut of unreliable data that is the increasingly pervasive output of generative AI systems. We minimize the number of decisions the AI makes that bypass human judgment, ensuring that the cognitive assistance provided by AI is not a black box and that our archive’s catalog, uneven as it still is, does not become a depot for AI slop.

Ongoing work: AI for extracting text data from video

Although visaids provide an intuitive visual overview of videos, a major pain point remains. Once a cataloger has used the visaid to identify the values of key metadata fields, such as dates and person names, it is still necessary to manually transcribe that text into the catalog. So, the obvious next step is to further reduce the mechanical friction within the process by automatically extracting the text, allowing the cataloger to transfer it into a cataloging interface with a few clicks.

However, the automated extraction of text information from television footage is challenging. Unlike the use of OCR to extract black text on a plain white background from a paper document, it is much more difficult to extract text overlaid on a complex colorful background from a degraded, analog video signal.

In research conducted last year in partnership with our Brandeis colleagues, we validated an approach for using the vision-language models to extract text from slates and Chyrons. (Unfortunately, rolling credits present more difficult challenges!) We are currently experimenting with a new interface we have developed to allow catalogers to expeditiously add extracted text to item-level catalog records. 

As with our uses of transcripts and visaids, this new cataloging assistance workflow is intended to surface information and put it in catalogers’ hands at the moments when it can be readily used to produce accurate records. Our hopes are that, with careful use of these tools, we can augment and amplify the labor of human archivists, and that we will find similar paths forward as new technological possibilities emerge.


Owen King, Caroline Mango, Raananah Sarid-Segal, and Miranda Villesvik work at GBH Archives as members of the cataloging and ingest team for the American Archive of Public Broadcasting.

Leave a Reply