We are thrilled to announce that the Institute of Museum and Library Services has awarded WGBH, on behalf of the American Archive of Public Broadcasting, a National Leadership Grant for a project titled “Improving Access to Time-Based Media through Crowdsourcing and Machine Learning.”
Together, WGBH and Pop Up Archive plan to address the challenges faced by many libraries and archives trying to provide better access to their media collections through online discoverability. This 30-month project will combine technological and social approaches for metadata creation by leveraging scalable computation and engaging the public to improve access through crowdsourcing games for time-based media. The project will support several related areas of research and testing, including: speech-to-text and audio analysis tools to transcribe and analyze almost 40,000 hours of digital audio from the American Archive of Public Broadcasting; develop open source web-based tools to improve transcripts and descriptive data by engaging the public in a crowdsourced, participatory cataloging project; and create and distribute data sets to provide a public database of audiovisual metadata for use by other projects.
Our research questions are: How can crowdsourced improvements to machine-generated transcripts and tags increase the quality of descriptive metadata and enhance search engine discoverability for audiovisual content? How can a range of web-based games create news points of access and engage the public engagement with time-based media through crowdsource tools? What qualitative attributes of audiovisual public media content (such as speaker identities, emotion, and tone) can be successfully identified with spectral analysis tools, and how can feeding crowdsourced improvements back into audio analysis tools improve their future output and create training data that can be publicly disseminated to help describe other audiovisual collections at scale?
This project will use content from the AAPB to answer our questions. The project will fund 1) audio analysis tools – development and use of speech-to-text and audio analysis tools to create transcripts and qualitative waveform analysis for almost 40,000 hours of AAPB digital files (and participating stations can definitely receive copies of their own transcripts!); 2) metadata games – development of open-source web-based tools to improve transcripts and descriptive data by engaging the public in a crowd sourced, participatory cataloging project; 3) evaluating access – a measurement of improved access to media files from crowd sourced data; 4) sharing tools – open-source code release for tools developed over the course of the grant, and 5) teaching data set– the publication of initial and improved data sets to ‘teach’ tools and provide a public database of audiovisual metadata (audio fingerprint) for use by other projects working to create access to audiovisual material.
The 2014 National Digital Stewardship Agenda includes, “Engage and encourage relationships between private/commercial and heritage organizations to collaborate on the development of standards and workflows that will ensure long-term access to our recorded and moving image heritage.” These partnerships are critical in order to move the needle of audiovisual access issues of national significance. The AAPB and Pop Up Archive are eager to continue building such a relationship so that the innovations in technology, workflows, and data analysis advanced by the private sector are fully and sustainably leveraged for U.S. public media and cultural heritage organizations.