Assessment of Whisper AI as a Tagalog Language Transcription Tool

This guest post is from Tommy Lim, Philippines Interviews Digitization Project Intern at GBH and the American Archive of Public Broadcasting.


This past summer, I had the opportunity to test out Whisper AI, a tool that is able to receive audio files and return transcriptions of human utterances from a variety of languages. These transcriptions may be useful for providing closed captioning for the audiovisual material made available at AAPB’s website. Primarily, I tested its ability to comprehend Tagalog – a dominant language in the Philippines – and English. While I correctly assumed that the AI’s English model would leave it better prepared to provide English transcriptions, assessing its Tagalog and “Taglish” (Tagalog-English) transcriptions allowed me to specifically identify in what ways Whisper AI continues to struggle with producing Tagalog transcriptions.

An interesting collection of errors pertains to Whisper’s tendency to return English translations instead of Tagalog transcriptions. I argue that Whisper does this in addition to erroneously identifying Tagalog phrases as English ones and transcribing them as such. Whisper AI’s tendency to misidentify Tagalog as English becomes salient when sounding out the anomalous English phrase, and realizing its phonetic similarities to the Tagalog phrase actually uttered. Lastly, there is also the issue of loan words. Tagalog will orthographically assimilate words from other languages to make it consistent with Philippine spelling conventions and phonotactic constraints (i.e. “election” becomes “eleksyon,” “ventana” becomes “bintana,” etc.). All audio files featuring Tagalog also featured English. This linguistic multiplicity may create a context in which Whisper is making more transcription errors than if an audio file only featured Tagalog. 

There is also the issue of Whisper AI returning transcriptions of other languages, although rarely. Other languages featured in these transcriptions contain characters from Thai, Korean Hangul, and characters of what I believe to be a variety of Chinese. Whisper AI also provides Tagalog transcriptions for other Philippine languages. The Philippines is home to multiple indigenous languages, and multiple audio files featuring Tagalog have also featured other Filipino languages, including but not limited to Visayan and Bicolano. As a speaker of a Manilan dialect of Tagalog, I am unaware if the AI offers an accurate transcription of these non-Tagalog utterances. 

The Tagalog transcriptions generated by Whisper AI were at times accurate, but still required revision most of the time. Errors that were easier to fix and perhaps simpler in nature were unnecessary inclusions of certain speech sounds, typically consonants and glides, misspelling of words, separation of a single word into multiple, as well as the concatenation of separate words into one. A more complex error I noticed was that Whisper AI would comprehend one Tagalog word as another, and transcribe that in place of the correct word. I noticed this would happen if the word mistaken for the correct one was frequently used prior. For instance, the word “kapatid” (sibling) might be used in place of “kasama” (together) if the speaker were talking about their siblings to answer a previous question by the interviewer. Typically, I noticed there to be phonetic similarities between the mistaken word and correct word, such as kapatid/kasama, where the words share the same initial consonant /k/. Occasionally, Whisper AI would also incorrectly identify the beginnings and endings of words. For instance, the Tagalog words “kamag anak” or hyphenated as “kamag-anak” together mean family or relative, and were occasionally transcribed as “kama ganak.” This error could have occurred for two reasons, both rooted in the fact that kamag-anak relies on the presence of both words to derive its meaning. First, the speaker may be uttering the words in fast succession due to their semantically connected nature, which makes it harder for Whisper to comprehend the proper ends and beginnings of the word. Second, while “kama” is an actual Tagalog word, meaning “bed”, “ganak” is phonotactically still a viable word in Tagalog phonology. The permissibility of the first word as well as the second non-word may have influenced the error. It may be important to note that this error was inconsistent: “kamag anak” was uttered multiple times by the speaker, and Whisper AI was able to correctly transcribe it sometimes. A variation to this problem was that of missing consonants, specifically in the case where the terminal and initial consonants of adjacent words were identical. For instance, Whisper AI would transcribe “hanging ngayon,” meaning “until now,” as “hanggang ayon.” Similar to the previous example, this mistake could have also been made because Whisper AI may have identified “ngayon” as “ayon,” an actual Tagalog word used to express approval or affirmation of something. The shared /ŋ/ (or “ng”) between the adjacent words was then only attributed to the final coda of the first word. 

In terms of other errors, Whisper AI would also return back extended periods of time – sometimes minutes – the same transcription repeatedly. This error would happen for multiple languages. One interview had a transcription indicating that the speaker said “I was a farmer” repeatedly for nearly 30 seconds. In another instance, the transcription would restate “Pag-dangir” for 90 seconds. It is notable that the interviewee was speaking a Philippine language I was unfamiliar with, and so I am unaware if this word is spelled correctly (or if it is a real word at all). An interesting distinction between these two errors was that while the first showed eight consecutive identical captions, each stating the erroneous phrase once, the second showed three erroneous captions, with the two of them stating the phrase over twenty times in a comma-separated list.

For a more investigative endeavor, my co-intern Rayya Chek, senior systems analyst Kevin Carter, and I thought it would be useful to see if Whisper AI could be directed to identify more than one language per audio file, and if there were meaningful differences between transcriptions indicating a specific language or no language for audio files containing Tagalog. For the first inquiry, we found that Whisper AI is incapable of receiving direction to utilize more than one language model to produce a transcription. This may be a useful area of improvement. Especially for a country like the Philippines, where many people are fluent in multiple languages, code-switching is a linguistic norm. Whisper AI will prove more useful if it can draw from multiple language models, and can know when to use one language model over the other in the instance of codeswitching. For the interviews with Tagalog speakers, I directed Whisper to produce two types of transcriptions for each interview, where one was executed with no indicated language, and another where Tagalog was the indicated language. When comparing the two, I found that transcriptions where Tagalog was explicitly indicated returned more Tagalog transcriptions. Transcriptions with no indicated language more frequently translated Tagalog utterances into English ones. For interviews where the speaker more frequently codeswitched between Tagalog and other Filipino languages, Whisper AI exhibited a less pronounced tendency to offer English translations. I assume this is due to Whisper AI lacking language models for other Filipino languages. 

Overall, my time with Whisper AI proved to be one of the more challenging yet rewarding aspects of my internship with GBH. My opinion on Whisper AI is that it still has a long way to go, even if it is only used to produce transcription drafts. I noticed that it took me much more time to revise Tagalog transcripts than purely English ones, even though the Tagalog interviews ran briefer. The most interesting aspect of working with Whisper AI was speculating why it made its errors and its relation to phonetics, phonology, and morphology. 

Tommy Lim is a graduate student at the University of Texas at Austin pursuing graduate degrees in information studies and women’s and gender studies. They obtained undergraduate degrees in linguistics and religious studies, and this carries into their graduate research focus in the way that ritualized language and religio-spirituality appear in genderqueer folks’ tarot practices to archive and facilitate identity formation. Tommy’s archival interests led them to their internship at GBH, where they looked forward to observing the imbrications of preservation, cultural identity, and coloniality.

Leave a comment