Craptions are – get ready for this – captions that are crappy. They are inaccurate, fail to differentiate between speakers, are not grammatically correct, or don’t contain punctuation. In short, they are crap – which is a bummer.
Captions are essential for those who are Deaf or hard of hearing – in the US, that’s about 15% of the population – and they are a critical tool for those learning a new language, making sense of some muddled speech, or trying to understand what’s being spoken over the sound of crunching on kettle chips.
Viewers are left to attempt to piece together meaning or stop watching all together. It’s not only annoying to us TV snackers, detrimental to those who rely on it, and harmful to the 50% of Americans watching with subtitles most of the time, it’s also supposed to be illegal.
The Federal Communications Commissions is responsible for making sure the airwaves are accessible for every body. Since 1996, they have required all TV shows have available “accurate, synchronous, complete, and properly placed” captions, meaning they match the spoken words and also display background noises accurately. Additionally, broadcasters in the Untied States also have to comply with the 1990 Americans with Disabilities Act. “The ADA says that all businesses who serve the public or ‘places of public accommodation,’ must be equally available to everyone,” concisely summarizes our friends at 99% Invisible, whose “Craptions” podcast episode inspired this blog post.
While online media streaming and communication services are not technically “places,” they do function as a social place. For example, when a family unit is gathered to watch a show, excluding someone from content by not offering them a vehicle for consumption, such as captions, excludes them also from that social interaction; similarly, when people cannot engage with virtual meetings, they are being removed from a place of gathering.
Softwares have been responding to this online meeting need with closed captioning services.
- Zoom Room meeting hosts can enable live transcription for a meeting or assign participants to manually type captions.
- Otter.ai takes transcription a step further, providing automated, real-time meeting notes of Zoom meetings and other audio or video files, while allowing action items, comments, and highlights to be directly applied to the produced notes.
- Similarly, MeetGeek uses AI-integrated technology to transcribe in real-time and summarize meeting notes with key insights.
Many of these caption creating tools – in meeting rooms and in media – use speech recognition technology, leveraging Artificial Intelligence to scale the captioning process. In place a human note taker or captioner, A.I. programs can transcribe faster, cheaper, and sometimes in real-time. This outcome of Automatic Speech Recognition (ASR) has a significant impact on the products of meetings and movies, for everyone.
ASR is also the technology at work when we ask Siri for the weather and has many other use cases. To make the magic happen, a computer receives audio input, processes it by breaking down various components of speech, and then transcribes the speech into text. An ASR system has a few key components:
- Lexical design. An ASR system is equipped with language-specific lexicon, containing phonemes and allophones of the language (these are the sounds we put to together to make words). The system includes these fundamental elements of the language – as they will be received – and written vocabulary, the text that will be produced. It makes use of the lexicon information through the acoustic modeling process.
- Acoustic model. Acoustic modeling allows the ASR system to separate audio into small time frames and provides the probability of the phonemes spoken. This makes it possible for the system to detect phrases produced by different people and accents. Deep-learning algorithms are trained on various audio recording and transcripts to better understand the relationship amongst audio frames, phonemes, and the intended words.
- Language model. To help computers understand the context of what’s being said, natural language processing (NLP) is used to help systems recognize the intent of what’s being said and use that insight to compose word sequences.
These components work together, intaking audio and making predictions, to recycle our meaning back to us.
ASR offers possibilities for computer communication that seemed unimaginable decades ago – even five years ago. Although it’s evolving rapidly to make sense of our speech, it’s far from perfect, which is why we have labels like “craptions” for some of ASR’s outputs. What will continue to aid captions and those who need them are human notetakers and transcribers – they can help to clarify meaning, emphasize importance, and cut down on the crap in captions.