When OpenAI announced it was open-sourcing Whisper, the company described the automatic speech recognition (ASR) system as one "that approaches human level robustness and accuracy on English speech recognition." Advertised like that, it is no wonder that Whisper has become one of the most popular ASR solutions in the market; for example, the whisper-large-v3 model, released in November 2023, now boasts over 4 million downloads on Hugging Face. Whisper now serves as the engine that powers many AI-powered applications, from medical transcription tools to voice assistants.
However, OpenAI also seems fully aware of Whisper's limitations. In the Whisper model card, OpenAI recommends against using Whisper for subjective classification or to deploy them in "high-risk domains like decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes." The company also recognizes that, because of the architecture they are built on, the models tend to hallucinate. Evidently, this has not stopped developers from building applications such as tools for medical and legal transcription
The Whisper models transcribe partly by having a component that attempts to predict what the next token will be based on the already transcribed text, so there is a lot of room for error there. The question that went unanswered, and the one independent researchers are now trying to answer is, just how bad is Whisper's hallucination problem? It turns out it is quite bad, according to a recent Associated Press report. For instance, a researcher from the University of Michigan using the model to transcribe public meetings found hallucinations in 8 out of 10 examined transcriptions using Whisper without fine-tuning.
Other reports include hallucinations in about half of 100 hours of transcribed audio, hallucinations in almost each of 26,000 reviewed transcripts, and over 187 hallucinations out of 13,000 clear audio transcriptions. Most numbers look worryingly high for a technology currently also powering accesibility solutions like closed captioning for the deaf and hard-of-hearing. Regardless of whether OpenAI has done enough warning against using Whisper in sensitive domains, the findings raise the question on whether numbers like these should be expected to be marketed as "near-human accuracy".
Comments