Imagine calling the Social Security Administration and asking, “Where is my July payment?” only to have a chatbot respond, “Canceling all future payments.” Your check has just fallen victim to “hallucination,” a phenomenon in which an automatic speech-recognition system outputs text that bears little or no relation to the input.
Hallucination is one of the many issues that plague so-called generative artificial-intelligence systems such as OpenAI’s ChatGPT, xAI’s Grok, Anthropic’s Claude and Meta’s Llama. These pitfalls result from design flaws in the architecture of these systems that make them problematic. Yet these are the same types of generative AI tools that the Trump administration and its Department of Government Efficiency (DOGE) want to use to, in one official’s words, replace “the human workforce with machines.”
This proposition is terrifying. There is no “one weird trick” that removes experts and creates miracle machines capable of doing everything humans can do but better. The prospect of replacing federal workers who handle critical tasks—ones that could result in life-and-death scenarios for hundreds of millions of people—with automated systems that can’t even perform basic speech-to-text transcription without making up large swaths of text is catastrophic. If these automated systems can’t even reliably parrot back the exact information that is given to them, then their outputs will be riddled with errors, leading to inappropriate or even dangerous actions. Automated systems cannot be trusted to make decisions the way federal workers—actual people—can.
On supporting science journalism
If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
Historically, hallucination hasn’t been a major issue in speech recognition. Earlier systems might make transcription errors in specific phrases or misspell words, but they didn’t output large chunks of fluent and grammatically correct text that weren’t uttered in the corresponding audio input. Analysts have shown, however, that recent speech-recognition systems such as OpenAI’s Whisper can produce entirely fabricated transcriptions. Whisper is a model that has been integrated into some versions of ChatGPT, OpenAI’s famous chatbot.
Researchers at four universities analyzed snippets of audio transcribed by Whisper and found completely fabricated sentences. In some cases, the transcripts showed the AI had invented the races of the people being spoken about, and in others it even attributed murder to them. One recording of someone saying, “He, the boy, was going to, I’m not sure exactly, take the umbrella,” was transcribed with additions, including “He took a big piece of a cross, a teeny, small piece.... I’m sure he didn’t have a terror knife so he killed a number of people.” In another example, “two other girls and one lady” was given as “two other girls and one lady, um, which were Black.”
We cannot afford to replace the critical tasks of federal workers with models that completely make stuff up.
In the age of unbridled AI hype, with entrepreneur Elon Musk claiming to have built a “maximally truth-seeking AI,” how did we come to have less reliable speech-recognition systems than we had before? The answer is that although researchers are working to improve such systems by using their contextual knowledge to create models uniquely appropriate for specific tasks, companies such as OpenAI and xAI claim to be building something akin to “one model for everything” that can perform many tasks, including, according to OpenAI, “tackling complex problems in science, coding, math, and similar fields.” These companies use model architectures they believe can work for many different tasks and train their models on vast amounts of noisy, uncurated data instead of using system architectures, training methods and evaluation datasets that best fit the specific task at hand. A tool that supposedly does everything won’t be able to do anything well.
The current dominant method of building tools like ChatGPT or Grok, which are advertised as systems along the lines of “one model for everything,” uses some variation of large language models (LLMs), which are trained to predict the most likely sequences of words. Whisper simultaneously maps the input speech to text and predicts what immediately comes next, producing a “token” as output. A token is a basic unit of text such as a word, number, punctuation mark or word segment that is used to analyze textual data. The two disparate jobs the system has to do—speech transcription and next-token prediction—in conjunction with the large, messy datasets used to train it make it more likely that hallucinations will happen.
Like many of OpenAI’s projects, Whisper’s development was influenced by a particular outlook, summarized by the company’s former chief scientist: “if you have a big dataset and you train a very big neural network,” it will work better. Arguably, Whisper doesn’t work better. Because its decoder is tasked with both transcription and token prediction without having been trained with precise alignment between audio and text, the model may prioritize the generation of fluent text over accurate transcription of the input. And unlike misspellings or other minor mistakes, coherent text doesn’t give the reader clues that the transcriptions might be inaccurate, potentially leading users to rely on the AI’s output in high-stakes scenarios without finding its failures—until it’s too late.
OpenAI researchers have claimed that Whisper approaches human “accuracy and robustness,” a statement that is demonstrably false. Most humans don’t transcribe speech by making up large swaths of text that never existed in the speech they heard. In the past, people working on automatic speech recognition trained their systems with carefully curated data consisting of speech-text pairs in which the text accurately represented the speech. In contrast, OpenAI’s attempt to use a “general” model architecture rather than one tailored for speech transcription—sidestepping the time and resources it takes to curate data and adequately compensate data workers and creators—results in a dangerously unreliable speech-recognition system.
If the current one-model-for-everything paradigm has failed at the kind of English-language speech transcription most English speakers can perform perfectly without further education, how will we fare if DOGE succeeds in replacing expert federal workers with generative AI systems? Unlike the generative AI systems that federal workers have been told to use to perform tasks ranging from creating talking points to writing code, automatic speech-recognition tools are constrained to the much better-defined setting of transcribing speech.
We cannot afford to replace the critical tasks of federal workers with models that completely make stuff up. There is no substitute for their expertise when handling sensitive information and working on life-critical sectors ranging from health care to immigration. We need to promptly challenge, including in courts if appropriate, DOGE’s push to replace “the human workforce with machines” before this action brings immense harm to Americans.
This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.