Researchers say an AI-powered transcription tool used in hospitals invents things no one ever said

By James On Oct 26, 2024

SAN FRANCISCO– Tech giant OpenAI has touted its AI-powered transcription tool Whisper as having human-level robustness and accuracy.

But Whisper has a major drawback: It has a tendency to make up chunks of text or even entire sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Those experts said some of the made-up text — known in the industry as hallucinations — could include racist commentary, violent rhetoric and even imagined medical treatments.

Experts said such fabrications are problematic because Whisper is used in a range of industries around the world to translate and transcribe interviews, generate text in popular consumer technologies and create subtitles for videos.

More worrying, they said, is a rush through medical centers to use Whisper-based tools to transcribe patients’ consultations with doctors OpenAI’ s warns that the tool should not be used in ‘high-risk domains’.

The full extent of the problem is difficult to know, but researchers and engineers said they often encountered Whisper’s hallucinations in their work. A University of Michigan For example, a researcher conducting a study of public meetings said he found hallucinations in 8 out of 10 audio transcripts he inspected before he began trying to improve the model.

A machine learning engineer said he initially discovered hallucinations in about half of the more than 100 hours of Whisper transcripts he analyzed. A third developer said he found hallucinations in almost every one of the 26,000 transcriptions he made with Whisper.

The problems persist even with well-recorded, short audio clips. A recent study by computer scientists revealed 187 hallucinations in the more than 13,000 clear audio samples they examined.

That trend would lead to tens of thousands of incorrect transcriptions of millions of recordings, researchers said.

Such errors can have “very serious consequences”, especially in hospitals Alondra Nelsonwho led the White House Office of Science and Technology Policy for the Biden administration until last year.

“Nobody wants a misdiagnosis,” says Nelson, a professor at the Institute for Advanced Study in Princeton, New Jersey. “There should be a higher bar.”

Whispering is also used to create captions for the deaf and hard of hearing – a population at particular risk for incorrect transcriptions. That’s because deaf and hard of hearing people have no way to identify fabrications “hidden among all these other texts,” said Christian Voglerwho is deaf and directs Gallaudet University’s Technology Access Program.

The prevalence of such hallucinations has led experts, advocates, and former OpenAI employees to call on the federal government to consider AI regulations. They said OpenAI should at least address the bug.

“This seems solvable if the company is willing to prioritize it,” said William Saunders, a San Francisco-based research engineer who left OpenAI in February over concerns about the company’s direction. “It’s problematic when you put this out there and people have too much confidence in what it can do and integrate it into all these other systems.”

A OpenAI A spokesperson said the company is continuously investigating how to reduce hallucinations and appreciated the researchers’ findings, adding that OpenAI incorporates feedback into model updates.

While most developers assume that transcription tools misspell words or make other mistakes, engineers and researchers said they had never seen an AI-powered transcription tool hallucinate as much as Whisper.

The tool is integrated into some versions of OpenAI’s flagship chatbot ChatGPT, and is a built-in offering in Oracle and Microsoft’s cloud computing platforms, which serve thousands of companies around the world. It is also used to transcribe and translate text in multiple languages.

In the past month alone, a recent version of Whisper was downloaded more than 4.2 million times from the open-source AI platform HuggingFace. Sanchit Gandhi, a machine learning engineer there, said Whisper is the most popular open-source speech recognition model and is built into everything from call centers to voice assistants.

Professors Allison Koenecke from Cornell University and Mona Sloane from the University of Virginia examined thousands of short excerpts obtained from TalkBank, a research repository hosted at Carnegie Mellon University. They found that nearly 40% of hallucinations were harmful or concerning because they could be misinterpreted or misrepresented by the speaker.

In one example they discovered, a speaker said, “He, the boy, would, I’m not exactly sure, take the umbrella.”

But the transcription software added: “He took a big piece of a cross, a very small piece… I’m sure he didn’t have a terror knife, so he killed some people.”

A speaker in another recording described “two other girls and one lady.” Whisper came up with additional commentary on race, adding, “two other girls and a lady, um, who were black.”

In a third transcript, Whisper invented a non-existent drug called “hyperactivated antibiotics.”

Researchers aren’t sure why Whisper and similar tools hallucinate, but software developers say the hallucinations often occur during pauses, background noise, or playing music.

OpenAI, in its online disclosures, recommended against using Whisper in “decision-making contexts, where shortcomings in accuracy can lead to pronounced shortcomings in outcomes.”

That warning hasn’t stopped hospitals and medical centers from using speech-to-text models, including Whisper, to transcribe what’s said during doctor visits, so medical providers can spend less time taking notes or writing reports.

More than 30,000 physicians and 40 healthcare systems, including the Mankato Clinic in Minnesota and the Children’s Hospital in Los Angeles, have started using a Whisper-based tool built by Nablawith offices in France and the US

That tool was finely tuned to medical language to transcribe and summarize patients’ interactions, said Martin Raison, Nabla’s chief technology officer.

Company officials said they are aware that Whisper can hallucinate and are mitigating the problem.

It’s impossible to compare Nabla’s AI-generated transcript to the original recording because Nabla’s tool erases the original audio for “data security reasons,” Raison said.

Nabla said the tool has been used to transcribe an estimated seven million medical visits.

Saunders, the former OpenAI engineer, said erasing the original audio could be concerning if the transcripts aren’t double-checked or if doctors don’t have access to the recording to verify they’re correct.

“You can’t find errors if you take away the ground truth,” he said.

Nabla said no model is perfect, and their current model requires medical providers to quickly edit and approve transcribed notes, but that could change.

Because patients’ encounters with their doctors are confidential, it is difficult to understand how AI-generated transcripts impact them.

A California state legislator, Rebecca Bauer-Kahansaid she took one of her children to the doctor earlier this year and refused to sign a form asking the health network for her permission to share the audio of the consultation with vendors including Microsoft Azure, the cloud computing system operated by OpenAI’s largest investor. Bauer-Kahan didn’t want such intimate medical conversations shared with tech companies, she said.

“The release was very specific that for-profit companies would have the right to have this,” said Bauer-Kahan, a Democrat who represents part of San Francisco’s suburbs in the state Assembly. “I was like, ‘absolutely not.’”

John Muir Health spokesman Ben Drew said the health care system complies with state and federal privacy laws.

___

Schellmann reported from New York.

___

This story was produced in collaboration with the Pulitzer Center’s AI Accountability Network, which also partially supported the academic Whisper study.

___

The Associated Press receives funding from the Omidyar Network to support reporting on artificial intelligence and its impact on society. AP is solely responsible for all content. Find APs standards for working with philanthropies, a list of supporters and funded coverage areas AP.org.

___

The Associated Press and OpenAI have one license and technology agreement giving OpenAI access to some of the AP’s text archives.