LLMs aren’t ready to automate clinical coding, says Mount Sinai research

A new study from Mount Sinai suggests that using generative artificial intelligence to aid in coding automation has some significant limitations.

WHY IT MATTERS

For the researchMount Sinai’s Icahn School of Medicine evaluated the potential application of large language models in healthcare to automate medical code commands – based on clinical text – for reimbursement and research purposes.

The study compared LLMs from OpenAI, Google, and Meta to assess whether they could effectively match appropriate medical codes with their corresponding official text descriptions.

To assess and benchmark the performance of GPT-3.5, GPT-4, Gemini Pro and Llama2-70b, researchers extracted more than 27,000 unique diagnosis and procedure codes from 12 months of routine care in the Mount Sinai Health System, excluding patient data.

“Previous studies indicate that newer large language models struggle with numerical tasks,” said Dr. Eyal Klang, director of Icahn Mount Sinai’s Data-Driven and Digital Medicine Genative AI Research Program and senior co-author of the study, in an announcement last week.

“However, the degree of accuracy in assigning medical codes from clinical texts had not been thoroughly investigated in different models.”

When assessing whether the four available models could effectively match medical codes via qualitative and quantitative methods, the researchers determined that all LLMs scored less than 50% accuracy in generating unique diagnosis and procedure codes.

Although GPT-4 performed best in the study with the highest exact match rates for ICD-9-CM of 45.9%, ICD-10-CM of 33.9% and CPT codes of 49.8%, there remained ” unacceptably large” errors exist.

The researchers said GPT-4 produced the most incorrectly generated codes, while GPT-3.5 had the greatest tendency to be vague and identify more general than precise codes.

The research results, which the New England Journal of Medicine AI published last week, led researchers to warn that the performance of LLMs in medical coding could yield worse results in the real world.

“LLMs are not suitable for use in medical coding tasks without additional research,” the researchers said in the report.

“Although AI has great potential, it must be approached with caution and continued development to ensure its reliability and efficacy in healthcare,” warned Dr. Ali Soroush, assistant professor of D3M and medicine, said in a statement.

Mount Sinai noted that the researchers will seek to develop customized LLM tools for accurate medical data extraction and billing code assignment.

THE BIG TREND

Despite the Mount Sinai study’s findings, others see value in AI-based coding, saying AI systems can help physician groups miss revenue opportunities and improve their documentation compliance.

Dr. Bruce Cohen, a surgeon and former CEO at OrthoCarolina in Charlotte, North Carolina.

“As annual coding requirements are implemented, an AI-based system will integrate and implement these changes in real time,” Dr. Bruce Cohen, a surgeon and former CEO at OrthoCarolina in Charlotte, North Carolina, told Healthcare IT News.

AI-based systems don’t eliminate coders’ jobs, he added: “It increases the overview and accuracy of every payout based on evaluation and management coding.”

ON THE RECORD

“Our findings underscore the critical need for rigorous evaluation and refinement before deploying AI technologies in sensitive operational areas such as medical coding,” Soroush said in a statement about the Mount Sinai study.

“This study sheds light on the current opportunities and challenges of AI in healthcare, highlighting the need for careful consideration and additional refinement prior to widespread adoption,” said Dr. Girish Nadkarni, director of the Charles Bronfman Institute of Personalized Medicine and system head of D3M.

Andrea Fox is editor-in-chief of Healthcare IT News.
Email: afox@himss.org

Healthcare IT News is a HIMSS Media publication.