John Halamka on the risks and benefits of clinical LLMs

ORLANDO – On HIMSS24 on Tuesday, Dr. John Halamka, President of the Mayo Clinic Platform, a candid discussion about the substantial potential benefits – and the very real potential for harm – in both predictive and generative artificial intelligence used in clinical settings.

AI in healthcare has a credibility problem, he said. Especially because the models so often lack transparency and accountability.

“Do you have any idea what training data went into the algorithm, predictive or generative, that you are using now?” Halamka asked. “Is the result of that predictive algorithm consistent and reliable? Has it been tested in a clinical trial?”

The goal, he said, is to come up with some strategies so that “the AI ​​future we all want is as secure as we all need.”

Of course, it starts with good data. And that is easier discussed than achieved.

“All algorithms are trained on data,” says Halamka. “And the data we use needs to be managed and normalized. We need to understand who collected it and for what purpose – that part is actually quite difficult.”

For example, “I don’t know if any of you have actually studied the data integrity of your electronic health record systems, and your databases, and your institutions, but you will actually find that things like social determinants of health are poorly collected and poorly represented.” ,” he explained. “They are sparse data and it is possible that they do not really reflect reality. So if you use social determinants of health for any of these algorithms, chances are you’re going to get a very biased result.”

More questions need to be answered: “Who is presenting that data to you? Your providers? Your patients? Does it come from telemetry? Does it come from automated systems that extract metadata from images?”

Once these questions have been answered satisfactorily, and you have ensured that the data has been collected extensively enough to develop the desired algorithm, it is only a matter of identifying potential biases and mitigating them. Easy enough, right?

“What are the multimodal data elements in the dataset you have? Patient registration alone is probably not enough to create an AI model. Do you have things like text, the notes, the anamnesis and the physical (examination), the operative note? , the diagnostic information? Do you have images? Do you have telemetry? Do you have genomics? Digital pathology? That’s going to give you a sense of data depth – multiple different types of data, which will likely be used more and more as we develop different algorithms that look beyond just structured and unstructured data.”

Then it’s time to think about data width. “How many patients do you have? I’ve spoken to several colleagues internationally who said, we have a registry of 5,000 patients, and we’re going to develop AI on that registry. Well, 5,000 is probably not broad enough to give you a very resilient model. “

And what about “heterogeneity or dispersion?” Halamka asked. “Mayo has 11.2 million patients in Arizona, Florida, Minnesota and internationally. But does it provide representative data from France or a representative Scandinavian population?”

According to him, “any data set from any institution will likely lack the distribution to create algorithms that can be applied globally,” Halamka said.

In fact, you could probably argue that there is no one who can create an unbiased algorithm developed in one region that will work seamlessly in another.

What that means, he said, is that you need a global network of federated participants who will help build models, test models, and tune locally if we want to deliver the AI ​​outcome we want on a global basis. ‘

On that front, one of the biggest challenges is that “not every country in the world has fully digitized data,” said Halamka, who was recently in Davos, Switzerland for the World Economic Forum.

“Why haven’t we created a great AI model in Switzerland?” he asked. “Well, Switzerland has extremely good chocolate – and extremely bad electronic health records. And about 90% of Switzerland’s data is on paper.”

But also with good digitized data. And even after taking into account the depth, breadth, and distribution of that data, there are other questions to consider. For example, what data should be included in the model?

“If you want a fair, appropriate, valid, effective and safe algorithm, should you use ethnicity as an input to your AI model? The answer is to be very careful about that, because it can shape the model very well in a certain way You don’t want that,” Halamka said.

“If there were some biological reason to have ethnicity as a data element, okay, maybe it would be useful. But if it’s really not related to a disease state or an outcome that you’re predicting, you’ll find – and I’m sure you’ve all read the literature about overtreatment, undertreatment, overdiagnosis – these kinds of problems. So you have to be very careful when you decide to build the model and what data to include in it.”

Even more steps: “Once you have the model, you need to test it on data that is not part of the development set, which could be a segregated data set in your organization, or perhaps another organization in your region or around the world. And the question I’d like to ask all of you is: What do you measure? How do you evaluate a model to make sure it’s fair? What does it mean to be fair?”

Halamka has been working for some time with the Coalition for Health AI, which was founded with the idea that “if we’re going to define what it means to be fair, effective or safe, we need to do it as a community.”

CHAI started with just six organizations. Today it has 1,500 members from around the world, including all major technology organizations, academic medical centers, regional healthcare payers, the pharmaceutical industry and government.

“You now have a public-private organization that is able to work as a community to define what it means to be fair, how to measure what a testing and evaluation framework is, so that we can create data maps, what data in the system and model have ended up cards, how do they perform?”

The fact is that any algorithm will have some kind of inherent bias, Halamka said.

That’s why “Mayo has an assurance lab and we test commercial algorithms and proprietary algorithms,” he said. “And what you do is identify the biases and then mitigate them. These can be mitigated by reducing the algorithm to different types of data, or just understanding that the algorithm cannot be completely fair to all patients. You just be extremely careful where and how you use it.

“For example, Mayo has a beautiful cardiology algorithm that will predict cardiac mortality, and it has incredible predictive, positive predictive value for a body mass index that is low and really not a good performance for a body mass index that is high. That also applies to Is it ethical to use that algorithm? Well, on people whose body mass index is low, and you just have to understand those biases and use them appropriately.”

Halamka noted that the Coalition for Health AI has created a comprehensive set of metrics, artifacts and processes – available at CoalitionforHealthAI.org. “They’re all free. They’re international. They’re downloadable.”

In the coming months, CHAI will “focus its attention on many generative AI topics,” he said. “Because generative AI evaluation is more difficult.

With predictive models, “I can understand what data came in, what data came out, how it performs compared to the ground truth. Did you have the diagnosis or not? Was the recommendation used or useful?

With generative AI, it can be a completely well-developed technology, but based on the question you give, the answer can be accurate or it can kill the patient.

Halamka gave a real example.

“We took one New England Journal of Medicine CPC case and gave it to a commercial narrative AI product. The case stated: The patient is a 59-year-old with crushing significant chest pain, shortness of breath – and radiation to the left leg.

“Now, for the doctors in the room, you know that radiation to the left leg is kind of strange. But remember: our generative AI systems are trained to look at language. And yes, they’ve seen that radiation thing in cases of pain on the chest a thousand times.

“So ask this question on ChatGPT or Anthropic or whatever you use: What is the diagnosis? The diagnosis came back: ‘This patient is having a myocardial infarction. Anticoagulation needs to be done immediately.’

“But then ask a different question: ‘Which diagnosis should I not miss?'”

To that question, the AI ​​responded, “Oh, don’t miss the dissection of the aortic aneurysm and of course the pain in the left leg,” Halamka said. ‘In this case it involved an aortic aneurysm, where anticoagulation would have killed the patient immediately.

“So there you go. If you have a product, depending on the question you ask, it either gives you a wonderful piece of guidance or it kills the patient. That’s not what I would call a very reliable product. You have to so extremely careful.”

At the Mayo Clinic, “we took a lot of risks,” he said. “We figured out how to identify data and keep it safe, how to generate models, how to build an international coalition of organizations, how to do the validation, and how to do the implementation.”

Of course, not every healthcare system is as advanced and well-equipped as Mayo.

“But I hope that as you all are on your AI journey – predictive and generative – you can take some of the lessons that we’ve learned, take some of the artifacts that are freely available from the Coalition for Health AI, and be able to build a virtuous lifecycle in your own organization so that we get the benefits of all this AI that we need, without harming the patient,” he said.

Related Post