DoD will develop scalable genAI test datasets

By Liam On Jan 3, 2025

The Chief Digital and Artificial Intelligence Office and Humane Intelligence, a U.S. Department of Defense technology nonprofit, announced the conclusion of the pilot of the agency’s Crowdsourced Artificial Intelligence Red-Teaming Assurance Program, which focuses on testing of chatbots with large language models used in military medical services.

The findings could ultimately improve military medical care by adhering to all required risk management practices for the use of AI, DoD officials said.

WHY IT’S IMPORTANT

In one announcement On Thursday, DoD said the latest red team test of the CAIRT program involved more than 200 clinical providers and healthcare analysts to compare three LLMs for two potential use cases: clinical note summary and a medical advice chatbot.

They found more than 800 potential vulnerabilities and biases testing LLMs to improve military medical care.

CAIRT aimed to build a community of practice around algorithmic assessments in collaboration with the Defense Health Agency and the Program Executive Office, Defense Healthcare Management Systems. In 2024, the program also offered a financial AI bias premium focused on unknown risks in LLMs, starting with open-source chatbots.

Crowdsourcing casts a wide net that can produce large amounts of data across multiple stakeholders. DoD said the findings from all CAIRT program red teaming efforts will be critical to shaping policy and best practices for the responsible use of generative AI.

DoD also said that continued testing of LLMs and AI systems through the CAIRT Assurance Program is critical to accelerating AI capabilities and justifying confidence in DoD genAI use cases.

THE BIG TREND

Trust is essential for doctors to embrace AI. To use genAI in clinical care, LLMs must meet critical performance expectations to best assure providers that the tools are useful, transparent, explainable, and safe, as Dr. Sonya Makhni, medical director of applied informatics at Mayo Clinic Platform, said Healthcare IT news recently.

Despite the enormous potential for the positive use of AI in healthcare, “unlocking it is a challenge,” Makhni said at the HIMSS AI in Healthcare Forum last September.

Because “assumptions and decisions are made at every step of the AI development lifecycle, and if incorrect, these assumptions can lead to systematic errors,” allowing bias to creep in, Makhni explained when asked how to use it safely of AI could be realized.

“Such errors can bias the final result of an algorithm against a subset of patients and ultimately pose risks to healthcare equity,” she continued. “This phenomenon has been demonstrated in existing algorithms.”

To test performance and eliminate algorithmic biases, clinicians and developers must work together, “throughout the AI development lifecycle and through solution implementation,” Makhni advised.

“Active involvement from both parties is necessary in predicting potential areas of bias and/or suboptimal performance,” she added. “This knowledge will help clarify contexts that are better suited to a particular AI algorithm and contexts that may require more monitoring and oversight.”

ON THE RECORD

“As the application of GenAI for such purposes within the Department of Defense is in an earlier phase of piloting and experimentation, this program acts as an essential pathfinder for generating a large amount of test data, highlighting areas for consideration and validating mitigation options that will shape future research, development and development. certainty of GenAI systems that can be deployed in the future,” said Dr. Matthew Johnson, CAIRT program leader, in a Jan. 2 statement about the initiative.

Andrea Fox is editor-in-chief of Healthcare IT News.
Email: afox@himss.org
Healthcare IT News is a HIMSS Media publication.