By Deborah Borfitz
October 16, 2019 | At the 2019 Next Generation Dx Summit in August, a staff fellow with the U.S. Food and Drug Administration (FDA) shared the agency's approach to evaluating medical devices enabled by "adaptive" artificial intelligence (AI) and machine learning (ML)—meaning they continuously learn as they're exposed to more data. He also described an intriguing approach to reusing test data to assess the performance of AI-powered devices.
Alexej Gossmann, PhD, sits in the Division of Imaging, Diagnostics and Software Reliability, Office of Science and Engineering Laboratories, of the FDA's Center for Devices and Radiological Health. His talk highlighted performance assessment, unintended confounding variables and provider oversight issues associated with medical devices powered by adaptive machine learning algorithms.
AI and ML are currently employed primarily in imaging, using algorithms that are "locked" prior to deployment with all substantive updates or modifications handled as a new submission, Gossmann says. But algorithms are under development in multiple other areas, and the potential applications include devices designed to tailor treatment, triage patients or predict their treatment outcomes, and assist with population-wide health research and drug development.
Even in the 1990s, the FDA was seeing a small number of medical devices using AI and ML, he says. A post-2014 "deep learning explosion," and advancements in computing GPUs and machine learning architecture outside the medical domain, set the stage for many new de novo devices in medical imaging over the last two years.
The challenges in bringing AI/ML-enabled medical devices to market are many, including high-quality training data that are well curated, have reliable ground truth annotations and are representative of target patient populations, says Gossmann. This is addition to more general challenges such as ensuring clinical tasks and objectives are relevant and will improve patient outcomes.
Test data used to validate the model also need to be appropriate, he continues. Retrospective data may be OK for a device that analyzes and processes images, for example, but not necessarily for one that makes a patient diagnosis. Additionally, performance metrics should be interpretable and indicative of a clinical outcome.
There can be any number of unintended confounding variables, including patient characteristics and the device operator, Gossmann says. Issues of explainability, transparency, bias, privacy, ethics, trust, and tradeoffs between accuracy and fairness also need to be addressed.
Users and patients may need to be educated about the AI tool, Gossmann adds. Then there is the question of whether the device will disrupt clinical workflow or people will passively rely on it too much. And how will ongoing monitoring of the device be handled if it is perpetually changing?
Risks and Rewards
The advantages of adaptive AI/ML algorithms, Gossmann says, are having an ever-enlarging dataset for algorithm training as well as their ability to adapt to different operational environments and individual patients and accept new types of inputs, resulting in improved outputs. The potential risks are unintended and undetected degradation of performance over time, creating a mismatch between true and reported performance, and incompatibility of results with other software.
A white paper recently issued by the FDA proposes a regulatory framework for AI/ML-based software as a medical device (SaMD) that would avoid unnecessary premarket submissions, Gossmann says. The agency is recommending an option that would allow manufacturers to submit a plan for software modifications during initial premarket review.
The proposed framework consists of SaMD pre-specifications covering anticipated changes to performance, inputs or intended use; an algorithm change protocol detailing the data and procedures to be followed so that the modification achieves its goals and the device remains safe and effective; and good machine learning practices, which would include the use of relevant clinical data and appropriate separation between training, tuning and test datasets.
Test Data Reuse
Each algorithm performance evaluation must be "independent from all data the algorithm has ever seen before," says Gossmann. That's problematic in medicine where high-quality datasets with reference truth are limited and may be unethical, making it tempting to utilize a previously-used test dataset.
As Gossmann's own research recently found, repeated interrogation of test datasets will eventually result in overfitting the algorithm to the test data. But the previously described Thresholdout method can be successfully used to hide test data from an adaptive machine learning algorithm—which he describes as "randomization in a tricky way"—to get strong statistical guarantees.
The method allows test data reuse without compromising algorithm performance evaluations, Gossmann says. The gap between reported and true AUC (area under the curve) is also much narrower when the Thresholdout rather than an unrestricted method is used, he notes. "The classifier learns the effect of local noise in the test data."
Gossmann says he is now working on a methodology that could be useful in assessing whether a new test—e.g., a single human opinion or diagnostic test—meets a pre-specified performance goal or is superior in sensitivity and specificity to test datasets with ground truth annotation. The research goal is a model-free approach that relies only on basic, reasonable assumptions about the dependency between the outcomes of the two tests.