Latest News

MILO Helping Artificial Intelligence Reach Its Diagnostic Potential

By Deborah Borfitz 

August 20, 2025 | WASHINGTON, D.C.—Efforts to couple patient-level data with machine learning (ML) are going to accelerate “100-plus-fold” over the next decade, dramatically improving evidence-based decision-making and demonstrably improving patient care. In the diagnostics field, supervised learning that leverages classification represents the bulk of this work currently and may even suggest the algorithms best suited to the task at hand, according to Hooman Rashidi, M.D., professor and associate dean of AI in medicine at the University of Pittsburgh Medical Center. 

Rashidi was speaking about the on-premises, auto-ML framework of MILO (Machine Intelligence Learning Optimizer) during his opening keynote yesterday at the Next Generation Dx Summit about guidelines for using artificial intelligence (AI) and ML in point-of-care (POC) testing. He is co-developer of the MILO platform, designed for automating the process of building and deploying machine learning models, which has been successfully validated and licensed by the University of California to several industry partners.  

“By default, we tell people not to make any assumptions about what algorithms [e.g., neural network, logistic regression, or K-nearest neighbors] are the best,” he says. “[MILO] is almost like a gladiator stadium where you let them fight for your data or your study and then you figure out which pipeline is the best.”  

Users need only follow a simple, four-step process of uploading their data, picking their target, evaluating it, and building their models to see how those algorithms are performing, says Rashidi. “If you can attach a file into an email, you can work with this.” 

In just over 30 minutes, Rashidi walked the audience through the paces and showed how MILO works “under the hood” in real studies and practice. He also showcased generative AI as a significant solution to educating and training users of diagnostic devices powered by AI and ML, whether they are or aren’t being used at the point of care. 

Not ‘One Big Thing’

One of the biggest challenges when talking about AI with colleagues, administrators, and regulators has been the misperception that it is “one big thing,” says Rashidi, although people are now starting to understand that AI represents distinct entities. His talk focused on “narrow artificial intelligence,” where, as he puts it, “we teach the machine ... the machine doesn’t think for itself.” This stands in contrast to artificial general intelligence of popular tools like ChatGPT and the hypothetical artificial superintelligence form that surpasses human intelligence in all respects.   

Rashidi homed in on non-generative AI and tasks around “big classes” such as positive versus negative sepsis and different grades of cancer. In addition to predictive models, examples in this space include decision support tools using rule-based approaches as well as more elaborate models for resource automation, he says.  

Major differences exist between AI models for POC and non-POC devices, Rashidi shares. Notably, the setting of use for POC tests is more variable with more non-specialist operators, possibly patients, reading the results. The potential for error with POC devices also requires that the algorithms be “robust and interpretable for a much larger audience ... [than] within a very controlled laboratory setting.”  

The regulatory pathway to market ranges from a low-risk CLIA (Clinical Laboratory Improvement Amendments) waiver to a medium-risk 510(k) or a higher-risk De Novo or Premarket Approval from the U.S. Food and Drug Administration (FDA). But the end game in all cases is an AI tool whose benefits outweigh the risks it produces, he continues. 

Many people are trying to apply best practices and implementation guidelines around data collection, usability and human factors, cybersecurity and privacy, and post-market surveillance in terms of monitoring the performance level of the impacted end user, says Rashidi. Among the most recent of these is the FDA-recommended “predetermined change control plans” (PCCPs) to establish in advance how an AI model will be retrained in response to data drift. 

AI tools are available to ensure embedded algorithms in diagnostic devices adhere to some of these guidelines better and quicker than ever before, he says. His presentation described two such tools available on MILO (free to use for educational purposes), one for data pre-processing and the other for training and educating users. 

Data Preprocessing

For purposes of cleaning and preparing data for analysis, the goals are to ensure integrity, completeness, value, and reproducibility of the data. The best way to do that currently is in a non-automated way with a team of people, which can easily take weeks, Rashidi says.  

With MILO’s data preprocessing tool, what he terms a “virtual machine learning statistics software engineering team,” the job can be done in a matter of minutes. Rashidi’s example was a synthetic dataset for breast cancer with a bunch of missing values, some columns that contain text that needs converting to numeric values, and multipolarity issues. 

The clean-up process with the MILO app has a human-in-the-loop following a simple seven-step process, he explains, which began with uploading 697 cases across 11 columns. The tool indicates if the missing values are coming from the cancer or non-cancer class and knows how to convert the textual tags for cancer to numerical values (e.g., negatives becoming 0s and positives 1s). 

Moreover, in his example, the tool objectively reveals how values for nuclei were disproportionately contributing to data missingness. This was in lieu of a human painstakingly going through an Excel spreadsheet to make this discovery before removing the column of data. “The machine is helping you, but you are still driving it,” says Rashidi. “You don’t have to accept its recommendations.” 

Importantly, the data preprocessing app recognizes columns with a text value called risk and can automatically do that coding, he continues. Many people just assign the low-, intermediate-, and high-risk a numeric 1, 2 or 3, which is “a big no-no.”   

It is also common practice for people to start off with a single dataset for model training and validation, Rashidi adds, and the MILO tool can prevent this deviation from machine learning best practices. “The process will not allow the synthetic data to be in the “generalization test set,” referring to a separate portion of the dataset used to evaluate how well a trained model performs on unseen data.  

With the MILO app, users can additionally do multipolarity assessment to eliminate features that are effectively repetitive, such as hemoglobin and hematocrit with 99% correlation, he says. Similarly, shape uniformity of two features, based on their R-value, could allow one of them to be removed for a cleaner dataset. 

Once that single file has been turned into two separate files, notes Rashidi, the data is ready for ML modeling. The tool is fully transparent about what has been done where and provides a full audit file “in case you want to see what was imputed, what was removed, [and] what was added.”  

‘Wheel of Pipelines’ 

Using MILO’s patented auto-ML framework, building a machine learning model on the cleaned-up data is “super-fast,” says Rashidi. Here, his example was its use in building a sepsis predictor. Two datasets of various lab and clinical findings from a 204-patient burns sepsis study were uploaded into the tool. 

The initial training and validation datasets could be mapped to the sepsis positive and negative datasets doing basic “data science 101 with univariate analysis, clinically and mathematically,” he says. It was then on to the final steps of building, training, and validating the ML models. 

This creates a “wheel of pipelines” with hundreds of individual spokes representing optimized ML models all running at the same time over the course of a few hours, explains Rashidi. In the end, MILO generates a few hundred thousand models and by default highlights those balancing accuracy with the highest sensitivity since “a lot of people are looking for screening tools.” But users can pick whatever format fits their needs, which might instead be based on the highest specificity or F1 score (for classification tasks) or Brier score (for probabilistic forecasts). The granular-level details might let users know, for example, that the best model had only half of the features brought in, he says. They can download the entire Excel table for that less-noisy model to see if it still makes sense clinically, or they can retrain the model themselves. 

Testing of the model could begin right away within an information management system or point-of-care device or platform, says Rashidi. In a clinical trial context, batch testing could also be done in parallel with the interpretation of patient test results. “These can be auto filled for you if it’s uploaded as a model within your framework and then you basically now make the prediction.” 

Building on Success 

MILO has been successfully used in an acute kidney injury (AKI) study on the burn sepsis population, Rashidi shares. AKI has traditionally been diagnosed based on criteria in the Kidney Disease: Improving Global Outcomes, or KDIGO, guidelines, which include changing creatinine levels and urine outputs that can take days to identify. Moreover, sensitivities were found to be “really horrible,” which has led to a pivot to greater reliance on the neutrophil gelatinase-associated lipocalin, or NGAL, biomarker in Europe and the U.S.   

The big question then became whether NGAL’s diagnostic accuracy, which was in the 80s, could be improved with machine learning. Once it was incorporated with the B-type natriuretic peptide (BNP) biomarker, creatinine, and urine output, “sensitivities and accuracies shot up the roof, into the 90s,” he says, and in follow-up studies Rashidi and his colleagues were also able to show similar improvement in burn-trauma cases. 

Not only did ML drastically improve diagnostic precision for AKI, but since NGAL is part of the process, serial measures of creatinine and urine outputs are no longer required. The combination gets to answers in a fraction of the time, in line with POC approaches as shown in a subsequent, multicenter follow-up study. 

Rashidi is particularly enthusiastic about the education and training capabilities of MILO for users of AI-enabled diagnostic devices. But it could find just as much utility in helping people prepare for inspections in terms of knowing where the laboratory is out of compliance and the type of questions regulators may be asking. 

Custom Local Option 

Revolutionary, text-based large language models (LLM) have “major limitations,” most notably general chatbots like ChatGPT and Gemini, he adds. “If you brought in your procedures that were never accessible to the internet, ... how would they be able to give you good, accurate results if your stuff was never available ... for them to learn from?” 

“API-plus” machine learning is being promoted as a cost-effective way to adopt and integrate AI capabilities into various applications, but “once people get hooked ... API costs will shoot through the roof,” he warns. Rashidi is therefore a fan of the hybrid approach where vendor partners are sought for activities with a ROI. For everything else, open-source or “home-brewed” versions will do just fine. 

The big worries beyond the direct cost are cybersecurity and data usage on the internet, says Rashidi, limitations he believes can be overcome by keeping those customized LLMs local. Two practical approaches are “fine-tuning” models and using a technique called “retrievable augmented generation,” or a combination of the two. 

His solution is something termed “PIH-GPT-Plus,” a custom local LLM framework. “Just like MILO, it is fully on-prem, so there is nothing that is going to the cloud.” 

The framework was used to ingest 24 papers related to bone marrow biopsies for diagnosing leukemia and lymphoma. As part of the exercise, Rashidi identified his negative and positive controls by asking the model about something initially it knew nothing about but was subsequently trained to know—the best soccer players in the world. 

The model was then asked a clinical question, specifically whether someone had chronic or accelerated myeloid leukemia based on criteria found in the 24 papers and from which of the documents the information was pulled. This increases trust in the model among end users and creates a memory in the model regarding progression from one disease phase to the next based on the sign (greater number of myeloblasts) indicative of more aggressive disease. 

The Fundamentals 

All told, the benefits of MILO include reduced hallucinations and thus greater accuracy of the platform, says Rashidi. That also makes it more cost-effective and scalable institutionally while offering better data security. 

Regardless of which AI platform is used, he adds, keep in mind that many end users are not going to be familiar with major performance standards for AI. These include terms such as BLEU (bilingual evaluation understudy) and ROUGE (recall-oriented understudy for gisting evaluation). “It is not as straightforward as a classification model that’s following a confusion-matrix-based performance measure.” 

Understanding AI requires a fundamental knowledge of its application in healthcare, says Rashidi. To that end, he and his colleagues recently published a seven-part AI review series featuring contributions from more than 40 global experts working at the intersection of AI and medicine (Modern Pathology, DOI: 10.1016/j.modpat.2024.100673

Load more comments
comment-avatar