August 14, 2025 | With the advent of large language models (LLMs), expert human abstractors are meeting their match when it comes to accurately identifying cancer progression events from electronic health records (EHRs). That’s one of the big surprises emerging from a study presented by Aaron Cohen, M.D., a practicing oncologist and head of oncology research at Flatiron Health, at a recent artificial intelligence and machine learning conference of the American Association for Cancer Research (AACR).
Across 14 cancer types, with inherent differences in how progression gets documented and plays out for patients, a Claude-based LLM produced nearly identical real-world progression-free survival estimates as human abstractors, he reports. “We were able to take 10 years of accumulated knowledge and use it to effectively prompt a LLM to do the same things as humans.”
The know-how was picked up from human abstractors trained by Flatiron to look through patient charts to find instances of cancer progression and their associated dates. They’re guided by policies and procedures the company has created and iterated on over time to ensure a consistent and high-quality approach to data abstraction, as well as to address real-world “edge cases” falling outside expected data patterns to minimize the likelihood of errors.
“Endpoints like progression are the main data points that help us figure out how patients are doing ... [and] decide whether a drug is approved or not, so it is critical to get them right,” says Cohen. The documentation and subsequent validation of those clinical endpoints require different approaches when moving from the controlled environment of a clinical trial to the dynamic reality of real-world clinical practice.
Cohen says he never dreamed he’d be working at a health tech company back in 2018, when he was finishing up his fellowship training at the University of Pennsylvania. “I was doing research on decision-making and trying to understand, particularly at the end of life, how patients weigh risks and benefits in making the decision to get treated, often when we know it is not going to be curative.”
It was often assumed that living as long as possible is what’s most important to patients, says Cohen, but from what he was observing that wasn’t the primary driver. That led to his fascination about how to leverage large amounts of data to enable better clinical decision-making based on what’s important to patients.
Whether cancer has progressed, and what can be done about it, are among “the most important, distressing, and anxiety-provoking” topics of conversation patients have with their doctor in the clinic, he says. “[Progression] is such a charged word, and it’s easy to forget what that actually means for patients when we’re talking about ... looking for mentions of progression in a chart, pulling it out, and providing information and data about it.”
The reason clinicians document the term at all is “primarily for billing [purposes] and to justify their decisions,” he points out. “It is not with the knowledge and intention that we are going to be trying to learn or do research on those data, or how those patients are doing.”
That reality is reflected in how things show up in the chart, Cohen continues, which “may be very messy, may contradict themselves from document to document ... [or] may not be there at all.” Detecting progression is therefore “much more complicated” than simply looking up a date in an EHR.
It has only been over the last three years that LLMs have been applied in real time for these sorts of tasks, he adds. But for over a decade now Flatiron has been thinking about how to leverage the data that are being documented in the EHR to better understand how patients in the real world are doing, in the hope that this would lead to improvements in care and research for patients with cancer.
Initially, this was done via manual abstraction, says Cohen. A considerable amount of time, effort, and thought went into how to train a human to methodically go through a chart and pull out complicated and nuanced cancer-related information from a system initially designed with billing in mind.
Flatiron has been sharing details about its EHR abstraction process for identifying real-world cancer progression in peer-reviewed journals since 2019, initially using human experts where scalability was one of the main limitations (Clinical Cancer Informatics, DOI: 10.1200/CCI.19.00013). “We could do complicated human abstraction well, but we were limited in how many cancer types and how many patients we could do that for,” says Cohen.
Due to the time-intensive nature of human abstraction, Flatiron began exploring machine learning capabilities that were then available—notably, natural language processing (NLP) and deep learning models—and made some headway, he says. But when it came to progression, the company repeatedly ran into performance-based issues that limited progress.
One vexing problem was getting the date right. “There are so many mentions of dates in the chart and there is so much context around what those dates mean, referencing things in the past, in the future, not even talking about it, and it was really tripping up our [NLP] models,” says Cohen. Although trained on human-labeled data, they struggled to “read between the lines” as human abstractors would routinely do.
As LLMs became mainstream, Flatiron began looking into how to use the powerful tools to extract data, he continues. The issue here is that they weren’t designed to extract clinical details, but to predict the next word in a sequence. “It takes a lot of thought about how to prompt a large language model to not predict what you’re looking for but to actually find it in the chart if it’s documented and pull it out.”
Progression still occurs too frequently across cancer types with “clear biologic differences in how aggressive certain cancers are and the demographics of patients who get [them],” says Cohen, getting back to the importance of the pan-cancer study. That in turn influences “how frequently progression happens, the amount of time between progression events, and how effective the treatments are at preventing progression.”
In the latest research presented at the AACR conference, Cohen made a point to highlight not only that the LLM model worked as well as expert humans across multiple cancer types but that for a couple of cancer types (endometrial cancer, melanoma, and hepatocellular carcinoma) it didn’t work as well as desired. Via error analysis, a cross-functional team of clinical and technical experts were able to “figure out why that was,” he says. Bottom line: the model was having trouble differentiating between advanced diagnosis workup and early progression events.
Since Flatiron is training LLMs to approach problems the same way humans are asked to do, says Cohen, “in general the cases where a human struggles, the large language model tends to struggle, too, or vice versa.” The cancers where researchers found more room for improvement tended to be those where the cancer typically spreads locally ... [and] it was unclear in the chart if this was the patient developing advanced disease for the first time ... or the patient was diagnosed with advanced disease and imaging [tests] were showing progression.”
As a result of the error analysis exercise, both LLMs and expert human abstractors now know how to provide better context for the problems they’re trying to solve, he says. “You’re not going to get it 100% right the first time; it’s an iterative process and so we take a very comprehensive approach to evaluating quality.” The company has summed up this approach in their recently published preprint, Validation of Accuracy for LLM/ML-Extracted Information and Data (VALID) Framework.
Part of that process is understanding when you have opportunity for continued model refinement versus when you’re approaching gold standard quality and can shift focus elsewhere because there is little room for improvement, says Cohen, citing the “excellent performance” nearly across the board with its latest progression LLM.
In addition to the study focused on how to comprehensively evaluate the quality of LLM-extracted information, Flatiron’s Director of Research Sciences Melissa Estevez presented a second study at the AACR conference on how to benchmark performance of an LLM against expert human abstractors using the VALID framework. The expert humans, in this case, are individuals with oncology experience (former or current nurses or tumor registrars) who have gone through specific training and following the guidelines written by Flatiron clinicians and iterated on over the past decade.
Since model performance has been anchored to expert humans, it was important to create a separately collected reference dataset of human-abstracted answers against which the LLM and human approaches could be fairly compared, he emphasizes. “We won’t know how well a model is doing if we simply evaluate it on a human who has done those same tasks and assume the human was always right. We won’t know the cases where the model was right, and the human was wrong.”
VALID stipulates the use of such an independent reference dataset so the two approaches can be more directly compared and contextualized to reveal true differences in performance between a model and a human. It would be understandably hard for most people to know what to make of an approach with 80% sensitivity or precision versus a yardstick such as 5 percentage points of an expert human, says Cohen. “That’s just something more tangible and relatable and accessible for people to kind of wrap their minds around and builds trust.”
Importantly, Flatiron requires progression in real-world charts to be clinician-confirmed, he adds. “A clinician seeing and caring for the patient needs to be the one who says the patient progressed or comments on an imaging study that suggests that the patient progressed instead of maybe just relying on a pathology report or an imaging test read by radiologist that has never seen that patient before.”
This is where the primary focus has been when guiding human abstractors and, accordingly, LLMs are instructed to focus on the clinical notes being written about patients, says Cohen. With large language models it’s about finding the “sweet spot,” he adds. “If you don’t show it enough [clinical documents], you’re not going to get what you need, but if you show it too much it gets overwhelmed and starts hallucinating and gives you the wrong answer [performance degradation]. You must know from the get-go where you are most likely to find the information that you are looking for and build out from there.”
The VALID accuracy assessment framework draws on the experience of Flatiron Health in developing and extensively validating curation pipelines based on machine learning and LLMs, and builds on past
approaches for evaluating the data they extract. But the latest study is the first time the approach was used to evaluate a model for “an endpoint as important as progression,” Cohen reports.
One of the central pillars of VALID is benchmarking against humans across all clinical details and variables that are being abstracted internally within Flatiron. “If you don’t know the quality of the data that you’re evaluating the model on, then you’re going to be inherently limited in how well you understand the performance of the model,” he reiterates.
“As good as humans are at abstracting things, especially the way we train them, we know especially as the tasks become more complicated that they’re not perfect and they make mistakes as well,” Cohen continues. As has already been suggested, LLMs are going to be able to outperform humans on certain tasks and cases.
But they still need some work. In the second study, which was a bias evaluation on LLM-extracted breast cancer and metastatic diagnoses and dates, the model’s accuracy in identifying cases varied slightly by race/ethnicity and age, with higher recall for Black patients and higher precision for Latinx patients compared to White patients, but lower recall for patients aged 75 and older compared to a larger and younger cohort of patients.
“The most important finding of that study ... is that you need to check [for bias] because once you check you can figure out what to do about it,” says Cohen, perhaps by changing the prompts. “In some cases, you might not be able to do much about it because you’re limited by the sample sizes or just the data that you have, but you can at least take into account the use cases or how you are going to use that data for decision-making.”
It is well recognized both that documentation in the EHR can be the subject of physician biases and that disparities in access to care and outcomes for patients with breast cancer exist, he notes. The same bias evaluation would be needed regardless of the cancer type and variables because “if you don’t even think to check ... then you are blind and you won’t know how those results are going to apply to all the patients that may be impacted.”
With some older cancer patients, one confounding issue may be that they leave their local oncologist to spend the winter months in Florida, for example, where they see a different, out-of-network doctor, Cohen says. Regardless of whether the same clinical decisions are made in both places, the documentation for these patients will be less complete and addressable with new LLM prompts only if information on seasonal moves exists in the medical record.
Flatiron seeks to keep its quality and validation approach to LLM-extracted data in lockstep with the ever-changing capabilities of the models so it can quickly and effectively “understand how they work, explain that clearly, identify hallucinations, and know what to do about them,” says Cohen. It’s a continuing, ever-more-impactful process in terms of “helping patients and realizing the vision and promise of personalized medicine.”
To understand if a treatment will or won’t work for individual patients, progression data are needed to train machine learning models, Cohen says. All that high-quality data that Flatiron is curating, developing, and validating is to serve as the “building blocks for predictive modeling” by the company to improve treatment decision-making and help ensure patients get enrolled in trials at the right time—in part by flagging clinicians to order the right tests and alerting them when data in the chart are missing.
This will additionally provide insights into the physician decision-making process itself, which is a huge area of research interest, he points out. Until LLMs came along, that wasn’t the kind of information that could be gleaned from EHRs on a broad scale.