By Allison Proffitt
May 7, 2026 | There’s a seductive promise baked into the rise of large language models: that a single, powerful AI trained on the breadth of human knowledge can do virtually anything. Need to draft a contract? Summarize a research paper? Predict whether a drug molecule will cross the blood-brain barrier? Just ask GPT.
But ask the executives building AI tools inside two highly specialized fields — radiology and early-stage drug discovery — and they’ll tell you that promise breaks down the moment you need answers that actually matter.
Radiology Preferred by Radiologists
Rad AI, a San Francisco-based company that has spent eight years building AI tools specifically for radiology, recently decided to put its flagship Impressions generator head-to-head against a generic large language model and have radiologists and oncologists judge the result. They published their findings last month in npj Digital Medicine (DOI: 10.1038/s41746-026-02586-6)
The study, co-authored by Andrew Del Gaizo, Rad AI’s Chief Medical Information Officer and a radiologist at Moffitt Cancer Center, pitted Rad AI’s domain-specific model against GPT-4 across 200 oncologic CT scans, some of the most complex studies in radiology. The study compared impressions, the clinically critical summary at the end of a radiology report that tells physicians and patients what a finding means and what to do next.
For every case the original radiology report was submitted to the systems and Impressions were created by Rad AI’s Impressions tool and by GPT-4. The study compared these two AI-generated Impression to the one written by the original radiologist over a year ago. The three versions were then reviewed by the original radiologist, an external radiologist, and an oncologist, who would have received the report.
“The original radiologist really preferred either their own impressions or the customized AI-generated impressions that were trained on their historical data,” Del Gaizo said. “There was a steep drop-off then to the generic AI.”
For the commercial Rad AI Impressions tool, which has been on the market since the company launched, the model is trained on each radiologist’s own historical reports. It had learned their vocabulary, their hedging patterns, even their preferred phrasings such as “concerning for” versus “suspicious for”. These minor distinctions impact whether the radiologist will accept the output or spend time editing it. If the draft doesn’t sound like them, it doesn’t save time. If it doesn’t save time, they won’t use it.
A second cohort of radiologists — reviewers with no prior connection to the original cases, simulating a common workflow where a radiologist reading a follow-up scan reviews the prior report — showed an even more interesting pattern: they actually preferred the custom AI’s impressions over those written by the original human radiologists. “That was pretty eye-opening,” Del Gaizo said. Again, “the generic AI was a steep drop-off from both.”
The generic model didn’t fail. All assessing physicians rated likelihood of patient harm as universally low across both AI systems, comparable to human error rates. And generic model impressions were rated as slightly clearer by oncologists, perhaps due to impressions that were less concise and included additional context for non-radiologists. “Downstream clinicians may value explanatory context and narrative clarity more than brevity alone,” the authors theorize in the paper.
But for the primary users, radiologists, the customized AI tool delivered results that were more aligned with what they would create.
The Blood-Brain Barrier Problem
On the other side of the globe, in Melbourne, Australia, a startup called Qubigen is making a similar argument.
Qubigen, founded in 2024 as a merger of two pre-existing AI companies with roots in federated learning and quantum chemistry, is focused on federated AI, AI drug design, and virtual screening.
One of the thorniest problems in that space is predicting blood-brain barrier (BBB) penetrance — whether a given drug molecule will cross from the bloodstream into the brain. BBB is one of the most important ADMET properties to predict for CNS drug development, but it’s highly sensitive to experimental context and assay definition. This makes it a useful benchmark for evaluating differences between generalized and tailored AI models.
Qubigen compared their Federated AI Drug Design platform (FedAIDD) with a strong generalist AI model to predict BBB permeability. They have released a blog post about the case study on their site.
“There’re tools out there,” to predict BBB penetrance, said Jonathan Hall, Qubigen’s founder and CEO. “You can get the leading open-source AIs, you can just download and run them, and they give you some predictions. But we found that those fail very badly when you look at real data with real projects.”
In setting up the comparison, the Qubigen team built a strong generalist AI model with which their FedAIDD to contend. They used the BBB_Martins dataset to develop the reference algorithm (distributed via MoleculeNet / TDC as the BBBP dataset), which represents a benchmark dataset and state-of-the-art generalized reference model for evaluating BBB prediction models.
But still the problem is still input, Hall said. Generic AI models trained on publicly available pharmaceutical data are drawing from a deeply biased dataset: approved compounds. The molecules that make it through clinical trials and into the literature are, by definition, not representative of the messy, inconclusive, often contradictory data that actually live inside a drug discovery organization at the early stage. Negative results don’t get published; failed molecules don’t get patented. The training data is skewed toward success in a field where most early-stage work ends in failure.
“They don’t predict negative results really well because people don’t publish negative results,” Hall said. “So there’s a huge amount of scope for tailored AI that works on specific project data.”
Qubigen’s FedAI architecture uses a federated model to incorporate client-specific datasets into training, “without moving or exposing proprietary data, enabling models that adapt to the chemistry and experimental outcomes observed within a client’s research programs,” the company explained in the case study.
To demonstrate this, Qubigen curated datasets designed to reflect the properties of real early-discovery project data with content from the BDB3 (Brain Drug Database v3) and two distinct Chembl datasets, and modeled how the four datasets could be queried in a federated manner. The Qubigen team created federated models with two, three, and four nodes with weights for either client-specific data or general data.
They found that incorporating client-specific data through federated AI substantially improved predictive performance on client-relevant data, while maintaining strong performance on the public benchmark. The generic model, in contrast, “absolutely collapses on the specific project data,” Hall said.
Why Generic AI Flatters to Deceive
In both cases, the companies argue that it is easy to miss errors unless you’re deep in the domain. Generic models report impressively on the benchmarks they were trained to perform well on. They can look excellent in controlled evaluations. But real-world clinical or discovery data looks different from curated benchmarks — it’s noisier, more idiosyncratic, shaped by the practices and preferences of the people and institutions that generated it.
In radiology, that gap manifests in the subtleties of an individual physician’s voice. In drug discovery, it manifests in the distribution of molecular data — the assay types used, the thresholds applied, the specific targets being pursued. Both are problems for which a generic model, by design, cannot fully account.
“By believing whatever it comes out with, it’s going to be problematic,” Hall said of generic models applied to early drug discovery. “It will only ever be so good. But by being able to update an AI to have specific data, that’s where you see the power of these abilities.”
Del Gaizo frames the issue in terms of what happens at the point of use. A radiologist reviewing a generic AI’s draft impression and finding it too verbose, too different from their natural phrasing, will spend time editing — eliminating the efficiency gains that justified adopting the tool in the first place. “There’s always a hesitation to being first,” he acknowledged about the delay in adopting any new modalities. But Rad AI has seen success with radiologists as the custom model, over time, learned to sound like them.
The Personalization Paradox — and Its Privacy Complications
Both companies have also grappled with an awkward corollary: if tailored AI requires training on specific, proprietary data, how do you actually get that data?
For Rad AI, the answer has been to train each radiologist’s profile on their own historical reports before the product goes live. “It should sound like them on the very first time they use it,” Del Gaizo said, “which is almost surreal and spooky that it works — but the radiologist is always blown away.” The model then continues learning from use, weighted toward more recent dictations to account for the way physicians’ styles evolve over their careers. Individual feedback — flagging a phrase they dislike, preferring one formulation over another — updates their personal profile directly.
For Qubigen, the challenge is more fraught. Drug discovery data is among the most commercially sensitive intellectual property a pharmaceutical company holds. Organizations don’t share it. They often can’t share it, for legal and competitive reasons, even when doing so would benefit the science.
This is where Hall sees the case for federated learning as not just technically interesting but commercially necessary. Federated learning allows AI models to train across distributed datasets without any single party seeing the underlying data. A federated model can be fine-tuned on a client’s project data without that data ever leaving their environment. The AI improves; the data stays private.
“Federated AI is a wonderful tool for being able to leverage data without moving or seeing it,” Hall believes. But he also argues that the federated model is impactful beyond what is “aspirational and utopian.” There is a business advantage to federated AI because the outcomes of its models are better than generic AI.
More Than a Better Model — A Different Business
The argument these companies are making isn’t just technical. It has implications for how AI products in high-stakes domains need to be built and sold.
The standard playbook for AI deployment — license a powerful model, fine-tune it on some representative data, deploy it broadly — may be well-suited for general productivity tools. But in environments where performance variance directly affects patient outcomes or billion-dollar research decisions, “generally good” isn’t sufficient. The question isn’t whether the model can generate a plausible impression or a reasonable BBB prediction. The question is whether it’s accurate enough, specific enough, and aligned enough with the user’s own standards to be trusted.
Rad AI’s study found that even oncologists — the downstream stakeholders reading radiology reports, not the radiologists generating them — preferred the custom AI model over both the generic model and the human-authored impressions. Their metric of preference wasn’t familiarity; it was clinical utility. The custom model’s impressions were more actionable.
Del Gaizo sees this as an opening for a future Rad AI has already begun exploring: impressions tailored not just to the radiologist who dictates them, but to the specific downstream reader. The impression surfaced to an ER physician should emphasize urgency and immediate next steps. The version going to an oncologist can assume a different base of knowledge. The version going directly to a patient should shed medical jargon entirely. The technology, he argues, now exists to generate each of those versions. The regulatory and legal frameworks are still catching up.
Qubigen’s Hall sees a similar trajectory in pharma, where the same underlying AI engine could be rapidly deployed as distinct, client-owned models for different organizations — a “factory” approach that can push out customized AI for one biotech’s rare disease program and another’s oncology pipeline without the outputs ever being conflated. Each client gets an AI trained on their data, reflecting their project, owned by them.
The Limits of the Argument
Neither case is a final verdict on generic AI. The Rad AI study assessed impressions in isolation, without the real-world human-in-the-loop editing that the clinical product actually involves, meaning the performance gap between custom and generic, in practice, may be somewhat narrowed by radiologists catching and correcting the generic model’s miscalibrations. Del Gaizo acknowledged that the study design, while rigorous, represented a best-case simulation for both models.
Qubigen, for its part, is still working to prove its case at scale. Hall is candid about the fact that federated learning, as a category, has a track record of demonstrating technical feasibility without translating cleanly into commercial adoption — largely because the incentive structures for data sharing remain misaligned even when the privacy problem is solved.
What the two cases do provide is a challenge to the assumption that general capability implies universal applicability. In radiology and in drug discovery, the precision requirements are simply too high, and the domain knowledge too specific, for a model trained broadly to outperform one trained specifically.
“The tailored AI generally exceeds the generalist AI,” Hall said. “But you can’t really scale that effectively without Federated AI.”
For industries where the cost of a wrong answer is measured in patient outcomes or failed drug programs, that distinction may prove to be the one that matters most.