By Deborah Borfitz
June 9, 2021 | Artificial intelligence (AI) algorithms were recently found to perform on par with pathologists when it came to quantifying tumor cellularity, an increasingly common measure of patient response to neoadjuvant therapy for breast cancer. That was one of the key findings of the Breast Pathology Quantitative Biomarkers (BreastPathQ) Challenge organized by the U.S. Food and Drug Administration (FDA), the International Society for Optics and Photonics (SPIE), the American Association of Physicists in Medicine (AAPM), and the U.S. National Cancer Institute (NCI).
The competition took place in 2019 and results were recently published in the SPIE’s Journal of Medical Imaging (DOI: 10.1117/1.JMI.8.3.034501). Thirty-nine competitors from 12 countries completed the challenge and they collectively developed, validated, and tested 100 algorithms—most of them combinations of convolutional neural networks (CNN) architectures to optimize their predictive performance, according to Nicholas Petrick, deputy director for the Division of Imaging, Diagnostics and Software Reliability in the FDA’s Center for Devices and Radiological Health, who led the effort.
The assigned task was to come up with an automated method for analyzing digital microscopy images of breast tissue and rank them based on their tumor cell content to provide a reliable tumor cellularity score, says Petrick. The image patches had all been previously evaluated by two pathologists as a reference standard.
Both the task and method of assessing performance made the challenge unique, Petrick says. The grading system for comparing and ranking algorithms was average “prediction probability concordance” (PK) with scores from the pathologists. This rank-based metric was selected because it was not impacted by calibration difference in the algorithms.
To assess how calibration differences between the algorithms and references skewed algorithm performance rankings, the post-challenge analysis employed a secondary correlation measure known as the “intraclass correlation coefficient” (ICC), he adds. “Generally, the ones that did well for the PK ranking did well for the ICC… increasing confidence that those algorithms were performing well in general [and not as an artifact of the measurement scale].”
Regardless of the algorithm calibration (e.g., 0 to 100 or 0 to 1), PK scoring simply assessed how well an algorithm ordered the cases relative to the reference standard, he says. “A lot of times after the fact you can recalibrate your algorithm scores to meet the expectations of the clinical task.”
It was unsurprising that challenge participants opted to use deep CNN, which has been the dominant way of doing AI since 2012, says Petrick. The methodology allowed developers to run their algorithms on the full image patch, rather than inputting human-engineered features as input to classifiers. The algorithms were trained directly using the input pixel data and learned the appropriate convolution kernel weights—a means to blend image data to obtain another, typically smaller image—to solve the clinical task.
Most competitors started with standard pre-trained CNN architectures that were not built for medical imaging, he says. These include Inception, ResNet, and DenseNet architectures trained for image classification and computer vision tasks on the more than 14 million everyday pictures (e.g., dogs and rabbits) housed in the ImageNet database.
The BreastPathQ Challenge used image patches from 63 scans taken on 33 patients, together representing thousands of “regions of interest,” says Petrick. These are the same image sections pathologists would manually view to estimate the proportion of the tumor bed comprising tumor cells, which results in wide variability physician to physician. “It’s not like you can’t count every cell… but that takes a long time and is difficult for human readers.”
In addition to offering repeatability, algorithms used in the competition did not introduce any large errors, he notes. They had roughly the same performance as clinicians (only the best performing algorithm slightly surpassed the scores of the pathologists) and could not do a lot better than the clinician readers since the clinicians were providing the reference standards for the study.
It is up to BreastPathQ Challenge participants whether to move on to the next phase of validation and possibly pursue development of their algorithms into a clinical product, says Petrick.
One observation that might be further explored is that the algorithms tended to perform well on easier patches of images but struggled on the difficult patches—those for which AI would be especially beneficial to pathologists, Petrick says. It could be that the algorithms would do better with more data. It is also possible their performance was lower because of higher variability on those cases by the pathologists.
The bigger question is how AI can best be deployed clinically, says Petrick. If AI is just repeating what clinicians are doing, the gains would be more in terms of efficiency than better patient management.
Petrick says AI holds much potential for improving diagnostic practices by adding consistency to quantitative measurements and aiding in disease subtyping, on top of AI applications already widely deployed in the clinic helping radiologists identify suspicious areas on digital images. “It’s not a question of whether AI will be implemented in medical practice but how and which applications are most appropriate.”
The BreastPathQ Challenge was a one-time event with a modest prize (registration to a SPIE conference). Still, it was the biggest ever for SPIE in terms of number of submissions, says Petrick, who chaired one of the group’s computer-aided diagnosis meetings. The imaging data used in the challenge is housed in the NCI’s Cancer Imaging Archive and publicly available for download.
SPIE was lead sponsor for two previous competitions, the 2015 LUNGx Challenge and the 2017 PROSTATEx Challenge, focused on diagnostic assessment using imaging datasets, he says. They are all associated with the SPIE Medical Imaging Conference. This challenge is distinct from many other challenges in that the goal was estimating a tumor cellularity value instead of solving a binary problem, such as the presence or absence of cancer.
Detection and diagnostic challenges are relatively common, Petrick says. Some with large monetary incentives to participate also focus on bringing teams together to further advance the development of clinical AI tools.
Enhanced performance using an ensemble of AI algorithms was one of the key findings of the 2016-2017 Digital Mammography DREAM Challenge of IBM and Sage Bionetworks, which focused on risk stratification of screening mammograms to improve breast cancer detection, Petrick says. It is now well appreciated that ensembles can help limit some of the variability in training and testing to make the algorithms more generalizable, and some interesting combinations were deployed in the BreastPathQ Challenge.
One of the advantages of the BreastPathQ Challenge and its predecessors is that it pulls in a lot of people who otherwise would not have access to clinical data, including students studying AI at universities without an affiliated medical center, says Petrick. Moreover, clinicians have already referenced the data and scored all the cases.
Another plus, especially of the larger challenges, it that they are a good indicator of the current state of AI for specific tasks and whether the use cases are worth pursuing further, Petrick continues. Contestants are on equal footing because they are all using their algorithms on the same dataset for an apples-to-apples comparison of their performance.
Challenge organizers did all the groundwork so participants could focus on the task at hand, he notes. And as the competition was underway, contestants had an online forum where they could post questions, find answers, and otherwise head off potential issues in real time.
The BreastPathQ Challenge attracted a diversity of participants, including people with an interest or background in medical imaging or AI who in some cases partnered with physicians, he says. Others were likely students participating as part of a class project or simply trying their hand at implementing deep neural networks.
Unlike the world of radiology, AI tools are still scarce in pathology labs because digital image processing is more the exception than the rule, Petrick says. For the most part, doctors are still looking at slides under a microscope, but digital pathology is taking root. “Down the road pathology will catch up.”