Changing Course: Rethinking How AI Can Interpret X-Rays
Changing Course: Rethinking How AI Can Interpret X-Rays
Can high-resolution images offer better accuracy in AI support for decision making than the standard low-resolution images used in most deep learning models today? That’s a question that researchers from SURFsara B.V., Intel’s AI group, and Dell EMC’s AI group recently tackled. The answer, it turns out, is yes.
This debate around image resolution highlights the occasions when hosting AI algorithms on CPUs can give better results and can help us understand the details that are critical for fine-tuning of all machine learning work, on all machines.
Pneumonia, emphysema, and other lung afflictions
Chest x-ray exams are one of the most frequent and cost-effective medical imaging examinations available — far cheaper and more accessible than chest CT imaging (Computerized Axial Tomography, commonly called CAT scans or CT scans). However, diagnosis via chest x-rays is generally regarded as more challenging and difficult than diagnosis via chest CT imaging.
It is significant that researchers have developed a technique to quickly train models for identifying pneumonia, emphysema, and other lung afflictions through the analysis of chest x-rays, because early and accurate diagnosis of emphysema and pneumonia can save lives. Emphysema, estimated to affect 16 million Americans, is life threatening and early detection is critical to halt its progression. According to the World Health Organization (WHO), 64 million people worldwide have emphysema or some form of pulmonary disease, and WHO predicts this will grow to be the third leading cause of death worldwide by 2030. WHO estimates that pneumonia is the single largest cause of death in children worldwide — killing 808,694 children worldwide in 2017. In the U.S., the American Thoracic Society estimates that one million adults seek hospital care annually due to pneumonia, and the CDC reported 49,157 deaths from pneumonia.
Typical Chest x-ray for analysis. Source: SURFsara
Why would we even doubt that larger images would give better results?
The research was done on a dataset (ChestRay-14) released by the NIH Clinical Center. This dataset is comprised of 112,120 x-ray images (1024x1024) with pathology labels from 30,805 unique patients. Fourteen pathologies (Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural Thickening, Cardiomegaly, Nodule Mass, and Hernia) are labelled — in some cases patients will exhibit more than one pathology, and some patients exhibit none at all.
It might seem obvious that examining the highest resolution x-ray images available would be the best option for a human or a computer program. Yet, prior work has generally downsampled the 1024x1024 images to 224x224 because they proved a better fit for the input layers of traditional ImageNet-pretrained models. Networks have been previously shown to be susceptible to noise caused by downsampling — as humans would be as well (we would call them blurry photos).
Since GPU-based deep-learning solutions cannot readily support high-resolution images due to their tight memory constraints, common network designs for image processing have been built to handle lower resolution images only. While this works reasonably well for sorting cats from dogs, simply downsampling the inputs sacrifices too much of the rich visual information present in medical data. The accuracy of AI algorithms trained on low resolution images has been far less than what is needed to play a significant role in assisting human experts.
While it is easy to assume that higher resolution images should be better, it is important to prove that in the real world. Hence, the work of the passionate researchers from SURFsara and their colleagues at Dell EMC and Intel.
Results “beyond the state-of-the-art” can help human experts
The goal of their research was not to try to replace radiologists, but rather to develop aids for the analysis of chest x-rays. Assisting in decision making may be particularly useful when medical doctors do not reach a consensus. Like all deep-learning work, their goal was to show that they could obtain a high level of accuracy with their models.
Using CPUs to escape the memory constraints of GPU-based systems, the researchers found techniques that led to much higher accuracy than was previously possible. Scale-out, large-batch training proved to be an effective way to speed up neural network training on CPU-based Intel Xeon platforms. Their experiments led to improving classification (the mean AUROC improved from 0.819 to 0.826) without significantly changing the training time. The upscaled version of ResNet-50, called ResNet-59, utilized the full 1024x1024 images to improve classification even further (mean AUROC from 0.826 to 0.838). Their scale-out work, when training a large AmoebaNet model (168 million parameters), managed to further boost classification (to obtain a mean AUROC of 0.842) outperforming prior work on all 14 different pathologies.
The researchers strongly emphasized that increasing model sizes and image sizes is not always beneficial for a variety of reasons, including the dreaded “overfitting.” Overfitting occurs when an AI algorithm pays attention to details too precisely, and fails to generalize. A classic example of overfitting is learning to notice a stop sign in a photo, but only if buildings are also present in the same photo. In this case, they found a way to make larger image sizes beneficial without overfitting.
Like humans, computers benefit from prior learnings
While training can start from scratch, the researchers emphasize that transfer learning and fine tuning are intelligent ways to improve training times. This is to say, a machine can learn to read x-rays faster if it has some prior experience looking at images. In their work, they first performed a full training on the ImageNet dataset followed by transfer learning on the target ChestXray-14 dataset. In both cases, scale-out (using a lot of CPUs) proved effective in improving training times.
Future work will compare the accuracy and feature transferability of networks trained by ImageNet vs. ChestXray-14 datasets – will a computer be able to learn better having studied a diverse world of images or only chest x-rays?
Neural networks can use a swift kick in the butt too
As alluded to previously, deep learning is infamously prone to overfitting, and every researcher seeks to avoid that issue. Understanding what the research team adjusted (“tuned”), in their approaches and algorithms, could be a model approach for where the state-of-the-art is headed in interpreting x-rays with AI.
A related problem is one of training time, and how to reduce it without causing overfitting. This brings us to fine tuning.
One of the most important hyperparameters when performing fine-tuning is called the learning rate. If the learning rate is too slow, then the computer seems “dense” (refuses to learn from repeated exposure to examples). If the learning rate is too high, the computer seems prone to assume everyone is victim to the disease it just learned about.
Using a lot of compute power to solve a problem more quickly involves a technique researchers call “scaling out” their algorithms to use multiple processors at the same time (“in parallel”). A popular method of using computers in parallel is referred to as “ensembles.” We can think of ensembles as scenarios – a kind of “what if” quiz we can have with a super-fast computer. By asking the computer lots of questions at once, we can see all its answers to various scenarios quickly and pick from the answers which seem most important. This is actually a key secret to modern weather predictions. Modern weather predictions run many ensembles to consider possible realities (effectively hypotheses). While considering x-ray interpretation, these researchers choose to adopt an ensemble learning techniques to train multiple learners to construct a set of scenarios/hypotheses in parallel.
After all the results from the numerous scenarios are collected, they are combined in a process that researchers call a “reduction operation.” The researchers’ choice to use ensembles builds upon work they had previously published in a paper detailing their use of ensembles in scale-out work with CPUs, including techniques to collapse the ensembles to help manage training time. A semi-cyclic learning-rate had also previously shown promising results in achieving good training times without overfitting. I had to scratch my head, and ask the researchers a bunch of questions, to comprehend why forcing the computer to learn in spurts made any sense.
Cyclical learning rates serve to give a “kick in the butt”
to the neural net occasionally. Source: SURFsara
They told me that research into adaptive learning rates is quite extensive —and a big motivation is helping neural networks learn even after getting stuck. Personal experience tells me that this works for humans – think of a caffeine kick that can help you think a bit differently. It turns out, neural networks can use a swift kick too. A cycling learning rate, also researched for some time, has the effect of desaturating neurons and thereby allowing networks to learn even after finding a locally optimal solution when a better global solution may exist. This is the AI world’s simulated annealing. Researchers have found this is useful for generalization purposes, which is another way of saying it helps avoid overfitting. The caveat is that if a “kick in the butt” moves us too far away from the best solution we’ve seen so far (called a “local minima”), beneficial information might be lost. Therefore, the researchers diverged from common practice and dampened their cyclical function as shown above (the cycles reduce in size over time). This had the effect of reducing the amount of information lost from the cyclical “kicks” as the training progressed. The “wiser” the network became, the less they allowed it to throw away prior learning.
The computers they used
Their new techniques proved highly effective on a Zenith cluster in the Dell EMC HPC & AI Innovation Lab. Each Zenith node consists of dual Intel Xeon Scalable Gold 6148/F processors, 192GB of memory, and an M.2 boot drive to house the operating system that does not provide user-accessible local storage. Nodes are interconnected by a 100Gbps Intel Omni-path fabric, and shared storage is provided by a combination of NFS (for HOME directories) and Lustre filesystems.
Doing their work on an Intel Xeon Gold processor-based cluster also gave them the opportunity to scale their workloads to utilize numerous CPU cores to do the work faster. Their “scale-out” work was particular effective in delivering high-performance in addition to the high accuracy that was their primary mission.
The researchers are studying ways to improve their system’s ability to assist doctors. They are eager to utilize new datasets emerging from the medical community that offer additional high-resolution labelled data on which to train and test.
Damian Podareanu, HPC & AI Consultant at SURFsara, explained, “The general perception might be that the world is digital, and everything is readily available to be analyzed by computers, however that is not true. Institutions are making strides toward that goal, but we are by no means at that stage today. Our work with CPUs helps take the data we do have, and do more with it. It also highlights a future of enormous possibilities using CPUs as additional high-quality datasets become available over the next few years.”
Knowing they will have limited amounts of real data in the near future, they are working to employ novel techniques to auto-generate training examples for rarer cases. They will work closely with radiology specialists to get guidance and feedback on such auto-gen work to help make sure it mimics real x-ray images in the hope that auto-generated x-ray images can be shown to improve interpretation of real x-ray images.
Examination of x-ray images using today’s AI remains limited because the entire system is designed with absolutely no knowledge of human anatomy. Expert humans have a non-trivial advantage in this respect. The AI techniques are based on the images they are provided with, and the machine has no other knowledge about what it is “seeing.” This is similar to image-based AI driving algorithms that do not consider the physics of what it sees — children, bicycles, pets, bouncing balls, and other visible moving objects. For now, this lack of anatomical knowledge will limit AI x-ray interpretation to only assist—and not replace—doctors.
- The researchers; Valeriu Codreanu and Damian Podareanu of SURFsara B.V., along with Vikram Saletore of the AI group at Intel Corp.; have a detailed paper on their techniques, and the work on scaling-out on CPUs including use of collapsed ensembles. These same researchers, joined by Lucas Wilson and Srinivas Varadharajan of the AI group at Dell EMC, also summarized their most recent work on a poster at ISC19, including their latest achievements in classification accuracy. If you want to replicate their work, the researchers will gladly share their models, hyperparameters, and control methodology to enable you to reproduce their neural network training.
- You can read about the latest CPUs for such work, in the 2nd Generation Intel® Xeon® Scalable Processor Ushers in a New Era of Performance for HPC Platforms technical brief explaining how systems built with the latest Intel processors offer performance for a broad range of HPC workloads, and a compelling solution for integrating HPC, data analytics and AI in a single system.