Professor and Department Head Rensselaer Polytechnic Institute, United States
Introduction:: Clinicians often use medical imaging data to diagnose diseases such as cancer[1]. In recent years, machine learning (ML) models, or classifiers, have become more frequently used to improve the quality of these diagnoses[1]. While ML classifiers have achieved state-of-the-art performance, little is known about the classifiers’ robustness, or reliability of the classifier when the data is noisy[2]. Current attempts to characterize robustness only characterize robustness under one condition, e.g., when the testing set contains noise, or perturbations, and the training set is clean (unperturbed)[2]. These methods do not address when both the training and testing set simultaneously contain noise, such as due to differences in clinical or testing procedures. This study examines the effect of different amounts of noisy data, including noise types commonly found in medical images, in both the training and testing set on the robustness of a medical imaging classifier.
Materials and Methods:: Classifier robustness to five different perturbations was evaluated: Gaussian noise, defocus blur, contrast, rotation, and tilt. The robustness of a classifier was evaluated by training a classifier f with clean training data and a classifier fpn (with the same architecture) with p% perturbed data on perturbation 'n' (e.g., blur). These classifiers were first tested on the clean testing set data to obtain baseline error rates. Then, the performance of these classifiers on noisy testing data was evaluated by perturbing the testing set on intervals of 20% of the data. For a classifier to be deemed robust, it should maintain a low error rate when the testing data is both clean and somewhat noisy. This approach was demonstrated on 10 unique classifiers (neural network architectures) and evaluated using the publicly available PneumoniaMNIST dataset[3]. This dataset contains chest X-ray images of patients diagnosed with and without pneumonia.
Results, Conclusions, and Discussions:: The average error rate of all 10 classifiers trained with clean (unperturbed) training data and tested on clean testing data is 15.3%. As can be seen from Fig. 1a, this error does not significantly change even for perturbations of larger amounts of perturbed data until all the training data was perturbed. Additionally, classifiers trained with 20% (p = 20) perturbed training data also perform significantly better on perturbed testing data than the clean classifier f when the data is perturbed by certain perturbations, such as blur. Fig. 1b demonstrates the difference between f20blur and f on noisy testing data. As such, f20blur is likely more robust than f to being blur-perturbed, as it performs similarly on the clean testing data and significantly better on noisy testing data.
Overall, this study evaluated medical imaging classifiers with regard to their robustness. To this end, portions of the training and testing sets were perturbed, and classifiers trained with this noisy data were evaluated. For the dataset under study, it can be concluded that for some perturbations, training the classifier partially with perturbed data can improve the robustness of the classifier.
Acknowledgements (Optional): : The authors would like to acknowledge the National Institutes of Aging (T32GM067545) for funding this opportunity.
References (Optional): : [1] Giger M., JACR, 2018. Vol 15, pg. 512-520. [2] Hendrycks, D., ICLR. 2019. [3] Yang J., Sci. Data, 2023. Vol 10, pg. 41