logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-05062024-125402


Tipo di tesi
Tesi di laurea magistrale
Autore
PIPARO, FRANCO ITALO
URN
etd-05062024-125402
Titolo
EXPLAINABLE ARTIFICIAL INTELLIGENCE IN MEDICAL IMAGES CLASSIFICATION BY PROTOTYPICAL PART NETWORK
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
INGEGNERIA BIOMEDICA
Relatori
relatore Prof. Vozzi, Giovanni
relatore Prof. Positano, Vincenzo
relatore Dott.ssa De Santi, Lisa Anita
Parole chiave
  • explainable artificial intelligence
  • pneumonia
  • protopnet
  • prototypes
Data inizio appello
31/05/2024
Consultabilità
Non consultabile
Data di rilascio
31/05/2094
Riassunto
In the medical imaging field, Convolutional Neural Networks (CNNs) present themselves as promising tools for diagnostic support due to their high performance. However, their opaque nature has slowed their adoption in healthcare. Explainable Artificial Intelligence (XAI) techniques aim to overcome this lack of transparency by enhancing interpretability without sacrificing performance. It is important to highlight, however, the need to identify reliable and standardized methods for evaluating the quality of explanations provided by these techniques.
Part-prototype models are a type of intrinsically explainable model that operate on images. These models integrate classification and explanation, which are based on the similarity between parts of a test image and prototypical parts (prototypes) of a specific class. The most representative implementation of this approach is the Prototypical Part Network (ProtoPNet), which leverages CNNs to learn prototypical parts of each class. In this work, we implement ProtoPNet to classify Normal/Pneumonia patients from chest X-ray images of a publicly available dataset, assessing the consistency and correctness of model’s explanations. We also investigate model’s generalization capability to unseen data.
ProtoPNet is composed by three modules: a CNN used as a feature extractor, a prototype layer and a fully connected layer which acts as decision layer. Using a specific loss function, the first two modules are trained to learn a meaningful latent space, in which prototypes of different classes are clustered in L2 distance around two separated centroids. During testing, the convolutional module compresses the input image into the latent space. Then, the prototype layer computes activation maps of similarities between the convolutional output and prototypes. These maps are max-pooled into similarity scores, which indicate how strongly a prototype is present in the input image. Such scores are finally weighted by the fully connected layer, producing class logits. The predicted class is the one with the highest logit. Additionally, the aforementioned activation maps are upsampled to visualize both prototypes and prototypical activations as image patches, making the model explainable both globally and locally.
The dataset chosen for this work contains 5856 frontal X-ray images of pediatric patients from a Chinese hospital. The dataset’s composition is 73% Pneumonia- 27% Normal. Pneumonia images include both bacterial and viral pneumonia. The former is usually reflected on the image with localized opacities, while the latter manifests with a more diffuse interstitial pattern in both lungs. In order to maintain dataset’s composition and to avoid data leakage we make a custom splitting of the dataset into training set (80%) and test set (20%).
Firstly, we train five ProtoPNets using 5-fold Cross Validation (CV) to assess whether the model’s accuracy and its explanations remain consistent across different dataset splits. By using a specific function, the splitting replicates dataset’s composition in all the folds and prevents data leakage. Images are converted to RGB, resized and transformed into tensors, who’s values are scaled into [0, 1]. Additionally, we standardize images with dataset’s mean and standard deviation for faster convergence. Since the dataset is unbalanced, we apply an offline data augmentation pipeline to training sets which includes geometric transformations and makes the two classes to have the same number of training images.
The consistency of the models’ explanations is assessed by examining both their global and local reproducibility. For this purpose, we introduce novel approaches. The Prototype average pair distance metric is utilized to monitor both the convergence of prototypes over epochs and their reproducibility in the latent space across folds (global explanation reproducibility). Additionally, we compute the L2 distance between inter-class prototype centroids in the latent space for each of the five models using hierarchical clustering. We then demonstrate that the aforementioned metric correlates with this distance by computing the Pearson linear correlation coefficient between the five metric values and the corresponding distances obtained from hierarchical clustering.
To quantify whether the five models provide consistent explanations for a given test image (local explanation reproducibility), we employ the Dice index to compute the average overlap between the most activated patches of the image by each model.
In the case of Part-prototype models, evaluating the correctness of models’ explanations translates to assessing the correctness of prototypical patches reconstruction. This was done through a single deletion experiment, which consists of deleting the prototypical patch from its source image and observing the change of the similarity score’s value when forwarding the perturbed image through the net. We also calculate, for the same image, the average of the similarity scores obtained by deleting random patches for comparison. To assess their generalization capability, we test all our 5-fold models with the hold-out internal test set. Furthermore, we evaluate the performance of our networks using images from an external dataset. This contains Normal and Pneumonia chest X-ray images from a more heterogeneous cohort of subjects in terms of age. In this case, the dataset is unbalanced in favor of the Normal class, and comprises images collected from an American hospital. We sample two test sets, one maintaining the composition of the internal dataset and the other reflecting that of the external dataset, named ’external 1’ and ’external 2’, respectively. The model that achieves the highest accuracy on the internal test set is chosen to present examples of local explanation of test images.
The 5-fold CV lead to similar validation accuracies across the folds (96.94 ± 0.34%), suggesting that the network’s performance is independent from data splitting. Regarding global explanation reproducibility, the proposed metric converges to the same value (44.14 ± 0.50) in all the five models and produces a high correlation coefficient (0.99) with the distances computed through hierarchical clustering. The local explanation reproducibility experiment shows that the models predominantly focus on the same patches to classify the same image, reflected by an average Dice index higher than 0.60 in 3 out of 4 examples. These results indicate that models trained on different data splits provide consistent explanations.
Results from the correctness experiment evidence that the deletion of the prototypical patch from its source image significantly decreases the similarity score, suggesting that patches reconstructed via upsampling are important for the network to identify a prototype in its source image. On the other hand, the random patch deletions unexpectedly lead to low similarity scores, so we cannot exclude the possibility that other pixels not included in the patches may also be important in representing the features learned by the prototype layer.
Our 5-fold models exhibit a mean accuracy of 97.27 ± 0.33% on the internal test set, which is close to the mean accuracy observed on the validation sets. The best model achieved an accuracy of 97.60%, which is comparable to the benchmark accuracy of 98.99% achieved by a black-box model. However, when tested on external datasets, their performance notably declines. The mean accuracies on external 1 and external 2 are 87.96 ± 2.08% and 78.84 ± 6.14%, respectively, indicating a significant decrease compared to the internal test set. These lower performances are particularly evident in the specificity metric, suggesting that our models are not able enough to recognize Normal images. Interestingly, the recall metric (sensitivity) remains relatively stable. This discrepancy in performance may be due to the impact of various factors when evaluating with external images, such as statistical differences between the two datasets and the imbalance of the dataset used for training. Local explanations of our best model show themselves as intuitive and self-explanatory. However, we deem a clinical evaluation of these explanations necessary.
In conclusion, our work demonstrates that ProtoPNet represents a promising explainable model, capable to maintain the performances of the black-box models. We have shown the consistency of model explanations through innovative approaches at both global and local levels. However, our results also indicate that further analysis have to be conducted to assess the correctness of ProtoPNet’s explanations. Moreover, results underline significant challenges in generalizing models to external data. Lastly, we deem a clinical evaluation of model explanations essential to ensure their utility and reliability in medical practice.
File