logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-02032026-133012


Tipo di tesi
Tesi di laurea magistrale
Autore
SAGHEDDU, CINZIA
URN
etd-02032026-133012
Titolo
DEVELOPMENT OF ARTIFICIAL INTELLIGENCE ALGORITHMS FOR THE CHARACTERIZATION OF MULTIPLE MYELOMA FROM PET/CT IMAGES
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
INGEGNERIA BIOMEDICA
Relatori
relatore Prof. Callara, Alejandro Luis
relatore Prof. Positano, Vincenzo
relatore Dott. Genovesi, Dario
tutor Dott. Arcangeli, Andrea
Parole chiave
  • binary classification
  • bone lesions
  • convolutional neural network (CNN)
  • deep learning
  • maximum intensity projection (MIP)
  • multiple myeloma (MM)
  • residual networks (ResNets)
  • skeletal involvement
  • standardised uptake value (SUV)
  • [18F]FDG PET/CT
Data inizio appello
23/02/2026
Consultabilità
Non consultabile
Data di rilascio
23/02/2029
Riassunto (Inglese)
Riassunto (Italiano)
Multiple myeloma (MM) is a rare haematological disorder characterized by the proliferation of neoplastic plasma cells within the bone marrow. It predominantly affects the elderly and presents heterogeneous clinical manifestations summarized by the acronym CRAB: hypercalcemia, renal insufficiency, anaemia, and bone pain. Among diagnostic and follow-up strategies, [18F] FDG PET/CT plays a central role, as it allows the identification of metabolically active lesions and the quantification of their uptake using standardized uptake values (SUVs).
Lesions in multiple myeloma primarily localize within the axial skeleton, including vertebrae, sternum, ribs, pelvis, humeri, and femora, but they may also involve the appendicular skeleton or soft tissues, displaying significant heterogeneity in terms of localization, number, and morphology. However, the interpretation of PET/CT images remains challenging due to the lack of standardization and the absence of clinically accepted quantitative measures of tumour burden, in addition to the substantial workload required for visual analysis, which is subjective. Artificial intelligence (AI) methods represent a potential solution to these limitations.
Scientific literature documents the combined use of radiomic and clinical features for prognostic prediction in MM patients, while other studies employ neural networks to segment the skeleton from CT images and to classify or segment lesions from PET scans.
The present thesis aims to develop AI methods for the early and non-invasive diagnosis of MM by leveraging PET/CT images and, when appropriate, integrating them with demographic (age and sex), anthropometric (weight), and dosimetric ([18F] FDG) variables, collectively referred to as clinical variables. The objective is to evaluate the potential of these methodologies in an exploratory scenario, where they could complement and, in the future, reduce the reliance on bone marrow biopsy, which remains the diagnostic gold standard despite its invasiveness.
The primary contribution of this work consists of a systematic comparison of approaches based on engineered features and end-to-end deep learning models for MM classification from PET/CT images within a realistic clinical context using limited data. The dataset comprises 125 scans from 68 patients of the Haematology Unit of the Azienda Ospedaliera Pisana, performed between 2015 and 2025 at the Nuclear Medicine Unit of the San Cataldo Hospital, Fondazione Toscana Gabriele Monasterio. Scans acquired using two different scanners (GE and Siemens) cover varying fields of view (whole-body, head-to-pelvis, head-to-knee), and the study has a retrospective design.
The first phase of the work focused on pre-processing: positivity labels were generated from nuclear medicine reports, PET images were converted into SUV values, and skeleton, liver, and gluteus regions were segmented from CT images using TotalSegmentator. The second phase involved patient classification, evaluating four different approaches based on model complexity and degree of feature engineering, maintaining a consistent data split to ensure comparability of results.
In the first approach, based on engineered features, statistics on bone SUVs were extracted, subsequently normalized against the mean SUV of the liver or gluteus, and used as inputs for neural networks (NN2 and NN3), emulating the manual workflow of a nuclear medicine physicist. NN2 and NN3 consist of two and three BasicBlocks, respectively, each composed of a Linear layer followed by Batch Normalization, a ReLU activation, and Dropout. The SUVs were then integrated with clinical variables and the scanner variable. This approach yielded the most stable models, although it required a greater workload. NN2 and NN3 models were found to be nearly equivalent, with no specific input type proving more predictive. Moreover, reducing the number of input features through Principal Component Analysis (PCA) did not lead to any performance improvement. Integration of clinical or scanner variables did not improve performance, although clinical variables were moderately predictive in dedicated models (as Logistic Regression, Random Forest Classifier and ClinicalOnlyNet), with Area Under the Curve (AUC) exceeding 0.7 and balanced accuracies between 0.6 and 0.7. ClinicalOnlyNet consists of a Linear layer, ReLU activation, Dropout and a final Linear layer. The best NN2 model, obtained via cross-validation, achieved a test-set Receiver Operating Characteristic (ROC) AUC of 0.76 and a balanced accuracy of 0.72. Explainable AI (XAI) techniques, such as Permutation Feature Importance and SHAP, together with model behaviour analyses such as Feature Ablation Studies and Logit Variation Analysis, demonstrated that network attention was distributed across the bones of the entire skeleton, consistent with lesion variability, with particular emphasis on the eleventh right rib, vertebrae, sacrum, and ulna.
The second approach employed 3D Convolutional Neural Networks (CNNs), including a pretrained ResNet18 and a MiniResNet trained from scratch, using PET-SUV images and skeletal masks as inputs. MiniResNet was built by simplifying the ResNet architecture and comprises only two ResidualBlocks. Different integration strategies were explored: in some experiments, the mask or the product between the mask and PET-SUV was provided as a second channel, while in others only the masked PET-SUV was used, with total or partial occlusion of extra-skeletal signals. A persistent overfitting issue emerged, due to the imbalance between the number of learnable parameters and the available samples, compounded by the prevalence of inactive voxels. The best 3D model achieved an AUC of 0.65 and a balanced accuracy of 0.64, while integration of the masks, with or without the skull, as a second channel improved performance, with AUCs of 0.78 and 0.76 and balanced accuracies of 0.71 and 0.73, respectively.
The third approach used maximum intensity projections (MIPs) of PET-SUV or masked PET-SUV images as inputs for 2D CNNs, including MultiViewResNet18 and UltraMinimalCNN. MultiViewResNet18 employs a shared backbone based on a two-dimensional ResNet18, adapted for single-channel input and initialized with pretrained weights. UltraMinimalCNN consists of three MinimalBlocks, each composed of a 2D Convolutional layer, GroupNorm normalization, a ReLU activation function, and a MaxPooling operation. Various masked MIPs were calculated (including or excluding the cranial region, considering only axial bones, and splitting the volume into slabs), and in some experiments a majority-voting strategy across models on individual MIPs was applied. The best-performing model was UltraMinimalCNN, owing to its minimal architecture better suited to the limited dataset dimensionality. The best test-set performance was achieved using global MIPs derived from PET-SUV or the global coronal MIP alone, with AUCs of 0.68 and 0.72 and balanced accuracies of 0.70. However, XAI techniques (GradCAM, Occlusion Sensitivity Map, and LIME) revealed predominant attention to organs with high physiological uptake, such as the brain, heart, liver, and bladder. Occlusion of extra-skeletal signals and exclusion of the cranial region led to less effective models, with AUCs of 0.59 and 0.51 and balanced accuracies of 0.51 and 0.44, although XAI analyses qualitatively indicated behaviour more consistent with clinical reasoning, focusing on different skeletal regions depending on lesion distribution. In voting-based approaches, networks operating on individual MIPs showed nearly equivalent performance, suggesting equal informational contribution from the three MIPs.
The fourth approach implemented a multimodal model (MultiModalNet) combining features extracted from MIPs with clinical variables, using the best-performing models identified for each input type. MultiModalNet consists of an image encoder, based on UltraMinimalCNN, a clinical encoder, based on ClinicalOnlyNet, and a classification head. The aim was to assess whether integrating additional information could improve performance. Results did not show the expected benefit, with a test-set AUC of 0.47 and balanced accuracy of 0.52, and the model exhibited strong instability across experiments and cross-validation, indicating limited generalizability.
Based on the obtained results, no classification approach for MM patients was clearly superior to the others, although models based on bone SUVs were generally more stable. However, CNN-based models, such as 3D ResNet18 using PET-SUV and skeletal masks as input, and UltraMinimalCNN, showed performance comparable to models emulating the clinical workflow, confirming the ability of CNN architectures to extract relevant clinical features. The addition of available clinical variables did not significantly improve performance. A direct comparison of the achieved results with the literature is not possible, as similar studies have focused on survival prediction, which was not assessed in this thesis.
All models were limited by the small dataset size (79 training samples) and its heterogeneity in terms of scanners and fields of view. Additionally, generalizability was further affected by dataset imbalance, with a predominance of positive cases (78), and likely by the partial representativeness of the cohort relative to the MM population. Another limitation was the absence of an external validation cohort, highlighting need evaluate larger and more balanced samples. An additional concern relates to models focused exclusively on the skeleton or excluding the cranial region, as they do not consider extra-skeletal or cranial lesions, which may represent manifestations of extramedullary or paramedullary disease.
A future development of this work will involve integrating clinical variables extracted from haematological biopsy and biochemical reports, such as calcium levels, β2-microglobulin, immunoglobulins, and free light chains. Their inclusion will allow a more comprehensive replication of the clinical diagnostic process for MM, leveraging the ability of models to learn from large amounts of heterogeneous data.
In conclusion, this work demonstrates that, based on the available data, models operating on engineered and interpretable features are slightly more stable than end-to-end deep learning models applied directly to PET-SUV images or related MIPs, although they require a greater workload. These results should be reassessed on larger and more balanced cohorts.
File