logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01282025-223029


Tipo di tesi
Tesi di laurea magistrale
Autore
ACAMPORA, VITTORIA
URN
etd-01282025-223029
Titolo
Development of an AI System for the Identification of Surgical Site Infections from Medical Reports
Dipartimento
INGEGNERIA DELL'INFORMAZIONE
Corso di studi
ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING
Relatori
relatore Prof. Marcelloni, Francesco
correlatore Prof. Renda, Alessandro
correlatore Prof. Bondielli, Alessandro
Parole chiave
  • artificial intelligence
  • LLM
  • machine learning
  • NLP
  • surgical site infections
  • tf-idf
  • transformer architecture
  • word embedding
Data inizio appello
21/02/2025
Consultabilità
Non consultabile
Data di rilascio
21/02/2095
Riassunto
This thesis proposes the development of a system based on machine learning, text mining, and natural language processing (NLP) techniques for the detection and classification of Surgical Site Infections (SSIs). The system leverages a real-world dataset, provided by Cisanello Hospital in Pisa, which contains unstructured medical records of patients, written in Italian language. This work analyses two main approaches for the classification task: traditional text vectorization techniques and the preliminary use of Large Language Models (LLMs). The first one focuses on transforming text into vector representations for machine learning algorithms, while the LLMs leverage the advanced capabilities of pre-trained models to interpret and classify text directly. The text vectorization techniques used in this work are Term Frequency-Inverse Document Frequency (TF-IDF),Word2Vec,
and Bidirectional Encoder Representations from Transformers (BERT). After that, several classification algorithms, such as Logistic Regression, Random Forest, Decision Trees, and Extreme Gradient Boosting, are used to classify the embeddings generated by vectorization methods.
Results show that the best performing model for the task at hand is a BERT variant, specifically after further pre-training it on the dataset text and then fine-tuning it for the classification task. It outperformed all other techniques. The experiments also show that open generative LLMs, either trained on biomedical data, or on Italian texts, fail to grasp the complexity of the task despite different configuration and inference settings.
File