Tesi etd-01282025-223029

Tipo di tesi

Tesi di laurea magistrale

Autore

ACAMPORA, VITTORIA

URN

etd-01282025-223029

Titolo

Development of an AI System for the Identification of Surgical Site Infections from Medical Reports

Dipartimento

INGEGNERIA DELL'INFORMAZIONE

Corso di studi

ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING

Relatori

relatore Prof. Marcelloni, Francesco
correlatore Prof. Renda, Alessandro
correlatore Prof. Bondielli, Alessandro

Parole chiave

artificial intelligence
LLM
machine learning
NLP
surgical site infections
tf-idf
transformer architecture
word embedding

Data inizio appello

21/02/2025

Consultabilità

Non consultabile

Data di rilascio

21/02/2095

Riassunto

This thesis proposes the development of a system based on machine learning, text mining, and natural language processing (NLP) techniques for the detection and classification of Surgical Site Infections (SSIs). The system leverages a real-world dataset, provided by Cisanello Hospital in Pisa, which contains unstructured medical records of patients, written in Italian language. This work analyses two main approaches for the classification task: traditional text vectorization techniques and the preliminary use of Large Language Models (LLMs). The first one focuses on transforming text into vector representations for machine learning algorithms, while the LLMs leverage the advanced capabilities of pre-trained models to interpret and classify text directly. The text vectorization techniques used in this work are Term Frequency-Inverse Document Frequency (TF-IDF),Word2Vec,
and Bidirectional Encoder Representations from Transformers (BERT). After that, several classification algorithms, such as Logistic Regression, Random Forest, Decision Trees, and Extreme Gradient Boosting, are used to classify the embeddings generated by vectorization methods.
Results show that the best performing model for the task at hand is a BERT variant, specifically after further pre-training it on the dataset text and then fine-tuning it for the classification task. It outperformed all other techniques. The experiments also show that open generative LLMs, either trained on biomedical data, or on Italian texts, fail to grasp the complexity of the task despite different configuration and inference settings.

File

Nome file	Dimensione
La tesi non è consultabile. Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01282025-223029