logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-06102025-173843


Tipo di tesi
Tesi di laurea magistrale LM6
Autore
LEO, EDOARDO
URN
etd-06102025-173843
Titolo
Benchmarking Large Language Models with and without RAG on expert clinical knowledge: the case of the Italian Association of Sleep Medicine expert examination
Dipartimento
RICERCA TRASLAZIONALE E DELLE NUOVE TECNOLOGIE IN MEDICINA E CHIRURGIA
Corso di studi
MEDICINA E CHIRURGIA
Relatori
relatore Prof. Faraguna, Ugo
Parole chiave
  • IA and medicine
  • LLM and medicine
  • Sleep medicine and artificial intelligence
Data inizio appello
15/07/2025
Consultabilità
Non consultabile
Data di rilascio
15/07/2095
Riassunto
The application of Large Language Models (LLMs) in medical contexts is rapidly expanding. However, their reliability in addressing highly specialized clinical topics remains a key issue. This study evaluates the accuracy of various LLMs in answering 50 multiple-choice questions from the official test of the Italian Association of Sleep Medicine (AIMS), a mandatory step to obtain the national qualification of “Sleep Disorder Specialist” in Italy.

Each of the 50 questions with the corresponding 4 options were asked by the experimenter to each LLM. The experimenter noted the correctness of the question in an Excel sheet. A score of 1 was assigned to the correct questions and 0 to each incorrect question. For each LLM, to verify the consistency, the same question was asked 5 times.
The models tested included Llama 3.2 3B, Llama 3.3 70B, Llama 3.2 3B that was perfectionated by Retrieval-Augmented Generation (RAG), Gemini 2.0 Flash, and NotebookLLM (which also utilizes RAG). RAG allows you to read documents such as txt or PDFs file so that LLM can check them before giving the answer. The uploaded documents are books and papers recommended by AIMS to pass their test.


File