logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-02122026-203557


Tipo di tesi
Tesi di laurea magistrale
Autore
LEONE, CRISTIAN
URN
etd-02122026-203557
Titolo
Exploiting LLMs for software code comprehension and explanation
Dipartimento
INFORMATICA
Corso di studi
DATA SCIENCE AND BUSINESS INFORMATICS
Relatori
relatore Prof. Ruggieri, Salvatore
Parole chiave
  • Code Summarization
  • Code Understanding
  • Data Lineage
  • Large Language Models
Data inizio appello
27/02/2026
Consultabilità
Non consultabile
Data di rilascio
27/02/2066
Riassunto (Inglese)
This thesis investigates how Large Language Models (LLMs) can support software understanding by generating natural-language explanations and structured artifacts directly from source code, with a stronger emphasis on data lineage. The work considers two complementary tasks. The first is behavior-focused code summarization, where models must produce short descriptions of what a function does under strict generation constraints; the thesis test an evaluation protocol that combines human assessment and LLM-as-a-judge scoring across multiple quality dimensions. The second task targets column-level backward data lineage for ETL-like scripts, where the goal is to infer and represent dependencies between input and output columns as a structured edge list, enabling traceability of transformations in data pipelines. Across both tasks, the thesis discusses dataset construction, annotation choices, prompt and output constraints, and the methodological challenges of evaluating model outputs when only static code is available (e.g., ambiguity, underspecification, and format compliance). Overall, the thesis provides a systematic framework for studying LLM-based software explanation and offers practical guidance for designing reliable evaluation setups for code summarization and data lineage extraction.
Riassunto (Italiano)
File