ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01222026-094706

Tipo di tesi

Tesi di dottorato di ricerca

URN

etd-01222026-094706

Titolo

Foundation Models for Automatic Labeling in Software Engineering

Settore scientifico disciplinare

INF/01 - INFORMATICA

Corso di studi

DOTTORATO NAZIONALE IN INTELLIGENZA ARTIFICIALE

Parole chiave

automated labeling
BERT
few-shot learning
foundation models
issue classification
issue tracking systems
large language models
LLMs
natural language processing
NLP
software engineering
zero-shot learning

Data inizio appello

02/03/2026

Consultabilità

Non consultabile

Data di rilascio

02/03/2029

Riassunto (Inglese)

Riassunto (Italiano)

This thesis investigates the application of foundation models for automating labeling tasks in software engineering, focusing on issue classification as a primary case study. Issue tracking systems are essential for collaborative software development, yet manual labeling of issue reports is often inconsistent and time-consuming, with approximately 33.8% of reports being incorrectly labeled. Traditional supervised machine learning approaches require substantial labeled training data, creating barriers for new or resource-constrained projects.

The research addresses two key questions: the extent to which foundation models can be leveraged for automated issue labeling, and which models offer optimal trade-offs among performance, computational costs, and scalability. Through comprehensive studies, the work evaluates the impact of data quality on classification performance, examines few-shot learning approaches for limited data scenarios, assesses generative language models in zero-shot and few-shot settings, and conducts extensive benchmarking across various foundation models and hardware configurations. The approaches are validated through collaboration with NASA Goddard Space Flight Center on mission-critical flight software systems.

Key findings demonstrate that BERT-based few-shot learning can outperform larger models on high-quality datasets, zero-shot methods achieve performance comparable to supervised approaches, and open-source models can match proprietary systems while offering transparency advantages. The research provides practical guidelines for model selection and supports progressive deployment strategies, enabling organizations to initially adopt zero-shot generative models for rapid automation and transition to fine-tuned models as labeled data becomes available, effectively addressing the cold-start problem in automated classification systems.

File

Nome file	Dimensione
La tesi non è consultabile. Contatta l’autore