Tesi etd-09092025-142853

Tipo di tesi

Tesi di laurea magistrale

Autore

NAMAKI GHANEH, DANIEL

URN

etd-09092025-142853

Titolo

Beyond Redundancy: Embedding-Aware Novelty Reranking in Retrieval-Augmented Generation

Dipartimento

INGEGNERIA DELL'INFORMAZIONE

Corso di studi

ARTIFICIAL INTELLIGENCE AND DATA ENGINEERING

Relatori

relatore Prof. Tonellotto, Nicola
relatore Prof. MacAvaney, Sean
relatore Dott.ssa Pezzuti, Francesca

Parole chiave

apprendimento contrastivo
clustering
contrastive learning
coverage fattuale
diversification
diversificazione
embedding
embeddings
evaluation
factual coverage
gpqa
information
lexical retrieval
listwise reranker
loss function
mmlu
neural retrieval
novità semantica
nugget-based evaluation
nuggetizer
rag
redundancy
reranking
retrieval-augmented generation
ridondanza
semantic novelty
set-encoder
trec rag track
valutazione

Data inizio appello

02/10/2025

Consultabilità

Completa

Riassunto

Questa tesi affronta il problema della ridondanza informativa nei sistemi di Retrieval-Augmented Generation (RAG), dove i modelli di linguaggio di grandi dimensioni vengono supportati da documenti recuperati da collezioni esterne. La presenza di passaggi duplicati o parafrasati limita infatti la copertura fattuale e riduce l’efficacia delle risposte generate. Per affrontare questa sfida, viene studiato un approccio di reranking basato sulla novità semantica, ottenuta attraverso il clustering in spazi di embedding densi. L’obiettivo è promuovere la diversità dei contenuti presentati al modello, riducendo la ridondanza e favorendo l’inclusione di fatti complementari. Il lavoro valuta diverse strategie di supervisione (lessicale e semantica), proponendo un fine-tuning del modello Set-Encoder con obiettivi contrastivi e listwise. La valutazione è condotta su benchmark standard (MMLU, GPQA) e sulla TREC RAG Track 2024, con metriche di accuratezza e di copertura nugget-based. I risultati mostrano che i benefici del reranking consapevole della novità emergono soprattutto in contesti di valutazione che privilegiano la copertura fine-grained dei contenuti, confermando il potenziale degli approcci embedding-aware nella costruzione di pipeline RAG più informative ed efficaci.

This thesis addresses the issue of informational redundancy in Retrieval-Augmented Generation (RAG) systems, where large language models are supported by documents retrieved from external collections. Redundant or paraphrased passages reduce factual coverage and limit the effectiveness of generated answers. To overcome this challenge, the work explores a novelty-aware reranking strategy based on semantic clustering in dense embedding spaces. The goal is to promote factual diversity by reducing redundancy and prioritizing complementary evidence. The study compares lexical and semantic supervision strategies, introducing a fine-tuning of the Set-Encoder reranker with contrastive and listwise objectives. Evaluation is carried out on standard benchmarks (MMLU, GPQA) and the TREC RAG Track 2024, using both accuracy-based and nugget-based coverage metrics. Results indicate that the benefits of novelty-aware reranking are most visible under evaluation frameworks that explicitly reward fine-grained factual coverage, highlighting the potential of embedding-aware approaches in designing more informative and effective RAG pipelines.

File

Nome file	Dimensione
thesis_namaki.pdf	2.97 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-09092025-142853