Tesi etd-10272022-223251

Tipo di tesi

Tesi di laurea magistrale

Autore

CALAMITA, LEONARDO

URN

etd-10272022-223251

Titolo

Improving the Semantic Proficiency of Large Language Models

Dipartimento

INFORMATICA

Corso di studi

INFORMATICA

Relatori

relatore Prof. Attardi, Giuseppe

Parole chiave

bleu
data to text
data-to-text
large language models
machine learning
ML
natural language processing
NLP
semantic accuracy
ser
transformers

Data inizio appello

02/12/2022

Consultabilità

Completa

Riassunto

La tesi propone un'analisi del panorama dei modelli Data-to-Text, nello specifico quelli riguardanti i dataset E2E, Viggo e WebNLG. Nel fare ciò, viene posta particolare attenzione all'accuratezza fattuale dei risultati proposti e ad eventuali accorgimenti atti a migliorare tale aspetto, mostrando che le principali metriche utilizzate per giudicare i modelli Data-to-Text non considerano adeguatamente il lato semantico dei risultati.
In seguito a questo, viene proposto un modello basato su Datatuner, un LLM creato da Amazon, il cui scopo è quello di incrementare la fedeltà semantica delle frasi generate. Il modello proposto riesce a utilizzare un solo modello pretrainato invece dei due presenti nell'architettura di Datatuner, migliorandone comunque i risultati dal punto di vista semantico tramite l'utilizzo di un'architettura encoder-decoder e ulteriori accorgimenti per il training del modello. Inoltre, viene introdotto il dataset JildaD2T, il primo dataset Data-to-Text interamente in italiano.

The thesis proposes an analysis of the Data-to-Text panorama, specifically regarding models based on the E2E, Viggo and WebNLG datasets. While doing so, the attention is posed on the factual accuracy of the proposed results and on eventual techniques apt to ameliorate said aspect, showing that the main metrics used to judge Data-to-Text models do not adequately consider the semantic properties of the results.
Following this, a model based on Datatuner (an LLM created by Amazon) is proposed, whose main objective is to increase the semantic fidelity of the generated sentences. The proposed model manages to employ only a single pretrained model instead of the two present in the architecture of Datatuner, while still improving its semantic results through the usage of an encoder-decoder architecture and additional training techniques. Moreover, the JildaD2T dataset is introduced, being the first Italian Data-to-Text dataset.

File

Nome file	Dimensione
calamita...pleta.pdf	1.10 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-10272022-223251