Tesi etd-01202026-204122

Tipo di tesi

Tesi di laurea magistrale

URN

etd-01202026-204122

Titolo

Analisi e comparazione di dati sperimentali e modelli computazionali per l'identificazione dei fattori sottostanti alla Marca Differenziale di Oggetto

Dipartimento

FILOLOGIA, LETTERATURA E LINGUISTICA

Corso di studi

LINGUISTICA E TRADUZIONE

Relatori

relatore Prof. Lenci, Alessandro
relatore Prof. Rovai, Francesco

Parole chiave

differential object marking
large language models
self-paced reading
surprisal
typology

Data inizio appello

06/02/2026

Consultabilità

Completa

Riassunto (Inglese)

The thesis offers an integrated analysis of Differential Object Marking (DOM) in Spanish, namely the phenomenon whereby the direct object is not marked uniformly but, under specific conditions, is introduced by the preposition a. DOM is of particular interest because it lies at the intersection of morphosyntax, semantics, and pragmatics: marking depends on structural properties, but above all on features related to the prominence of the referent and, to a non-negligible extent, on properties of the verbal event. The thesis has a twofold goal: first, to reconstruct the state of the art on the factors that determine DOM, and second, to test them experimentally through a design combining offline and online data, while complementing human results with a computational analysis based on surprisal extracted from Large Language Models (LLMs). Within this framework, a distinctive contribution is the study of the psychometric predictive power of LLMs, that is, their ability to predict psychometric measures of human behavior, in particular processing difficulty as reflected in reading times.
In the typological literature, DOM is interpreted as a strategy that balances economy and disambiguation: when the object is low in prominence, marking can be omitted, whereas when the object is highly prominent, marking becomes more likely or even obligatory in order to make argument structure more transparent. In Spanish, the most stable factor is animacy: human objects are strongly associated with marking, while inanimate objects tend to appear without a. A second fundamental dimension is referentiality (definiteness/specificity): definite and specific objects are more compatible with DOM, whereas non-specific indefinite objects show greater instability and stronger dependence on contextual factors. It is precisely in these “intermediate zones,” where nominal features do not determine the choice categorically, that the possibility emerges of observing modulations driven by additional factors.
Among these, the thesis foregrounds affectedness, understood as the degree to which the object is involved as a patient and undergoes an impact or transformation. The central hypothesis is that affectedness does not replace nominal features, but functions as a modulator, yielding effects that are more visible in contexts where marking is less stabilized. In parallel, the thesis addresses sociolinguistic variation, hypothesizing that bilingual speakers (e.g., Spanish–Catalan) may exhibit systematic differences in tolerance toward marginal structures precisely in the gradient domains of the phenomenon.
The experiment combines a self-paced reading task (an online measure of processing cost) with acceptability judgments (an offline measure of perceived naturalness). The design systematically manipulates the presence vs. absence of DOM by crossing it with animacy, referentiality, and verb classes defined in terms of affectedness. The distinction between “grammatical” and “ungrammatical” sentences is explicitly operationalized: a sentence is grammatical when object marking matches general DOM expectations, and ungrammatical when it introduces a violation (omission where marking is expected, or insertion where it is not expected). This enables a direct comparison across conditions and a controlled measurement of the penalty associated with violations, including potential spillover effects in subsequent regions.
The innovative core of the thesis, however, concerns the use of surprisal as a bridge between computational language models and psychometric data, with a particular emphasis on the psychometric predictive power of LLMs. Surprisal, defined as -log p(w|context), quantifies the unpredictability of a word given the preceding context and connects directly to prediction-based theories of language processing: less predictable input requires greater effort, yielding increased reading times. The thesis does not assume that LLMs “replicate” human competence, but rather evaluates operationally whether their probabilistic distributions are sufficiently informative to predict observable behavioral variation.
The study focuses on decoder-only autoregressive models, consistent with incremental reading, including Llama 3.1 8B, Qwen3 8B, a Spanish fine-tuned version of Llama, and a Spanish-only GPT-2, alongside baselines of increasing complexity (n-gram models, and RNN and Transformer architectures trained from scratch). Surprisal is calculated specifically on the object noun and, when subword tokenization applies, it is aggregated to the word level by summing the surprisal values of the component tokens. Model adequacy is assessed through three complementary tests: (i) the coherence of surprisal with the grammatical manipulations, (ii) the ability of surprisal to improve the prediction of reading times beyond superficial controls, and (iii) robustness, i.e., verification that grammatical factors retain an independent role even after surprisal is included.
Experimental results confirm that grammatical sentences are more acceptable and processed faster, whereas violations incur processing costs. The effects are not uniform: penalties vary as a function of animacy, referentiality, and verb class, supporting the multifactorial nature of DOM and the relevance of affectedness as a modulator. At the sociolinguistic level, Spanish–Catalan bilinguals show greater tolerance in some ungrammatical conditions, suggesting that variation emerges primarily in the less categorical domains. On the computational side, LLMs differ in their ability to capture DOM: more powerful models display surprisal patterns that align more closely with grammatical expectations, and the Spanish-only GPT-2 appears particularly sensitive even to finer-grained modulations. Crucially, including surprisal in the statistical models improves the prediction of reading times, demonstrating genuine psychometric predictive power; however, grammatical factors continue to contribute independently, indicating that surprisal captures a predictive component of processing but does not fully replace the explanatory structure provided by grammar and semantics.
In conclusion, the thesis shows that Spanish DOM is primarily driven by nominal features, but modulated by event-level properties and sensitive to sociolinguistic variation. The integration of LLM-based surprisal constitutes the most innovative aspect of the work: it makes probabilistic expectations quantifiable in a way that is directly comparable to psychometric measures, and it provides a testable framework for assessing the extent to which neural language models can serve as predictors of human behavior in language processing.

Riassunto (Italiano)

La tesi propone un’analisi integrata della Marca Differenziale di Oggetto (MDO) nello spagnolo, ossia il fenomeno per cui l’oggetto diretto non viene marcato in modo uniforme ma, in condizioni specifiche, viene introdotto dalla preposizione a. La MDO rappresenta un caso di particolare interesse perché si colloca all’intersezione tra morfosintassi, semantica e pragmatica: la marcatura dipende da proprietà strutturali, ma soprattutto da tratti di prominenza del referente e, in misura non trascurabile, da caratteristiche dell’evento verbale. L’obiettivo della tesi è duplice: ricostruire lo stato dell’arte sui fattori che determinano la MDO e verificarli sperimentalmente tramite un disegno che combina dati offline e online, affiancando inoltre ai risultati umani una modellazione computazionale fondata sulla surprisal estratta da Large Language Models (LLM). In questo quadro, un contributo distintivo è lo studio dello psychometric predictive power dei LLM, cioè la loro capacità di predire misure psicometriche del comportamento umano, in particolare la difficoltà di processamento osservabile nei tempi di lettura.
Nella letteratura tipologica, la MDO viene interpretata come una strategia che bilancia economia e disambiguazione: quando l’oggetto è poco prominente la marcatura può essere omessa, mentre quando l’oggetto è altamente prominente la marcatura diventa più probabile o obbligatoria per rendere più trasparente la struttura argomentale. Nel caso dello spagnolo, il fattore più stabile è l’animatezza: gli oggetti umani sono fortemente associati alla marcatura, mentre gli inanimati tendono a comparire senza a. Un secondo asse fondamentale è la referenzialità (definitezza/specificità): oggetti definiti e specifici risultano più compatibili con la MDO, mentre oggetti indefiniti non specifici mostrano maggiore instabilità e dipendenza dal contesto. Proprio nelle “zone intermedie”, dove i tratti nominali non determinano la scelta in modo categorico, emerge la possibilità di osservare modulazioni legate a fattori ulteriori.
Tra questi, la tesi mette in primo piano l’affectedness, intesa come grado in cui l’oggetto è coinvolto come paziente e subisce un impatto o trasformazione. L’ipotesi centrale è che l’affectedness non rimpiazzi i tratti nominali, ma agisca come modulatore, producendo effetti più visibili nei contesti in cui la marcatura è meno stabilizzata. Parallelamente, la tesi considera la variazione sociolinguistica, ipotizzando che parlanti bilingui (es. spagnolo-catalano) possano mostrare differenze sistematiche di tolleranza verso strutture marginali proprio nei domini graduali del fenomeno.
L’esperimento integra un compito di self-paced reading (misura online di costo di elaborazione) e giudizi di accettabilità (misura offline di naturalezza). Il disegno manipola la presenza/assenza della MDO incrociandola con animatezza, referenzialità e classi verbali definite tramite affectedness. La distinzione tra frasi “corrette” e “non corrette” è operazionalizzata in modo esplicito: una frase è corretta se la marcatura corrisponde alle aspettative generali della MDO, e non corretta se introduce una violazione (omissione dove attesa o inserzione dove non prevista). Questo permette un confronto diretto tra condizioni e una misurazione controllata della penalizzazione associata alle violazioni, includendo anche effetti di spillover nelle regioni successive.
Il nucleo innovativo della tesi riguarda però l’impiego della surprisal come ponte tra modelli linguistici e dati psicometrici, con un’enfasi sullo psychometric predictive power dei LLM. La surprisal, definita come -log p(w|contesto), quantifica l’imprevedibilità di una parola dato il contesto precedente e si collega direttamente a teorie dell’elaborazione basate sulla predizione: input meno prevedibili richiedono maggiore sforzo, producendo incrementi nei tempi di lettura. La tesi non assume che i LLM “replichino” la competenza umana, ma valuta in modo operativo se la loro distribuzione probabilistica sia sufficientemente informativa da predire variazioni comportamentali osservabili.
Sono selezionati modelli autoregressivi decoder-only, coerenti con la lettura incrementale, tra cui Llama 3.1 8B, Qwen3 8B, una versione fine-tuned sullo spagnolo e un GPT-2 monolingue spagnolo, oltre a baseline (modelli n-gram, e architture RNN e Transformer addestrate da zero). La surprisal viene calcolata specificamente sul nome oggetto e, in presenza di tokenizzazione subword, viene aggregata a livello di parola sommando le surprisal dei token costituenti. L’adeguatezza dei modelli viene verificata tramite tre test complementari: (i) coerenza della surprisal con le manipolazioni grammaticali, (ii) capacità della surprisal di migliorare la predizione dei tempi di lettura oltre i controlli superficiali, e (iii) solidità, cioè verifica che i fattori grammaticali mantengano un ruolo autonomo anche dopo l’inclusione della surprisal.
I risultati sperimentali confermano che le frasi corrette sono più accettabili e processate più rapidamente, mentre le violazioni generano costi di elaborazione. Gli effetti non sono uniformi: la penalizzazione varia in funzione di animatezza, referenzialità e classe verbale, confermando la natura multifattoriale della MDO e la rilevanza dell’affectedness come modulatore. Sul piano sociolinguistico, i bilingui spagnolo-catalano mostrano maggiore tolleranza in alcune condizioni non corrette, suggerendo che la variazione emerga soprattutto nei domini meno categoriali. Dal lato computazionale, i LLM differiscono nella capacità di catturare la MDO: i modelli più potenti mostrano una surprisal più coerente con le aspettative grammaticali e, soprattutto, il GPT2 monolingue spagnolo risulta più sensibile anche a modulazioni fini. Crucialmente, includere la surprisal nei modelli statistici migliora la predizione dei tempi di lettura, dimostrando un effettivo psychometric predictive power; tuttavia, i fattori grammaticali continuano a contribuire in modo indipendente, indicando che la surprisal cattura una componente predittiva del processamento ma non esaurisce la struttura esplicativa offerta da grammatica e semantica.
In conclusione, la tesi mostra che la MDO nello spagnolo è un fenomeno guidato principalmente da tratti nominali, ma modulato da proprietà eventive e sensibile alla variazione sociolinguistica. L’integrazione con la surprisal degli LLM costituisce il tratto più innovativo: consente di quantificare le aspettative probabilistiche in modo confrontabile con misure psicometriche e di valutare, con criteri testabili, fino a che punto i modelli neurali possano fungere da predittori del comportamento umano in elaborazione linguistica.

File

Nome file	Dimensione
Tesi_mag...manna.pdf	3.13 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01202026-204122