Tesi etd-09082015-032932

Tipo di tesi

Tesi di laurea magistrale

URN

etd-09082015-032932

Titolo

Casting Light on Idiom Flexibility: A Corpus-based Approach

Dipartimento

FILOLOGIA, LETTERATURA E LINGUISTICA

Corso di studi

LINGUISTICA

Relatori

relatore Prof. Lenci, Alessandro
controrelatore Prof.ssa Marotta, Giovanna
tutor Prof. Bertinetto, Pier Marco
tutor Dott. Lebani, Gianluca

Parole chiave

compositionality
Computational Linguistics
Corpus Linguistics
Distributional Semantic Models
Idioms
Psycholinguistic Judgments
Shannon entropy

Data inizio appello

28/09/2015

Consultabilità

Non consultabile

Data di rilascio

28/09/2085

Riassunto (Inglese)

Riassunto (Italiano)

The goal of this work is to assess the cognitive plausibility of corpus-based measures that capture the formal flexibility and the semantic idiosyncrasy of a sample of Italian idiomatic expressions. The 87 idioms in our dataset are taken from the study of Tabossi and colleagues (2011), who elicit normative judgments on 245 Italian idioms from 740 native subjects. We exploit Shannon Entropy (Shannon 1948) to measure the lexical and morphosyntactic variability of our expressions and Distributional Semantic Models (DSMs) (Lenci 2008; Turney & Pantel 2010) to represent their semantics. Our dataset is extracted by the La Repubblica corpus (Baroni et al. 2004) via SYMPAThy (Syntactically Marked PATterns) (Lenci et al. 2014; 2015), a format of data representation that encompasses both Part-of-Speech-related and syntactic information to derive word combinations from corpora. Performing a series of stepwise multiple regression analyses, we find out that psycholinguistic judgments on idiom predictability, literality and syntactic flexibility can be modeled by an array of computational measures, composed of our entropic and distributional values, token frequency and the number of fully lexicalized arguments exhibited by each idiom.
This thesis is organized as follows. In Chapter 1 we illustrate the concepts of idiomaticity (Cacciari & Glucksberg 1991; Nunberg et al. 1994) and multiword expressions (MWEs) (Sag et al. 2001; Masini 2012), reviewing the major theoretical, psycholinguistic and computational studies that have been conducted on the subject. In Chapter 2 we give a definition of word combinations and describe the constructionist framework (Fillmore et al. 1988; Goldberg 1995; 2006; Croft & Cruse 2004; Hoffmann & Trousdale 2013) we have adopted in our work. We then survey both pros and cons of Part-Of-Speech-based and syntax-based methods for the extraction of word combinations from corpora and present SYMPAThy (Syntactically Marked PATterns), a format of data representation that combines both the approaches (Lenci et al. 2014; 2015). In Chapter 3 we expound the entropic indices and the distributional measures we have exploited and briefly present the normative data by Tabossi and colleagues (2011). Chapter 4 reports the description of our experiment, including data extraction, the calculation of our corpus-based indices and the execution of the stepwise multiple regression analyses. There follows an extensive discussion of our results. We then provide some Conclusions and suggest future directions of research.

File

Nome file	Dimensione
Tesi non consultabile. Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-09082015-032932