Tesi etd-09082015-032932 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
SENALDI, MARCO SILVIO GIUSEPPE
URN
etd-09082015-032932
Titolo
Casting Light on Idiom Flexibility: A Corpus-based Approach
Dipartimento
FILOLOGIA, LETTERATURA E LINGUISTICA
Corso di studi
LINGUISTICA
Relatori
relatore Prof. Lenci, Alessandro
controrelatore Prof.ssa Marotta, Giovanna
tutor Prof. Bertinetto, Pier Marco
tutor Dott. Lebani, Gianluca
controrelatore Prof.ssa Marotta, Giovanna
tutor Prof. Bertinetto, Pier Marco
tutor Dott. Lebani, Gianluca
Parole chiave
- compositionality
- Computational Linguistics
- Corpus Linguistics
- Distributional Semantic Models
- Idioms
- Psycholinguistic Judgments
- Shannon entropy
Data inizio appello
28/09/2015
Consultabilità
Non consultabile
Data di rilascio
28/09/2085
Riassunto
The goal of this work is to assess the cognitive plausibility of corpus-based measures that capture the formal flexibility and the semantic idiosyncrasy of a sample of Italian idiomatic expressions. The 87 idioms in our dataset are taken from the study of Tabossi and colleagues (2011), who elicit normative judgments on 245 Italian idioms from 740 native subjects. We exploit Shannon Entropy (Shannon 1948) to measure the lexical and morphosyntactic variability of our expressions and Distributional Semantic Models (DSMs) (Lenci 2008; Turney & Pantel 2010) to represent their semantics. Our dataset is extracted by the La Repubblica corpus (Baroni et al. 2004) via SYMPAThy (Syntactically Marked PATterns) (Lenci et al. 2014; 2015), a format of data representation that encompasses both Part-of-Speech-related and syntactic information to derive word combinations from corpora. Performing a series of stepwise multiple regression analyses, we find out that psycholinguistic judgments on idiom predictability, literality and syntactic flexibility can be modeled by an array of computational measures, composed of our entropic and distributional values, token frequency and the number of fully lexicalized arguments exhibited by each idiom.
This thesis is organized as follows. In Chapter 1 we illustrate the concepts of idiomaticity (Cacciari & Glucksberg 1991; Nunberg et al. 1994) and multiword expressions (MWEs) (Sag et al. 2001; Masini 2012), reviewing the major theoretical, psycholinguistic and computational studies that have been conducted on the subject. In Chapter 2 we give a definition of word combinations and describe the constructionist framework (Fillmore et al. 1988; Goldberg 1995; 2006; Croft & Cruse 2004; Hoffmann & Trousdale 2013) we have adopted in our work. We then survey both pros and cons of Part-Of-Speech-based and syntax-based methods for the extraction of word combinations from corpora and present SYMPAThy (Syntactically Marked PATterns), a format of data representation that combines both the approaches (Lenci et al. 2014; 2015). In Chapter 3 we expound the entropic indices and the distributional measures we have exploited and briefly present the normative data by Tabossi and colleagues (2011). Chapter 4 reports the description of our experiment, including data extraction, the calculation of our corpus-based indices and the execution of the stepwise multiple regression analyses. There follows an extensive discussion of our results. We then provide some Conclusions and suggest future directions of research.
This thesis is organized as follows. In Chapter 1 we illustrate the concepts of idiomaticity (Cacciari & Glucksberg 1991; Nunberg et al. 1994) and multiword expressions (MWEs) (Sag et al. 2001; Masini 2012), reviewing the major theoretical, psycholinguistic and computational studies that have been conducted on the subject. In Chapter 2 we give a definition of word combinations and describe the constructionist framework (Fillmore et al. 1988; Goldberg 1995; 2006; Croft & Cruse 2004; Hoffmann & Trousdale 2013) we have adopted in our work. We then survey both pros and cons of Part-Of-Speech-based and syntax-based methods for the extraction of word combinations from corpora and present SYMPAThy (Syntactically Marked PATterns), a format of data representation that combines both the approaches (Lenci et al. 2014; 2015). In Chapter 3 we expound the entropic indices and the distributional measures we have exploited and briefly present the normative data by Tabossi and colleagues (2011). Chapter 4 reports the description of our experiment, including data extraction, the calculation of our corpus-based indices and the execution of the stepwise multiple regression analyses. There follows an extensive discussion of our results. We then provide some Conclusions and suggest future directions of research.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |