ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-04042008-163914


Tipo di tesi
Tesi di dottorato di ricerca
Autore
ESULI, ANDREA
URN
etd-04042008-163914
Titolo
Automatic Generation of Lexical Resources for Opinion Mining: Models, Algorithms and Applications
Settore scientifico disciplinare
INF/01
Corso di studi
INGEGNERIA DELL'INFORMAZIONE
Relatori
Relatore Prof. Simoncini, Luca
Relatore Dott. Sebastiani, Fabrizio
Parole chiave
  • sentiment classification
  • random walk models
  • opinion mining
  • lexical resources
  • information extraction
  • gloss classification
  • text classification
Data inizio appello
10/06/2008
Consultabilità
Completa
Riassunto
Opinion mining is a recent discipline at the crossroads of Information Retrieval and of Computational Linguistics which is concerned not with the topic a document is about, but with the opinion it expresses. It has a rich set of applications, ranging from tracking users' opinions about products or about political candidates as expressed in online forums, to customer relationship management.
Functional to the extraction of opinions from text is the determination of the relevant entities of the language that are used to express opinions, and their opinion-related properties. For example, determining that the term beautiful casts a positive connotation to its subject.

In this thesis we investigate on the automatic recognition of opinion-related properties of terms. This results into building opinion-related lexical resources, which can be used into opinion mining applications.
We start from the (relatively) simple problem of determining the orientation of subjective terms.
We propose an original semi-supervised term classification model that is based on the quantitative analysis of the glosses of such terms, i.e. the definitions that these terms are given in on-line dictionaries. This method outperforms all known methods when tested on the recognized standard benchmarks for this task.

We show how our method is capable to produce good results on more complex tasks, such as discriminating subjective terms (e.g., good) from objective ones (e.g., green), or classifying terms on a fine-grained attitude taxonomy.

We then propose a relevant refinement of the task, i.e., distinguishing the opinion-related properties of distinct term senses. We present SentiWordNet, a novel high-quality, high-coverage lexical resource, where each one of the 115,424 senses contained in WordNet has been automatically evaluated on the three dimensions of positivity, negativity, and objectivity.

We propose also an original and effective use of random-walk models to rank term senses by their positivity or negativity. The random-walk algorithms we present have a great application potential also outside the opinion mining area, for example in word sense disambiguation tasks. A result of this experience is the generation of an improved version of SentiWordNet.

We finally evaluate and compare the various versions of SentiWordNet we present here with other opinion-related lexical resources well-known in literature, experimenting their use in an Opinion Extraction application. We show that the use of SentiWordNet produces a significant improvement with respect to the baseline system, not using any specialized lexical resource, and also with respect to the use of other opinion-related lexical resources.

File