Tesi etd-04042008-163914

Tipo di tesi

Tesi di dottorato di ricerca

URN

etd-04042008-163914

Titolo

Automatic Generation of Lexical Resources for Opinion Mining: Models, Algorithms and Applications

Settore scientifico disciplinare

INF/01 - INFORMATICA

Corso di studi

INGEGNERIA DELL'INFORMAZIONE

Relatori

Relatore Prof. Simoncini, Luca
Relatore Dott. Sebastiani, Fabrizio

Parole chiave

gloss classification
information extraction
lexical resources
opinion mining
random walk models
sentiment classification
text classification

Data inizio appello

10/06/2008

Consultabilità

Completa

Riassunto (Inglese)

Riassunto (Italiano)

Opinion mining is a recent discipline at the crossroads of Information Retrieval and of Computational Linguistics which is concerned not with the topic a document is about, but with the opinion it expresses. It has a rich set of applications, ranging from tracking users' opinions about products or about political candidates as expressed in online forums, to customer relationship management.
Functional to the extraction of opinions from text is the determination of the relevant entities of the language that are used to express opinions, and their opinion-related properties. For example, determining that the term beautiful casts a positive connotation to its subject.

In this thesis we investigate on the automatic recognition of opinion-related properties of terms. This results into building opinion-related lexical resources, which can be used into opinion mining applications.
We start from the (relatively) simple problem of determining the orientation of subjective terms.
We propose an original semi-supervised term classification model that is based on the quantitative analysis of the glosses of such terms, i.e. the definitions that these terms are given in on-line dictionaries. This method outperforms all known methods when tested on the recognized standard benchmarks for this task.

We show how our method is capable to produce good results on more complex tasks, such as discriminating subjective terms (e.g., good) from objective ones (e.g., green), or classifying terms on a fine-grained attitude taxonomy.

We then propose a relevant refinement of the task, i.e., distinguishing the opinion-related properties of distinct term senses. We present SentiWordNet, a novel high-quality, high-coverage lexical resource, where each one of the 115,424 senses contained in WordNet has been automatically evaluated on the three dimensions of positivity, negativity, and objectivity.

We propose also an original and effective use of random-walk models to rank term senses by their positivity or negativity. The random-walk algorithms we present have a great application potential also outside the opinion mining area, for example in word sense disambiguation tasks. A result of this experience is the generation of an improved version of SentiWordNet.

We finally evaluate and compare the various versions of SentiWordNet we present here with other opinion-related lexical resources well-known in literature, experimenting their use in an Opinion Extraction application. We show that the use of SentiWordNet produces a significant improvement with respect to the baseline system, not using any specialized lexical resource, and also with respect to the use of other opinion-related lexical resources.

File

Nome file	Dimensione
andrea_e...hesis.pdf	1.68 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-04042008-163914