ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-07032015-110257


Tipo di tesi
Tesi di laurea magistrale
Autore
PONZA, MARCO
URN
etd-07032015-110257
Titolo
A New Algorithm for Document Aboutness
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Prof. Ferragina, Paolo
correlatore Dott. Cornolti, Marco
Parole chiave
  • machine learning
  • information retrieval
  • entity salience
  • natural language processing
Data inizio appello
24/07/2015
Consultabilità
Completa
Riassunto
The thesis investigates the document aboutness task and proposes the design, implementation and test of a system that identifies the main focus of a text by detecting entities which are salient for its discourses and are drawn from Wikipedia. In order to design this system we deploy several Natural Language Processing tools, such as entity annotator, text summarizer and dependency parser. By using these tools we derive a large set of features upon which we develop a (binary) classifier that distinguishes salient versus non-salient entities. The efficiency and effectiveness of the developed system is checked via a large experimental test over the well-known annotated New York Times dataset.
File