ETD system

Electronic theses and dissertations repository

 

Tesi etd-07032015-110257


Thesis type
Tesi di laurea magistrale
Author
PONZA, MARCO
URN
etd-07032015-110257
Title
A New Algorithm for Document Aboutness
Struttura
INFORMATICA
Corso di studi
INFORMATICA
Supervisors
relatore Prof. Ferragina, Paolo
correlatore Dott. Cornolti, Marco
Parole chiave
  • information retrieval
  • natural language processing
  • machine learning
  • entity salience
Data inizio appello
24/07/2015;
Consultabilità
Parziale
Data di rilascio
24/07/2018
Riassunto analitico
The thesis investigates the document aboutness task and proposes the design, implementation and test of a system that identifies the main focus of a text by detecting entities which are salient for its discourses and are drawn from Wikipedia. In order to design this system we deploy several Natural Language Processing tools, such as entity annotator, text summarizer and dependency parser. By using these tools we derive a large set of features upon which we develop a (binary) classifier that distinguishes salient versus non-salient entities. The efficiency and effectiveness of the developed system is checked via a large experimental test over the well-known annotated New York Times dataset.
File