ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-03242017-173633


Tipo di tesi
Tesi di dottorato di ricerca
Autore
CORNOLTI, MARCO
URN
etd-03242017-173633
Titolo
Entity Linking on Text and Queries
Settore scientifico disciplinare
INF/01
Corso di studi
INFORMATICA
Relatori
tutor Prof. Ferragina, Paolo
correlatore Prof. Pedreschi, Dino
tutor Dott. Ciaramita, Massimiliano
Parole chiave
  • knowledge base
  • entity linking on queries
  • natural language processing
  • natural language understanding
  • algorithms evaluation
  • search engines
  • entity linking
Data inizio appello
29/04/2017
Consultabilità
Completa
Riassunto
Representing information is a key challenge for all applications that process and organize documents. Given the growth of digital data globally produced, and the increasing complexity of user needs when it comes to managing information, it becomes a priority to overcome traditional syntactic-based text representation and shift to a deeper understanding of documents. Making a computer understand a document, as opposed to simply analyzing its form, has been identified as a key challenge that would open an unprecedented range of new applications. For this reason, in recent years the need to shift from a purely syntactic representation of documents towards a semantic representation gained consensus and spin among academic and industrial researchers, and a big effort has been dedicated in this direction.


Thanks to the frontiers that the Web opened in terms of horizontal cooperation between users, there are examples of semi-structured knowledge bases created with the contribution of millions of people, most notably Wikipedia, DBpedia, and Wikidata. The information contained in these knowledge bases is extremely rich and easy to process, thanks to their open nature and well formed structure. Once the mapping between a document (being it a natural language text, an image, a video, etc.) and its semantics is done correctly, these knowledge bases can be exploited to treat information retrieval problems more deeply and accurately.

This thesis aims at being a step ahead in the task of building semantic representations of short and long text, from queries to long articles.

Recently a big effort has been put into finding a solution to the problem of detecting sequences of terms in a natural language text that mention entities and link them to the mentioned entity. This process is called entity linking and is a preliminary step for building more sophisticated solutions which aim at reconstructing the semantics of phrases or the whole document. Entities are usually drawn from a knowledge base, and each entity represents an unambiguous concept. The kind of entities covered by a knowledge base depends on the application.

Since entities are unambiguous, representing a textual document as the set of entities mentioned by it overcomes issues related to synonymy and polysemy that are inherent to natural language terms. In addition to that, entities are elements of a knowledge base that offers structured information about an entity and a graph of relations between entities. This way we can capture semantic connections to other entities and other documents. Since the ontology is generated by humans, these connections have a high accuracy. In other words, entity linking adds a layer of structured information to the original unstructured text.

A diverse set of algorithms have been proposed to do entity linking and, despite being in a early stage of development, the systems achieve surprisingly good results and account for a significant improvement in the applications that employ this representation of texts, such as documents clustering, classification, and others.

For all the problems mentioned above, it becomes of crucial importance to define abstractions like similarity of concepts, disambiguation of mentions contained in a natural language text, relevance of a topic, and to find a formal description, and a following efficient software implementation, of all these abstractions.

Considering the size of the knowledge bases at issue, the tasks presented above offer important algorithmic challenges over several kinds of data types of significant size, such as labeled graphs with millions of nodes and edges.

This thesis will investigate both the theoretical aspects and the algorithm engineering issues that arise in this context.
File