ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-03272017-103645


Tipo di tesi
Tesi di dottorato di ricerca
Autore
TRANI, SALVATORE
URN
etd-03272017-103645
Titolo
Improving the Efficiency and Effectiveness of Document Understanding in Web Search
Settore scientifico disciplinare
INF/01
Corso di studi
INFORMATICA
Relatori
tutor Dott. Perego, Raffaele
commissario Prof. Venturini, Rossano
commissario Prof. Grossi, Roberto
Parole chiave
  • learning to rank
  • entity linking
  • efficiency
  • effectiveness
  • saliency detection
  • web search
Data inizio appello
29/04/2017
Consultabilità
Completa
Riassunto
Web Search Engines (WSEs) are probably nowadays the most complex information systems since they need to handle an ever-increasing amount of web pages and match them with the information needs expressed in short and often ambiguous queries by a multitude of heterogeneous users. In addressing this challenging task they have to deal at an unprecedented scale with two classic and contrasting IR problems: the satisfaction of effectiveness requirements and efficiency constraints. While the former refers to the user-perceived quality of query results, the latter regards the time spent by the system in retrieving and presenting them to the user.
Due to the importance of text data in the Web, natural language understanding techniques acquired popularity in the latest years and are profitably exploited by WSEs to overcome ambiguities in natural language queries given for example by polysemy and synonymy. A promising approach in this direction is represented by the so-called Web of Data, a paradigm shift which originates from the Semantic Web and promotes the enrichment of Web documents with the semantic concepts they refer to. Enriching unstructured text with an entity-based representation of documents - where entities can precisely identify persons, companies, locations, etc. - allows in fact, a remarkable improvement of retrieval effectiveness to be achieved.
In this thesis, we argue that it is possible to improve both efficiency and effectiveness of document understanding in Web search by exploiting learning-to-rank, i.e., a supervised technique aimed at learning effective ranking functions from training data. Indeed, on one hand, enriching documents with machine-learnt semantic annotations leads to an improvement of WSE effectiveness, since the retrieval of relevant documents can exploit a finer comprehension of the documents. On the other hand, by enhancing the efficiency of learning to rank techniques we can improve both WSE efficiency and effectiveness, since a faster ranking technique can reduce query processing time or, alternatively, allow a more complex and accurate ranking model to be deployed.
The contribution of this thesis are manifold: i) we discuss a novel machine- learnt measure for estimating the relatedness among entities mentioned in a document, thus enhancing the accuracy of text disambiguation tech- niques for document understanding; ii) we propose novel machine-learnt technique to label the mentioned entities according to a notion of saliency, where the most salient entities are those that have the highest utility in understanding the topics discussed; iii) we enhance state-of-the-art ensemble-based ranking models by means of a general learning-to-rank framework that is able to iteratively prune the less useful part of the ensemble and re-weight the remaining part accordingly to the loss function adopted. Finally, we share with the research community working in this area several open source tools to promote collaborative developments and favoring the reproducibility of research results.
File