Sistema ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa


Tesi etd-09222014-150614

Tipo di tesi
Tesi di laurea magistrale
Indirizzo email
Entity-enhanced query classification
Corso di studi
relatore Ferragina, Paolo
correlatore Cornolti, Marco
controrelatore Pedreschi, Dino
Parole chiave
  • query classification
  • information retrieval
Data inizio appello
Riassunto analitico
Web query classification aims at classifying Web users' queries with one or more predefined categories, according to their topics. This kind of classification task plays an important role in many IR applications, because it is a key tool to improve the effectiveness and the efficiency of general-purpose web search engines. As an example, consider the following uses:
- Metasearch engines: where the user's query is sent to multiple search engines and the top results are blended from each search engine into one overall list; the search engine can organize the large number of Web pages in the search results, according to the potentials categories to which the issued query belongs.
- Vertical search: compared to general search, this tool allows to focus the search on a subset of Web pages that correspond to the intent behind every user query. Once the search engine has predicted the category of information a Web user is looking for, she can select a certain vertical search engine automatically, without being forced to access the vertical search engine explicitly.
- Online advertising: this application aims at providing interesting advertisements to Web users during their search activities. This is the main revenue stream for the (free) search engines available on the Web. The classification of user queries into predefined categories is useful to improve the selection of the most pertinent advertisements.

While text classification and categorization is a well-known topic in Information Retrieval and Text Mining fields, the query classification problem is not yet completely addressed; in fact, there are several difficulties underlying this task since queries are short, ambiguous and query terms can be noisy.
Furthermore, as the queries contain less than 3 terms on average, classic text classification techniques that use the occurrence of words as features get in trouble because of feature-space sparseness.
To face these problems, most research groups built their query classification system extracting extra information through/from the Web, more specifically from Web ontologies such as: Wordnet , Wikipedia, BabelNet, DBpedia, Yago 2 .
The effectiveness about using this kind of resources resides in their structure as large graphs of concepts, properly interconnected to denote (semantic) relations between pairs of them. These labeled graphs are becoming more and more important in many IR applications and have recently led to the design of some powerful tools which are nowadays known as topic annotators.
The key idea is to identify, in the input text, short-and-meaningful sequences of terms (also called mentions) and annotate them with unambiguous identifiers (also called entities) drawn from a catalog.
Most recent work adopts anchor texts occurring in Wikipedia as entity mentions and the respective Wikipedia pages as the mentioned entity, because Wikipedia offers today the best trade-off between catalogs with a rigorous structure but low coverage (such as WordNet, CYC, TAP), and a large text collection with wide coverage but unstructured and noisy content (like the whole Web).
Apart of some preliminary approaches to using topic annotators in classification, as far as we know no one has investigated their use in the query classification problem.

This is exactly the goal of this thesis in which we will study, design and test a multi-label classification system which deploys three main ingredients:
- the best topic annotator for texts to date, namely TagMe (version WAT), to annotate user queries with pertinent topics drawn from Wikipedia;
- the best topic annotator for queries to date, namely SMAPH which resulted the winning system in the SIGIR competition;
- several novel algorithms and data structures for deploying the structural knowledge produced by the previous annotators in order to efficiently and efficaciously classify user queries into a set of 67 categories drawn from the KDDCUP '05 competition.