ETD system

Electronic theses and dissertations repository

 

Tesi etd-04102018-101354


Thesis type
Tesi di laurea magistrale
Author
CAIAZZA, LUIGI
URN
etd-04102018-101354
Title
User-centric focused Web crawling
Struttura
INFORMATICA
Corso di studi
INFORMATICA E NETWORKING
Commissione
relatore Tonellotto, Nicola
relatore Catena, Matteo
controrelatore Pagli, Linda
Parole chiave
  • information retrieval
Data inizio appello
27/04/2018;
Consultabilità
completa
Riassunto analitico
Search engines are the main hub of information in the Web. They crawl and index Web contents to allow their users to satisfy their information needs. At the same time, many other users create everyday new Web contents or modify existing ones. This continuous growth of the Web poses a challenge to search engines. In fact, due to the evolution of the Web and to hardware limitations, it is impossible for search engines to crawl the Web in all its entirety. Consequently, new crawling approaches are needed to limit the amount of Web pages a search engine needs to fetch.<br><br>In this work, we give a literature review of the state-of-the-art on Web crawling with a particular attention on two optimization paradigms: the focused Web crawling and the user-centric Web crawling. The first aims at fetching only Web contents regarding a particular topic (e.g., sports, games, etc.), while the latter tries to fetch only those contents related to the information needed by the users (e.g., by analyzing query logs). Both these approaches can crawl high quality contents with a smaller amount of hardware requirements compared to traditional Web crawling. To the best of our knowledge, however, no attempt has been made to combine their strengths. Therefore, in this thesis we present a novel paradigm for Web crawling optimization that combines the two aforementioned approaches. Through extensive experimentation on the TREC ClueWeb09 (cat B.) Web corpus and on the MSN 2006 query log we show that our hybrid paradigm outperforms existing focused Web crawlers and user-centric Web crawlers.
File