ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-04102018-101354


Tipo di tesi
Tesi di laurea magistrale
Autore
CAIAZZA, LUIGI
URN
etd-04102018-101354
Titolo
User-centric focused Web crawling
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA E NETWORKING
Relatori
relatore Tonellotto, Nicola
relatore Catena, Matteo
controrelatore Pagli, Linda
Parole chiave
  • information retrieval
Data inizio appello
27/04/2018
Consultabilità
Completa
Riassunto
Search engines are the main hub of information in the Web. They crawl and index Web contents to allow their users to satisfy their information needs. At the same time, many other users create everyday new Web contents or modify existing ones. This continuous growth of the Web poses a challenge to search engines. In fact, due to the evolution of the Web and to hardware limitations, it is impossible for search engines to crawl the Web in all its entirety. Consequently, new crawling approaches are needed to limit the amount of Web pages a search engine needs to fetch.

In this work, we give a literature review of the state-of-the-art on Web crawling with a particular attention on two optimization paradigms: the focused Web crawling and the user-centric Web crawling. The first aims at fetching only Web contents regarding a particular topic (e.g., sports, games, etc.), while the latter tries to fetch only those contents related to the information needed by the users (e.g., by analyzing query logs). Both these approaches can crawl high quality contents with a smaller amount of hardware requirements compared to traditional Web crawling. To the best of our knowledge, however, no attempt has been made to combine their strengths. Therefore, in this thesis we present a novel paradigm for Web crawling optimization that combines the two aforementioned approaches. Through extensive experimentation on the TREC ClueWeb09 (cat B.) Web corpus and on the MSN 2006 query log we show that our hybrid paradigm outperforms existing focused Web crawlers and user-centric Web crawlers.
File