ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-12222005-090239


Tipo di tesi
Tesi di dottorato di ricerca
Autore
Gulli', Antonino
Indirizzo email
gulli@di.unipi.it, a.gulli@tin.it
URN
etd-12222005-090239
Titolo
On Two Web IR Boosting Tools: Clustering and Ranking
Settore scientifico disciplinare
INF/01
Corso di studi
INFORMATICA
Relatori
relatore Prof. Ferragina, Paolo
Parole chiave
  • Search Engines
  • Ranking
  • Clustering
  • Web-IR
Data inizio appello
09/06/2006
Consultabilità
Completa
Riassunto
This thesis investigates several research problems which arise in modern Web Information Retrieval (WebIR). The Holy Grail of modern WebIR is to find a way to organize and to rank results so that the most ``relevant' come first. The first break-through technique was the exploitation of the link structure of the Web graph in order to rank the result pages, using the well-known Hits and Pagerank algorithms. This link-analysis approaches have been improved and extended, but yet they seem to be insufficient in providing a satisfying search experience.
In a number of situations a flat list of search results is not enough, and the users might desire to have search results grouped on-the-fly in folders of similar topics. In addition, the folders should be annotated with meaningful labels for rapid identification of the desired group of results. In other situations, users may have different search goals even when they express them with the same query. In this case the search results should be personalized according to the users' on-line activities. In order to address this need, we will discuss the algorithmic ideas behind SnakeT, a hierarchical clustering meta-search engine which personalizes searches according to the clusters selected by users on-the-fly.
There are also situations where users might desire to access fresh information. In these cases, traditional link analysis could not be suitable. In fact, it is possible that there is not enough time to have many links pointing to a recently produced piece of information. In order to address this need, we will discuss the algorithmic and numerical ideas behind a new ranking algorithm suitable for ranking fresh type of information, such as news articles or blogs.
When link analysis suffices to produce good quality search results, the huge amount of Web information asks for fast ranking methodologies. We will discuss numerical methodologies for accelerating the eingenvector-like computation, commonly used by link analysis.
An important result of this thesis is that we show how to address the above predominant issues of Web Information Retrieval by using clustering and ranking methodologies. We will demonstrate that both clustering and ranking have a mutual reinforcement propriety which has not yet been studied intensively. This propriety can be exploited to boost the precision of both the two methodologies.
File