Tesi etd-05132010-103407

Tipo di tesi

Tesi di dottorato di ricerca

Autore

PUPPIN, DIEGO MARIO

URN

etd-05132010-103407

Titolo

A Search Engine Architecture Based on Collection Selection

Settore scientifico disciplinare

INF/01

Corso di studi

INFORMATICA

Relatori

tutor Prof. Vanneschi, Marco

Parole chiave

Collection Selection
Information Retrieval
Search Engine

Data inizio appello

11/12/2007

Consultabilità

Completa

Riassunto

In this thesis, we present a distributed architecture for a Web search engine, based on the concept of collection selection. We introduce a novel approach to partition the collection of documents, able to greatly improve the effectiveness of standard collection selection techniques (CORI), and a new selection function outperforming the state of the art. Our technique is based on the novel query-vector (QV) document model, built from the analysis of query logs, and on our strategy of co-clustering queries and documents at the same time.

Incidentally, our partitioning strategy is able to identify documents that can be safely moved out of the main index (into a supplemental index), with a minimal loss in result accuracy. In our test, we could move 50\% of the collection to the supplemental index with a minimal loss in recall.

By suitably partitioning the documents in the collection, our system is able to select the subset of servers containing the most relevant documents for each query. Instead of broadcasting the query to every server in the computing platform, only the most relevant will be polled, this way reducing the average computing cost to solve a query.

We introduce a novel strategy to use the instant load at each server to drive the query routing. Also, we describe a new approach to caching, able to incrementally improve the quality of the stored results. Our caching strategy is effectively both in reducing computing load and in improving result quality.

By combining these innovations, we can achieve extremely high figures of precision, with a reduced load w.r.t.~full query broadcasting.
Our system can cover 65\% results offered by a centralized reference index (competitive recall at 5), with a computing load of only 15.6\%, \ie a peak of 156 queries out of a shifting window of 1000 queries. This means about 1/4 of the peak load reached when broadcasting queries. The system, with a slightly higher load (24.6\%), can cover 78\% of the reference results.

The proposed architecture, overall, presents a trade-off between computing cost and result quality, and we show how to guarantee very precise results in face of a dramatic reduction to computing load. This means that, with the same computing infrastructure, our system can serve more users, more queries and more documents.

File

Nome file	Dimensione
tesi.pdf	1.96 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-05132010-103407