Tesi etd-09132018-170123

Tipo di tesi

Tesi di laurea magistrale

Autore

VINCIGUERRA, GIORGIO

URN

etd-09132018-170123

Titolo

On Achieving Principled Space-Time Trade-Offs by Novel Indexing Data Structures

Dipartimento

INFORMATICA

Corso di studi

INFORMATICA

Relatori

relatore Prof. Ferragina, Paolo

Parole chiave

big data
data structures
database
dictionary
external memory
indexing
neural networks
searching

Data inizio appello

05/10/2018

Consultabilità

Non consultabile

Data di rilascio

05/10/2088

Riassunto

The explosion of big data poses a serious problem to the efficient retrieval and management of information. Conventional indexes such as B-tree and its variants scale both in space and in time with the number of keys, and this limitation will become more and more severe in the long run. To further complicate the situation, a new generation of applications and paradigms, such as IoT, fog and edge computing, demand strict latency, energy and storage constraints that also vary among devices and users. Traditional algorithms are unable to offer these flexibility and adaptability features in a principled way.

Recent research on external memory, cache-oblivious, and compressed data structures, has tried to tackle these problems but, unfortunately, apart from some few and specific results, we are still far from achieving the above goals, let alone offer tools to ease the work of software engineers in choosing the best solution for a constrained application.

To address these challenges, we propose a novel data structure that exploits a simple yet effective observation: not all datasets should be indexed in the same way, as they differ both in distribution and regularities. Indeed, one would neither use a tree nor a hash table if the dataset has increasing consecutive integer keys, as it is sufficient to use a linear function mapping from keys to positions. Starting from this trivial observation, our strategy builds a piecewise linear representation of the 2D data distribution (key, position), which is then used at query time to find the approximate position of a key in constant time. We were able to show that the piecewise representation is effective on various input datasets and, moreover, that depending on the context of use, one can design via an optimisation process a data structure that given a maximum query time minimises the space occupancy, or that given a maximum space minimises the query time.

We experiment our data structure, which we call Top-Down Regression index (TDR-index), on four real-world datasets: timestamps of IoT sensors events, taxi pickup times, longitude of points-of-interest in a map, and timestamps of requests to a web server. Compared to a popular in-memory B+ tree implementation, our data structure is able to achieve faster query time while reducing the memory occupancy by four orders of magnitude. Compared to the cache-sensitive search tree, our data structure is able to achieve its efficient query performance but with a gain of 74× in space reduction.

Our last contribution is to explore the possibility of improving the piecewise representation through the use of nonlinear regression models, such as neural networks.

File

Nome file	Dimensione
Tesi non consultabile. Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-09132018-170123