Tesi etd-09132018-170123 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
VINCIGUERRA, GIORGIO
URN
etd-09132018-170123
Titolo
On Achieving Principled Space-Time Trade-Offs by Novel Indexing Data Structures
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Prof. Ferragina, Paolo
Parole chiave
- big data
- data structures
- database
- dictionary
- external memory
- indexing
- neural networks
- searching
Data inizio appello
05/10/2018
Consultabilità
Non consultabile
Data di rilascio
05/10/2088
Riassunto
The explosion of big data poses a serious problem to the efficient retrieval and management of information. Conventional indexes such as B-tree and its variants scale both in space and in time with the number of keys, and this limitation will become more and more severe in the long run. To further complicate the situation, a new generation of applications and paradigms, such as IoT, fog and edge computing, demand strict latency, energy and storage constraints that also vary among devices and users. Traditional algorithms are unable to offer these flexibility and adaptability features in a principled way.
Recent research on external memory, cache-oblivious, and compressed data structures, has tried to tackle these problems but, unfortunately, apart from some few and specific results, we are still far from achieving the above goals, let alone offer tools to ease the work of software engineers in choosing the best solution for a constrained application.
To address these challenges, we propose a novel data structure that exploits a simple yet effective observation: not all datasets should be indexed in the same way, as they differ both in distribution and regularities. Indeed, one would neither use a tree nor a hash table if the dataset has increasing consecutive integer keys, as it is sufficient to use a linear function mapping from keys to positions. Starting from this trivial observation, our strategy builds a piecewise linear representation of the 2D data distribution (key, position), which is then used at query time to find the approximate position of a key in constant time. We were able to show that the piecewise representation is effective on various input datasets and, moreover, that depending on the context of use, one can design via an optimisation process a data structure that given a maximum query time minimises the space occupancy, or that given a maximum space minimises the query time.
We experiment our data structure, which we call Top-Down Regression index (TDR-index), on four real-world datasets: timestamps of IoT sensors events, taxi pickup times, longitude of points-of-interest in a map, and timestamps of requests to a web server. Compared to a popular in-memory B+ tree implementation, our data structure is able to achieve faster query time while reducing the memory occupancy by four orders of magnitude. Compared to the cache-sensitive search tree, our data structure is able to achieve its efficient query performance but with a gain of 74× in space reduction.
Our last contribution is to explore the possibility of improving the piecewise representation through the use of nonlinear regression models, such as neural networks.
Recent research on external memory, cache-oblivious, and compressed data structures, has tried to tackle these problems but, unfortunately, apart from some few and specific results, we are still far from achieving the above goals, let alone offer tools to ease the work of software engineers in choosing the best solution for a constrained application.
To address these challenges, we propose a novel data structure that exploits a simple yet effective observation: not all datasets should be indexed in the same way, as they differ both in distribution and regularities. Indeed, one would neither use a tree nor a hash table if the dataset has increasing consecutive integer keys, as it is sufficient to use a linear function mapping from keys to positions. Starting from this trivial observation, our strategy builds a piecewise linear representation of the 2D data distribution (key, position), which is then used at query time to find the approximate position of a key in constant time. We were able to show that the piecewise representation is effective on various input datasets and, moreover, that depending on the context of use, one can design via an optimisation process a data structure that given a maximum query time minimises the space occupancy, or that given a maximum space minimises the query time.
We experiment our data structure, which we call Top-Down Regression index (TDR-index), on four real-world datasets: timestamps of IoT sensors events, taxi pickup times, longitude of points-of-interest in a map, and timestamps of requests to a web server. Compared to a popular in-memory B+ tree implementation, our data structure is able to achieve faster query time while reducing the memory occupancy by four orders of magnitude. Compared to the cache-sensitive search tree, our data structure is able to achieve its efficient query performance but with a gain of 74× in space reduction.
Our last contribution is to explore the possibility of improving the piecewise representation through the use of nonlinear regression models, such as neural networks.
File
Nome file | Dimensione |
---|---|
Tesi non consultabile. |