## Tesi etd-02262019-145646 |

Thesis type

Tesi di dottorato di ricerca

Author

PIBIRI, GIULIO ERMANNO

URN

etd-02262019-145646

Title

Space and Time-Efficient Data Structures for Massive Datasets

Settore scientifico disciplinare

INF/01

Corso di studi

INFORMATICA

Supervisors

**tutor**Prof. Venturini, Rossano

Parole chiave

- Algorithm Engineering
- Efficiency
- Data Compression

Data inizio appello

08/03/2019;

Consultabilità

Completa

Riassunto analitico

This thesis concerns the design of compressed data structures for the efficient storage of massive datasets of integer sequences and short strings. The studied problems arise in several real-world scenarios, such as inverted index engineering, thus a consistent part of the thesis focuses on the experimental analysis of the proposed, practical, solutions. All the code used in the experimentation is publicly available in the hope of being useful for further research.

For the problem of representing in compressed space the sorted integer sequences of inverted indexes, we propose three different solutions showing new interesting space/time trade-offs.

The first proposal shows that clustering the inverted lists yields more compact indexes at the expense of a moderate slowdown in query processing speed. The core idea is that clusters of similar inverted lists, i.e., the ones sharing many integers, are compressed together, thus permitting to reduce their redundancy.

The second proposal describes an optimal, linear-time, algorithm that splits an inverted list in variable-size partitions in order to minimize its space when the Variable-Byte encoding is used to compress its partitions. The proposed algorithm is fast and yields indexes that are more than twice smaller than regular Variable- Byte, without compromising their speed efficiency.

The third proposal exploits the repetitiveness of the d-gapped inverted lists to achieve compact storage and very fast decoding speed. Specifically, we show that a dictionary of repeated integer patterns can be built to compress the inverted lists. This permits to represent the inverted lists as references to the dictionary’s patterns, thus allowing the two-fold advantage of: encoding many integers with a single reference identifier; decoding many integers with a single dictionary access.

We then take into account strings of at most n words, i.e., n-grams and address the two fundamental problems of: indexing large and sparse n-gram datasets; esti- mating language models from a large textual collection.

For the problem of indexing, we introduce a compressed trie data structure where the identifier of a word appearing after a context of k preceding words is represented as an integer whose value is proportional to the number of words that can follow only such context. Since the number of words following a given context is small for natural languages, the data structure achieves very compact storage and fast random access to the n-gram satellite values.

For the problem of estimation, we design an algorithm for estimating Kneser-Ney language models in external memory, which have emerged as the de-facto choice for language modelling. We show that exploiting the relation between con- text and suffix order of the extracted n-gram strings significantly improves the running time of the algorithm.

From a theoretical perspective, we also study the fascinating problem of representing dynamic integer dictionaries in succinct space. Many solutions for this problem are known to implement the operation of insertion, deletion and access to individual integers in optimal time, yet without posing any space bound on the problem. We exhibit, instead, a data structure that matches the optimal time bounds for the operations and, at the same time, achieves almost optimal com- pressed space by resorting on the succinctness of the Elias-Fano integer encoding.

For the problem of representing in compressed space the sorted integer sequences of inverted indexes, we propose three different solutions showing new interesting space/time trade-offs.

The first proposal shows that clustering the inverted lists yields more compact indexes at the expense of a moderate slowdown in query processing speed. The core idea is that clusters of similar inverted lists, i.e., the ones sharing many integers, are compressed together, thus permitting to reduce their redundancy.

The second proposal describes an optimal, linear-time, algorithm that splits an inverted list in variable-size partitions in order to minimize its space when the Variable-Byte encoding is used to compress its partitions. The proposed algorithm is fast and yields indexes that are more than twice smaller than regular Variable- Byte, without compromising their speed efficiency.

The third proposal exploits the repetitiveness of the d-gapped inverted lists to achieve compact storage and very fast decoding speed. Specifically, we show that a dictionary of repeated integer patterns can be built to compress the inverted lists. This permits to represent the inverted lists as references to the dictionary’s patterns, thus allowing the two-fold advantage of: encoding many integers with a single reference identifier; decoding many integers with a single dictionary access.

We then take into account strings of at most n words, i.e., n-grams and address the two fundamental problems of: indexing large and sparse n-gram datasets; esti- mating language models from a large textual collection.

For the problem of indexing, we introduce a compressed trie data structure where the identifier of a word appearing after a context of k preceding words is represented as an integer whose value is proportional to the number of words that can follow only such context. Since the number of words following a given context is small for natural languages, the data structure achieves very compact storage and fast random access to the n-gram satellite values.

For the problem of estimation, we design an algorithm for estimating Kneser-Ney language models in external memory, which have emerged as the de-facto choice for language modelling. We show that exploiting the relation between con- text and suffix order of the extracted n-gram strings significantly improves the running time of the algorithm.

From a theoretical perspective, we also study the fascinating problem of representing dynamic integer dictionaries in succinct space. Many solutions for this problem are known to implement the operation of insertion, deletion and access to individual integers in optimal time, yet without posing any space bound on the problem. We exhibit, instead, a data structure that matches the optimal time bounds for the operations and, at the same time, achieves almost optimal com- pressed space by resorting on the succinctness of the Elias-Fano integer encoding.

File

Nome file | Dimensione |
---|---|

Pibiri.PhD.pdf | 7.71 Mb |

Pibiri.P...zione.pdf | 85.22 Kb |

Contatta l'autore |