Tesi etd-05232013-000551

Tipo di tesi

Tesi di dottorato di ricerca

Autore

OTTAVIANO, GIUSEPPE

URN

etd-05232013-000551

Titolo

Space-Efficient Data Structures for Collections of Textual Data

Settore scientifico disciplinare

INF/01

Corso di studi

INFORMATICA

Relatori

tutor Prof. Grossi, Roberto

Parole chiave

compression
semi-structured data
sequences
strings
succinct data structures

Data inizio appello

07/06/2013

Consultabilità

Completa

Riassunto

This thesis focuses on the design of succinct and compressed data structures for collections of string-based data, specifically sequences of semi-structured documents in textual format, sets of strings, and sequences of strings. The study of such collections is motivated by a large number of applications both in theory and practice.
For textual semi-structured data, we introduce the concept of semi-index, a succinct construction that speeds up the access to documents encoded with textual semi-structured formats, such as JSON and XML, by storing separately a compact description of their parse trees, hence avoiding the need to re-parse the documents every time they are read.
For string dictionaries, we describe a data structure based on a path decomposition of the compacted trie built on the string set. The tree topology is encoded using succinct data structures, while the node labels are compressed using a simple dictionary-based scheme. We also describe a variant of the path-decomposed trie for scored string sets, where each string has a score. This data structure can support efficiently top-k completion queries, that is, given a string p and an integer k, return the k highest scored strings among those prefixed by p.
For sequences of strings, we introduce the problem of compressed indexed sequences of strings, that is, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while supporting supports random access, searching, and counting operations, both for exact matches and prefix search. We present a new data structure, the Wavelet Trie, that solves the problem by combining a Patricia trie with a wavelet tree. The Wavelet Trie improves on the state-of-the-art compressed data structures for sequences by supporting a dynamic alphabet and prefix queries.
Finally, we discuss the issue of the practical implementation of the succinct primitives used throughout the thesis for the experiments. These primitives are implemented as part of a publicly available library, Succinct, using state-of-the-art algorithms along with some improvements.

File

Nome file	Dimensione
thesis.pdf	1.05 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-05232013-000551