logo SBA

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-01292022-141555


Tipo di tesi
Tesi di dottorato di ricerca
Autore
VINCIGUERRA, GIORGIO
URN
etd-01292022-141555
Titolo
Learning-based compressed data structures
Settore scientifico disciplinare
INF/01
Corso di studi
INFORMATICA
Relatori
tutor Prof. Ferragina, Paolo
Parole chiave
  • algorithm engineering
  • algorithms
  • compressed data structures
  • data compression
Data inizio appello
23/02/2022
Consultabilità
Completa
Riassunto
This thesis revisits two fundamental problems in data structure design: predecessor search and rank/select primitives. These problems are pervasive in applications, particularly in areas such as database systems, search engines, bioinformatics, and Internet routing.
We show that real data present a peculiar kind of regularity that can be explained in terms of geometric considerations. We name it "approximate linearity" and analyse its algorithmic effectiveness in a variety of possible input data distributions. We then expand the horizon of compressed data structures by presenting solutions for the problems above that discover, or "learn", in a rigorous and efficient algorithmic way, the approximate linearities present in the data. In addition, we show how to combine this new form of compressibility with the classic repetition-aware approaches thus introducing a new class of compressed indexes. We accompany our theoretical results with implementations and experiments on large amounts of data, and we show that, compared to several well-engineered known compressed indexes, our data structures provide improvements in time, in space or both (often of orders of magnitude).
File