Tesi etd-04192024-113322

Tipo di tesi

Tesi di dottorato di ricerca

URN

etd-04192024-113322

Titolo

Designing new compressed data structures using data-aware approaches

Settore scientifico disciplinare

INF/01 - INFORMATICA

Corso di studi

INFORMATICA

Relatori

.

tutor Prof. Ferragina, Paolo

Parole chiave

algorithms
compresseddatastructure
database
datacompression
datastructure

Data inizio appello

06/05/2024

Consultabilità

Completa

Riassunto (Inglese)

Riassunto (Italiano)

Our society is generating an exponentially increasing amount of data that is becoming
progressively repetitive. While compressed data structures have traditionally played a
crucial role in addressing repetitiveness, the current dynamic data flow is characterised
by new emerging patterns and trends. Ignoring this evolving tendency means missing
out on the opportunity to significantly enhance both space and time efficiency in system
performance.
In this thesis, we design, implement, and experimentally validate innovative, distinctive,
and data-aware compressed data structures for a wide set of data types. As a result, our
schemes automatically tailor to new patterns and trends arising from Big Data using brand
new algorithms, as well as state-of-the-art machine-learning-inspired techniques.
This research introduces a learned approach to address the ubiquitous problem of com-
pressing and indexing integers. Additionally, it explores data-aware optimisation strategies
for constructing compressed trie structures, thereby indexing and compressing strings.
The exploration extends to theoretically grounded solutions for selecting compression
encodings for table columns within industrial analytical database management systems.
Furthermore, it delves into the compression of huge source code datasets, considering file
similarity based on the actual content. To underscore the substantial practical benefit these
techniques bring, they have been thoroughly compared against well-engineered known
solutions. The dataset size goes from tens of GB of integers and strings to petabytes scale
database columns and ultra-large-scale source code datasets.
In conclusion, this PhD thesis represents a contribution to the evolving landscape of data
management and compression in the era of Big Data. The data-aware compressed data
structures proposed and examined herein contribute to the emerging trend of designing
adaptive systems that can automatically tailor themselves to diverse patterns.

File

Nome file	Dimensione
PhD_Thes...final.pdf	1.94 Mb
Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-04192024-113322