Tesi etd-11132024-150728 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
RAMACCIOTTI, FEDERICO
URN
etd-11132024-150728
Titolo
Integrating advanced compression techniques with key-value stores for managing large source-code datasets
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Prof. Ferragina, Paolo
correlatore Dott. Tosoni, Francesco
correlatore Dott. Tosoni, Francesco
Parole chiave
- compression
- large source-code datasets
- lossless compression
- permute-partition-compress
- rocksdb
- software heritage
Data inizio appello
29/11/2024
Consultabilità
Completa
Riassunto
Large source-code datasets are rising in popularity for archival and artificial intelligence purposes. We find new ways to optimize the storage and indexing of these datasets, achieving better compression ratios and access throughput to the files. To this aim, we leverage the Permute-Partition-Compress paradigm and the RocksDB key-value store, in order to make the datasets compressible, dynamic and quickly accessible. We corroborate the practical efficiency of our solution by performing experiments on datasets of various sizes up to 6 TB, taken from The Stack v1 (provided by HuggingFace) and the Software Heritage Archive. Datasets and source code are made publicly available on Github.
File
Nome file | Dimensione |
---|---|
Tesi.pdf | 3.03 Mb |
Contatta l’autore |