logo SBA

ETD

Digital archive of theses discussed at the University of Pisa

 

Thesis etd-11132024-150728


Thesis type
Tesi di laurea magistrale
Author
RAMACCIOTTI, FEDERICO
URN
etd-11132024-150728
Thesis title
Integrating advanced compression techniques with key-value stores for managing large source-code datasets
Department
INFORMATICA
Course of study
INFORMATICA
Supervisors
relatore Prof. Ferragina, Paolo
correlatore Dott. Tosoni, Francesco
Keywords
  • compression
  • large source-code datasets
  • lossless compression
  • permute-partition-compress
  • rocksdb
  • software heritage
Graduation session start date
29/11/2024
Availability
Full
Summary
Large source-code datasets are rising in popularity for archival and artificial intelligence purposes. We find new ways to optimize the storage and indexing of these datasets, achieving better compression ratios and access throughput to the files. To this aim, we leverage the Permute-Partition-Compress paradigm and the RocksDB key-value store, in order to make the datasets compressible, dynamic and quickly accessible. We corroborate the practical efficiency of our solution by performing experiments on datasets of various sizes up to 6 TB, taken from The Stack v1 (provided by HuggingFace) and the Software Heritage Archive. Datasets and source code are made publicly available on Github.
File