Tipo di tesi
Tesi di laurea magistrale
Titolo
Lossless Compression of Source Code using Large Language Models
Corso di studi
INFORMATICA
Parole chiave
- code
- compression
- language
- large
- lossless
- models
- source
Data inizio appello
17/10/2025
Consultabilità
Non consultabile
Data di rilascio
17/10/2028
Riassunto (Italiano)
This thesis explores novel approaches to lossless text compression using Large Language Models (LLMs), with a focus on the compression of source code files. Motivated by the exponential growth of software repositories, where classical compressors such as bzip or zstd are widely used but limited, the study investigates the potential of LLMs to improve compression performance.
The work makes three main contributions. First, it evaluates methods on code-oriented datasets directly tied to large-scale archival challenges. Second, it provides a systematic comparison of multiple LLMs and their quantized variants, emphasizing both compression ratio and execution time. Third, it introduces new Shannon-inspired symbol ranking techniques, not previously explored, which demonstrate slightly improved runtime efficiency compared to existing LLM-based methods.
While results confirm that LLMs can achieve superior compression ratios, they also reveal persistent limitations in execution time. Nonetheless, the proposed approaches highlight promising research directions for balancing compression effectiveness with practical efficiency.