Tesi etd-09222025-154207 |
Link copiato negli appunti
Tipo di tesi
Tesi di laurea magistrale
Autore
NARDONE, ANGELO
URN
etd-09222025-154207
Titolo
Lossless Compression of Source Code using Large Language Models
Dipartimento
INFORMATICA
Corso di studi
INFORMATICA
Relatori
relatore Prof. Ferragina, Paolo
Parole chiave
- code
- compression
- language
- large
- lossless
- models
- source
Data inizio appello
17/10/2025
Consultabilità
Non consultabile
Data di rilascio
17/10/2028
Riassunto
This thesis explores novel approaches to lossless text compression using Large Language Models (LLMs), with a focus on the compression of source code files. Motivated by the exponential growth of software repositories, where classical compressors such as bzip or zstd are widely used but limited, the study investigates the potential of LLMs to improve compression performance.
The work makes three main contributions. First, it evaluates methods on code-oriented datasets directly tied to large-scale archival challenges. Second, it provides a systematic comparison of multiple LLMs and their quantized variants, emphasizing both compression ratio and execution time. Third, it introduces new Shannon-inspired symbol ranking techniques, not previously explored, which demonstrate slightly improved runtime efficiency compared to existing LLM-based methods.
While results confirm that LLMs can achieve superior compression ratios, they also reveal persistent limitations in execution time. Nonetheless, the proposed approaches highlight promising research directions for balancing compression effectiveness with practical efficiency.
The work makes three main contributions. First, it evaluates methods on code-oriented datasets directly tied to large-scale archival challenges. Second, it provides a systematic comparison of multiple LLMs and their quantized variants, emphasizing both compression ratio and execution time. Third, it introduces new Shannon-inspired symbol ranking techniques, not previously explored, which demonstrate slightly improved runtime efficiency compared to existing LLM-based methods.
While results confirm that LLMs can achieve superior compression ratios, they also reveal persistent limitations in execution time. Nonetheless, the proposed approaches highlight promising research directions for balancing compression effectiveness with practical efficiency.
File
| Nome file | Dimensione |
|---|---|
La tesi non è consultabile. |
|