Tesi etd-09222025-154207

Tipo di tesi

Tesi di laurea magistrale

Autore

NARDONE, ANGELO

URN

etd-09222025-154207

Titolo

Lossless Compression of Source Code using Large Language Models

Dipartimento

INFORMATICA

Corso di studi

INFORMATICA

Relatori

relatore Prof. Ferragina, Paolo

Parole chiave

code
compression
language
large
lossless
models
source

Data inizio appello

17/10/2025

Consultabilità

Non consultabile

Data di rilascio

17/10/2028

Riassunto

This thesis explores novel approaches to lossless text compression using Large Language Models (LLMs), with a focus on the compression of source code files. Motivated by the exponential growth of software repositories, where classical compressors such as bzip or zstd are widely used but limited, the study investigates the potential of LLMs to improve compression performance.

The work makes three main contributions. First, it evaluates methods on code-oriented datasets directly tied to large-scale archival challenges. Second, it provides a systematic comparison of multiple LLMs and their quantized variants, emphasizing both compression ratio and execution time. Third, it introduces new Shannon-inspired symbol ranking techniques, not previously explored, which demonstrate slightly improved runtime efficiency compared to existing LLM-based methods.

While results confirm that LLMs can achieve superior compression ratios, they also reveal persistent limitations in execution time. Nonetheless, the proposed approaches highlight promising research directions for balancing compression effectiveness with practical efficiency.

File

Nome file	Dimensione
La tesi non è consultabile. Contatta l’autore

ETD

Archivio digitale delle tesi discusse presso l’Università di Pisa

Tesi etd-09222025-154207