ETD

Archivio digitale delle tesi discusse presso l'Università di Pisa

Tesi etd-05252010-115131


Tipo di tesi
Tesi di dottorato di ricerca
Autore
NITTO, IGOR
URN
etd-05252010-115131
Titolo
Parsing Algorithms for Data Compression
Settore scientifico disciplinare
INF/01
Corso di studi
INFORMATICA
Relatori
tutor Prof. Ferragina, Paolo
Parole chiave
  • Algorithms
  • Data Compression
Data inizio appello
24/06/2010
Consultabilità
Non consultabile
Data di rilascio
24/06/2050
Riassunto
The task of parsing consists of splitting an input text into a sequence of contiguous phrases. Several classes of data compression algorithms rely on parsing techniques either to identify repetitions inside the input string (dictionary-based compression) or to locate homogeneous pieces of data which are separately compressed (Permute- Partition-Compress paradigm).

In these applications, the choice of the parsing strategy is determinant for the final performance of the compressor. An ideal parsing algorithm should be able to parse the input in a way that minimizes the output-size of the underlying compressor, and the question is how efficiently this can be done. Many investigations have ocused on parsing algorithms that achieve optimality in the compressor’s output-size but the solutions proposed in literature are far from being satisfactory. In fact, most of them are either simple approaches based on dynamic programming with prohibitive time complexities, or heuristic algorithms which do not offer any bounds on the efficacy of the solution.

We propose a new approach to the design of optimal parsing algorithms, achieving significant improvements in running time over previous methods. It is well known that this problem can be modeled as a shortest-path computation over a particular directed-acyclic graph. We build upon this idea by showing that the class of graphs arising from this reduction satisfies particular structural properties that can be exploited by our algorithms to speed-up a lot shortest-path computation.

We obtain new results by applying this approach to the contexts of dictionary-based compression and Permute-Partition-Compress paradigm. We consider the class of LZ77-based compressors, the most powerful example of dictionary-based compression, and design the first parser which achieve optimality in the compressed output size (measured in bits) by taking efficient/optimal time and optimal space.

Then, using similar techniques, we provide an approximate parsing algorithm that, when used inside the Permute-Partition-Compress paradigm, produces a compressed output whose size is guaranteed to be no more than (1 + ε)-worse the optimal one, where ε is a user defined constant.
File