logo SBA


Digital archive of theses discussed at the University of Pisa


Thesis etd-05252010-115131

Thesis type
Tesi di dottorato di ricerca
Thesis title
Parsing Algorithms for Data Compression
Academic discipline
Course of study
tutor Prof. Ferragina, Paolo
  • Algorithms
  • Data Compression
Graduation session start date
Release date
The task of parsing consists of splitting an input text into a sequence of contiguous phrases. Several classes of data compression algorithms rely on parsing techniques either to identify repetitions inside the input string (dictionary-based compression) or to locate homogeneous pieces of data which are separately compressed (Permute- Partition-Compress paradigm).

In these applications, the choice of the parsing strategy is determinant for the final performance of the compressor. An ideal parsing algorithm should be able to parse the input in a way that minimizes the output-size of the underlying compressor, and the question is how efficiently this can be done. Many investigations have ocused on parsing algorithms that achieve optimality in the compressor’s output-size but the solutions proposed in literature are far from being satisfactory. In fact, most of them are either simple approaches based on dynamic programming with prohibitive time complexities, or heuristic algorithms which do not offer any bounds on the efficacy of the solution.

We propose a new approach to the design of optimal parsing algorithms, achieving significant improvements in running time over previous methods. It is well known that this problem can be modeled as a shortest-path computation over a particular directed-acyclic graph. We build upon this idea by showing that the class of graphs arising from this reduction satisfies particular structural properties that can be exploited by our algorithms to speed-up a lot shortest-path computation.

We obtain new results by applying this approach to the contexts of dictionary-based compression and Permute-Partition-Compress paradigm. We consider the class of LZ77-based compressors, the most powerful example of dictionary-based compression, and design the first parser which achieve optimality in the compressed output size (measured in bits) by taking efficient/optimal time and optimal space.

Then, using similar techniques, we provide an approximate parsing algorithm that, when used inside the Permute-Partition-Compress paradigm, produces a compressed output whose size is guaranteed to be no more than (1 + ε)-worse the optimal one, where ε is a user defined constant.