## Tesi etd-05252010-115131 |

Thesis type

Tesi di dottorato di ricerca

Author

NITTO, IGOR

URN

etd-05252010-115131

Title

Parsing Algorithms for Data Compression

Settore scientifico disciplinare

INF/01

Corso di studi

INFORMATICA

Supervisors

**tutor**Prof. Ferragina, Paolo

Parole chiave

- Algorithms
- Data Compression

Data inizio appello

24/06/2010;

Consultabilità

Parziale

Data di rilascio

24/06/2050

Riassunto analitico

The task of parsing consists of splitting an input text into a sequence of contiguous phrases. Several classes of data compression algorithms rely on parsing techniques either to identify repetitions inside the input string (dictionary-based compression) or to locate homogeneous pieces of data which are separately compressed (Permute- Partition-Compress paradigm).

In these applications, the choice of the parsing strategy is determinant for the final performance of the compressor. An ideal parsing algorithm should be able to parse the input in a way that minimizes the output-size of the underlying compressor, and the question is how efficiently this can be done. Many investigations have ocused on parsing algorithms that achieve optimality in the compressor’s output-size but the solutions proposed in literature are far from being satisfactory. In fact, most of them are either simple approaches based on dynamic programming with prohibitive time complexities, or heuristic algorithms which do not offer any bounds on the efficacy of the solution.

We propose a new approach to the design of optimal parsing algorithms, achieving significant improvements in running time over previous methods. It is well known that this problem can be modeled as a shortest-path computation over a particular directed-acyclic graph. We build upon this idea by showing that the class of graphs arising from this reduction satisfies particular structural properties that can be exploited by our algorithms to speed-up a lot shortest-path computation.

We obtain new results by applying this approach to the contexts of dictionary-based compression and Permute-Partition-Compress paradigm. We consider the class of LZ77-based compressors, the most powerful example of dictionary-based compression, and design the first parser which achieve optimality in the compressed output size (measured in bits) by taking efficient/optimal time and optimal space.

Then, using similar techniques, we provide an approximate parsing algorithm that, when used inside the Permute-Partition-Compress paradigm, produces a compressed output whose size is guaranteed to be no more than (1 + ε)-worse the optimal one, where ε is a user defined constant.

In these applications, the choice of the parsing strategy is determinant for the final performance of the compressor. An ideal parsing algorithm should be able to parse the input in a way that minimizes the output-size of the underlying compressor, and the question is how efficiently this can be done. Many investigations have ocused on parsing algorithms that achieve optimality in the compressor’s output-size but the solutions proposed in literature are far from being satisfactory. In fact, most of them are either simple approaches based on dynamic programming with prohibitive time complexities, or heuristic algorithms which do not offer any bounds on the efficacy of the solution.

We propose a new approach to the design of optimal parsing algorithms, achieving significant improvements in running time over previous methods. It is well known that this problem can be modeled as a shortest-path computation over a particular directed-acyclic graph. We build upon this idea by showing that the class of graphs arising from this reduction satisfies particular structural properties that can be exploited by our algorithms to speed-up a lot shortest-path computation.

We obtain new results by applying this approach to the contexts of dictionary-based compression and Permute-Partition-Compress paradigm. We consider the class of LZ77-based compressors, the most powerful example of dictionary-based compression, and design the first parser which achieve optimality in the compressed output size (measured in bits) by taking efficient/optimal time and optimal space.

Then, using similar techniques, we provide an approximate parsing algorithm that, when used inside the Permute-Partition-Compress paradigm, produces a compressed output whose size is guaranteed to be no more than (1 + ε)-worse the optimal one, where ε is a user defined constant.

File

Nome file | Dimensione |
---|---|

1 file non consultabili su richiesta dell'autore. |