## Tesi etd-03242017-172505 |

Thesis type

Tesi di dottorato di ricerca

Author

FARRUGGIA, ANDREA

URN

etd-03242017-172505

Title

On the use of optimization techniques for designing novel data compression and indexing schemes

Settore scientifico disciplinare

INF/01

Corso di studi

INFORMATICA

Commissione

**tutor**Prof. Ferragina, Paolo

**relatore**Prof. Venturini, Rossano

Parole chiave

- algoritmi
- strutture dati
- indici compressi
- compressione dati
- ottimizzazione

Data inizio appello

29/04/2017;

Consultabilità

completa

Riassunto analitico

The last few years have seen an exponential increase, driven by many disparate fields such as big data analytics, genomic technologies or even high-energy physics, of the amount of data that must be stored, accessed and analyzed.

The possibility offered to the users by social media networks of easily creating and publishing contents gave rise to platforms managing hundred of petabytes, or the appearance of sequencers capable of producing terabytes of data at an inexpensive price, made possible by advancements in DNA sequencing technology, opened the road to fields such as pan-genomic analysis where common motifs have to be found in hundreds of genomes that must be first sequenced and indexed.

Many of these developments, which operate on the data in an on-line fashion, have strict performance requirements. For example, web indexes must be capable of retrieving any indexed webpage in less than a millisecond on average in order to meet operational standards.

This represents an incentive to store as much data as possible on main memory, instead of secondary storage like hard disks or solid-state disks. In fact, since accessing the RAM has a latency of around 100 nanoseconds, while accessing the disk has a latency of few milliseconds, even allowing just 1% of the memory references to be read from disk implies a slowdown of a factor of about 100.

The most effective way of fitting more data in memory is to make use of data compression techniques. However, because of the on-line fashion these applications operate, these techniques must allow to operate over the data with an efficiency that is comparable to that of storing data in an uncompressed format. The kind of techniques that must be used to allow good compression ratios and fast data access depends on the way the data is accessed and on the kind of operations that must be performed on it. Applications can be grouped into two broad categories, depending on the nature of these operational

-) applications that organize data into files that must be accessed on their entirety. Examples of this pattern are high-performing distributed storage systems like Google's BigTable, a distributed key-value database where each value is composed of chunks of $64$MiB data. Applications in this scenario, also known as "compress once, decompress many times", need a good trade-off between compression ratio and decompression speed.

-) applications that need to perform sophisticated queries on their dataset. Examples of this class of applications are web indexes, where a search query needs to fetch only those documents containing a user-supplied list of words, or DNA aligners, where portions of the text that match a given pattern must be found. In this kind of applications, only a small fraction of the data should be accessed, so the compressed representation should allow for some kind of direct access without requiring to decompress the whole text, which is costly.

Theoretical solutions for these needs abound in the literature.

For "compress once, decompress many times" scenarios, there are compressors based either on the Lempel-Ziv parsing scheme or on the Burrows-Wheeler Transform that are asymptotically optimal both in time and space, while for the second category Compressed Full-Text Indexes such as the FM-index allow powerful queries on the compressed representation with almost optimal time complexities.

However, new applications provide plenty of new challenges, such as the need of specific time/space trade-offs that are not provided by existing solutions or the necessity of new, compact storage systems that exploit the specialties of new datasets to support fast queries on it.

In this thesis, our contributions focused on developing tools and techniques that can be helpful in addressing these new challenges.

Efficient and usable space-optimal compressor.

Even though the LZ77 algorithm described by Lempel and Ziv is asymptotically optimal in a strong information-theory sense, it does not necessarily yield the lowest possible compression ratio achievable for every possible string. In fact, an active line of research, both in the industry and in academia, focused on improving the achieved compression ratio attainable by a LZ77 scheme for each individual string. Ferragina, Nitto and Venturini introduced the bit-optimal LZ77 parsing problem, the problem of constructing the most succinct LZ77 compression of any given text, and illustrated an efficient construction algorithm. Their algorithm assumes that universal codes are used for compressing the LZ77 phrases in the parsing. The algorithm illustrated in their work, albeit theoretically efficient when specific coders are used, is not practical because it involves sophisticated transformations on cache-unfriendly data structures, so it is difficult to implement and likely slow in practice. In this chapter we show a practical, memory-friendly algorithm that matches the time and space complexity of the original solution.

Bicriteria data compressor.

In the industry, many data compressors have been developed to trade, in a very specific way, compression ratio for decompression speed.

Unfortunately, these approaches are ad-hoc and thus not readily applicable to other applicative scenarios, a situation that originated in the development of myriads of different data compressors, each targeting a different performance profile. In this chapter we target this issue by introducing the Bicriteria Data Compression problem, which asks for the most succinct LZ77 compression that can be decompressed in a given time budget, or vice-versa. Similarly to the bit-optimal LZ77 parsing problem, we assume that phrases are compressed using universal codes.

The problem is modeled as a Weight-Constrained Shortest Path Problem on a directed, acyclic graph. We then show an algorithm that solves efficiently the problem by exploiting some peculiarities of that graph. Nicely, the solution reduces the problem to the resolution of a small number of instances of the bit-optimal LZ77 problem, which implies in turn the existence of an efficient and practical construction algorithm.

An efficient, engineered data compressor

In this chapter we illustrate bczip, an engineered implementation of the efficient bit-optimal and bicriteria data compressors introduced in the previous chapters. We illustrate some algorithmic improvements that aim at making the algorithm even more efficient in practice.

We also show that conventional means of providing some decompression time/compression ratio trade-offs on LZ77-compressors are not consistent and sometimes even detrimental, a result which further validates our approach.

The benchmarks show that the proposed compressor is extremely competitive with the state-of-the-art when compression ratio and decompression speeds are considered together.

Succinct index for relative data structures

On biological applications, there is a need of indexing a great number of documents that are very similar. Because of this, a new line of research has focused on building relative compressed indexes, that is, compressed indexes that are expressed relatively to the index of a similar document in order to save space. These approaches exploit the similarities of the data structures underneath these indexes to lower their total space consumption. For example, the Relative FM-Index exploits the fact that the BWT transform of two similar documents are similar, and thus expresses a BWT as the difference of the reference. In this chapter we propose a new relative Lempel-Ziv representation that can compactly represent all differences among similar documents and we show a very efficient implementation supporting fast random access to the (relatively) compressed input text. Being a general solution, this approach can be used to help designing new relative data structures for similar settings.

The possibility offered to the users by social media networks of easily creating and publishing contents gave rise to platforms managing hundred of petabytes, or the appearance of sequencers capable of producing terabytes of data at an inexpensive price, made possible by advancements in DNA sequencing technology, opened the road to fields such as pan-genomic analysis where common motifs have to be found in hundreds of genomes that must be first sequenced and indexed.

Many of these developments, which operate on the data in an on-line fashion, have strict performance requirements. For example, web indexes must be capable of retrieving any indexed webpage in less than a millisecond on average in order to meet operational standards.

This represents an incentive to store as much data as possible on main memory, instead of secondary storage like hard disks or solid-state disks. In fact, since accessing the RAM has a latency of around 100 nanoseconds, while accessing the disk has a latency of few milliseconds, even allowing just 1% of the memory references to be read from disk implies a slowdown of a factor of about 100.

The most effective way of fitting more data in memory is to make use of data compression techniques. However, because of the on-line fashion these applications operate, these techniques must allow to operate over the data with an efficiency that is comparable to that of storing data in an uncompressed format. The kind of techniques that must be used to allow good compression ratios and fast data access depends on the way the data is accessed and on the kind of operations that must be performed on it. Applications can be grouped into two broad categories, depending on the nature of these operational

-) applications that organize data into files that must be accessed on their entirety. Examples of this pattern are high-performing distributed storage systems like Google's BigTable, a distributed key-value database where each value is composed of chunks of $64$MiB data. Applications in this scenario, also known as "compress once, decompress many times", need a good trade-off between compression ratio and decompression speed.

-) applications that need to perform sophisticated queries on their dataset. Examples of this class of applications are web indexes, where a search query needs to fetch only those documents containing a user-supplied list of words, or DNA aligners, where portions of the text that match a given pattern must be found. In this kind of applications, only a small fraction of the data should be accessed, so the compressed representation should allow for some kind of direct access without requiring to decompress the whole text, which is costly.

Theoretical solutions for these needs abound in the literature.

For "compress once, decompress many times" scenarios, there are compressors based either on the Lempel-Ziv parsing scheme or on the Burrows-Wheeler Transform that are asymptotically optimal both in time and space, while for the second category Compressed Full-Text Indexes such as the FM-index allow powerful queries on the compressed representation with almost optimal time complexities.

However, new applications provide plenty of new challenges, such as the need of specific time/space trade-offs that are not provided by existing solutions or the necessity of new, compact storage systems that exploit the specialties of new datasets to support fast queries on it.

In this thesis, our contributions focused on developing tools and techniques that can be helpful in addressing these new challenges.

Efficient and usable space-optimal compressor.

Even though the LZ77 algorithm described by Lempel and Ziv is asymptotically optimal in a strong information-theory sense, it does not necessarily yield the lowest possible compression ratio achievable for every possible string. In fact, an active line of research, both in the industry and in academia, focused on improving the achieved compression ratio attainable by a LZ77 scheme for each individual string. Ferragina, Nitto and Venturini introduced the bit-optimal LZ77 parsing problem, the problem of constructing the most succinct LZ77 compression of any given text, and illustrated an efficient construction algorithm. Their algorithm assumes that universal codes are used for compressing the LZ77 phrases in the parsing. The algorithm illustrated in their work, albeit theoretically efficient when specific coders are used, is not practical because it involves sophisticated transformations on cache-unfriendly data structures, so it is difficult to implement and likely slow in practice. In this chapter we show a practical, memory-friendly algorithm that matches the time and space complexity of the original solution.

Bicriteria data compressor.

In the industry, many data compressors have been developed to trade, in a very specific way, compression ratio for decompression speed.

Unfortunately, these approaches are ad-hoc and thus not readily applicable to other applicative scenarios, a situation that originated in the development of myriads of different data compressors, each targeting a different performance profile. In this chapter we target this issue by introducing the Bicriteria Data Compression problem, which asks for the most succinct LZ77 compression that can be decompressed in a given time budget, or vice-versa. Similarly to the bit-optimal LZ77 parsing problem, we assume that phrases are compressed using universal codes.

The problem is modeled as a Weight-Constrained Shortest Path Problem on a directed, acyclic graph. We then show an algorithm that solves efficiently the problem by exploiting some peculiarities of that graph. Nicely, the solution reduces the problem to the resolution of a small number of instances of the bit-optimal LZ77 problem, which implies in turn the existence of an efficient and practical construction algorithm.

An efficient, engineered data compressor

In this chapter we illustrate bczip, an engineered implementation of the efficient bit-optimal and bicriteria data compressors introduced in the previous chapters. We illustrate some algorithmic improvements that aim at making the algorithm even more efficient in practice.

We also show that conventional means of providing some decompression time/compression ratio trade-offs on LZ77-compressors are not consistent and sometimes even detrimental, a result which further validates our approach.

The benchmarks show that the proposed compressor is extremely competitive with the state-of-the-art when compression ratio and decompression speeds are considered together.

Succinct index for relative data structures

On biological applications, there is a need of indexing a great number of documents that are very similar. Because of this, a new line of research has focused on building relative compressed indexes, that is, compressed indexes that are expressed relatively to the index of a similar document in order to save space. These approaches exploit the similarities of the data structures underneath these indexes to lower their total space consumption. For example, the Relative FM-Index exploits the fact that the BWT transform of two similar documents are similar, and thus expresses a BWT as the difference of the reference. In this chapter we propose a new relative Lempel-Ziv representation that can compactly represent all differences among similar documents and we show a very efficient implementation supporting fast random access to the (relatively) compressed input text. Being a general solution, this approach can be used to help designing new relative data structures for similar settings.

File

Nome file | Dimensione |
---|---|

final_report.pdf | 90.10 Kb |

phd_thes...uggia.pdf | 1.23 Mb |

Contatta l'autore |