Practical and Effective Re-Pair Compression
Abstract
Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses (1+ε)n+ n words on top of the re-writable text (of length n and stored in n words), for any constant ε>0; in practice however, this solution uses complex sub-procedures preventing it from being practical. In this paper, we present an implementation of the above-mentioned result making use of more practical solutions; our tool further improves the working space to (1.5+ε)n words (text included), for some small constant ε. As a second contribution, we focus on compact representations of the output grammar. The lower bound for storing a grammar with d rules is (d!)+2d≈ d d+0.557 d bits, and the most efficient encoding algorithm in the literature uses at most d d + 2d bits and runs in O(d1.5) time. We describe a linear-time heuristic maximizing the compressibility of the output Re-Pair grammar. On real datasets, our grammar encoding uses---on average---only 2.8\% more bits than the information-theoretic minimum. In half of the tested cases, our compressor improves the output size of 7-Zip with maximum compression rate turned on.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.