Towards a Definitive Compressibility Measure for Repetitive Sequences
Abstract
Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures like the size z of the Lempel--Ziv parse are frequently used to estimate it. The size b z of the smallest bidirectional macro scheme captures better what can be achieved via copy-paste processes, though it is NP-complete to compute and it is not monotonic upon symbol appends. Recently, a more principled measure, the size γ of the smallest string attractor, was introduced. The measure γ b lower bounds all the previous relevant ones, yet length-n strings can be represented and efficiently indexed within space O(γnγ), which also upper bounds most measures. While γ is certainly a better measure of repetitiveness than b, it is also NP-complete to compute and not monotonic, and it is unknown if one can always represent a string in o(γ n) space. In this paper, we study an even smaller measure, δ γ, which can be computed in linear time, is monotonic, and allows encoding every string in O(δnδ) space because z = O(δnδ). We show that δ better captures the compressibility of repetitive strings. Concretely, we show that (1) δ can be strictly smaller than γ, by up to a logarithmic factor; (2) there are string families needing (δnδ) space to be encoded, so this space is optimal for every n and δ; (3) one can build run-length context-free grammars of size O(δnδ), whereas the smallest (non-run-length) grammar can be up to ( n/ n) times larger; and (4) within O(δnδ) space we can not only...
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.