Sensitivity of string compressors and repetitiveness measures
Abstract
The sensitivity of a string compression algorithm C asks how much the output size C(T) for an input string T can increase when a single character edit operation is performed on T. This notion enables one to measure the robustness of compression algorithms in terms of errors and/or dynamic changes occurring in the input string. In this paper, we analyze the worst-case multiplicative sensitivity of string compression algorithms, which is defined by T ∈ n\C(T')/C(T) : ed(T, T') = 1\, where ed(T, T') denotes the edit distance between T and T'. For the most common versions of the Lempel-Ziv 77 compressors, we prove that the worst-case multiplicative sensitivity is upper bounded by a small constant, and give matching lower bounds. We generalize these results to the smallest bidirectional scheme b. In addition, we show that the sensitivity of a grammar-based compressor called GCIS is also a small constant. Further, we extend the notion of the worst-case sensitivity to string repetitiveness measures such as the smallest string attractor size γ and the substring complexity δ, and show that the worst-case sensitivity of δ is also a small constant. These results contrast with the previously known related results such that the size z 78 of the Lempel-Ziv 78 factorization can increase by a factor of (n1/4) [Lagarde and Perifel, 2018], and the number r of runs in the Burrows-Wheeler transform can increase by a factor of ( n) [Giuliani et al., 2021] when a character is prepended to an input string of length n. By applying our sensitivity bounds of δ or the smallest grammar to known results (c.f. [Navarro, 2021]), some non-trivial upper bounds for the sensitivities of important string compressors and repetitiveness measures including γ, r, LZ-End, RePair, LongestMatch, and AVL-grammar are derived.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.