Time-Optimal Construction of String Synchronizing Sets
Abstract
A key principle in string processing is local consistency: using short contexts to handle matching fragments of a string consistently. String synchronizing sets [Kempa, Kociumaka; STOC 2019] are an influential instantiation of this principle. A τ-synchronizing set of a length-n string is a set of O(n/τ) positions, chosen via their length-2τ contexts, such that (outside highly periodic regions) at least one position in every length-τ window is selected. Among their applications are faster algorithms for data compression, text indexing, and string similarity in the word RAM model. We show how to preprocess any string T ∈ [0..σ)n in O(nσ/ n) time so that, for any τ∈[1..n], a τ-synchronizing set of T can be constructed in O((nτ)/(τ n)) time. Both bounds are optimal in the word RAM model with word size w=( n). Previously, the construction time was O(n/τ), either after an O(n)-time preprocessing [Kociumaka, Radoszewski, Rytter, Wale\'n; SICOMP 2024], or without preprocessing if τ<0.2σ n [Kempa, Kociumaka; STOC 2019]. A simple version of our method outputs the set as a sorted list in O(n/τ) time, or as a bitmask in O(n/ n) time. Our optimal construction produces a compact fully indexable dictionary, supporting select queries in O(1) time and rank queries in O((τ n)) time, matching unconditional cell-probe lower bounds for τ n1-(1). We achieve this via a new framework for processing sparse integer sequences in a custom variable-length encoding. For rank and select queries, we augment the optimal variant of van Emde Boas trees [Patrascu, Thorup; STOC 2006] with a deterministic linear-time construction. The above query-time guarantees hold after preprocessing time proportional to the encoding size (in words).
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.