The anti-lexicographic SUS-anchor: a near-optimal k=1 sampling scheme
Abstract
In recent years, there has been a renewed interest in the search for low density minimizer schemes. These schemes take a window of w consecutive k-mers, and sample one of them: the smallest under some specific order. Schemes such as the mod-minimizer provide a low density (fraction of sampled k-mers) when k w, while schemes such as the greedy minimizer work well for explicit small parameters roughly in the regime k ≤ 2w, for k and w up to 15 or so. When k < σw is very small, minimizer schemes cannot do well, and more general sampling schemes are needed that can be richer than just comparing k-mers. Bidirectional-string anchors (bd-anchors) form one such scheme. Inspired by bd-anchors, we introduce the smallest unique substring or SUS-anchor: Given a window, this considers all suffixes that do not occur as a substring elsewhere in the window. It then samples the start position of the smallest suffix according to the new anti-lexicographic order that minimizes the first character and maximizes the remaining characters. We give a linear-time and O(w) space streaming algorithm to compute all SUS-anchors of a string. For alphabet size σ=4 and k=1, the anti-lexicographic SUS-anchor empirically has density <1\% away from the density lower bound, significantly improving over bd-anchors that are often >15\% above it. For alphabet size σ=2, the density is at most 10\% above the lower bound, which again improves over the >50\% overhead of bd-anchors.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.