Counting Distinct (Non-)Crossing Substrings in Optimal Time

Abstract

Let w be a string of length n. The problem of counting factors crossing a position -- Problem 64 from the textbook ``125 Problems in Text Algorithms'' [Crochemore, Lecroq, and Rytter, 2021] -- asks to count the number C(w,k) (resp. N(w,k)) of distinct substrings in w that have occurrences containing (resp. not containing) a position k in w. The solutions provided in their textbook compute C(w,k) and N(w,k) in O(n) time for a single position k in w, and thus a direct application would require O(n2) time for all positions k = 1, …, n in w. Their solution is designed for constant-size alphabets. In this paper, we present new algorithms which compute C(w,k) in O(n) total time for general ordered alphabets, and N(w,k) in O(n) total time for linearly sortable alphabets,for all positions k = 1, …, n in w. We further derive model-dependent optimal bounds by separating the algorithms into preprocessing and linear-time postprocessing: for C the preprocessing is run reporting, and for N it is preprocessing based on longest previous non-overlapping factors (LPnF) and longest next factors (LNF). In particular, all values C(w,k) can be computed in O(n n) time over general unordered alphabets in which direct accesses to alphabet characters are restricted to equality tests, and in O(nσ) time in the word RAM model, where σ denotes the number of distinct characters occurring in w. For N(w,k), the equality-testing complexity over general unordered alphabets is Θ(n2). We also show that our upper bounds are optimal for all of the aforementioned alphabet assumptions and computation models.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…