Computing Minimal Absent Words and Extended Bispecial Factors with CDAWG Space

Abstract

A string w is said to be a minimal absent word (MAW) for a string S if w does not occur in S and any proper substring of w occurs in S. We focus on non-trivial MAWs which are of length at least 2. Finding such non-trivial MAWs for a given string is motivated for applications in bioinformatics and data compression. Fujishige et al. [TCS 2023] proposed a data structure of size (n) that can output the set MAW(S) of all MAWs for a given string S of length n in O(n + |MAW(S)|) time, based on the directed acyclic word graph (DAWG). In this paper, we present a more space efficient data structure based on the compact DAWG (CDAWG), which can output MAW(S) in O(|MAW(S)|) time with O(e) space, where e denotes the minimum of the sizes of the CDAWGs for S and for its reversal SR. For any strings of length n, it holds that e < 2n, and for highly repetitive strings e can be sublinear (up to logarithmic) in n. We also show that MAWs and their generalization minimal rare words have close relationships with extended bispecial factors, via the CDAWG.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…