Computing all-vs-all MEMs in grammar-compressed text

Abstract

We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection T. The key concept in our work is the construction of a fully-balanced grammar G from T that meets a property that we call fix-free: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of T incrementally over G using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build G from T in linear time and space. We also demonstrate that our MEM algorithm runs on top of G in O(G +occ) time and uses O( G(G+occ)) bits, where G is the grammar size, and occ is the number of MEMs in T. In the conclusions, we discuss how our idea can be modified to implement approximate pattern matching in compressed space.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…