Faster Approximate Pattern Matching in Compressed Repetitive Texts

Abstract

Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the underlying string(s). Bille et al. (2011) recently showed how, given a straight-line program with r rules for a string s of length n, we can build an r-word data structure that allows us to extract any substring of length m in n + m time. They also showed how, given a pattern p of length m and an edit distance (k ≤ m), their data structure supports finding all approximate matches to p in s in r ( (m k, k4 + m) + n) + time. Rytter (2003) and Charikar et al. (2005) showed that r is always at least the number z of phrases in the LZ77 parse of s, and gave algorithms for building straight-line programs with z n rules. In this paper we give a simple z n-word data structure that takes the same time for substring extraction but only z (m k, k4 + m) + time for approximate pattern matching.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…