A Faster Grammar-Based Self-Index
Abstract
To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on grammars. In this paper we show how, given a straight-line program with r rules for a string (S [1..n]) whose LZ77 parse consists of z phrases, we can store a self-index for S in r + z n space such that, given a pattern (P [1..m]), we can list the occurrences of P in S in m2 + n time. If the straight-line program is balanced and we accept a small probability of building a faulty index, then we can reduce the m2 term to m m. All previous self-indexes are larger or slower in the worst case.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.