Document Listing on Repetitive Collections with Guaranteed Performance

Abstract

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,σ] is composed of D copies of a string of size n, and s edits are applied on ranges of copies. We introduce the first document listing index with size O(n+s), precisely O((nσ+s2 N) D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the >0 strings where it appears in time O(m1+ε N · ), for any constant ε>0 (and tells in time O(m N) if =0). Our technique is to augment a range data structure that is commonly used on grammar-based indexes, so that instead of retrieving all the pattern occurrences, it computes useful summaries on them. We show that the idea has independent interest: we introduce the first grammar-based index that, on a text T[1,N] with a grammar of size r, uses O(r N) bits and counts the number of occurrences of a pattern P[1,m] in time O(m2 + m2+ε r), for any constant ε>0. We also give the first index using O(z(N/z) N) bits, where T is parsed by Lempel-Ziv into z phrases, counting occurrences in time O(m2+ε N).

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…