Speeding-up $q$-gram mining on grammar-based compressed texts

Masayuki Takeda

doi:10.1007/978-3-642-31265-6_18

Speeding-up q-gram mining on grammar-based compressed texts

Abstract

We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP T of size n that represents string T, the algorithm computes the occurrence frequencies of all q-grams in T, by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size m = |T|-dup(q,T), where dup(q,T) is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m = O(qn), the running time of our algorithm is O(\|T|-dup(q,T),qn\), improving our previous O(qn) algorithm when q = (|T|/n).

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…