Document Counting in Practice

Abstract

We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them experimentally on various datasets. Our results not only show which are the best options for each situation and help discard practically unappealing solutions, but also uncover some unexpected compressibility properties of the best data structures. By taking advantage of these properties, we can reduce the size of the structures by a factor of 5--400, depending on the dataset.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…