Secondary Indexing in One Dimension: Beyond B-trees and Bitmap Indexes
Abstract
Let S be a finite, ordered alphabet, and let x = x1 x2 ... xn be a string over S. A "secondary index" for x answers alphabet range queries of the form: Given a range [al,ar] over S, return the set I[al;ar] = i |xi ∈ [al; ar]. Secondary indexes are heavily used in relational databases and scientific data analysis. It is well-known that the obvious solution, storing a dictionary for the position set associated with each character, does not always give optimal query time. In this paper we give the first theoretically optimal data structure for the secondary indexing problem. In the I/O model, the amount of data read when answering a query is within a constant factor of the minimum space needed to represent I[al;ar], assuming that the size of internal memory is (|S| log n)delta blocks, for some constant delta > 0. The space usage of the data structure is O(n log |S|) bits in the worst case, and we further show how to bound the size of the data structure in terms of the 0-th order entropy of x. We show how to support updates achieving various time-space trade-offs. We also consider an approximate version of the basic secondary indexing problem where a query reports a superset of I[al;ar] containing each element not in I[al;ar] with probability at most epsilon, where epsilon > 0 is the false positive probability. For this problem the amount of data that needs to be read by the query algorithm is reduced to O(|I[al;ar]| log(1/epsilon)) bits.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.