Finding Subcube Heavy Hitters in Analytics Data Streams
Abstract
Data streams typically have items of large number of dimensions. We study the fundamental heavy-hitters problem in this setting. Formally, the data stream consists of d-dimensional items x1,…,xm ∈ [n]d. A k-dimensional subcube T is a subset of distinct coordinates \ T1,·s,Tk \ ⊂eq [d]. A subcube heavy hitter query Query(T,v), v ∈ [n]k, outputs YES if fT(v) ≥ γ and NO if fT(v) < γ/4, where fT is the ratio of number of stream items whose coordinates T have joint values v. The all subcube heavy hitters query AllQuery(T) outputs all joint values v that return YES to Query(T,v). The one dimensional version of this problem where d=1 was heavily studied in data stream theory, databases, networking and signal processing. The subcube heavy hitters problem is applicable in all these cases. We present a simple reservoir sampling based one-pass streaming algorithm to solve the subcube heavy hitters problem in O(kd/γ) space. This is optimal up to poly-logarithmic factors given the established lower bound. In the worst case, this is (d2/γ) which is prohibitive for large d, and our goal is to circumvent this quadratic bottleneck. Our main contribution is a model-based approach to the subcube heavy hitters problem. In particular, we assume that the dimensions are related to each other via the Naive Bayes model, with or without a latent dimension. Under this assumption, we present a new two-pass, O(d/γ)-space algorithm for our problem, and a fast algorithm for answering AllQuery(T) in O(k/γ2) time. Our work develops the direction of model-based data stream analysis, with much that remains to be explored.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.