Beating CountSketch for Heavy Hitters in Insertion Streams
Abstract
Given a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers \1, 2, …, n\, we consider the problem of returning all 2-heavy hitters, i.e., those items j for which fj ≥ ε F2, where fj is the number of occurrences of item j in the stream, and F2 = Σi ∈ [n] fi2. Such a guarantee is considerably stronger than the 1-guarantee, which finds those j for which fj ≥ ε m. In 2002, Charikar, Chen, and Farach-Colton suggested the CountSketch data structure, which finds all such j using (2 n) bits of space (for constant ε > 0). The only known lower bound is ( n) bits of space, which comes from the need to specify the identities of the items found. In this paper we show it is possible to achieve O( n n) bits of space for this problem. Our techniques, based on Gaussian processes, lead to a number of other new results for data streams, including (1) The first algorithm for estimating F2 simultaneously at all points in a stream using only O( n n) bits of space, improving a natural union bound and the algorithm of Huang, Tai, and Yi (2014). (2) A way to estimate the ∞ norm of a stream up to additive error ε F2 with O( n n) bits of space, resolving Open Question 3 from the IITK 2006 list for insertion only streams.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.