Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon

Abstract

Estimating the cardinality (number of distinct elements) of a large multiset is a classic problem in streaming and sketching. In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error. We define a new measure of efficiency for data sketches called the Fisher-Shannon (FiSh) number H/I. It captures the tension between the limiting Shannon entropy (H) of the sketch and its normalized Fisher information (I) that characterizes the variance of a statistically efficient, asymptotically unbiased estimator. Our aim in introducing the FiSh-number is to build the mathematical machinery necessary to argue for precise optimality, rather than asymptotic optimality, up to large constant factors. Our results are as follows. [1] We prove that all base-q variants of Flajolet and Martin's PCSA sketch have FiSh-number H0/I0 ≈ 1.98016 and that every base-q variant of HyperLogLog has FiSh-number worse than H0/I0, but that they tend to H0/I0 in the limit as q→ ∞. Here H0,I0 are precisely defined constants. [2] We describe a sketch called Fishmonger that is based on a smoothed, entropy-compressed variant of PCSA with a different estimator function. Fishmonger processes a multiset of [U] such that at all times, w.h.p., its space is (1+o(1))(H0/I0)m ≈ 1.98m bits and its standard error is 1/m. For example, to achieve a 1% standard error, one needs a little more than 19,800 bits, or ≈ 2.42 kilobytes. [3] Finally, we give circumstantial evidence that H0/I0 is the optimum FiSh-number of mergeable sketches for Cardinality Estimation. We define a natural subset of mergeable sketches called linearizable sketches and prove that no member of this class can beat H0/I0. The popular mergeable sketches are, in fact, also linearizable.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…