A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting

Abstract

Compressed Counting (CC) was recently proposed for approximating the αth frequency moments of data streams, for 0<α ≤ 2. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections, especially as α 1. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measurement and often serves a crucial "feature" for data mining. The R\'enyi entropy and the Tsallis entropy are functions of the αth frequency moments; and both approach the Shannon entropy as α 1. A recent theoretical work suggested using the αth frequency moment to approximate the Shannon entropy with α=1+δ and very small |δ| (e.g., <10-4). In this study, we experiment using CC to estimate frequency moments, R\'enyi entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We demonstrate the variance-bias trade-off in estimating Shannon entropy and provide practical recommendations. In particular, our experiments enable us to draw some important conclusions: (1) As α 1, CC dramatically improves symmetric stable random projections in estimating frequency moments, R\'enyi entropy, Tsallis entropy, and Shannon entropy. The improvements appear to approach "infinity." (2) Using symmetric stable random projections and α = 1+δ with very small |δ| does not provide a practical algorithm because the required sample size is enormous.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…