Improving Count-Mean Sketch as the Leading Locally Differentially Private Frequency Estimator for Large Dictionaries

Abstract

This paper identifies that a group of latest locally-differentially-private (LDP) algorithms for frequency estimation, including all the Hadamard-matrix-based algorithms, are equivalent to the private Count-Mean Sketch (CMS) algorithm with different parameters. Therefore, we revisit the private CMS, correct errors in the original CMS paper regarding expectation and variance, modify the CMS implementation to eliminate existing bias, and optimize CMS using randomized response (RR) as the perturbation method. The optimized CMS with RR is shown to outperform CMS variants with other known perturbations in reducing the worst-case mean squared error (MSE), l1 loss, and l2 loss. Additionally, we prove that pairwise-independent hashing is sufficient for CMS, reducing its communication cost to the logarithm of the cardinality of all possible values (i.e., a dictionary). As a result, the optimized CMS with RR is proven theoretically and empirically as the leading algorithm for reducing the aforementioned loss functions when dealing with a very large dictionary. Furthermore, we demonstrate that randomness is necessary to ensure the correctness of CMS, and the communication cost of CMS, though low, is unavoidable despite the randomness being public or private.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…