Consistent Subset Sampling

Abstract

Consistent sampling is a technique for specifying, in small space, a subset S of a potentially large universe U such that the elements in S satisfy a suitably chosen sampling condition. Given a subset I⊂eq U it should be possible to quickly compute I S, i.e., the elements in I satisfying the sampling condition. Consistent sampling has important applications in similarity estimation, and estimation of the number of distinct items in a data stream. In this paper we generalize consistent sampling to the setting where we are interested in sampling size-k subsets occurring in some set in a collection of sets of bounded size b, where k is a small integer. This can be done by applying standard consistent sampling to the k-subsets of each set, but that approach requires time (bk). Using a carefully designed hash function, for a given sampling probability p ∈ (0,1], we show how to improve the time complexity to (b k/2 b + pbk) in expectation, while maintaining strong concentration bounds for the sample. The space usage of our method is (b k/4). We demonstrate the utility of our technique by applying it to several well-studied data mining problems. We show how to efficiently estimate the number of frequent k-itemsets in a stream of transactions and the number of bipartite cliques in a graph given as incidence stream. Further, building upon a recent work by Campagna et al., we show that our approach can be applied to frequent itemset mining in a parallel or distributed setting. We also present applications in graph stream mining.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…