Subset Sampling and Its Extensions
Abstract
This paper studies the subset sampling problem. The input is a set S of n records together with a function p that assigns each record v∈S a probability p(v). A query returns a random subset X of S, where each record v∈S is sampled into X independently with probability p(v). The goal is to store S in a data structure to answer queries efficiently. If S fits in memory, the problem is interesting when S is dynamic. We develop a dynamic data structure with O(1+μS) expected query time, O(n) space and O(1) amortized expected update, insert and delete time, where μS=Σv∈Sp(v). The query time and space are optimal. If S does not fit in memory, the problem is difficult even if S is static. Under this scenario, we present an I/O-efficient algorithm that answers a query in O((*B n)/B+(μS/B)M/B (n/B)) amortized expected I/Os using O(n/B) space, where M is the memory size, B is the block size and *B n is the number of iterative 2(.) operations we need to perform on n before going below B. In addition, when each record is associated with a real-valued key, we extend the subset sampling problem to the range subset sampling problem, in which we require that the keys of the sampled records fall within a specified input range [a,b]. For this extension, we provide a solution under the dynamic setting, with O( n+μS[a,b]) expected query time, O(n) space and O( n) amortized expected update, insert and delete time.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.