Fast Pseudo-Random Fingerprints

Abstract

We propose a method to exponentially speed up computation of various fingerprints, such as the ones used to compute similarity and rarity in massive data sets. Rather then maintaining the full stream of b items of a universe [u], such methods only maintain a concise fingerprint of the stream, and perform computations using the fingerprints. The computations are done approximately, and the required fingerprint size k depends on the desired accuracy ε and confidence δ. Our technique maintains a single bit per hash function, rather than a single integer, thus requiring a fingerprint of length k = O( 1δε2) bits, rather than O( u · 1δε2) bits required by previous approaches. The main advantage of the fingerprints we propose is that rather than computing the fingerprint of a stream of b items in time of O(b · k), we can compute it in time O(b k). Thus this allows an exponential speedup for the fingerprint construction, or alternatively allows achieving a much higher accuracy while preserving computation time. Our methods rely on a specific family of pseudo-random hashes for which we can quickly locate hashes resulting in small values.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…