HyperMinHash: MinHash in LogLog space

Abstract

In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O( n) to buckets of size O( n) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For a multiplicative approximation error 1+ ε on a Jaccard index t , given a random oracle, HyperMinHash needs O(ε-2 ( n + 1 t ε )) space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10\% using 64KiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…