Fast Similarity Sketching
Abstract
We consider the Similarity Sketching problem: Given a universe [u] = \0,…, u-1\ we want a random function S mapping subsets A⊂eq [u] into vectors S(A) of size t, such that the Jaccard similarity J(A,B) = |A B|/|A B| between sets A and B is preserved. More precisely, define Xi = [S(A)[i] = S(B)[i]] and X = Σi∈ [t] Xi. We want E[Xi]=J(A,B), and we want X to be strongly concentrated around E[X] = t · J(A,B) (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors S(A) are also called sketches. Strong concentration is critical, for often we want to sketch many sets B1,…,Bn so that we later, for a query set A, can find (one of) the most similar Bi. It is then critical that no Bi looks much more similar to A due to errors in the sketch. The seminal t×MinHash algorithm uses t random hash functions h1,…, ht, and stores ( a∈ A h1(A),…, a∈ A ht(A) ) as the sketch of A. The main drawback of MinHash is, however, its O(t· |A|) running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.