Sign-Full Random Projections
Abstract
The method of 1-bit ("sign-sign") random projections has been a popular tool for efficient search and machine learning on large datasets. Given two D-dim data vectors u, v∈RD, one can generate x = Σi=1D ui ri, and y = Σi=1D vi ri, where ri N(0,1) iid. The "collision probability" is Pr(sgn(x)=sgn(y)) = 1--1π, where = (u,v) is the cosine similarity. We develop "sign-full" random projections by estimating from (e.g.,) the expectation E(sgn(x)y)=2π , which can be further substantially improved by normalizing y. For nonnegative data, we recommend an interesting estimator based on E(y- 1x≥ 0 + y+ 1x<0) and its normalized version. The recommended estimator almost matches the accuracy of the (computationally expensive) maximum likelihood estimator. At high similarity (→1), the asymptotic variance of recommended estimator is only 43π ≈ 0.4 of the estimator for sign-sign projections. At small k and high similarity, the improvement would be even much more substantial.