OPORP: One Permutation + One Random Projection
Abstract
Consider two D-dimensional data vectors (e.g., embeddings): u, v. In many embedding-based retrieval (EBR) applications where the vectors are generated from trained models, D=256 1024 are common. In this paper, OPORP (one permutation + one random projection) uses a variant of the ``count-sketch'' type of data structures for achieving data reduction/compression. With OPORP, we first apply a permutation on the data vectors. A random vector r is generated i.i.d. with moments: E(ri) = 0, E(ri2)=1, E(ri3) =0, E(ri4)=s. We multiply (as dot product) r with all permuted data vectors. Then we break the D columns into k equal-length bins and aggregate (i.e., sum) the values in each bin to obtain k samples from each data vector. One crucial step is to normalize the k samples to the unit l2 norm. We show that the estimation variance is essentially: (s-1)A + D-kD-11k[ (1-2)2 -2A], where A≥ 0 is a function of the data (u,v). This formula reveals several key properties: (1) We need s=1. (2) The factor D-kD-1 can be highly beneficial in reducing variances. (3) The term 1k(1-2)2 is a substantial improvement compared with 1k(1+2), which corresponds to the un-normalized estimator. We illustrate that by letting the k in OPORP to be k=1 and repeat the procedure m times, we exactly recover the work of ``very spars random projections'' (VSRP). This immediately leads to a normalized estimator for VSRP which substantially improves the original estimator of VSRP. In summary, with OPORP, the two key steps: (i) the normalization and (ii) the fixed-length binning scheme, have considerably improved the accuracy in estimating the cosine similarity, which is a routine (and crucial) task in modern embedding-based retrieval (EBR) applications.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.