Is Dimensionality a Barrier for Retrieval Models?

Abstract

Why does the low dimensionality of representations, typically d≈ 1000, not prevent modern embedding-based retrieval models from scaling to billions, or even trillions, of data points? To answer this question, we study maximal-margin embeddings in the following retrieval model, classically studied in communication complexity [PS86] and more recently in embedding-based retrieval [WBNL26]. Let A∈ \0,1\N× n be a matrix indicating whether each of N queries is relevant to each of n documents. We are interested in the largest margin m>0, denoted by mrd(d, A), for which there exist unit norm embeddings of the queries and documents \Uj\j = 1N, \Vi\i = 1n with the following property. Uj, Vi m whenever Aji = 1 and Uj, Vi -m otherwise. A large margin is a key proxy for representation quality: it controls both robustness to perturbations and compositional generalization across queries. Our main theorem establishes that the best possible margin without a restriction on the dimension, mrd(+∞, A), can be nearly achieved in dimension d = O(mrd(+∞, A)-2 n) which improves a theorem of [BDES02]. Together with a matching lower bound in Theorem 1.5, we conclude that when A∈ \0,1\nk× n is the matrix containing all possible k-sparse rows once, dimension d = O(k (n/k)) is necessary and sufficient for the maximal possible margin mrd(+∞, A) = Θ(k-1/2) in this setting. This fully resolves the setup of [WBNL26]. We also give several constructions for large margins when d = o(k (n/k)). Finally, we empirically test the InfoNCE and sigmoid losses for producing large margin embeddings and demonstrate a clear advantage of the sigmoid loss.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…