R2k is Theoretically Large Enough for Embedding-based Top-k Retrieval

Abstract

This paper studies the Minimal Embeddable Dimension (MED): the least dimension in which there exists a configuration of m object vectors so that every subset of size at most k is exactly retrieved by score comparison. Our result shows MED is Θ(k), independent of m, for inner product, Euclidean distance, and cosine similarity. We then consider Robust MED (RMED), where all vectors are unit normed and an ε gap of scores is required. We derive the m-dependent feasibility ceiling ε(m,k)=m/k(m-1)(m-k), which approaches 1/k when m k, and a Gaussian centroid construction gives a robust witness upper bound in the feasible margin regime. Numerical simulation on synthetic top-2 retrieval with cyclic polytope and centroid query optimization confirmed our theoretical claims. Experiments on LIMIT and LIMIT-small datasets also show that simple embedding-based retrieval baselines can overfit and outperform the reported single-vector LLM embedding baseline. Both theoretical and empirical findings rule out the lack of exact geometric capacity as the obstruction.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…