Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch
Abstract
Embedding-based representations in Euclidean space Rd are a cornerstone of modern machine learning, where a major goal is to use the smallest dimension that faithfully captures data relations. In this work, we prove sharp dimension--accuracy tradeoffs and identify a fundamental information-theoretic limitation: unless the embedding dimension d is chosen close to the ground-truth dimension D, accuracy undergoes a sudden collapse. Our main result shows that this phenomenon arises even in standard contrastive learning settings, where supervision is limited to a set of m anchor--positive--negative triplets (i,j,k) encoding distance comparisons dist(i,j) < dist(i,k). Specifically, given triplets realizable by an unknown ground-truth embedding in D dimensions, we prove that there exists constant c < 1, such that every embedding of dimension at most cD violates half of the triplets, yielding accuracy as low as a trivial one-dimensional solution that ignores the input. We complement our information-theoretic bounds with strong computational hardness results: under the Unique Games Conjecture, even if the given triplets are nearly realizable in D=1 dimension, no polynomial-time algorithm -- regardless of its dimension -- can achieve accuracy above the trivial 50\% baseline.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.