Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity
Abstract
Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. Exact identity recovery requires additional information precisely when representation fibers have size greater than one. The residual cost is controlled by a single combinatorial object: the collision-fiber geometry of the representation map π. Let Aπ=u |π-1(u)| be the largest collision fiber. The finite laws include a tight fixed-length converse L 2 Aπ, an exact finite-block scaling law, a pointwise adaptive budget 2 |π-1(u)|, and an exact fiberwise rate-distortion law for arbitrary finite sources via recoverable-mass decomposition across representation fibers. The uniform single-block formula D(L)=(0,1-2L/a) appears as a closed-form special case when all mass lies on one collision block, where a = Aπ is the collision block size. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.