Entropy Gate: Entropy Quenching for Near-Lossless Token Compression in LLM Pipelines

Abstract

LLM pipelines waste substantial token budgets on low-information content: repeated context, verbose responses, and redundant boilerplate. We introduce Entropy Gate, a token compression framework applying entropy quenching - a thermodynamic process that progressively freezes out low-energy tokens while preserving semantic fidelity. Each token receives a multi-factor information energy E(t) combining statistical, structural, and positional components. An adaptive quenching schedule T(τ) = T0 / (1 + ατ) removes tokens whose Boltzmann survival probability pi = (-Ei / kT) falls below threshold, with a fidelity gate halting compression when energy-weighted similarity drops below θ. We prove token selection by descending E(t) maximizes expected semantic preservation, that quenching produces nested survival sets, and that achievable compression approaches the information-theoretic limit CR 1 - I(P; T)/H(P). A Phase 1 heuristic achieves 40-60% compression across five prompt categories while maintaining SE > 0.80, with energy-squared amplification E E2 adding 10-25 percentage points. Context deduplication adds 50-70% savings on repeated blocks. Output-side quenching, motivated by findings that brevity improves accuracy, further reduces response overhead. Combined with external memory, reduction composes multiplicatively to 88-96% for agentic workloads. The framework is stateless, model-agnostic, and deploys as an OpenAI-compatible HTTP proxy.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…