Entropy-Gated Latent Recursion

Abstract

Inference-time scaling has become the dominant lever for improving language-model reasoning, but existing methods derive rollout diversity from a single source: stochastic token-level sampling. We argue that this single-axis sampling space is fundamentally limiting, and identify a second, fully deterministic and complementary axis: the layer span L at which a frozen model's top decoder layers are recursively re-applied at high-uncertainty tokens. Different choices of L produce distinct rollouts that solve different subsets of problems, with no stochasticity. We instantiate this axis through Entropy-Gated Latent Recursion (EGLR), a training-free decoding procedure that re-applies the top-L layers for at most K iterations until the next-token distribution converges. Combined with T temperature samples, EGLR turns a single-axis stochastic rollout pool into an L× T Cartesian sampling space at almost the same per-rollout cost. We characterize this space across 8 instruction-tuned models and 6 math reasoning benchmarks, and show that the L-axis is genuinely complementary to temperature: on MATH-500 with Qwen2.5-3B-Instruct, the joint L× T oracle reaches 91.6\%, +8.2 percentage points beyond the temperature-only oracle (83.4\%) and +10.4 points beyond the layer-only oracle (81.2\%), confirming that the two axes capture genuinely complementary problems. The expanded rollout pool provides richer per-prompt candidates for any downstream procedure that consumes rollouts, including self-consistency, best-of-N with verifiers, and group-relative RL training (GRPO), opening a new direction for inference-time scaling that does not rely on stochastic noise.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…