Capacity, Not Format: Rethinking Structured Reasoning Failures

Hengxin Fan

Capacity, Not Format: Rethinking Structured Reasoning Failures

Abstract

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: 88.74.0% JSON vs. 89.31.7% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp (p < 0.0001) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp (p < 0.001), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar p < 0.0001) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON (-5.3pp; the displayed percentages are independently rounded, exact difference is 7/133 = 5.26pp ≈ 5.3pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…