The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Abstract

Leaderboards rank frontier models on independent axes but do not reveal whether capabilities reinforce or trade off across releases -- and at the frontier, this interaction is the more informative signal. We decompose paired SWE-bench and GPQA Diamond scores into a population coupling trend and per-release residual (h-field) that diagnoses capability emphasis from two public benchmark scores. Across 34 models from 10 labs (2024--2026), capabilities cooperate (r = +0.72, p < 10-6), but cooperation varies systematically: per-lab coupling slopes span 5× (Google 1.15 vs. DeepSeek 0.23), and labs pivot -- DeepSeek reversed from reasoning-rich to coding-first (Δh = 15.9~pp); Anthropic oscillates between coding excursions and recovery. The population regression serves as an isocline phase boundary: the same (a/b)· B1 classifier that identifies the base-scale coupling transition [Amin, 2026] classifies frontier models and already detects mixed-phase behavior at the next transition (two models below the GPQA--IFEval isocline). The h-field is not just diagnostic -- it tells you what to change. Pretraining establishes coupling at 0.871 while RLHF adds 0.081 [Amin, 2026]: pretraining-level shifts are permanent (DeepSeek's four-release reversal persists), post-training shifts are reversible (Anthropic's three coding excursions each recover within one release), and inference compute alone shifts h by +7.8~pp without retraining. Knowing which component dominates determines whether to retrain or wait. We provide a three-step diagnostic (locate, classify, predict), a per-lab measurement-priority table, and seven falsifiable predictions with timestamped criteria. Five post-cutoff releases fall within the 95\% prediction interval. Code, data, and an interactive dashboard: https://zehenlabs.com/cape/.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…