Scaling Laws for Agent Harnesses via Effective Feedback Compute
Abstract
Agent harnesses shape language-model performance by controlling tool use, feedback, verification, memory, and repair. Yet raw test-time expenditure, such as tokens, tool calls, wall time, or cost, cannot distinguish useful feedback from redundant or unstable interaction. We introduce Effective Feedback Compute (EFC), a trace-level scaling coordinate for informative, valid, non-redundant, and retained feedback. We further define Estimated-EFC, NRS-EFC, harness efficiency η, and task-demand normalization for realistic traces and heterogeneous tasks. Across synthetic, real, held-out, and prospective evaluations, EFC-based coordinates outperform raw-compute baselines and SAS. Oracle-EFC/Dtask reaches R2=0.99 in controlled scaling, and NRS-EFC/Dtask reaches R2=0.93 on real traces where raw compute has near-zero or negative fit. Finally, uses EFC as a companion control layer for existing harnesses, improving mean pass rate from 61.2\% to 68.2\% while reducing mean raw cost from 213.8 to 85.1 under matched settings. These results suggest that harness scaling depends on durable, task-sufficient feedback rather than raw computation alone.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.