How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Abstract

SFT-then-RLVR is widely used for post-training reasoning models, but why this specific ordering, and why RLVR-only stalls at cold start, have lacked a unifying theoretical account. We provide that account under a unified loss family JQ using the Tsallis q-logarithm. JQ is a single-parameter family that interpolates between RLVR (at q=0, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at q=1, the density-estimation pole), under which the standard pipeline corresponds to a stepwise q=1 0 schedule. All members share the same per-example gradient direction, differing only by a per-instance amplification Pθ-q that reweights each instance independently of the learning rate. Under gradient flow analysis, we show that the exploitation pole requires (1p0) time to escape cold start but is robust to label noise, while the density-estimation pole escapes in ((1p0)) but memorizes label noise. This separation explains how SFT (q=1) first moves the model out of the cold-start regime, followed by the more robust RLVR (q=0), under the SFT-then-RLVR paradigm. We further derive two Monte Carlo estimators that directly optimize fixed-q on the JQ continuum, without annotated rationales: Gradient-Amplified RL (GARL) and Posterior-Attenuated Fine-Tuning (PAFT), with shared bias O(qM Pθq) but different variance and stability properties. On FinQA, HotPotQA, and MuSiQue, GARL at sufficiently high q substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low q dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes and PAFT at q=0.75 remains stable, reaching 47.9 m@16 on HotPotQA (+13.9 over GRPO).

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…