GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Abstract
This paper investigates whether reward matching is a viable alternative to reward maximization methods for on-policy RL of LLMs. Group-relative Implicit Fine-Tuning (GIFT) is proposed, combining GRPO-style group sampling, DPO-style implicit reward, and UNA-style MSE between implicit and explicit advantages. By applying z-score standardization, the intractable partition function Z(x) in the DPO implicit reward is canceled, and the KL coefficient β is eliminated from the RLHF and RLVR objective. The population minimizers of LGIFT are characterized in closed form: they coincide exactly with the GRPO/RLHF solution family π*β(y|x)πref(y|x)e1βrϕ(x,y), with a prompt-dependent, variance-determined KL coefficient β(x)=σϕ(x)σθ(x). GIFT therefore solves the same parametric policy family as GRPO while replacing GRPO's externally tuned scalar β with a prompt-adaptive β(x) optimized endogenously by matching reward distributions. Empirically, on 7B-32B backbones, GIFT converges faster than GRPO, DAPO and GSPO and overfits less on RLVR (GSM8K, MATH, AIME) and produces higher length-controlled win rates on RLHF (AlpacaEval, Arena-Hard). All proofs and detailed background are deferred to the appendix.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.