f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Abstract

Recent work shows that preference alignment objectives can be interpreted as divergence estimators between aligned (preferred) & unaligned (less-preferred) distributions, yielding a principled recipe for designing alignment losses. However, this view has so far been limited to preference-based supervision. We extend it to general LLM alignment, including reinforcement learning with verifiable rewards (RLVR), where alignment feedback is given only as scalar rewards. We introduce f-Group Relative Policy Optimization (f-GRPO), a class of on-policy RL objectives, and f-Hybrid Alignment Loss (f-HAL), which combines on-policy reward optimization with off-policy preference supervision. We show that these objectives estimate f-divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement after alignment. Empirically, f-GRPO improves over GRPO on math-reasoning RLVR tasks, while hybrid f-HAL mitigates reward hacking in on-policy safety alignment when verifiable rewards are unavailable and learned reward models must be used.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…