Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning
Abstract
A recurring challenge in preference fine-tuning (PFT) is handling intransitive (i.e., cyclic) preferences. Intransitive preferences often stem from either (i) inconsistent rankings along a single objective or (ii) scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept, the Maximum Entropy Blackwell Winner (MaxEntBW), that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive PROSPER: a provably efficient PFT algorithm. Unlike prior self-play techniques, PROSPER directly handles multiple objectives without requiring scalarization. We then apply PROSPER to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that PROSPER outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.