The Viscosity of Logic: Phase Transitions and Hysteresis in DPO Alignment

Abstract

Direct Preference Optimization (DPO) is often tuned as if increasing alignment pressure (controlled by β) yields progressively "better" behavior. We instead treat β as a control parameter and densely sweep it for three 7B open-weight families under a fixed DPO recipe. In Mistral, capability is sharply non-monotonic: aggregated logic-probe margins become positive only in a narrow band near β ≈ 10-2 and revert outside it, with boundary points that are seed-sensitive. Across architectures under the same sweep, we observe qualitatively different response modes: sharp reorganization in Mistral, selective changes in Llama, and smooth trade-offs in Qwen. Critically, the DPO preference margin can anticorrelate with reasoning capability (Pearson r=-0.91 for Llama logic), so margin-based selection can prefer capability-impaired models. Training path also matters: exposure to high β induces capability losses that persist even after β is reduced (hysteresis). These findings motivate capability-resolved evaluation across the β landscape rather than reliance on margins or aggregate benchmarks.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…