Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

Abstract

Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a Rank-1 Subspace phenomenon: while raw optimization steps oscillate violently, consecutive merged checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a river-valley landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose Extra-Merge, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer jordan2024muon.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…