Geometrically Principled Randomized Optimization for Efficient LLM Training

Abstract

Low-rank gradient optimization for large language models is currently divided into two categories: structured methods that rigorously identify subspaces, and randomized approaches employed primarily for computational efficiency. In this work, we question the intuition behind why random projections are effective. We trace this phenomenon to the geometry of the gradient subspaces, which exhibits subspace optimization landscape has a nearly flat curvature, while a significant portion of gradient information lies outside the core subspace. Leveraging these insights, and drawing on randomized linear algebra, we theoretically establish that random low-rank projections preserve the geometry, and we introduce GrassWalk and GrassJump, algorithms that navigate the Grassmannian manifold via random walks and jumps. By coupling this randomized exploration with subspace-aware optimizer and recovering the lost gradient signals, we achieve state-of-the-art results on LLaMA-1B, LLaMA-7B, and Qwen-1.5B pretraining. Our findings reframe randomization not merely as a computational shortcut, but as a geometrically principled approach to high-dimensional optimizations.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…