Riemannian Gradient Descent for Low-Rank Architectures

Abstract

We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank-r matrices, three geometries for rank-r partial isometries, and block-matrix variants of these five, where factors are shared across block-rows and block-columns. We apply our methods to the multihead attention parameters in small language models. After tuning learning rates, our methods do not conclusively outperform an AdamW baseline. Our implementations are available online.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…