Training Transformers in Cosine Coefficient Space

Abstract

Linear layers hold most of a transformer's parameters. We replace each linear layer with one that stores K out of mn two-dimensional DCT coefficients per weight matrix and reconstructs the full matrix through an inverse DCT at every forward pass; the K coefficients are the trainable parameters. A 4-layer, 128-dim transformer trained from scratch on character-level Shakespeare reaches validation loss 1.604 at K = mn/2, against 1.580 for a standard dense baseline -- a gap of +0.024 at half the trainable parameter count, within the terminal-epoch variation of the dense run. A rank-48 LoRA factorization at the same trainable parameter count reaches only 1.801 (+0.221). The structural advantage of sparse-coefficient over low-rank parameterizations at matched K is qualitative. We identify rank flexibility as the mechanism. A random orthonormal basis matches the DCT within noise at K = mn/2, and a compression sweep through K = mn/10 and K = mn/20 shows that subspaces that can host high-rank matrices keep the loss low, while subspaces that flatten into a low-rank block (zigzag-selection variants) converge onto the observed stable rank and the loss line of the rank-48 LoRA reference in lock-step. Among these orthonormal bases, the DCT is preferred because its separable fast transform admits a fused reconstruction kernel: the materialized weight matrix never leaves on-chip memory, so the parameter saving translates into a bandwidth saving as well.

0

Discussion (0)

Sign in to join the discussion.

Loading comments…