Refresh-Scaling the Memory of Balanced Adam

Abstract

Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, β1=β2, reducing the optimizer to a single remaining parameter. However, how this parameter should be set remains poorly understood. We argue that, in balanced Adam, β should not be treated as a dimensionless constant: it defines a statistical memory horizon Hβ=(1-β)-1. In terms of the effective learning horizon TES, estimated from the validation trajectory, we study the refresh count Rβ=(1-β)TES, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing β so that Rβ≈1000 selects different β values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice β=0.944, the refresh rule improves worst-case robustness, reducing the maximum relative gap in validation loss by 33.4\%, while bringing all 11 runs within 1\% of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is more naturally viewed as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…