From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

Ding-Xuan Zhou

From One-Pass SGD to Data Reuse: Mini-Batch Scaling Laws in Sketched Linear Regression

Abstract

Scaling laws provide compact descriptions of how prediction error varies with compute, model size, and data, but existing theory mainly treats single-sample SGD or full data reuse, leaving the role of mini-batching unclear. We study batch scaling laws for sketched linear regression under a power-law covariance spectrum and a source condition on the target parameter. We analyze one-pass batch SGD, multi-pass batch SGD with replacement, and multi-pass batch SGD without replacement. Our first result is a risk decomposition: all three procedures share the same irreducible and approximation terms, while their stochastic terms depend on the sampling protocol. One-pass batch SGD splits into bias and variance, whereas the two multi-pass methods split into GD bias, GD variance, and a fluctuation term around a common GD reference trajectory. We then prove source-condition scaling laws for one-pass and multi-pass mini-batch methods. For one-pass batch SGD, mini-batching preserves the approximation and optimization-bias exponents, while the variance scales as O((M,(Teffγ)1/a)/(B Teff)). Thus the usual 1/B covariance reduction holds at fixed update count T, but in the one-pass regime T=N/B it is partly offset by the shorter optimization horizon. For multi-pass batch SGD, with- and without-replacement sampling have identical approximation and GD bias/variance terms; they differ only in the fluctuation covariance prefactor, which is 1/B with replacement and ρN,B=(N-B)/(B(N-1)) without replacement. Hence without-replacement sampling is less noisy for B>1, and when B=N the fluctuation vanishes, recovering deterministic gradient descent. These results place batch size on the same theoretical footing as compute, data, and model dimension in sketched linear regression.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…