Inefficiency of K-FAC for Large Batch Size Training
Abstract
In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns, beyond a certain critical batch size. In the hopes of addressing this, it has been suggested that the Kronecker-Factored Approximate Curvature (K-FAC) method allows for greater scalability to large batch sizes, for non-convex machine learning problems such as neural network optimization, as well as greater robustness to variation in model hyperparameters. Here, we perform a detailed empirical analysis of large batch size training %of these two hypotheses, for both K-FAC and SGD, evaluating performance in terms of both wall-clock time and aggregate computational cost. Our main results are twofold: first, we find that both K-FAC and SGD doesn't have ideal scalability behavior beyond a certain batch size, and that K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD; and second, we find that K-FAC, in addition to requiring more hyperparameters to tune, suffers from similar hyperparameter sensitivity behavior as does SGD. We discuss extensive results using ResNet and AlexNet on CIFAR-10 and SVHN, respectively, as well as more general implications of our findings.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.