On the Local Minima of the Empirical Risk
Abstract
Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex nonsmooth losses (such as modern deep networks), the population risk is generally significantly more well-behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function F (population risk) given only access to an approximation f (empirical risk) that is pointwise close to F (i.e., \|F-f\|∞ ). Our objective is to find the ε-approximate local minima of the underlying function F while avoiding the shallow local minima---arising because of the tolerance ---which exist only in f. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of f that is guaranteed to achieve our goal as long as O(ε1.5/d). We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance among all algorithms making a polynomial number of queries of f. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.