Depth Separations in Neural Networks: What is Actually Being Separated?
Abstract
Existing depth separation results for constant-depth networks essentially show that certain radial functions in Rd, which can be easily approximated with depth 3 networks, cannot be approximated by depth 2 networks, even up to constant accuracy, unless their size is exponential in d. However, the functions used to demonstrate this are rapidly oscillating, with a Lipschitz parameter scaling polynomially with the dimension d (or equivalently, by scaling the function, the hardness result applies to O(1)-Lipschitz functions only when the target accuracy ε is at most poly(1/d)). In this paper, we study whether such depth separations might still hold in the natural setting of O(1)-Lipschitz radial functions, when ε does not scale with d. Perhaps surprisingly, we show that the answer is negative: In contrast to the intuition suggested by previous work, it is possible to approximate O(1)-Lipschitz radial functions with depth 2, size poly(d) networks, for every constant ε. We complement it by showing that approximating such functions is also possible with depth 2, size poly(1/ε) networks, for every constant d. Finally, we show that it is not possible to have polynomial dependence in both d,1/ε simultaneously. Overall, our results indicate that in order to show depth separations for expressing O(1)-Lipschitz functions with constant accuracy -- if at all possible -- one would need fundamentally different techniques than existing ones in the literature.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.