Tight Sample Complexity of Transformers

Abstract

We tightly characterize the VC dimension of depth-L Transformers with a total of W parameters, mapping an input sequence of length T to a single output, establishing an upper bound of O(L W (T W)) and a nearly matching lower bound of Ω(L W (T W / L)). We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity O(L W ((T+T) W)) and that any learning rule that uses chain-of-thought data requires at least Ω(L W ((T+T) W / L)) examples, where T is the input length and T is the number of autoregressive steps.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…