Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

Abstract

This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain [0,1]d and d-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for Transformers that builds local approximations of the target function and aggregates them into a global approximation via softmax partition of unity. This approach leverages the attention mechanism to achieve spatial localization through affine transformations of the input. The softmax activation plays a crucial role in aggregating local approximations to a global output. From an approximation perspective, we prove that a dense Transformer equipped with only two encoder blocks and standard single-hidden-layer point-wise feed-forward networks can achieve a uniform -approximation error for α-H\"older continuous functions with α ∈ (0,1] using O(-d/α) total parameters. Building upon this approximation guarantee, we establish a near minimax-optimal generalization error bound of order O(n-2α2α+d n) for the empirical risk minimizer, where n is the training data size. The Transformer architecture studied in this paper is dense, shallow and wide, and employs softmax activation and sinusoidal positional encodings, closely reflecting practical implementations.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…