Mathematical Models of Computation in Superposition
Abstract
Superposition -- when a neural network represents more ``features'' than it has dimensions -- seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies representational superposition, where superposition is only used when passing information through bottlenecks. In this work, we present mathematical models of computation in superposition, where superposition is actively helpful for efficiently accomplishing the task. We first construct a task of efficiently emulating a circuit that takes the AND of the m2 pairs of each of m features. We construct a 1-layer MLP that uses superposition to perform this task up to -error, where the network only requires O(m23) neurons, even when the input features are themselves in superposition. We generalize this construction to arbitrary sparse boolean circuits of low depth, and then construct ``error correction'' layers that allow deep fully-connected networks of width d to emulate circuits of width O(d1.5) and any polynomial depth. We conclude by providing some potential applications of our work for interpreting neural networks that implement computation in superposition.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.