Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers
Abstract
We propose a novel framework, ContinuousTime Attention, which infuses partial differential equations (PDEs) into the Transformer's attention mechanism to address the challenges of extremely long input sequences. Instead of relying solely on a static attention matrix, we allow attention weights to evolve over a pseudotime dimension via diffusion, wave, or reactiondiffusion dynamics. This mechanism systematically smooths local noise, enhances longrange dependencies, and stabilizes gradient flow. Theoretically, our analysis shows that PDEbased attention leads to better optimization landscapes and polynomial rather than exponential decay of distant interactions. Empirically, we benchmark our method on diverse experimentsdemonstrating consistent gains over both standard and specialized long sequence Transformer variants. Our findings highlight the potential of PDEbased formulations to enrich attention mechanisms with continuoustime dynamics and global coherence.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.