MatchAttention: Embedding Explicit Matching Constraints into Attention for Efficient Stereo Matching
Abstract
Standard attention mechanisms are not well suited to stereo matching. Global attention scales quadratically and provides no explicit matching constraint, while local attention is efficient but loses long-range correspondences. We propose MatchAttention, an attention mechanism that embeds an explicit matching constraint into attention by treating the relative position between a query and its matched key as a learnable component of attention sampling. Centering a small contiguous sampling window on this learnable relative position enforces the matching constraint and supports long-range correspondence at strictly linear attention complexity. A differentiable contiguous attention sampling (CAS) operator enables sub-pixel accuracy, and cascaded MatchAttention blocks iteratively refine the relative positions through residual connections. We instantiate MatchAttention as a hierarchical coarse-to-fine stereo network with two variants. MatchAttentionXL targets accuracy and MatchAttentionRT targets real-time edge inference. MatchAttentionXL achieves state-of-the-art accuracy on Middlebury V3 and top results across KITTI 2012/2015 and ETH3D. MatchAttentionRT runs at 9.3 ms on RTX 4060 Ti and 79.1 ms on Jetson Orin NX 16 GB at 1024 x 512, making it the first stereo model to deliver real-time edge inference without sacrificing zero-shot generalization. The code is available at https://github.com/TingmanYan/MatchAttention.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.