Probing and Leveraging Video Diffusion Transformer Features for Robust Point Tracking
Abstract
Despite achieving strong results on standard benchmarks, current point tracking methods rely on feature backbones that are rarely designed with the temporal coherence needed for robust real-world performance. While recent works incorporate powerful visual foundation model (VFM) features into tracking pipelines, no prior work has systematically analyzed which VFM provides the most robust representations for point tracking. We present the first such analysis, evaluating diverse VFMs in a zero-shot setting on both standard and robustness benchmarks for point tracking. Our study reveals that video diffusion transformers (DiTs) consistently yield the most temporally coherent and discriminative features, even surpassing ResNet backbones explicitly supervised on tracking data. We hypothesize this advantage stem from large-scale video pretraining, full 3D spatio-temporal attention, and a diffusion training objective. Motivated by this finding, we propose DiTracker, which integrates video DiT features into existing tracking frameworks through query-key matching cost computation, cost-level fusion with a lightweight ResNet branch, and LoRA adaptation. Under the same tracking head, DiTracker is trained solely on synthetic data with far fewer iterations, yet outperforms CoTracker3 trained with additional real-world videos, with the largest gains under challenging and corrupted scenarios. It further generalizes across tracking heads and scales with backbone size, confirming that generative video pretraining provides real-world priors that reduce the dependence on large-scale real-data supervision.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.