TTF: Temporal Token Fusion for Efficient Video-Language Model
Abstract
Video-language models (VLMs) face rapid inference costs as visual token counts scale with video length. For example, 32 frames at 448×448 resolution already yield >8,000 visual tokens in Qwen3-VL, making LLM prefill the dominant throughput bottleneck. Existing methods often rely on global similarity or attention-guided compression, incurring offsets to their gains. We propose Temporal Token Fusion (TTF), a training-free, plug-and-play pre-LLM token compression framework that exploits structured temporal redundancy in video. TTF automatically selects an anchor frame, then for each subsequent frame, performs a local window similarity search (e.g.,3× 3), fusing tokens that exceed a threshold. The compressed sequence maintains positional consistency across both prefill and decoding through coordinate realignment, enabling seamless integration with existing VLM pipelines. On Qwen3-VL-8B with threshold t=0.70, TTF removes about 67\% of visual tokens while retaining 99.5\% of the baseline accuracy and introducing only ≈0.16\,GFLOPs of matching overhead. Overall, TTF offers a practical, efficient solution for video understanding. The code is available at https://github.com/Cominder/ttfhttps://github.com/Cominder/ttf
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.