FastSLM: Hierarchical Temporal Abstraction for Efficient Long-Form Speech Adaptation
Abstract
Scaling Multimodal Large Language Models (MLLMs) to long-form speech is bottlenecked by the explosive growth of input tokens. Unlike images or videos, audio lacks overlapping information, making extreme 1-token compression highly susceptible to the loss of fine-grained acoustic cues. To overcome this, we propose FastSLM, a token-efficient architecture featuring the Hierarchical Temporal Abstractor (HTA). HTA progressively distills non-overlapping acoustic features across multiple temporal scales, achieving an extreme compression rate of 1.67 tokens per second a 97% reduction without losing critical context. Experimental results show that FastSLM achieves competitive performance with state-of-the-art models on long-form benchmarks despite operating with significantly fewer FLOPs and parameters. The source code and model checkpoints are available at https://anonymous.4open.science/r/FastSLM-8BD3.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.