Instruction Anchor: Dissecting the Mechanistic Dynamics of Modality Arbitration
Abstract
Modality following is the ability to selectively leverage multimodal contexts based on user instructions. It is fundamental to the safety and reliability of multimodal large language models (MLLMs) in real-world deployments. However, the internal mechanisms governing this decision-making process remain largely under-explored. In this work, we investigate the mechanism underlying modality following through an information flow perspective. Our findings reveal that instruction tokens serve as structural anchor for modality arbitration: Shallow attention layers perform undifferentiated information transfer, aggregating multimodal cues to instruction tokens as a latent buffer; in contrast, deep attention layers selectively strengthen the instruction-compliant subspace and resolve modality arbitration according to the instruction-specified intent, with a sparse subset of attention heads driving this process. Targeted attention-head interventions further validate the functional specificity of these heads: blocking only 5\% of the identified heads substantially degrades modality following while preserving general visual and language capabilities, whereas targeted amplification can restore failed modality-following samples by up to approximately 60\%. Together, this work provides a mechanistic account of modality following and informs future efforts to improve how MLLMs integrate and utilize multimodal evidence under user instructions.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.