HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector
Abstract
Many visual requests -- ``the object to open this bottle'', ``the person not wearing a helmet'' -- require reasoning, not just category matching. Pure open-vocabulary detectors need an explicit phrase; vision-language models (VLMs) can reason yet ``see but mis-speak'', attending to the right region but returning the wrong box or label. We argue this is a binding failure: in coordinate-as-text VLMs localization passes through the autoregressive head, coupling it to language generation; in two-stage pipelines the model's intent is squeezed through a single class string. We present HKVLM, which removes localization from the language path. A frozen, language-aligned detector emits class-agnostic region proposals; a frozen language model encodes reasoning instructions as referential query embeddings; a lightweight alignment hook binds queries to regions by contrastive retrieval and bipartite assignment in a shared embedding space. A perception-grounded faithfulness veto forbids naming an object that no region supports. Only the hook is trained, targeting small-data cold-start settings where monolithic VLM tuning struggles. We formalize a say-vs-see decomposition separating localization error (SeeErr) from binding error (SayErr), and evaluate on RefCOCO/RefCOCO+/RefCOCOg and POPE. With frozen Grounding DINO and Qwen2.5-VL, training only the hook lifts grounding accuracy by 50--90× over untrained cross-space matching; the faithfulness veto raises POPE accuracy from near-chance (0.50) to 0.66--0.76 and reduces hallucination from 0.99 to 0.23--0.43, with gains from 200 expressions. Increasing proposals from M=50 to M=300 improves grounding by 19--24\% without retraining, confirming that residual error is perceptual (SeeErr) rather than binding (SayErr).
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.