In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification
Abstract
While random demonstration labels barely hurt in-context learning (Min et al., 2022), we show that homogeneous labels--even semantically valid ones--collapse accuracy to <=12% across six models (Pythia, Llama, Qwen; 0.8B--8B) and four tasks. The trigger is label-slot content: the model treats tokens occupying the label position as an exhaustive answer vocabulary, with homogeneity as the maximally collapsed case. A novel set-level fixation finding confirms this: when demonstrations carry varied nonsense tokens from foo,bar,vex,nit,orb, the model places 42--67% of probability on the demonstrated set while P(dog) remains below 0.2%. This is inconsistent with latent-concept Bayesian accounts (Xie et al., 2022) and reveals that ICL output is constrained vocabulary retrieval--the model binds its output to the demonstrated token inventory regardless of semantic plausibility. The effect generalizes to 4-way classification (0% accuracy across three models, 1B--8B) and multi-token verbalizers ("very positive"), where we decompose fixation into format-level (template adoption) and content-level (polarity override) components that are experimentally dissociable. Mechanistically, per-item paired activation patching on Pythia-1B recovers 98.4% of the gap (95% CI [84%, 112%]), localizing fixation to a layer-7-centered circuit (rank 2/560, 99.8th percentile; 4-fold CV mean 103%). Cross-architecture logit lens on Llama-3.2-1B replicates the encode-then-override trajectory with causal confirmation (top-5 layers: 89% recovery).
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.