Hallucination Is Linearly Decodable from Mid-Layer Hidden States in Quantized LLMs

Abstract

We investigate whether open-source LLMs encode a linearly separable truthfulness signal in their hidden states, and at which network depth this signal is strongest. Across three 7B--8B instruction-tuned models (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B) loaded in 4-bit NF4 quantization, we extract per-layer hidden states on four hallucination benchmarks (TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set) and compare four detection approaches: linear and MLP probes, INSIDE EigenScore, self-consistency, and attention entropy. A linear probe on a single mid-network layer achieves 0.904--1.000 AUROC on held-out splits, while sampling-based detectors do not exceed 0.541 AUROC under the same protocol. The truthfulness signal is approximately linear: MLP probes rarely surpass linear probes by more than 0.01 AUROC. Peak probing layers fall in a consistent band across model families on natural-language benchmarks -- blocks~13--18 of~32 for Llama and Mistral, and blocks~19--25 of~28 for Qwen. First-block attention entropy provides a complementary signal in knowledge-grounded settings (0.866--0.941 AUROC on HaluEval-QA) at no additional inference cost. The low discriminability of sampling methods under this protocol reflects a structural mismatch between paired-label evaluation and the information these methods access, rather than an inherent limitation of those methods. Code and data are released for full reproducibility on a single 8\,GB GPU.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…