OneLatent: Single-Token Compression for Visual Latent Reasoning

Haoxiang Shi

OneLatent: Single-Token Compression for Visual Latent Reasoning

Abstract

Chain-of-thought (CoT) prompting improves reasoning but often increases inference cost by one to two orders of magnitude. To address these challenges, we present OneLatent, a framework that compresses intermediate reasoning into a single latent token via supervision from rendered CoT images and DeepSeek-OCR hidden states. By rendering textual steps into images, we obtain a deterministic supervision signal that can be inspected and audited without requiring the model to output verbose textual rationales. Across benchmarks, OneLatent reduces average output length by 11× with only a 2.21\% average accuracy drop relative to textual CoT, while improving output token contribution (OTC) by 6.8×. On long-chain logical reasoning, OneLatent reaches 99.80\% on ProntoQA and 97.80\% on ProsQA with one latent token, with compression up to 87.4×, supporting compression-constrained generalization.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…