Identified-Set Geometry of Distributional Model Extraction under Top-K Censored API Access
Abstract
Modern LLM APIs often reveal only top-K logit scores and censor the remaining vocabulary. We study the per-position distribution-recovery limits of this access model. For censoring threshold τ, the compatible teacher distributions form an identified set whose total-variation diameter is exactly UK=(V-K)(τ)/(ZA+(V-K)(τ)), where ZA is the observed partition function. For KL recovery, we give a computable binary-endpoint lower bound and an asymptotically matching small-ambiguity upper bound, with an extension to reference-aware attackers. Experiments on a Qwen3 math-reasoning teacher reveal a layered extraction hierarchy: on-task top-K distillation recovers 12% of private capability, full-logit distillation recovers 56% despite 99% KL closure, and generation-based extraction recovers 96%. Top-K censoring therefore limits per-position distribution recovery but does not by itself prevent capability extraction, separating fidelity from transfer in prompt-only logit distillation.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.