Before and After Temperature: A Distributional View of Creative LLM Generation
Abstract
Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature reshapes the model's token distribution before the next token is drawn. On Llama-3.1-8B-Instruct generations of 500 open-ended creative prompts at T ∈ \0.3, 0.8, 1.5\, a single per-token feature derived from this reshaping predicts the within-prompt creativity rank at Spearman ρ=0.918 against an averaged gpt-4o\,/\,gemini-2.5-pro judge (n=500) and ρ=0.870 against a three-rater human-majority ranking (n=150). Each of four standard reference-free baselines (self-perplexity, mean predictive entropy, top-1 margin, gzip compression ratio) tops out at |ρ|\!≈\!0.76 on both ground truths: a gap of +0.165 on averaged-LLM and +0.110 on human-majority, both far larger than the spread among the baselines themselves. The two ground-truth panels agree with each other at ρ=0.83, above the inter-human ceiling of ρ=0.77, so the comparison is not bottlenecked by judge noise. Mechanistically, the win comes from a sharp distributional signature of the incoherence regime: at T=1.5 the cumulative-mass width n95(q) inflates from \!1 to \!131 tokens and post-temperature mass leaks off the pre-temperature top-90\% plausible set by about 13 percentage points. The per-token aggregates do not separate T=0.8 from T=0.3; discriminating the two coherent regimes is left to sequence-level features.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.