Emergent Hierarchical Structure in Large Language Models: An Information-Theoretic Framework for Multi-Scale Representation

Abstract

Why do language models from different architecture families respond so differently to the same perturbation? We argue that the answer is not scale, but how architecture shapes information compression. Analyzing eight Transformer models (7B--70B parameters) from the Llama and Qwen families, we show that every model spontaneously develops discrete functional boundaries dividing its layers into Local, Intermediate, and Global processing segments -- yet boundary locations and per-segment brittleness are determined overwhelmingly by architecture family rather than model size or training configuration. We formalize this regularity as the Multi-Scale Probabilistic Generation Theory (MSPGT), which models an autoregressive Transformer as a Hierarchical Variational Information Bottleneck system and derives a tiered set of falsifiable predictions. Three predictions are strongly confirmed: all eight models exhibit two prominent phase-transition boundaries (P1.1); Llama boundary positions are stable across a 10× parameter range (CV=0.067--0.095) while Qwen positions vary widely (CV=0.465--0.726), precisely matching our strong- and weak-dominance conditions; and cross-architecture local-segment brittleness spans three orders of magnitude (493× ratio) -- a gap that architecture family alone predicts and that dwarfs any within-family or scale-driven variation.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…