Listen and Chant Before You Read: The Ladder of Beauty in LM Pre-Training
Abstract
We show that pre-training a Transformer on music before language significantly accelerates language acquisition. Using piano performances (MAESTRO dataset), a developmental pipeline -- music poetry prose -- yields a 17.5\% perplexity improvement over random initialization (p < 0.001, 5 seeds), with music and poetry improving orthogonal model components (internal computation and embeddings, respectively). Convergence tests confirm that this is not a transient head start: at d\!=\!64, multi-seed validation (5 seeds) shows a persistent 5.5\% gap at plateau (p = 0.017), with the pipeline converging faster and to a lower loss in every run. Real music matches the transfer ceiling of synthetic patterns with one-third the data, and scaling experiments reveal that optimal pre-training data volume shifts with model capacity (-3\% +3\% +6\% advantage of larger datasets from d\!=\!16 to d\!=\!64). Across the scales we study (d\!∈\!\16,32,64\, up to 400K parameters), these results suggest a capacity-dependent data curation principle and indicate that structured human creative outputs can provide an efficient pre-training substrate for small language models; stronger conclusions at modern pre-training scale will require substantially larger experiments.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.