Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster

Abstract

Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling an S3M-based language identification system from 126 to 4,017 languages reshapes this topology, and find a non-linear effect: phylogenetic recovery stays flat up to the 1K scale, but the 4K model undergoes a qualitative shift, resolving both clear lineages and long-term linguistic contact. Most strikingly, a robust Pacific macro-cluster emerges, grouping genealogically unrelated Papuan, Oceanic, and Australian languages, and we trace its driver to a concentrated encoding that captures shared acoustic signatures such as global energy dynamics. These results suggest that massive S3Ms internalize multiple layers of language history, offering a promising perspective for computational phylogenetics and the study of language contact.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…