On the Emergence of Linear Analogies in Word Embeddings

Abstract

Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability P(i,j) of words i and j in text corpora. The resulting vectors Wi not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, Wking - Wman + Wwoman ≈ Wqueen -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix M(i,j) = P(i,j)/P(i)P(j), (ii) strengthens and then saturates as more eigenvectors of M (i, j), which controls the dimension of the embeddings, are included, (iii) is enhanced when using M(i,j) rather than M(i,j), and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…