A Formal Perspective on Byte-Pair Encoding

Abstract

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a 1σ(μ)(1-e-σ(μ))-approximation of an optimal merge sequence, where σ(μ) is the total backward curvature with respect to the optimal merge sequence μ. Empirically the lower bound of the approximation is ≈ 0.37. We provide a faster implementation of BPE which improves the runtime complexity from O(N M) to O(N M), where N is the sequence length and M is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…