A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

Abstract

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function L(X,Y) = Σj0 = 1n Σi0 = 1d ( ( Aj0 x ) , 1n -1 ( Aj0 x ), A3 Y*,i0 - bj0,i0 )2. Here A ∈ Rn2 × d2 is Kronecker product between A1 ∈ Rn × d and A2 ∈ Rn × d. A3 is a matrix in Rn × d, Aj0 ∈ Rn × d2 is the j0-th block of A. The X, Y ∈ Rd × d are variables we want to learn. B ∈ Rn × d and bj0,i0 ∈ R is one entry at j0-th row and i0-th column of B, Y*,i0 ∈ Rd is the i0-column vector of Y, and x ∈ Rd2 is the vectorization of X. In a multi-layer LLM network, the matrix B ∈ Rn × d can be viewed as the output of a layer, and A1= A2 = A3 ∈ Rn × d can be viewed as the input of a layer. The matrix version of x can be viewed as QK and Y can be viewed as V. We provide an iterative greedy algorithm to train loss function L(X,Y) up ε that runs in O( ( Tmat(n,n,d) + Tmat(n,d,d) + d2ω) (1/ε) ) time. Here Tmat(a,b,c) denotes the time of multiplying a × b matrix another b × c matrix, and ω≈ 2.37 denotes the exponent of matrix multiplication.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…