Task-Vector Arithmetic for Emotional Expressivity Control in Language-Model-Based Text-to-Speech
Abstract
We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction τ= Ei[x(si,emo)] -Ei[x(si,neutral)] applied to an unseen target speaker as xnew = x(target,neutral) + α·τ. Using ESD (English) as the τ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of +0.29 in emotion2vec cosine over the ICL baseline on English held-out speakers and +0.09 on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS 0.88 for the multi-speaker τ variant) and intelligibility (WER ≈ 0 in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.