Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

Abstract

Mel-spectrograms have been widely used in zero-shot text-to-speech (TTS); their inherent redundancy leads to inefficiency in text-speech alignment. Compact VAE-based latent representations have emerged as a stronger alternative but exhibit an optimization dilemma: higher-dimensional latents improve reconstruction quality and speaker similarity but degrade intelligibility, while lower-dimensional latents improve intelligibility at the cost of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, which uses semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems and vanilla acoustic VAE baselines with improved training efficiency. Demo and codes: https://zhikangniu.github.io/semantic-vae/

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…