The Affective Bridge: Preserving Speech Representations while Enhancing Deepfake Detection vian emotional Constraints
Abstract
Speech deepfake detection (DFD) has benefited from diverse acoustic and semantic speech representations, many of which encode valuable speech information and are costly to train. Prior work has shown that affective cues improve DFD, yet existing approaches either fuse emotion with other task-specific features in complex pipelines or directly fine-tune representations toward DFD objectives, risking distortion of the original speech representations that support downstream tasks such as speaker verification (SV) or automatic speech recognition (ASR). We propose a simpler approach: fine-tuning speech encoders on emotion recognition alone-without any DFD supervision, and training a lightweight support vector machine (SVM) on the frozen emotion-tuned representations for DFD. This preserves the original representation capacity for downstream tasks such as SV and ASR, while emergently improving DFD performance. Crucially, we find that emotion is uniquely effective as this bridging task: replacing it with speaker identity even degrades DFD performance, demonstrating that the benefit stems from emotion's role as a natural bridge between speech representation and DFD. Experiments on FakeOrReal and In-the-Wild show accuracy improvements of up to 6\% and 2\% with corresponding EER reductions, while analysis on ASVspoof 2019 LA reveals dataset-specific speaker bias in the real-speech subset. Code is available at supplementary materials.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.