The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models
Abstract
Human speech perception is multimodal. In natural speech, lip movements can precede corresponding voicing by a non-negligible gap of 100-300 ms, especially for specific consonants, affecting the time course of neural phonetic encoding in human listeners. However, it remains unexplored whether self-supervised learning models, which have been used to simulate audio-visual integration in humans, can capture this asynchronicity between audio and visual cues. We compared AV-HuBERT, an audio-visual model, with audio-only HuBERT, by using linear classifiers to track their phonetic decodability over time. We found that phoneme information becomes available in AV-HuBERT embeddings only about 20 ms before HuBERT, likely due to AV-HuBERT's lower temporal resolution and feature concatenation process. It suggests AV-HuBERT does not adequately capture the temporal dynamics of multimodal speech perception, limiting its suitability for modeling the multimodal speech perception process.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.