On the representation of speech and music

Abstract

In most automatic speech recognition (ASR) systems, the audio signal is processed to produce a time series of sensor measurements (e.g., filterbank outputs). This time series encodes semantic information in a speaker-dependent way. An earlier paper showed how to use the sequence of sensor measurements to derive an "inner" time series that is unaffected by any previous invertible transformation of the sensor measurements. The current paper considers two or more speakers, who mimic one another in the following sense: when they say the same words, they produce sensor states that are invertibly mapped onto one another. It follows that the inner time series of their utterances must be the same when they say the same words. In other words, the inner time series encodes their speech in a manner that is speaker-independent. Consequently, the ASR training process can be simplified by collecting and labelling the inner time series of the utterances of just one speaker, instead of training on the sensor time series of the utterances of a large variety of speakers. A similar argument suggests that the inner time series of music is instrument-independent. This is demonstrated in experiments on monophonic electronic music.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…