S2Cap: A Benchmark and a Baseline for Singing Style Captioning

Abstract

Singing voices contain much richer information than common voices, including varied vocal and acoustic properties. However, current open-source audio-text datasets for singing voices capture only a narrow range of attributes and lack acoustic features, leading to limited utility towards downstream tasks, such as style captioning. To fill this gap, we formally define the singing style captioning task and present S2Cap, a dataset of singing voices with detailed descriptions covering diverse vocal, acoustic, and demographic characteristics. Using this dataset, we develop an efficient and straightforward baseline algorithm for singing style captioning. The dataset is available at https://zenodo.org/records/15673764.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…