Learning from Audio-Dependency Errors: Data Curation Strategies Based on Model Confusion Patterns in Audio Question Answering

Abstract

We frame the system as diagnostic data curation for a large audio-language model: before fine-tuning, we probe Qwen3-Omni-30B-A3B-Instruct under normal, empty-audio, and shuffled-audio conditions to identify how the model's answers change when audio evidence is removed or mismatched. These model confusion patterns are used to bucket training samples into text-prior, shuffle-leak, strong audio-dependent, and hard or misleading cases. Our strongest train-only system fine-tunes only on strong-audio items, where the normal audio-question pair is correct but both counterfactual variants fail, plus a small number of empty-audio negatives and a text-only response normalizer for parse-failed generations. On the official development set, the best train-only system reaches 67.27% accuracy after response normalization, compared with 65.90% for our local Qwen3-Omni baseline. Final submissions additionally include models trained using train+development splits and a three-model ensemble.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…