Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs

Abstract

Automatic mean opinion score (MOS) prediction serves as a principled alternative to both subjective listening tests and objective metrics, providing scalable and consistent audio evaluation. Inspired by the LLM-as-Judge paradigm, recent multimodal large language models offer strong perceptual modeling and reasoning capabilities, enabling audio quality assessment. In this work, we address the challenging problem of audio editing evaluation and propose the first natural language-based automated evaluation framework built upon Qwen2-Audio. Two caption-based fine-tuning tasks are introduced to enhance multi-audio understanding, together with a designed Chain-of-Thought prompting strategy to encourage structured, step-by-step reasoning. Experiments show that our framework produces interpretable and logically consistent text-based evaluations, aligning closely with human judgments while outperforming existing baselines. The code and demo are available at https://github.com/NKU-HLT/EvalReasoning.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…