Optimal Transport Audio Distance with Learned Riemannian Ground Metrics
Abstract
In audio generation evaluation, Fr\'echet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 contamination relative to discrete OT. We propose Optimal Transport Audio Distance (OTAD), which corrects each primitive with one dedicated mechanism -- a residual Riemannian ground-metric adapter for the cost and entropic Sinkhorn optimal transport for the coupling. Across eight encoders under a four-axis protocol, coupling-only comparisons at ε = 0.05 show that Sinkhorn's rank-1 sensitivity exceeds FAD's by a factor of 1.9 to 3.6. Furthermore, OTAD achieves a higher mean Spearman correlation with audio-quality MOS (DCASE 2023 Task 7) than baseline metrics. As an intrinsic benefit of the discrete transport plan, OTAD yields per-sample diagnostics with AUROC 0.86, a capability that scalar- or kernel-aggregated metrics structurally lack.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.