LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters

Abstract

Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: FDpasst 450.00 → 327.29 (+27.27%), FDpanns 34.88 → 22.68 (+34.98%), FDvgg 3.75 → 1.28 (+65.87%), KLpanns 2.49 → 2.07 (+16.87%), KLpasst 1.78 → 1.53 (+14.04%), ISpanns 4.17 → 4.30 (+3.12%), IBscore 0.25 → 0.28 (+12.00%), Energy10ms 0.3013 → 0.1349 (+55.23%), Energy10ms(vs.GT) 0.0531 → 0.0288 (+45.76%), and Sem.\,Rel. 2.73 → 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available at https://github.com/deepreasonings/long-form-video2audio.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…