A Fusion-Aware Two-Stage Framework for Mispronunciation Detection and Diagnosis in Low-Resource Modern Standard Arabic
Abstract
Accurate phoneme recognition is pivotal for mispronunciation detection and diagnosis (MDD) in modern standard Arabic (MSA), yet remains constrained by data scarcity and the synthetic-real domain gap. This work proposes a two-stage end-to-end framework. It integrates a pre-trained encoder with causal dilated temporal convolutional networks to preserve fine-grained phonetic variations. A hierarchical two-stage strategy first learns general mappings from native/synthetic corpora, then adapts to scarce real learner data to mitigate domain shift without over-correction. Prediction stability is further enhanced via multi-checkpoint ensemble inference with N-gram rescoring. Evaluated on the QuranMB.v2 test set, our system achieves an F1-score of 0.7201, a 63.1\% relative improvement over baseline (0.4414). This performance ranks at the top of the IqraEval.2 Challenge, establishing a new state-of-the-art for low-resource MSA in MDD.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.