Lost in Localization: Building RabakBench with Human-in-the-Loop Validation to Measure Multilingual Safety Gaps

Abstract

Large language models (LLMs) often fail to maintain safety in low-resource language varieties, such as code-mixed vernaculars and regional dialects. We introduce RabakBench, a multilingual safety benchmark and scalable pipeline localized to Singapore's unique linguistic landscape, covering Singlish, Chinese, Malay, and Tamil. We construct the benchmark through a three-stage pipeline: (1) Generate: augmenting real-world unsafe web content via LLM-driven red teaming; (2) Label: applying semi-automated multi-label annotation using majority-voted LLM labelers; and (3) Translate: performing high-fidelity, toxicity-preserving translation. The resulting dataset contains over 5,000 examples across six fine-grained safety categories. Despite using LLMs for scalability, our framework maintains rigorous human oversight, achieving 0.70-0.80 inter-annotator agreement. Evaluations of 13 state-of-the-art guardrails reveal significant performance degradation, underscoring the need for localized evaluation. RabakBench provides a reproducible framework for building safety benchmarks in underserved communities.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…