Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

Abstract

Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present Yor-Sarc, the first gold-standard dataset for sarcasm detection in Yor\`ub\'a, a tonal Niger-Congo language spoken by over 50 million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yor\`ub\'a sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' = 0.7660; pairwise Cohen's = 0.6732--0.8743), with 83.3\% unanimous consensus. One annotator pair achieved almost perfect agreement ( = 0.8743; 93.8\% raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining 16.7\% majority-agreement cases are preserved as soft labels for uncertainty-aware modelling. Yor-Sarchttps://github.com/toheebadura/yor-sarc is expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…