Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Mokshit Surana

Fair and Calibrated Toxicity Detection with Robust Training and Abstention

Abstract

Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs (n = 1000). We report four findings. (1) Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration (0.013) but is significantly miscalibrated across all identity subgroups (+0.029 to +0.134). (2) Training interventions reshape rather than eliminate disparity. Reweighted ERM improves ranking (BPSN AUC +0.06 to +0.12) but worsens the calibration-fairness gap by up to +0.232. Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated globally (ECE 0.118). (3) Post-hoc methods inherit training failure modes. Temperature scaling fails because miscalibration is non-uniform. Confidence-based abstention works under ERM but breaks under DRO, where the risk-coverage curve rises with deferral. (4) Abstention itself is unfair. Confidence-based deferral helps background content far more than identity-mentioning content. We argue that SRAI fairness requires a multi-axis framework: methods that differ only in aggregate ranking can differ sharply in failure modes that determine real-world harm.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Or compile a full topic from this idea

Discussion (0)

Sign in to join the discussion.

Loading comments…