Evaluating Black-Box Classifiers via Stable Adaptive Two-Sample Inference
Abstract
We consider the problem of evaluating black-box multi-class classifiers. In the standard setup, we observe class labels Y∈ \0,1,…,M-1\ generated according to the conditional distribution Y|X Multinom(η(X)), where X denotes the features and η maps from the feature space to the (M-1)-dimensional simplex. A black-box classifier is an estimate η for which we make no assumptions about the training algorithm. Given holdout data, our goal is to evaluate the performance of the classifier η. Recent work suggests treating this as a goodness-of-fit problem by testing the hypothesis H0: ((X,Y),(X',Y')) δ, where is some metric between two distributions, and (X',Y') PX× Multinom(η(X)). Combining ideas from algorithmic fairness, Neyman-Pearson lemma, and conformal p-values, we propose a new methodology for this testing problem. The key idea is to generate a second sample (X',Y') PX × Multinom(η(X)) allowing us to reduce the task to two-sample conditional distribution testing. Using part of the data, we train an auxiliary binary classifier called a distinguisher to attempt to distinguish between the two samples. The distinguisher's ability to differentiate samples, measured using a rank-sum statistic, is then used to assess the difference between η and η . Using techniques from cross-validation central limit theorems, we derive an asymptotically rigorous test under suitable stability conditions of the distinguisher.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.