Classification with High-Dimensional Sparse Samples

Abstract

The task of the binary classification problem is to determine which of two distributions has generated a length-n test sequence. The two distributions are unknown; two training sequences of length N, one from each distribution, are observed. The distributions share an alphabet of size m, which is significantly larger than n and N. How does N,n,m affect the probability of classification error? We characterize the achievable error rate in a high-dimensional setting in which N,n,m all tend to infinity, under the assumption that probability of any symbol is O(m-1). The results are: 1. There exists an asymptotically consistent classifier if and only if m=o(\N2,Nn\). This extends the previous consistency result in [1] to the case N≠ n. 2. For the sparse sample case where \n,N\=o(m), finer results are obtained: The best achievable probability of error decays as -(Pe)=J \N2, Nn\(1+o(1))/m with J>0. 3. A weighted coincidence-based classifier has non-zero generalized error exponent J. 4. The 2-norm based classifier has J=0.

0

Turn this paper into a lesson

ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…