Categorical distance correlation under general encodings and its application to high-dimensional feature screening

Abstract

In this paper, we extend distance correlation to categorical data with general encodings, such as one-hot encoding for nominal variables and semicircle encoding for ordinal variables. Unlike existing methods, our approach leverages the spacing information between categories, which enhances the performance of distance correlation. Two estimates including the maximum likelihood estimate and a bias-corrected estimate are given, together with their limiting distributions under the null and alternative hypotheses. Furthermore, we establish the sure screening property for high-dimensional categorical data under mild conditions. We conduct a simulation study to compare the performance of different encodings, and illustrate their practical utility using the 2018 General Social Survey data.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…