Managing Map Cardinality in Automatic Disease Classification Mapping: Balancing Precision, Recall and Coverage
Abstract
Automatic mapping between disease classification systems, such as the International Classification of Diseases (ICD), is a challenging yet essential task for integrating health data and conducting longitudinal data analysis. Existing embedding-based methods primarily focus on one-to-one mappings, overlooking more complex one-to-many scenarios. The threshold-based and top-K methods offer natural extensions; however, they involve inherent trade-offs between precision, recall and mapping coverage -- the proportion of source codes with at least one mapping to a target code. To address this challenge, we introduce a novel method, which is inspired by the blocking-and-matching pipeline commonly used in entity resolution. In particular, we first generate a block of candidate matches (blocking) and then employ a large language model (LLM) to identify all valid mappings within each block (matching). Empirically, we show that the proposed method achieves higher precision with comparable recall and broader coverage across multiple ICD version pairs (ICD-9-CM-10-CM and ICD-10-AM-11). Our source code and dataset is available at: https://tinyurl.com/46kyn7wp.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.