CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation
Abstract
Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress on automated metrics, yet their clinical utility remains uncertain due to limited qualitative evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that enables tractable reinforcement learning (RL) through structured multimodal temporal embeddings and high-resolution visual feature compression, for efficient, unified conditioning of an LLM decoder on visual, textual, and temporal context from a study and its prior. This enables group relative policy optimisation (GRPO), where a proposed reward function is used to improve semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates for seven of the eight analysed findings. Preferences for radiologist reports were driven primarily by higher recall, while generated reports were consistently preferred for readability. Together, these results define a clear pathway to clinically acceptable CXR RRG. Improving recall and the detection of subtle findings represents the primary remaining barrier to non-inferiority with radiologist reporting, positioning CXR RRG for prospective evaluation in assistive, radiologist-led workflows.
Turn this paper into a lesson
ArcXiv compiles a structured reading guide from this paper's metadata: plain-English importance, contributions, prerequisite concepts, which sections to read first, flashcards, and a quiz. Grounded in the abstract, never invented.