LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation
Abstract
Precise localization and delineation of brain tumors using magnetic resonance imaging (MRI) are essential for planning therapy and guiding surgical decisions. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. On BRISC 2025, LoGSAM attains a Dice score of 80.32\%, reaching 98.6\% of a fully fine-tuned GDINO + MedSAM baseline while training fewer than 5\% of its parameters, indicating a favorable accuracy/parameter trade-off. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on unseen MRI scans, achieving 91.7\% case-level class-extraction accuracy. These results highlight the feasibility of constructing a modular speech-to-segmentation pipeline from pretrained foundation models with minimal parameter updates.
Turn this paper into a full lesson
ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.