MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

Abstract

Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

0

Turn this paper into a full lesson

ArcXiv compiles a staged curriculum from this paper: 8-12 lessons across beginner → advanced, synthesised section guides, visuals, flashcards, a quiz, exercises, and on-demand deep dives per section. Grounded in the abstract, never invented.

Discussion (0)

Sign in to join the discussion.

Loading comments…