GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
Fiaz, Debary, Fraccaro et al.
Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .
academic
GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
Recent advances in reinforcement learning have demonstrated significant progress in reasoning capabilities for natural images, yet its potential in the Earth Observation (EO) domain remains largely unexplored. EO tasks introduce unique challenges spanning referential object detection, image/region description, change detection, localization, and temporal analysis, requiring task-aware reasoning capabilities. This paper proposes a novel post-training framework incorporating task-aware reward mechanisms, enabling reasoning-based reinforcement learning models to effectively adapt to diverse EO tasks. The training strategy enhances reasoning capabilities for remote sensing imagery, stabilizes the optimization process, and improves robustness. Extensive experiments on multiple EO benchmarks demonstrate consistent performance improvements compared to state-of-the-art general-purpose and specialized vision-language models.
Remote sensing vision-language models (RS-VLMs) demonstrate strong performance on high-resolution Earth observation imagery but suffer from shallow reasoning issues:
Insufficient Reasoning Capability: Existing models heavily rely on text priors and supervised fine-tuning (SFT), lacking chain-of-thought reasoning, resulting in poor generalization
Lack of Task Specificity: Early RL attempts such as UAV-VL-R1 are limited to visual question answering tasks, performing poorly on broader EO tasks including detection, description, and localization
Weak Reward Signals: Existing RL methods in the EO domain receive weak and task-agnostic reward signals, prone to reward hacking, unable to capture structured multi-step reasoning required for complex EO scenes
Supervised Learning Constraints: Traditional SFT and contrastive learning objectives limit model robustness and reasoning capability
Inapplicability of General RL Methods: Conventional RL methods like PPO suffer from high variance and unstable policy updates in complex structured reasoning tasks
Improper Reward Design: Lack of specialized reward mechanisms tailored to EO task characteristics
Proposes GeoVLM-R1 Framework: Develops a post-training RL framework specifically designed for reasoning capabilities across diverse EO tasks
Innovative Dual-Objective Reward Mechanism: Introduces dual rewards for format compliance and accuracy compliance within the GRPO framework, enhancing stable RL learning and producing accurate, structured, interpretable reasoning paths
Task-Aware Reward Design: Designs specialized reward functions for different EO tasks, including recall rewards, detection rewards, SBERT rewards, etc.
Comprehensive Experimental Validation: Demonstrates superior performance compared to existing VLMs across 28 downstream benchmarks
Given EO multimodal samples Qi={i,qi} containing satellite image i and corresponding text prompt qi, the objective is to generate structured output containing reasoning steps and final answer:
On zero-shot and multi-label classification tasks, GeoVLM-R1 achieves 7.88% improvement over EarthDial on BigEarthNet, with 2.56% and 6.9% absolute advantages on temporal datasets xBD and FMoW respectively.
In referential object detection tasks, GeoVLM-R1 achieves significant 21.63% improvement over EarthDial on multi-object detection. On NWPU VHR-10 dataset, substantial improvements are observed across all object scales.
In region description tasks, Rouge metrics comprehensively surpass baseline methods. In localization description tasks, @0.5 and @0.25 metrics reach 38.74% and 61.45% respectively.
Using horizontal bounding boxes (HBB) for RL training proves more stable than rotated bounding boxes (RBB), avoiding cumulative angle prediction errors.
The paper cites 82 relevant references covering remote sensing VLMs, reinforcement learning, and vision-language models across multiple domains, providing solid theoretical foundation for the research.
Overall Assessment: This is a high-quality computer vision paper making significant contributions to the important application domain of remote sensing image understanding. The methodology is novel, experiments are comprehensive, and results are convincing, providing valuable technical pathways for advancing remote sensing AI technology development.