2025-11-24T14:16:17.279785

GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Fiaz, Debary, Fraccaro et al.
Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .
academic

GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Basic Information

  • Paper ID: 2509.25026
  • Title: GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
  • Authors: Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan
  • Institutions: IBM Research, INSAIT, ETH Zürich, MBZUAI, Linköping University, ANU Australia
  • Classification: cs.CV (Computer Vision)
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2509.25026

Abstract

Recent advances in reinforcement learning have demonstrated significant progress in reasoning capabilities for natural images, yet its potential in the Earth Observation (EO) domain remains largely unexplored. EO tasks introduce unique challenges spanning referential object detection, image/region description, change detection, localization, and temporal analysis, requiring task-aware reasoning capabilities. This paper proposes a novel post-training framework incorporating task-aware reward mechanisms, enabling reasoning-based reinforcement learning models to effectively adapt to diverse EO tasks. The training strategy enhances reasoning capabilities for remote sensing imagery, stabilizes the optimization process, and improves robustness. Extensive experiments on multiple EO benchmarks demonstrate consistent performance improvements compared to state-of-the-art general-purpose and specialized vision-language models.

Research Background and Motivation

Problem Definition

Remote sensing vision-language models (RS-VLMs) demonstrate strong performance on high-resolution Earth observation imagery but suffer from shallow reasoning issues:

  1. Insufficient Reasoning Capability: Existing models heavily rely on text priors and supervised fine-tuning (SFT), lacking chain-of-thought reasoning, resulting in poor generalization
  2. Lack of Task Specificity: Early RL attempts such as UAV-VL-R1 are limited to visual question answering tasks, performing poorly on broader EO tasks including detection, description, and localization
  3. Weak Reward Signals: Existing RL methods in the EO domain receive weak and task-agnostic reward signals, prone to reward hacking, unable to capture structured multi-step reasoning required for complex EO scenes

Research Significance

Earth observation tasks possess unique complexity and diversity, spanning classification, detection, description, change detection, and disaster assessment across multiple dimensions, requiring powerful VLM systems capable of structured reasoning to handle multi-sensor inputs and complex spatiotemporal relationships.

Limitations of Existing Methods

  • Supervised Learning Constraints: Traditional SFT and contrastive learning objectives limit model robustness and reasoning capability
  • Inapplicability of General RL Methods: Conventional RL methods like PPO suffer from high variance and unstable policy updates in complex structured reasoning tasks
  • Improper Reward Design: Lack of specialized reward mechanisms tailored to EO task characteristics

Core Contributions

  1. Proposes GeoVLM-R1 Framework: Develops a post-training RL framework specifically designed for reasoning capabilities across diverse EO tasks
  2. Innovative Dual-Objective Reward Mechanism: Introduces dual rewards for format compliance and accuracy compliance within the GRPO framework, enhancing stable RL learning and producing accurate, structured, interpretable reasoning paths
  3. Task-Aware Reward Design: Designs specialized reward functions for different EO tasks, including recall rewards, detection rewards, SBERT rewards, etc.
  4. Comprehensive Experimental Validation: Demonstrates superior performance compared to existing VLMs across 28 downstream benchmarks

Methodology Details

Task Definition

Given EO multimodal samples Qi={i,qi}Q_i = \{i, q_i\} containing satellite image ii and corresponding text prompt qiq_i, the objective is to generate structured output containing reasoning steps and final answer:

<think>reasoning process</think>
<answer>final answer</answer>

Model Architecture

1. Two-Stage Training Paradigm

Stage One: Supervised Fine-Tuning (SFT)

  • Objective function: LSFT(πθ)=E(i,qi,yi)D[t=1Tlogπθ(yi,ti,qi,yi,<t)]L_{SFT}(\pi_\theta) = -E_{(i,q_i,y_i)\sim D}\left[\sum_{t=1}^T \log \pi_\theta(y_{i,t} | i, q_i, y_{i,<t})\right]
  • Purpose: Provides models with core EO knowledge and foundational reasoning capabilities

Stage Two: GRPO-Based Reinforcement Learning

  • Adopts Group Relative Policy Optimization (GRPO) instead of traditional PPO
  • Leverages relative advantages between candidate responses to reduce training variance and enhance structured reasoning

2. GRPO Optimization Mechanism

For multimodal samples QiQ_i, GRPO generates K candidate responses SQi={s1,s2,...,sK}S_{Q_i} = \{s_1, s_2, ..., s_K\}, optimizing:

JGRPO(θ)=E{si}i=1Kπθold(Qi)[1Ki=1Kmin[ρiAi,clip(ρi,1ϵ,1+ϵ)Ai]]βDKL[πθπref]J_{GRPO}(\theta) = E_{\{s_i\}_{i=1}^K \sim \pi_{\theta_{old}}(Q_i)}\left[\frac{1}{K}\sum_{i=1}^K \min[\rho_i A_i, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)A_i]\right] - \beta D_{KL}[\pi_\theta \| \pi_{ref}]

where relative advantage is calculated as: Ai=rirˉσrA_i = \frac{r_i - \bar{r}}{\sigma_r}

Technical Innovations

1. Task-Aware Reward Design

Total reward function: R(a)=Rformat+Rtask_accR(a) = R_{format} + R_{task\_acc}

Format Reward (RformatR_{format}):

  • Think Reward: Ensures inclusion of <think>...</think> tags
  • Answer Reward: Ensures inclusion of <answer>...</answer> tags

Task-Aware Accuracy Reward (Rtask_accR_{task\_acc}):

  • Recall Reward (Classification tasks): RRecall=TPTP+FNR_{Recall} = \frac{TP}{TP+FN}
  • Detection Reward (Object detection): RDetection=1Nn=1NmaxmIoU(sim,gin)R_{Detection} = \frac{1}{N}\sum_{n=1}^N \max_m IoU(s_i^m, g_i^n)
  • SBERT Reward (Region description): RSBERT=max(0,cos(esi,egi))R_{SBERT} = \max(0, \cos(e_{s_i}, e_{g_i}))
  • Lexical Metric-based Grounding Reward (LMGR): RLMGR=RLM+RDetection2R_{LMGR} = \frac{R_{LM} + R_{Detection}}{2}
  • Hybrid SBERT and Lexical Metric Reward (HSLR): RHSLR=RSBERT+RLM2R_{HSLR} = \frac{R_{SBERT} + R_{LM}}{2}

2. Training Stabilization Strategies

  • Uses horizontal bounding boxes (HBB) rather than rotated bounding boxes for RL training, reducing angle prediction errors' impact on IoU
  • Group-wise relative advantage normalization reduces reward variance
  • KL divergence constraints prevent policy drift

Experimental Setup

Datasets

Multiple EO datasets employed for training and evaluation:

DatasetTemporal TypeTask TypeQA PairsReward Function
BigEarthNetSingle-temporalClassification30,000Recall Reward
RSCISSingle-temporalImage Description43,670Levenshtein Similarity
RSVQA-LRBENSingle-temporalVisual QA57,223Jaccard Similarity
GeoChat-InstructSingle-temporalMulti-task69,269-73,000Multiple Rewards
xBDBi-temporalDisaster Detection2,283-4,202Detection Reward

Evaluation Metrics

  • Classification Tasks: Accuracy, Recall
  • Detection Tasks: mAP@0.5, mAP@0.25
  • Description Tasks: Rouge-1, Rouge-L, Meteor
  • QA Tasks: Jaccard Similarity

Implementation Details

  • Base Model: Qwen2.5VL-3B-Instruct
  • Image Size: 448×448
  • SFT Settings: 8×A100 GPUs, 2 epochs, learning rate 1e-5
  • GRPO Settings: 4×A100 GPUs, 2 epochs, learning rate 1e-6, temperature 0.9, KL ratio 0.04

Experimental Results

Main Results

1. Scene Classification Tasks

On zero-shot and multi-label classification tasks, GeoVLM-R1 achieves 7.88% improvement over EarthDial on BigEarthNet, with 2.56% and 6.9% absolute advantages on temporal datasets xBD and FMoW respectively.

2. Object Detection and Localization Tasks

In referential object detection tasks, GeoVLM-R1 achieves significant 21.63% improvement over EarthDial on multi-object detection. On NWPU VHR-10 dataset, substantial improvements are observed across all object scales.

3. Description and Localization Tasks

In region description tasks, Rouge metrics comprehensively surpass baseline methods. In localization description tasks, @0.5 and @0.25 metrics reach 38.74% and 61.45% respectively.

4. Temporal Disaster Assessment

On xBD dataset, object detection mAP@0.5 achieves 30.55% absolute improvement, demonstrating advantages in complex temporal analysis tasks.

Ablation Studies

1. Reward Function Effectiveness

  • Classification tasks: Recall reward most effective, reaching 80.91% on BigEarthNet
  • Image description: Levenshtein ratio reward performs best
  • Change detection: Hybrid SBERT and lexical metric reward (HSLR) most effective

2. Bounding Box Representation Impact

Using horizontal bounding boxes (HBB) for RL training proves more stable than rotated bounding boxes (RBB), avoiding cumulative angle prediction errors.

3. GRPO vs Baselines

Compared to SFT-only GeoVLM-SFT, incorporating GRPO optimization yields significant improvements across all tasks.

Case Analysis

The paper demonstrates examples of model-generated reasoning processes, showing GeoVLM-R1 capability to:

  1. Generate structured thinking processes
  2. Provide accurate spatial localization
  3. Perform multi-step logical reasoning
  4. Handle complex temporal change analysis

Remote Sensing VLM Development

  • Early Work: RS-GPT first introduces EO image-text paired datasets
  • Zero-Shot Capability: RemoteCLIP demonstrates strong zero-shot performance on classification and retrieval tasks
  • Region-Level Understanding: GeoChat, SkyEyeGPT extend to region-level visual grounding
  • Multimodal Fusion: EarthGPT, EarthDial integrate heterogeneous EO modalities

VLM Post-Training Techniques

  • Alignment Techniques: DPO and PPO widely applied to VLM alignment
  • Reasoning Enhancement: GRPO demonstrates excellent structured reasoning capability in DeepSeek-R1
  • Domain Limitations: Existing reasoning models primarily focus on mathematics and programming, overlooking remote sensing task potential

Conclusions and Discussion

Main Conclusions

  1. Effectiveness Validation: GeoVLM-R1 consistently surpasses existing methods across 28 EO benchmarks
  2. Reasoning Capability Enhancement: Structured reasoning significantly improves performance on complex EO tasks
  3. Stable Training: GRPO combined with task-aware rewards achieves stable and effective RL training

Limitations

  1. Computational Cost: RL training requires additional computational resources and time
  2. Reward Design Complexity: Different tasks require carefully designed specialized reward functions
  3. Data Dependency: Performance largely depends on high-quality EO instruction data

Future Directions

  1. Multimodal Extension: Integrate additional EO sensor data (SAR, hyperspectral, etc.)
  2. Zero-Shot Generalization: Enhance model generalization on unseen tasks
  3. Efficiency Optimization: Develop more efficient RL training strategies

In-Depth Evaluation

Strengths

  1. Strong Innovation: First application of R1-style reasoning training to remote sensing, filling an important gap
  2. Complete Methodology: Comprehensive technical pathway from problem definition to solution
  3. Comprehensive Experiments: Thorough evaluation across multiple datasets and tasks
  4. High Practical Value: Addresses the practical problem of insufficient reasoning capability in remote sensing VLMs

Weaknesses

  1. Base Model Dependency: Method effectiveness largely depends on base VLM quality
  2. Reward Engineering Complexity: Requires manual reward function design for each task type
  3. Computational Overhead: RL training introduces significant computational cost compared to direct fine-tuning
  4. Insufficient Generalization Analysis: Lacks in-depth analysis of cross-domain generalization capability

Impact

  1. Academic Contribution: Introduces new training paradigm to remote sensing AI field
  2. Practical Value: Directly applicable to real-world remote sensing application scenarios
  3. Technical Inspiration: Provides reference for enhancing reasoning capabilities of VLMs in other specialized domains

Applicable Scenarios

  1. Remote Sensing Image Analysis: Satellite image classification, object detection, change detection
  2. Disaster Monitoring: Natural disaster damage assessment, emergency response
  3. Urban Planning: Land use change monitoring, infrastructure planning
  4. Environmental Monitoring: Ecosystem change tracking, climate change research

References

The paper cites 82 relevant references covering remote sensing VLMs, reinforcement learning, and vision-language models across multiple domains, providing solid theoretical foundation for the research.


Overall Assessment: This is a high-quality computer vision paper making significant contributions to the important application domain of remote sensing image understanding. The methodology is novel, experiments are comprehensive, and results are convincing, providing valuable technical pathways for advancing remote sensing AI technology development.