2025-11-24T14:16:17.279785

GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Fiaz, Debary, Fraccaro et al.

Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .

academic

GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Basic Information

Paper ID: 2509.25026
Title: GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
Authors: Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan
Institutions: IBM Research, INSAIT, ETH Zürich, MBZUAI, Linköping University, ANU Australia
Classification: cs.CV (Computer Vision)
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2509.25026

Abstract

Recent advances in reinforcement learning have demonstrated significant progress in reasoning capabilities for natural images, yet its potential in the Earth Observation (EO) domain remains largely unexplored. EO tasks introduce unique challenges spanning referential object detection, image/region description, change detection, localization, and temporal analysis, requiring task-aware reasoning capabilities. This paper proposes a novel post-training framework incorporating task-aware reward mechanisms, enabling reasoning-based reinforcement learning models to effectively adapt to diverse EO tasks. The training strategy enhances reasoning capabilities for remote sensing imagery, stabilizes the optimization process, and improves robustness. Extensive experiments on multiple EO benchmarks demonstrate consistent performance improvements compared to state-of-the-art general-purpose and specialized vision-language models.

Research Background and Motivation

Problem Definition

Remote sensing vision-language models (RS-VLMs) demonstrate strong performance on high-resolution Earth observation imagery but suffer from shallow reasoning issues:

Insufficient Reasoning Capability: Existing models heavily rely on text priors and supervised fine-tuning (SFT), lacking chain-of-thought reasoning, resulting in poor generalization
Lack of Task Specificity: Early RL attempts such as UAV-VL-R1 are limited to visual question answering tasks, performing poorly on broader EO tasks including detection, description, and localization
Weak Reward Signals: Existing RL methods in the EO domain receive weak and task-agnostic reward signals, prone to reward hacking, unable to capture structured multi-step reasoning required for complex EO scenes

Research Significance

Earth observation tasks possess unique complexity and diversity, spanning classification, detection, description, change detection, and disaster assessment across multiple dimensions, requiring powerful VLM systems capable of structured reasoning to handle multi-sensor inputs and complex spatiotemporal relationships.

Limitations of Existing Methods

Supervised Learning Constraints: Traditional SFT and contrastive learning objectives limit model robustness and reasoning capability
Inapplicability of General RL Methods: Conventional RL methods like PPO suffer from high variance and unstable policy updates in complex structured reasoning tasks
Improper Reward Design: Lack of specialized reward mechanisms tailored to EO task characteristics

Core Contributions

Proposes GeoVLM-R1 Framework: Develops a post-training RL framework specifically designed for reasoning capabilities across diverse EO tasks
Innovative Dual-Objective Reward Mechanism: Introduces dual rewards for format compliance and accuracy compliance within the GRPO framework, enhancing stable RL learning and producing accurate, structured, interpretable reasoning paths
Task-Aware Reward Design: Designs specialized reward functions for different EO tasks, including recall rewards, detection rewards, SBERT rewards, etc.
Comprehensive Experimental Validation: Demonstrates superior performance compared to existing VLMs across 28 downstream benchmarks

Methodology Details

Task Definition

Given EO multimodal samples $Q_i = \{i, q_i\}$ containing satellite image $i$ and corresponding text prompt $q_i$ , the objective is to generate structured output containing reasoning steps and final answer:

<think>reasoning process</think>
<answer>final answer</answer>

Model Architecture

1. Two-Stage Training Paradigm

Stage One: Supervised Fine-Tuning (SFT)

Objective function: $L_{SFT}(\pi_\theta) = -E_{(i,q_i,y_i)\sim D}\left[\sum_{t=1}^T \log \pi_\theta(y_{i,t} | i, q_i, y_{i,<t})\right]$
Purpose: Provides models with core EO knowledge and foundational reasoning capabilities

Stage Two: GRPO-Based Reinforcement Learning

Adopts Group Relative Policy Optimization (GRPO) instead of traditional PPO
Leverages relative advantages between candidate responses to reduce training variance and enhance structured reasoning

2. GRPO Optimization Mechanism

For multimodal samples $Q_i$ , GRPO generates K candidate responses $S_{Q_i} = \{s_1, s_2, ..., s_K\}$ , optimizing:

$J_{GRPO}(\theta) = E_{\{s_i\}_{i=1}^K \sim \pi_{\theta_{old}}(Q_i)}\left[\frac{1}{K}\sum_{i=1}^K \min[\rho_i A_i, \text{clip}(\rho_i, 1-\epsilon, 1+\epsilon)A_i]\right] - \beta D_{KL}[\pi_\theta \| \pi_{ref}]$

where relative advantage is calculated as: $A_i = \frac{r_i - \bar{r}}{\sigma_r}$

Technical Innovations

1. Task-Aware Reward Design

Total reward function: $R(a) = R_{format} + R_{task\_acc}$

Format Reward ( $R_{format}$ ):

Think Reward: Ensures inclusion of <think>...</think> tags
Answer Reward: Ensures inclusion of <answer>...</answer> tags

Task-Aware Accuracy Reward ( $R_{task\_acc}$ ):

Recall Reward (Classification tasks): $R_{Recall} = \frac{TP}{TP+FN}$
Detection Reward (Object detection): $R_{Detection} = \frac{1}{N}\sum_{n=1}^N \max_m IoU(s_i^m, g_i^n)$
SBERT Reward (Region description): $R_{SBERT} = \max(0, \cos(e_{s_i}, e_{g_i}))$
Lexical Metric-based Grounding Reward (LMGR): $R_{LMGR} = \frac{R_{LM} + R_{Detection}}{2}$
Hybrid SBERT and Lexical Metric Reward (HSLR): $R_{HSLR} = \frac{R_{SBERT} + R_{LM}}{2}$

2. Training Stabilization Strategies

Uses horizontal bounding boxes (HBB) rather than rotated bounding boxes for RL training, reducing angle prediction errors' impact on IoU
Group-wise relative advantage normalization reduces reward variance
KL divergence constraints prevent policy drift

Experimental Setup

Datasets

Multiple EO datasets employed for training and evaluation:

Dataset	Temporal Type	Task Type	QA Pairs	Reward Function
BigEarthNet	Single-temporal	Classification	30,000	Recall Reward
RSCIS	Single-temporal	Image Description	43,670	Levenshtein Similarity
RSVQA-LRBEN	Single-temporal	Visual QA	57,223	Jaccard Similarity
GeoChat-Instruct	Single-temporal	Multi-task	69,269-73,000	Multiple Rewards
xBD	Bi-temporal	Disaster Detection	2,283-4,202	Detection Reward

Evaluation Metrics

Classification Tasks: Accuracy, Recall
Detection Tasks: mAP@0.5, mAP@0.25
Description Tasks: Rouge-1, Rouge-L, Meteor
QA Tasks: Jaccard Similarity

Implementation Details

Base Model: Qwen2.5VL-3B-Instruct
Image Size: 448×448
SFT Settings: 8×A100 GPUs, 2 epochs, learning rate 1e-5
GRPO Settings: 4×A100 GPUs, 2 epochs, learning rate 1e-6, temperature 0.9, KL ratio 0.04

Experimental Results

Main Results

1. Scene Classification Tasks

On zero-shot and multi-label classification tasks, GeoVLM-R1 achieves 7.88% improvement over EarthDial on BigEarthNet, with 2.56% and 6.9% absolute advantages on temporal datasets xBD and FMoW respectively.

2. Object Detection and Localization Tasks

In referential object detection tasks, GeoVLM-R1 achieves significant 21.63% improvement over EarthDial on multi-object detection. On NWPU VHR-10 dataset, substantial improvements are observed across all object scales.

3. Description and Localization Tasks

In region description tasks, Rouge metrics comprehensively surpass baseline methods. In localization description tasks, @0.5 and @0.25 metrics reach 38.74% and 61.45% respectively.

4. Temporal Disaster Assessment

On xBD dataset, object detection mAP@0.5 achieves 30.55% absolute improvement, demonstrating advantages in complex temporal analysis tasks.

Ablation Studies

1. Reward Function Effectiveness

Classification tasks: Recall reward most effective, reaching 80.91% on BigEarthNet
Image description: Levenshtein ratio reward performs best
Change detection: Hybrid SBERT and lexical metric reward (HSLR) most effective

2. Bounding Box Representation Impact

Using horizontal bounding boxes (HBB) for RL training proves more stable than rotated bounding boxes (RBB), avoiding cumulative angle prediction errors.

3. GRPO vs Baselines

Compared to SFT-only GeoVLM-SFT, incorporating GRPO optimization yields significant improvements across all tasks.

Case Analysis

The paper demonstrates examples of model-generated reasoning processes, showing GeoVLM-R1 capability to:

Generate structured thinking processes
Provide accurate spatial localization
Perform multi-step logical reasoning
Handle complex temporal change analysis

Remote Sensing VLM Development

Early Work: RS-GPT first introduces EO image-text paired datasets
Zero-Shot Capability: RemoteCLIP demonstrates strong zero-shot performance on classification and retrieval tasks
Region-Level Understanding: GeoChat, SkyEyeGPT extend to region-level visual grounding
Multimodal Fusion: EarthGPT, EarthDial integrate heterogeneous EO modalities

VLM Post-Training Techniques

Alignment Techniques: DPO and PPO widely applied to VLM alignment
Reasoning Enhancement: GRPO demonstrates excellent structured reasoning capability in DeepSeek-R1
Domain Limitations: Existing reasoning models primarily focus on mathematics and programming, overlooking remote sensing task potential

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: GeoVLM-R1 consistently surpasses existing methods across 28 EO benchmarks
Reasoning Capability Enhancement: Structured reasoning significantly improves performance on complex EO tasks
Stable Training: GRPO combined with task-aware rewards achieves stable and effective RL training

Limitations

Computational Cost: RL training requires additional computational resources and time
Reward Design Complexity: Different tasks require carefully designed specialized reward functions
Data Dependency: Performance largely depends on high-quality EO instruction data

Future Directions

Multimodal Extension: Integrate additional EO sensor data (SAR, hyperspectral, etc.)
Zero-Shot Generalization: Enhance model generalization on unseen tasks
Efficiency Optimization: Develop more efficient RL training strategies

In-Depth Evaluation

Strengths

Strong Innovation: First application of R1-style reasoning training to remote sensing, filling an important gap
Complete Methodology: Comprehensive technical pathway from problem definition to solution
Comprehensive Experiments: Thorough evaluation across multiple datasets and tasks
High Practical Value: Addresses the practical problem of insufficient reasoning capability in remote sensing VLMs

Weaknesses

Base Model Dependency: Method effectiveness largely depends on base VLM quality
Reward Engineering Complexity: Requires manual reward function design for each task type
Computational Overhead: RL training introduces significant computational cost compared to direct fine-tuning
Insufficient Generalization Analysis: Lacks in-depth analysis of cross-domain generalization capability

Impact

Academic Contribution: Introduces new training paradigm to remote sensing AI field
Practical Value: Directly applicable to real-world remote sensing application scenarios
Technical Inspiration: Provides reference for enhancing reasoning capabilities of VLMs in other specialized domains

Applicable Scenarios

Remote Sensing Image Analysis: Satellite image classification, object detection, change detection
Disaster Monitoring: Natural disaster damage assessment, emergency response
Urban Planning: Land use change monitoring, infrastructure planning
Environmental Monitoring: Ecosystem change tracking, climate change research

References

The paper cites 82 relevant references covering remote sensing VLMs, reinforcement learning, and vision-language models across multiple domains, providing solid theoretical foundation for the research.

Overall Assessment: This is a high-quality computer vision paper making significant contributions to the important application domain of remote sensing image understanding. The methodology is novel, experiments are comprehensive, and results are convincing, providing valuable technical pathways for advancing remote sensing AI technology development.