2025-11-20T22:01:15.701145

Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach

Tayar, de Oliveira, Tommaselli et al.

Autonomous UAV inspection of confined industrial infrastructure, such as ventilation ducts, demands robust navigation policies where collisions are unacceptable. While Deep Reinforcement Learning (DRL) offers a powerful paradigm for developing such policies, it presents a critical trade-off between on-policy and off-policy algorithms. Off-policy methods promise high sample efficiency, a vital trait for minimizing costly and unsafe real-world fine-tuning. In contrast, on-policy methods often exhibit greater training stability, which is essential for reliable convergence in hazard-dense environments. This paper directly investigates this trade-off by comparing a leading on-policy algorithm, Proximal Policy Optimization (PPO), against an off-policy counterpart, Soft Actor-Critic (SAC), for precision flight in procedurally generated ducts within a high-fidelity simulator. Our results show that PPO consistently learned a stable, collision-free policy that completed the entire course. In contrast, SAC failed to find a complete solution, converging to a suboptimal policy that navigated only the initial segments before failure. This work provides evidence that for high-precision, safety-critical navigation tasks, the reliable convergence of a well-established on-policy method can be more decisive than the nominal sample efficiency of an off-policy algorithm.

academic

Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach

Basic Information

Paper ID: 2508.16807
Title: Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach
Authors: Marco S. Tayar, Lucas K. de Oliveira, Felipe Andrade G. Tommaselli, Juliano D. Negri, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker (University of São Paulo)
Classification: cs.RO cs.AI cs.LG cs.SY eess.SY
Publication Date: October 11, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2508.16807

Abstract

This paper investigates autonomous UAV inspection in confined industrial infrastructure such as ventilation ducts, tasks requiring robust navigation strategies that preclude collisions. While deep reinforcement learning (DRL) provides a powerful paradigm for developing such strategies, critical trade-offs exist between on-policy and off-policy algorithms. Off-policy methods promise high sample efficiency, crucial for minimizing expensive and unsafe real-world fine-tuning. Conversely, on-policy methods typically exhibit superior training stability, essential for reliable convergence in high-hazard-density environments. This paper directly investigates this trade-off by comparing the precise flight performance of the leading on-policy algorithm PPO against the off-policy algorithm SAC in procedurally generated ducts within a high-fidelity simulator. Results demonstrate that PPO consistently learns stable, collision-free policies and completes entire flight paths, while SAC fails to find complete solutions, converging to suboptimal policies capable of navigating only initial segments.

Research Background and Motivation

Problem Definition

Manual inspection of industrial infrastructure such as pipelines and ventilation ducts is a complex, expensive, and time-consuming process critical to maintaining operational integrity. Unmanned aerial vehicles (UAVs) represent significant advances in industrial inspection, enabling automated and safe data collection in environments inaccessible or unsafe for humans.

Challenge Analysis

Navigating UAVs in confined spaces such as ducts presents unique challenges:

Complex Aerodynamic Effects: Proximity to walls creates complex aerodynamic phenomena, increasing collision risk
Limitations of Classical Methods: Traditional motion planning approaches lack adaptability and struggle with unmodeled aerodynamic phenomena (such as ground effect in narrow ducts)
Safety-Critical Nature: In these environments, collisions are unacceptable, requiring highly reliable control strategies

Research Motivation

Deep reinforcement learning provides a powerful paradigm for addressing these challenges, but algorithm selection is critical. The core question is: for tasks requiring high precision and safety, is the stability of on-policy methods more important than the sample efficiency of off-policy algorithms?

Core Contributions

Direct Comparative Analysis: Direct comparison of mature on-policy and off-policy algorithms on autonomous UAV navigation tasks in confined industrial ducts
Empirical Evidence: Provides empirical evidence for high-hazard-density, high-precision tasks demonstrating that training stability of on-policy methods is more critical than sample efficiency of off-policy methods
Simulation Workflow Validation: Validates a simulation workflow using procedurally generated environments and high-fidelity physics engines as a testing platform for developing and benchmarking industrial UAV control strategies

Methodology

Task Definition

Models goal-oriented UAV control as a Markov Decision Process (MDP): M = (S,A,T,R,γ)

State Space:

st = [prel, p̂Brel, q, vBlin, vBang, at-1] ∈ R20

where:

prel ∈ R³: position vector from UAV to next waypoint
p̂Brel ∈ R³: unit-normalized representation in body coordinates
q ∈ R⁴: unit quaternion (world to body)
vBlin, vBang ∈ R³: linear and angular velocities in body coordinates
at-1 ∈ R⁴: motor command vector from previous timestep

Action Space: Continuous actions at ∈ -1,1⁴, parameterizing each rotor command:

ωi = (1 + 0.8 at,i) ωhover, i = 1,...,4

where ωhover = 14.47 krpm is the calibrated hover speed.

Simulation Environment Design

Genesis Physics Engine: Employs the Genesis high-fidelity physics engine for GPU-accelerated parallel rigid-body simulation.

Procedural Duct Generation:

Different ducts are generated for each episode, ensuring the policy learns to navigate diverse and challenging scenarios
Ducts are constructed by connecting Ns straight duct segments end-to-end
Angular deviation between adjacent segments is controlled using Rodrigues' rotation formula:

v' = v cos θ + (k × v) sin θ + k(k · v)(1 - cos θ)

UAV Model: Uses a simulation model of the Bitcraze Crazyflie 2 (92×92×29 mm nano-quadrotor).

Learning Algorithm Comparison

Uses the skrl framework to ensure fair comparison, with both algorithms sharing identical network architecture:

Network Structure: actor-critic with two hidden layers (256, 128 units, ELU activation)
PPO Configuration: rollout horizon 256, 4096 parallel environments, adaptive KL target 0.01, γ=0.99, λ=0.95, ε=0.2
SAC Configuration: twin critics, replay buffer 10⁶, batch size 512, τ=0.005, γ=0.99, automatic entropy adjustment

Reward Function Design

Employs a modular reward function: Rt = Σk wk rk

Three Main Categories:

Guidance Rewards:
- Progress: rewards motion toward next waypoint
- Centerline Deviation: penalizes deviation from duct centerline
- Velocity Tracking: encourages target forward velocity
Stability Rewards:
- Orientation Alignment: rewards yaw/level attitude
- Angular Velocity Damping: penalizes rotational velocity
- Action Smoothness: penalizes abrupt motor command changes
Event Rewards:
- Waypoint Pass: sparse reward for passing waypoints
- Duct Finish: large terminal reward for completing duct
- Crash Penalty: large penalty for collision/violation

Experimental Setup

Experimental Environment

Platform: Genesis physics engine
Duct Configuration: procedurally generated, Rd = 0.5m, 7 waypoints
Training Configuration: PPO and SAC each trained for 500 checkpoints

Evaluation Metrics

Average Reward: mean reward per episode
Waypoints Passed: number of waypoints successfully navigated
Collisions per Episode: collision frequency
Average/Maximum Deviation: mean and maximum centerline deviation

Hyperparameter Optimization

Uses Weights & Biases sweep tool to optimize reward weights, with expanded weight ranges for primary guidance terms in SAC to accommodate its replay buffer characteristics.

Experimental Results

PPO Training Results

Checkpoint	50	75	100	150	200	300	400	500
Average Reward	1.3k	2.7k	4.5k	6.4k	7.2k	9.9k	10.2k	9.6k
Waypoints Passed	1/7	2/7	4/7	5/7	6/7	7/7	7/7	7/7
Collisions/Episode	1.00	0.70	0.30	0.00	0.00	0.00	0.00	0.00
Avg Deviation (m)	0.123	0.113	0.084	0.065	0.094	0.064	0.063	0.094

Key Findings:

Achieves 100% flight completion rate with zero collisions at checkpoint 300
Average centerline deviation decreases from 0.1128m to 0.0636m (between checkpoints 200-300)
Optimal performance achieved at checkpoint 400 (average reward 10.2k)

SAC Training Results

Checkpoint	50	75	100	150	200	300
Average Reward	2.0k	3.0k	3.6k	4.1k	5.4k	4.4k
Waypoints Passed	0/7	1/7	2/7	3/7	3/7	3/7
Collisions/Episode	1.00	1.00	1.00	1.00	1.00	1.00

Key Findings:

Flight completion rate remains 0% throughout training
Average of 1 collision per episode, indicating terminal failure is standard outcome
Converges to local optimum after navigating maximum 3 waypoints before crashing

Performance Comparison Analysis

PPO Success Factors:

On-policy updates provide consistent learning signals
Capable of escaping local optima and solving end-to-end tasks
Demonstrates classical learning pattern: first mastering primary objectives, then optimizing trajectories

SAC Failure Factors:

Replay buffer becomes saturated with experiences from initial simple segments
Biased toward refining trajectory beginning, neglecting later challenges
Sample efficiency becomes counterproductive in this context

DRL Applications in Robotics

DRL learns complex control policies through trial-and-error interaction, suitable for robotic tasks difficult to model precisely
Has achieved breakthroughs in dynamic motion skill generation for legged robots

Importance of High-Fidelity Simulation

Simulation becomes essential for DRL research due to high costs and safety risks of real-world interaction
Techniques such as domain randomization are critical for sim-to-real transfer

Autonomous UAV Navigation

DRL demonstrates superhuman performance in high-speed dynamic tasks such as drone racing
Navigation in confined environments presents greater challenges than open-space navigation, requiring more stable and reliable learning algorithms

Conclusions and Discussion

Main Conclusions

Stability Over Efficiency: For high-precision, safety-critical navigation tasks, training stability of on-policy methods is more important than sample efficiency of off-policy methods
Criticality of Algorithm Selection: PPO successfully learns robust collision-free policies while SAC converges to suboptimal solutions
Limitations of Replay Buffers: SAC's replay buffer may lead to exploration bias in complex sequential tasks

Limitations

Limited Algorithm Scope: Compares only PPO and SAC algorithms
Reward Engineering Dependency: Performance heavily depends on carefully designed reward functions
Sim-to-Real Gap: Validation on actual physical systems remains absent

Future Directions

Sim-to-Real Transfer: Transfer successful PPO policies to physical UAV testing platforms
Domain Randomization: Combine domain randomization and curriculum learning to enhance policy robustness
Hybrid Algorithms: Investigate advanced algorithms that unify on-policy stability and off-policy data efficiency

In-Depth Evaluation

Strengths

Problem Specificity: Addresses practical safety-critical problems in industrial inspection
Rigorous Experimental Design: Ensures fair comparison using unified framework; procedurally generated environments enhance generalization
Clear and Compelling Conclusions: Provides clear guidance principles for algorithm selection
High Engineering Value: Offers valuable technical pathways for practical industrial applications

Weaknesses

Narrow Algorithm Coverage: Compares only two algorithms, lacking comprehensive algorithm evaluation
Insufficient Theoretical Analysis: Failure analysis primarily based on empirical observations, lacking theoretical support
Absence of Real-World Validation: All experiments conducted in simulation environments without real-world verification
Reward Design Sensitivity: Different reward weights for different algorithms may affect generalizability of conclusions

Impact

Academic Contribution: Provides empirical guidance for DRL algorithm selection in safety-critical tasks
Industrial Value: Offers technical reference for industrial inspection UAV development
Methodological Value: Validates effectiveness of procedurally generated environments in DRL training

Applicable Scenarios

High-precision, safety-critical UAV navigation tasks
Robot control in confined spaces
Reinforcement learning applications requiring reliable convergence guarantees

References

The paper cites 26 relevant references covering DRL foundational theory, UAV navigation, simulation techniques, and other domains, providing solid theoretical foundation. Key references include original PPO and SAC papers, breakthrough work in drone racing, and important research on sim-to-real transfer.