2025-11-19T18:31:14.017963

Simultaneous Localization and 3D-Semi Dense Mapping for Micro Drones Using Monocular Camera and Inertial Sensors

Danial, Asher, Klein

Monocular simultaneous localization and mapping (SLAM) algorithms estimate drone poses and build a 3D map using a single camera. Current algorithms include sparse methods that lack detailed geometry, while learning-driven approaches produce dense maps but are computationally intensive. Monocular SLAM also faces scale ambiguities, which affect its accuracy. To address these challenges, we propose an edge-aware lightweight monocular SLAM system combining sparse keypoint-based pose estimation with dense edge reconstruction. Our method employs deep learning-based depth prediction and edge detection, followed by optimization to refine keypoints and edges for geometric consistency, without relying on global loop closure or heavy neural computations. We fuse inertial data with vision by using an extended Kalman filter to resolve scale ambiguity and improve accuracy. The system operates in real time on low-power platforms, as demonstrated on a DJI Tello drone with a monocular camera and inertial sensors. In addition, we demonstrate robust autonomous navigation and obstacle avoidance in indoor corridors and on the TUM RGBD dataset. Our approach offers an effective, practical solution to real-time mapping and navigation in resource-constrained environments.

academic

Simultaneous Localization and 3D-Semi Dense Mapping for Micro Drones Using Monocular Camera and Inertial Sensors

Basic Information

Paper ID: 2511.14335
Title: Simultaneous Localization and 3D-Semi Dense Mapping for Micro Drones Using Monocular Camera and Inertial Sensors
Authors: Jeryes Danial (University of Haifa), Yosi Ben Asher (University of Haifa), Itzik Klein (University of Haifa)
Category: cs.RO (Robotics)
Publication Date: November 18, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.14335

Abstract

This paper addresses the challenges of simultaneous localization and mapping (SLAM) for micro drones using monocular cameras by proposing an edge-aware lightweight monocular SLAM system. The system combines sparse keypoint pose estimation with dense edge reconstruction, employing deep learning for depth prediction and edge detection, while achieving geometric consistency through optimization without relying on global loop closure or heavy neural network computation. The system fuses inertial data with visual information using an Extended Kalman Filter to address scale ambiguity and improve accuracy. Real-time implementation on the DJI Tello drone demonstrates robust autonomous navigation and obstacle avoidance capabilities on the TUM RGB-D dataset.

Research Background and Motivation

Core Problems to Address

Sparse Map Problem: Traditional feature-point-based SLAM systems (e.g., ORB-SLAM) effectively estimate pose but generate overly sparse 3D point clouds lacking structural richness, making them unsuitable for tasks requiring dense 3D understanding.
Computational Resource Constraints: Existing learning-driven dense SLAM methods (e.g., NeRF, NICE-SLAM) are computationally intensive and difficult to run in real-time on resource-constrained embedded platforms.
Scale Ambiguity: The inherent scale uncertainty in monocular SLAM affects localization accuracy.
Global Optimization Overhead: Traditional SLAM relies on loop closure detection and global bundle adjustment, incurring significant computational costs.

Research Significance

Autonomous navigation of micro drones requires real-time, accurate 3D perception capabilities for navigation, obstacle avoidance, and environmental interaction. Achieving this on resource-constrained embedded platforms represents a core challenge in robotics.

Limitations of Existing Methods

ORB-SLAM: Generates only sparse 3D points lacking structural details
Edge SLAM: Produces semi-dense maps but relies on global optimization with high computational cost; optical flow-based tracking introduces noise
DeepTAM/D3VO: Large parameter counts and high computational complexity unsuitable for low-power devices
NeRF/NICE-SLAM: Requires high-end GPUs, assumes static scenes, lacks real-time capability

Research Motivation

Develop a lightweight, real-time SLAM system capable of generating semi-dense maps on resource-constrained platforms while maintaining high-precision pose estimation.

Core Contributions

Lightweight SLAM Pipeline: Integrates sparse epipolar geometry with dense depth prediction and edge extraction, enabling edge-anchored semi-dense map construction.
Edge Cycle Consistency Loss: Proposes multi-view edge projection consistency constraints without explicit 2D-2D edge matching.
Shape-Aware Structural Constraints: Geometric regularization based on L-shaped structures, enhancing structural consistency in indoor environments.
Local Geometric Optimization: Multi-objective bundle adjustment jointly optimizing camera poses, keypoints, and edge segments without global loop closure or dense voxel fusion.
Visual-Inertial Fusion: Uses Extended Kalman Filter to fuse inertial data, addressing scale ambiguity.

Methodology Details

Task Definition

Input:

Monocular camera image sequence
Inertial Measurement Unit (IMU) data (linear velocity, Euler angles)
Camera intrinsic matrix K

Output:

Camera pose trajectory {Ti} ∈ SE(3)
Semi-dense 3D edge map
Sparse 3D keypoint map

Constraints: Real-time requirements on resource-constrained platforms (e.g., DJI Tello drone)

System Architecture

The system employs a four-thread parallel architecture (as shown in Figure 1):

Thread 1: Image Preprocessing and Feature Extraction (Blue)

ORB Keypoint Detection: Extract ORB features and descriptors
Canny Edge Detection: Detect image edges
Depth Prediction: Use pretrained FastDepth CNN (based on MobileNet-NNConv5 architecture) to predict dense depth maps
Feature Matching: Match ORB descriptors using Hamming distance, with KD-tree acceleration for nearest neighbor search

Thread 2: Pose Estimation and Sensor Fusion (Green)

Relative Pose Estimation:

Estimate essential matrix E from matched ORB features via epipolar geometry:
```
u_j^T E_ij u_i = 0
```
Use RANSAC to remove outliers; recover relative rotation R_ij and translation t_ij via SVD decomposition

Extended Kalman Filter Fusion:

State vector:

x = [p, α]^T = [x, y, z, φ, θ, ψ]^T

where p is global position and α is Euler angles (roll, pitch, yaw)

Prediction step:

p_{k|k-1} = p_{k-1} + R_imu(α_{k-1}) · v_imu · Δt

Adaptive Process Noise:

Q_k = β · (1 - b_k + λτ) · I_6

where b_k is battery level and τ is time since last monocular update, accounting for SDK data accuracy degradation with battery depletion and time progression

Measurement update:

Observation 1: Euler angles from SDK z_api = α_api
Observation 2: Global pose estimation from visual odometry (via accumulated relative poses)

Thread 3: Dense Edge Map and 3D Anchor Point Generation (Yellow)

Reconstruct 3D points (anchors) via triangulation using depth maps and estimated camera poses:

P^k* = argmin_P ||u_i^k - π(K P)||^2 + ||u_j^k - π(K[R_ij* P + t_ij*])||^2

Thread 4: Edge-Aware Local Optimization (Pink)

Multi-Loss Function Design:

Reprojection Loss (sparse keypoints):

L_reproj = Σ_i,k ||u_ik - u_ik^proj||^2

where u_ik^proj = π(R_i P^k + t_i)

Cycle Consistency Loss (dense edge points): Verify edge point consistency through closed-loop transformation:

P_i = π^{-1}(u_i*, d_i) → P_j = T_{i,j} · P_i → u_j = π(P_j)
→ P'_j = π^{-1}(u_j, d_j) → P'_i = T_{i,j}^{-1} · P'_j → u'_i = π(P'_i)

L_cycle = Σ_{u_i* ∈ E} ||u_i* - u'_i||^2

L-Shaped Structure Loss (geometric regularization):

Angle Consistency:

L_angle = (1/N) Σ_i (cos(θ_proj^(i)) - cos(θ_expected^(i)))^2

Collinearity Constraint:

L_collinear = (1/N) Σ_i [(1/M_1^(i)) Σ_j d_{j,1}^2 + (1/M_2^(i)) Σ_k d_{k,2}^2]

Combined Loss:

L_Lshape = λ_θ L_angle + λ_col L_collinear

Total Optimization Objective:

min_{P_w, T_w, D_w} L_total = λ_reproj L_reproj + λ_cycle L_cycle + λ_shape L_Lshape

Optimization Algorithm: Levenberg-Marquardt algorithm for solving nonlinear least squares problems, balancing Gauss-Newton and gradient descent

Technical Innovations

Edge-Aware Semi-Dense Mapping: Combines sparse keypoints and dense edges, balancing computational efficiency and map detail.
Explicit Edge Matching-Free: Avoids complex edge correspondence search through cycle consistency loss.
Structure-Aware Regularization: Leverages L-shaped geometric priors in indoor environments to enhance reconstruction quality.
Local Optimization Strategy: Eliminates global loop closure detection, reducing computational complexity.
Adaptive Sensor Fusion: Process noise modeling considering battery level and time.

Strategies for Addressing Optimization Challenges

Nonlinear Problems: Use regularization and Levenberg-Marquardt algorithm for stable convergence
Singularity: Diagonal regularization (μI) ensures invertibility
Ill-Conditioned Jacobian: Enhance parallax through oblique camera motion (e.g., zigzag trajectories)
Loss Imbalance: Adaptive weight adjustment based on uncertainty

Experimental Setup

Datasets

TUM RGB-D Benchmark Dataset
- 23 indoor sequences, 2-10 minutes duration
- Synchronized RGB-D images with ground truth poses
- Diverse motion patterns, viewpoints, and lighting conditions
- Released by TUM CVPR team under Creative Commons license
Depth Estimation Training Set
- FastDepth model pretrained on NYU Depth v2 dataset
- MobileNet as backbone network
- Depth-wise separable convolutions for reduced complexity
Real-World Test Platform
- DJI Tello drone
- Monocular camera + inertial sensors
- Indoor corridor environment

Evaluation Metrics

Absolute Pose Error (APE):

APE_i = ||t_est^i - t_gt^i||_2

Measures instantaneous Euclidean distance error at each timestamp

Absolute Trajectory Error (ATE):

ATE_RMS = sqrt((1/N) Σ_i ||T_gt^{-1}_i T_est_i||_F^2)

Evaluates global drift over entire sequence (including translation and rotation)

Comparison Methods

ORB-SLAM2: Baseline method, representing traditional sparse feature SLAM

Implementation Details

Platform: Ubuntu 16.04 laptop
Depth Network: Pretrained FastDepth (MobileNet-NNConv5)
Feature Detection: ORB + Canny edge detection
Optimization Window: Local sliding window bundle adjustment
Weight Parameters: λ_reproj, λ_cycle, λ_shape (specific values not provided in paper)
EKF Parameters: β, λ for adaptive process noise

Experimental Results

Main Results

Quantitative Evaluation on TUM RGB-D Dataset (Table I):

Method	RMSE m	Mean m	Std m
ORB-SLAM2 (baseline)	0.182	0.17	0.71
Edge-Aware SLAM (this work)	0.046	0.040	0.011
Improvement	74.7%	76.5%	98.4%

Key Findings:

74.7% RMSE reduction demonstrates significant trajectory accuracy improvement
98.4% standard deviation reduction indicates more stable pose estimation
76.5% mean error reduction shows reduced systematic bias

Qualitative Map Evaluation

Early-Stage Mapping (Figure 4):

Proposed method generates clear, accurate 3D edge maps from initial frames
ORB-SLAM2 point clouds show poor interpretability in early stages

Complete Sequence Mapping (Figure 5):

Proposed method maintains high precision after processing complete sequences without drift
ORB-SLAM2 maps show lower clarity and interpretability

Laboratory Environment (Figure 6):

Proposed method maintains high-precision 3D edge maps from sequence start to end
No drift or error accumulation, validating system robustness and reliability

Computational Efficiency

Key Performance Indicators:

ORB-based edge map creation approximately 100 times faster than ORB-SLAM
Supports deployment on small hardware like Raspberry Pi Zero
Achieves true real-time processing

Experimental Findings

Edge Enhancement Advantages: Semi-dense edge maps provide richer structural information than sparse point clouds
Local Optimization Effectiveness: Maintains long-term consistency without global loop closure
Sensor Fusion Value: EKF fusion effectively addresses monocular scale ambiguity
Lightweight Deep Learning: FastDepth maintains accuracy while meeting real-time requirements
Structural Prior Impact: L-shaped constraints significantly improve reconstruction quality in indoor environments

Traditional SLAM Methods

ORB-SLAM Series: Classical sparse feature-based methods relying on global optimization
Voxel Map: Improved retrieval and visibility reasoning, but still sparse
SfM: Foundational technique for 3D structure reconstruction from multiple images

Visual-Inertial Odometry

EKF-Based Methods: Fast and efficient pose estimation (e.g., VINS-Mono, MSCKF-DVIO)
Limitations: Typically generate sparse 3D point clouds

Learning-Driven Dense SLAM

DeepTAM: Dense depth maps from deep neural networks, but limited accuracy and high computation
D3VO: High precision but complex models unsuitable for low-power devices
NeRF/NICE-SLAM: High-fidelity reconstruction requiring high-end GPUs and static scenes
NeuralRecon: Fuses depth and pose, computationally infeasible

Edge SLAM

Edge SLAM: Generates semi-dense maps but relies on global optimization; optical flow-based tracking introduces noise

Advantages of This Work

Combines traditional geometric methods with lightweight deep learning
Replaces global loop closure with local optimization
Suitable for real-time execution on resource-constrained platforms

Conclusions and Discussion

Main Conclusions

The proposed edge-aware SLAM system achieves real-time, accurate 3D mapping on resource-constrained platforms
Compared to ORB-SLAM2, trajectory and pose estimation RMSE improves by 74.5%
Generated semi-dense maps are more accurate and detailed
Processing speed approximately 100 times faster than ORB-SLAM, supporting embedded deployment

Limitations

Environmental Assumptions: L-shaped structure constraints primarily suit indoor artificial environments; may not apply to natural scenes
Depth Dependency: Relies on pretrained FastDepth model; performance may degrade on out-of-distribution scenes
Dynamic Scenes: Paper does not explicitly address dynamic object handling
Parameter Tuning: Multiple weight parameters (λ_reproj, λ_cycle, λ_shape) require manual adjustment
Long-Term Drift: While local consistency is good, lack of global loop closure may accumulate errors in ultra-long sequences
Insufficient Quantitative Analysis: Comparison only with ORB-SLAM2; lacks comparison with other modern methods

Future Directions

While not explicitly stated, potential directions include:

Extension to outdoor and unstructured environments
Integration of lightweight loop closure detection
Handling dynamic objects and occlusions
Adaptive weight learning
Multi-sensor fusion (e.g., LiDAR)

In-Depth Evaluation

Strengths

Technical Innovation:

Hybrid Architecture Design: Cleverly combines sparse geometry and dense learning, balancing accuracy and efficiency
Cycle Consistency Loss: Innovative constraint design avoiding explicit edge matching
Structure-Aware Regularization: Leverages environmental priors to enhance reconstruction quality
Adaptive Sensor Fusion: Process noise modeling considering battery level has practical significance

Experimental Sufficiency:

Validation on standard dataset (TUM RGB-D) and real platform (DJI Tello)
Quantitative and qualitative results mutually reinforce
Comprehensive computational efficiency analysis (100× speedup)

Result Convincingness:

74.7% RMSE improvement is significant
98.4% standard deviation reduction demonstrates improved stability
Visualization clearly shows semi-dense map advantages

Writing Clarity:

Clear problem definition and rigorous mathematical derivations
Intuitive system architecture diagrams
Four-thread design easy to understand

Weaknesses

Method Limitations:

Generalization Capability: L-shaped constraints limit application scope
Long-Term Consistency: Lack of global loop closure may cause issues in large-scale scenarios
Depth Quality Dependency: FastDepth may fail in certain scenarios

Experimental Setup Defects:

Single Comparison Method: Only compared with ORB-SLAM2; lacks comparison with Edge SLAM, VINS-Mono, etc.
Missing Parameter Settings: Does not provide values for λ_reproj, λ_cycle, λ_shape
Insufficient Ablation Studies: Does not separately analyze contribution of each loss term
Dataset Limitations: Primarily tested in indoor scenes; outdoor performance unknown

Insufficient Analysis:

Failure Cases: Does not discuss method failure scenarios
Computational Analysis: Lacks detailed time and memory consumption analysis
Robustness Testing: Does not test sensitivity to noise, occlusion, lighting changes
Theoretical Analysis: Lacks convergence guarantees and error bounds

Impact

Contribution to Field:

Provides practical SLAM solution for resource-constrained platforms
Demonstrates potential of combining traditional methods with lightweight deep learning
Edge-aware mapping approach can inspire subsequent research

Practical Value:

Successful deployment on DJI Tello demonstrates practicality
100× speedup enables embedded applications
Semi-dense maps suitable for navigation and obstacle avoidance

Reproducibility:

Moderate: Paper provides method details but lacks code, complete parameter settings, and training details
Use of public FastDepth model aids reproduction
Clear four-thread architecture but implementation details need supplementation

Applicable Scenarios

Suitable Applications:

Indoor Drone Navigation: Corridors, warehouses, building interiors
Resource-Constrained Robots: Low-power mobile platforms
Real-Time Obstacle Avoidance: Fast-response scenarios
Structured Environments: Artificial buildings, industrial facilities

Unsuitable Scenarios:

Outdoor Natural Environments: Lack L-shaped structures
Highly Dynamic Scenes: Fast-moving objects
Ultra-Large-Scale Mapping: Lack global loop closure
High-Precision Applications: Precision measurement (relative error still 4.6cm)

References

Key Citations:

ORB-SLAM Series: Classical sparse SLAM baseline
FastDepth (Wofk et al., ICRA 2019): Lightweight depth estimation network
TUM RGB-D (Sturm et al., 2012): Standard SLAM evaluation dataset
Bundle Adjustment (Triggs et al., 1999): Classical optimization technique
Epipolar Geometry (Zhang, 1998): Foundational epipolar geometry theory
Extended Kalman Filter: Standard sensor fusion method
Edge SLAM (Maity et al., ICCV 2017): Pioneer work in edge SLAM
NeRF/NICE-SLAM: Learning-based dense reconstruction methods

Overall Assessment: This is a practical SLAM research work targeting resource-constrained platforms with reasonable technical approach and convincing experimental results. Main contributions lie in system engineering and method integration rather than single algorithmic breakthrough. The 74.7% accuracy improvement and 100× speedup have practical value. However, the paper has room for improvement in experimental comparison, ablation analysis, and theoretical depth. Suitable for publication in robotics application conferences or journals.