2025-11-19T18:31:14.017963

Simultaneous Localization and 3D-Semi Dense Mapping for Micro Drones Using Monocular Camera and Inertial Sensors

Danial, Asher, Klein
Monocular simultaneous localization and mapping (SLAM) algorithms estimate drone poses and build a 3D map using a single camera. Current algorithms include sparse methods that lack detailed geometry, while learning-driven approaches produce dense maps but are computationally intensive. Monocular SLAM also faces scale ambiguities, which affect its accuracy. To address these challenges, we propose an edge-aware lightweight monocular SLAM system combining sparse keypoint-based pose estimation with dense edge reconstruction. Our method employs deep learning-based depth prediction and edge detection, followed by optimization to refine keypoints and edges for geometric consistency, without relying on global loop closure or heavy neural computations. We fuse inertial data with vision by using an extended Kalman filter to resolve scale ambiguity and improve accuracy. The system operates in real time on low-power platforms, as demonstrated on a DJI Tello drone with a monocular camera and inertial sensors. In addition, we demonstrate robust autonomous navigation and obstacle avoidance in indoor corridors and on the TUM RGBD dataset. Our approach offers an effective, practical solution to real-time mapping and navigation in resource-constrained environments.
academic

Simultaneous Localization and 3D-Semi Dense Mapping for Micro Drones Using Monocular Camera and Inertial Sensors

Basic Information

  • Paper ID: 2511.14335
  • Title: Simultaneous Localization and 3D-Semi Dense Mapping for Micro Drones Using Monocular Camera and Inertial Sensors
  • Authors: Jeryes Danial (University of Haifa), Yosi Ben Asher (University of Haifa), Itzik Klein (University of Haifa)
  • Category: cs.RO (Robotics)
  • Publication Date: November 18, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2511.14335

Abstract

This paper addresses the challenges of simultaneous localization and mapping (SLAM) for micro drones using monocular cameras by proposing an edge-aware lightweight monocular SLAM system. The system combines sparse keypoint pose estimation with dense edge reconstruction, employing deep learning for depth prediction and edge detection, while achieving geometric consistency through optimization without relying on global loop closure or heavy neural network computation. The system fuses inertial data with visual information using an Extended Kalman Filter to address scale ambiguity and improve accuracy. Real-time implementation on the DJI Tello drone demonstrates robust autonomous navigation and obstacle avoidance capabilities on the TUM RGB-D dataset.

Research Background and Motivation

Core Problems to Address

  1. Sparse Map Problem: Traditional feature-point-based SLAM systems (e.g., ORB-SLAM) effectively estimate pose but generate overly sparse 3D point clouds lacking structural richness, making them unsuitable for tasks requiring dense 3D understanding.
  2. Computational Resource Constraints: Existing learning-driven dense SLAM methods (e.g., NeRF, NICE-SLAM) are computationally intensive and difficult to run in real-time on resource-constrained embedded platforms.
  3. Scale Ambiguity: The inherent scale uncertainty in monocular SLAM affects localization accuracy.
  4. Global Optimization Overhead: Traditional SLAM relies on loop closure detection and global bundle adjustment, incurring significant computational costs.

Research Significance

Autonomous navigation of micro drones requires real-time, accurate 3D perception capabilities for navigation, obstacle avoidance, and environmental interaction. Achieving this on resource-constrained embedded platforms represents a core challenge in robotics.

Limitations of Existing Methods

  • ORB-SLAM: Generates only sparse 3D points lacking structural details
  • Edge SLAM: Produces semi-dense maps but relies on global optimization with high computational cost; optical flow-based tracking introduces noise
  • DeepTAM/D3VO: Large parameter counts and high computational complexity unsuitable for low-power devices
  • NeRF/NICE-SLAM: Requires high-end GPUs, assumes static scenes, lacks real-time capability

Research Motivation

Develop a lightweight, real-time SLAM system capable of generating semi-dense maps on resource-constrained platforms while maintaining high-precision pose estimation.

Core Contributions

  1. Lightweight SLAM Pipeline: Integrates sparse epipolar geometry with dense depth prediction and edge extraction, enabling edge-anchored semi-dense map construction.
  2. Edge Cycle Consistency Loss: Proposes multi-view edge projection consistency constraints without explicit 2D-2D edge matching.
  3. Shape-Aware Structural Constraints: Geometric regularization based on L-shaped structures, enhancing structural consistency in indoor environments.
  4. Local Geometric Optimization: Multi-objective bundle adjustment jointly optimizing camera poses, keypoints, and edge segments without global loop closure or dense voxel fusion.
  5. Visual-Inertial Fusion: Uses Extended Kalman Filter to fuse inertial data, addressing scale ambiguity.

Methodology Details

Task Definition

Input:

  • Monocular camera image sequence
  • Inertial Measurement Unit (IMU) data (linear velocity, Euler angles)
  • Camera intrinsic matrix K

Output:

  • Camera pose trajectory {Ti} ∈ SE(3)
  • Semi-dense 3D edge map
  • Sparse 3D keypoint map

Constraints: Real-time requirements on resource-constrained platforms (e.g., DJI Tello drone)

System Architecture

The system employs a four-thread parallel architecture (as shown in Figure 1):

Thread 1: Image Preprocessing and Feature Extraction (Blue)

  1. ORB Keypoint Detection: Extract ORB features and descriptors
  2. Canny Edge Detection: Detect image edges
  3. Depth Prediction: Use pretrained FastDepth CNN (based on MobileNet-NNConv5 architecture) to predict dense depth maps
  4. Feature Matching: Match ORB descriptors using Hamming distance, with KD-tree acceleration for nearest neighbor search

Thread 2: Pose Estimation and Sensor Fusion (Green)

Relative Pose Estimation:

  • Estimate essential matrix E from matched ORB features via epipolar geometry:
    u_j^T E_ij u_i = 0
    
  • Use RANSAC to remove outliers; recover relative rotation R_ij and translation t_ij via SVD decomposition

Extended Kalman Filter Fusion:

State vector:

x = [p, α]^T = [x, y, z, φ, θ, ψ]^T

where p is global position and α is Euler angles (roll, pitch, yaw)

Prediction step:

p_{k|k-1} = p_{k-1} + R_imu(α_{k-1}) · v_imu · Δt

Adaptive Process Noise:

Q_k = β · (1 - b_k + λτ) · I_6

where b_k is battery level and τ is time since last monocular update, accounting for SDK data accuracy degradation with battery depletion and time progression

Measurement update:

  • Observation 1: Euler angles from SDK z_api = α_api
  • Observation 2: Global pose estimation from visual odometry (via accumulated relative poses)

Thread 3: Dense Edge Map and 3D Anchor Point Generation (Yellow)

Reconstruct 3D points (anchors) via triangulation using depth maps and estimated camera poses:

P^k* = argmin_P ||u_i^k - π(K P)||^2 + ||u_j^k - π(K[R_ij* P + t_ij*])||^2

Thread 4: Edge-Aware Local Optimization (Pink)

Multi-Loss Function Design:

  1. Reprojection Loss (sparse keypoints):
L_reproj = Σ_i,k ||u_ik - u_ik^proj||^2

where u_ik^proj = π(R_i P^k + t_i)

  1. Cycle Consistency Loss (dense edge points): Verify edge point consistency through closed-loop transformation:
P_i = π^{-1}(u_i*, d_i) → P_j = T_{i,j} · P_i → u_j = π(P_j)
→ P'_j = π^{-1}(u_j, d_j) → P'_i = T_{i,j}^{-1} · P'_j → u'_i = π(P'_i)

L_cycle = Σ_{u_i* ∈ E} ||u_i* - u'_i||^2
  1. L-Shaped Structure Loss (geometric regularization):
    • Angle Consistency:
    L_angle = (1/N) Σ_i (cos(θ_proj^(i)) - cos(θ_expected^(i)))^2
    
    • Collinearity Constraint:
    L_collinear = (1/N) Σ_i [(1/M_1^(i)) Σ_j d_{j,1}^2 + (1/M_2^(i)) Σ_k d_{k,2}^2]
    
    • Combined Loss:
    L_Lshape = λ_θ L_angle + λ_col L_collinear
    

Total Optimization Objective:

min_{P_w, T_w, D_w} L_total = λ_reproj L_reproj + λ_cycle L_cycle + λ_shape L_Lshape

Optimization Algorithm: Levenberg-Marquardt algorithm for solving nonlinear least squares problems, balancing Gauss-Newton and gradient descent

Technical Innovations

  1. Edge-Aware Semi-Dense Mapping: Combines sparse keypoints and dense edges, balancing computational efficiency and map detail.
  2. Explicit Edge Matching-Free: Avoids complex edge correspondence search through cycle consistency loss.
  3. Structure-Aware Regularization: Leverages L-shaped geometric priors in indoor environments to enhance reconstruction quality.
  4. Local Optimization Strategy: Eliminates global loop closure detection, reducing computational complexity.
  5. Adaptive Sensor Fusion: Process noise modeling considering battery level and time.

Strategies for Addressing Optimization Challenges

  1. Nonlinear Problems: Use regularization and Levenberg-Marquardt algorithm for stable convergence
  2. Singularity: Diagonal regularization (μI) ensures invertibility
  3. Ill-Conditioned Jacobian: Enhance parallax through oblique camera motion (e.g., zigzag trajectories)
  4. Loss Imbalance: Adaptive weight adjustment based on uncertainty

Experimental Setup

Datasets

  1. TUM RGB-D Benchmark Dataset
    • 23 indoor sequences, 2-10 minutes duration
    • Synchronized RGB-D images with ground truth poses
    • Diverse motion patterns, viewpoints, and lighting conditions
    • Released by TUM CVPR team under Creative Commons license
  2. Depth Estimation Training Set
    • FastDepth model pretrained on NYU Depth v2 dataset
    • MobileNet as backbone network
    • Depth-wise separable convolutions for reduced complexity
  3. Real-World Test Platform
    • DJI Tello drone
    • Monocular camera + inertial sensors
    • Indoor corridor environment

Evaluation Metrics

  1. Absolute Pose Error (APE):
APE_i = ||t_est^i - t_gt^i||_2

Measures instantaneous Euclidean distance error at each timestamp

  1. Absolute Trajectory Error (ATE):
ATE_RMS = sqrt((1/N) Σ_i ||T_gt^{-1}_i T_est_i||_F^2)

Evaluates global drift over entire sequence (including translation and rotation)

Comparison Methods

  • ORB-SLAM2: Baseline method, representing traditional sparse feature SLAM

Implementation Details

  • Platform: Ubuntu 16.04 laptop
  • Depth Network: Pretrained FastDepth (MobileNet-NNConv5)
  • Feature Detection: ORB + Canny edge detection
  • Optimization Window: Local sliding window bundle adjustment
  • Weight Parameters: λ_reproj, λ_cycle, λ_shape (specific values not provided in paper)
  • EKF Parameters: β, λ for adaptive process noise

Experimental Results

Main Results

Quantitative Evaluation on TUM RGB-D Dataset (Table I):

MethodRMSE mMean mStd m
ORB-SLAM2 (baseline)0.1820.170.71
Edge-Aware SLAM (this work)0.0460.0400.011
Improvement74.7%76.5%98.4%

Key Findings:

  • 74.7% RMSE reduction demonstrates significant trajectory accuracy improvement
  • 98.4% standard deviation reduction indicates more stable pose estimation
  • 76.5% mean error reduction shows reduced systematic bias

Qualitative Map Evaluation

Early-Stage Mapping (Figure 4):

  • Proposed method generates clear, accurate 3D edge maps from initial frames
  • ORB-SLAM2 point clouds show poor interpretability in early stages

Complete Sequence Mapping (Figure 5):

  • Proposed method maintains high precision after processing complete sequences without drift
  • ORB-SLAM2 maps show lower clarity and interpretability

Laboratory Environment (Figure 6):

  • Proposed method maintains high-precision 3D edge maps from sequence start to end
  • No drift or error accumulation, validating system robustness and reliability

Computational Efficiency

Key Performance Indicators:

  • ORB-based edge map creation approximately 100 times faster than ORB-SLAM
  • Supports deployment on small hardware like Raspberry Pi Zero
  • Achieves true real-time processing

Experimental Findings

  1. Edge Enhancement Advantages: Semi-dense edge maps provide richer structural information than sparse point clouds
  2. Local Optimization Effectiveness: Maintains long-term consistency without global loop closure
  3. Sensor Fusion Value: EKF fusion effectively addresses monocular scale ambiguity
  4. Lightweight Deep Learning: FastDepth maintains accuracy while meeting real-time requirements
  5. Structural Prior Impact: L-shaped constraints significantly improve reconstruction quality in indoor environments

Traditional SLAM Methods

  • ORB-SLAM Series: Classical sparse feature-based methods relying on global optimization
  • Voxel Map: Improved retrieval and visibility reasoning, but still sparse
  • SfM: Foundational technique for 3D structure reconstruction from multiple images

Visual-Inertial Odometry

  • EKF-Based Methods: Fast and efficient pose estimation (e.g., VINS-Mono, MSCKF-DVIO)
  • Limitations: Typically generate sparse 3D point clouds

Learning-Driven Dense SLAM

  • DeepTAM: Dense depth maps from deep neural networks, but limited accuracy and high computation
  • D3VO: High precision but complex models unsuitable for low-power devices
  • NeRF/NICE-SLAM: High-fidelity reconstruction requiring high-end GPUs and static scenes
  • NeuralRecon: Fuses depth and pose, computationally infeasible

Edge SLAM

  • Edge SLAM: Generates semi-dense maps but relies on global optimization; optical flow-based tracking introduces noise

Advantages of This Work

  • Combines traditional geometric methods with lightweight deep learning
  • Replaces global loop closure with local optimization
  • Suitable for real-time execution on resource-constrained platforms

Conclusions and Discussion

Main Conclusions

  1. The proposed edge-aware SLAM system achieves real-time, accurate 3D mapping on resource-constrained platforms
  2. Compared to ORB-SLAM2, trajectory and pose estimation RMSE improves by 74.5%
  3. Generated semi-dense maps are more accurate and detailed
  4. Processing speed approximately 100 times faster than ORB-SLAM, supporting embedded deployment

Limitations

  1. Environmental Assumptions: L-shaped structure constraints primarily suit indoor artificial environments; may not apply to natural scenes
  2. Depth Dependency: Relies on pretrained FastDepth model; performance may degrade on out-of-distribution scenes
  3. Dynamic Scenes: Paper does not explicitly address dynamic object handling
  4. Parameter Tuning: Multiple weight parameters (λ_reproj, λ_cycle, λ_shape) require manual adjustment
  5. Long-Term Drift: While local consistency is good, lack of global loop closure may accumulate errors in ultra-long sequences
  6. Insufficient Quantitative Analysis: Comparison only with ORB-SLAM2; lacks comparison with other modern methods

Future Directions

While not explicitly stated, potential directions include:

  1. Extension to outdoor and unstructured environments
  2. Integration of lightweight loop closure detection
  3. Handling dynamic objects and occlusions
  4. Adaptive weight learning
  5. Multi-sensor fusion (e.g., LiDAR)

In-Depth Evaluation

Strengths

Technical Innovation:

  1. Hybrid Architecture Design: Cleverly combines sparse geometry and dense learning, balancing accuracy and efficiency
  2. Cycle Consistency Loss: Innovative constraint design avoiding explicit edge matching
  3. Structure-Aware Regularization: Leverages environmental priors to enhance reconstruction quality
  4. Adaptive Sensor Fusion: Process noise modeling considering battery level has practical significance

Experimental Sufficiency:

  1. Validation on standard dataset (TUM RGB-D) and real platform (DJI Tello)
  2. Quantitative and qualitative results mutually reinforce
  3. Comprehensive computational efficiency analysis (100× speedup)

Result Convincingness:

  1. 74.7% RMSE improvement is significant
  2. 98.4% standard deviation reduction demonstrates improved stability
  3. Visualization clearly shows semi-dense map advantages

Writing Clarity:

  1. Clear problem definition and rigorous mathematical derivations
  2. Intuitive system architecture diagrams
  3. Four-thread design easy to understand

Weaknesses

Method Limitations:

  1. Generalization Capability: L-shaped constraints limit application scope
  2. Long-Term Consistency: Lack of global loop closure may cause issues in large-scale scenarios
  3. Depth Quality Dependency: FastDepth may fail in certain scenarios

Experimental Setup Defects:

  1. Single Comparison Method: Only compared with ORB-SLAM2; lacks comparison with Edge SLAM, VINS-Mono, etc.
  2. Missing Parameter Settings: Does not provide values for λ_reproj, λ_cycle, λ_shape
  3. Insufficient Ablation Studies: Does not separately analyze contribution of each loss term
  4. Dataset Limitations: Primarily tested in indoor scenes; outdoor performance unknown

Insufficient Analysis:

  1. Failure Cases: Does not discuss method failure scenarios
  2. Computational Analysis: Lacks detailed time and memory consumption analysis
  3. Robustness Testing: Does not test sensitivity to noise, occlusion, lighting changes
  4. Theoretical Analysis: Lacks convergence guarantees and error bounds

Impact

Contribution to Field:

  1. Provides practical SLAM solution for resource-constrained platforms
  2. Demonstrates potential of combining traditional methods with lightweight deep learning
  3. Edge-aware mapping approach can inspire subsequent research

Practical Value:

  1. Successful deployment on DJI Tello demonstrates practicality
  2. 100× speedup enables embedded applications
  3. Semi-dense maps suitable for navigation and obstacle avoidance

Reproducibility:

  • Moderate: Paper provides method details but lacks code, complete parameter settings, and training details
  • Use of public FastDepth model aids reproduction
  • Clear four-thread architecture but implementation details need supplementation

Applicable Scenarios

Suitable Applications:

  1. Indoor Drone Navigation: Corridors, warehouses, building interiors
  2. Resource-Constrained Robots: Low-power mobile platforms
  3. Real-Time Obstacle Avoidance: Fast-response scenarios
  4. Structured Environments: Artificial buildings, industrial facilities

Unsuitable Scenarios:

  1. Outdoor Natural Environments: Lack L-shaped structures
  2. Highly Dynamic Scenes: Fast-moving objects
  3. Ultra-Large-Scale Mapping: Lack global loop closure
  4. High-Precision Applications: Precision measurement (relative error still 4.6cm)

References

Key Citations:

  1. ORB-SLAM Series: Classical sparse SLAM baseline
  2. FastDepth (Wofk et al., ICRA 2019): Lightweight depth estimation network
  3. TUM RGB-D (Sturm et al., 2012): Standard SLAM evaluation dataset
  4. Bundle Adjustment (Triggs et al., 1999): Classical optimization technique
  5. Epipolar Geometry (Zhang, 1998): Foundational epipolar geometry theory
  6. Extended Kalman Filter: Standard sensor fusion method
  7. Edge SLAM (Maity et al., ICCV 2017): Pioneer work in edge SLAM
  8. NeRF/NICE-SLAM: Learning-based dense reconstruction methods

Overall Assessment: This is a practical SLAM research work targeting resource-constrained platforms with reasonable technical approach and convincing experimental results. Main contributions lie in system engineering and method integration rather than single algorithmic breakthrough. The 74.7% accuracy improvement and 100× speedup have practical value. However, the paper has room for improvement in experimental comparison, ablation analysis, and theoretical depth. Suitable for publication in robotics application conferences or journals.