2025-11-15T09:37:11.895501

HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation

Sun, Wang, Zhang et al.
Seamless loco-manipulation in unstructured environments requires robots to leverage autonomous exploration alongside whole-body control for physical interaction. In this work, we introduce HANDO (Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation), a two-layer framework designed for legged robots equipped with manipulators to perform human-centered mobile manipulation tasks. The first layer utilizes a goal-conditioned autonomous exploration policy to guide the robot to semantically specified targets, such as a black office chair in a dynamic environment. The second layer employs a unified whole-body loco-manipulation policy to coordinate the arm and legs for precise interaction tasks-for example, handing a drink to a person seated on the chair. We have conducted an initial deployment of the navigation module, and will continue to pursue finer-grained deployment of whole-body loco-manipulation.
academic

HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation

Basic Information

  • Paper ID: 2510.09221
  • Title: HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation
  • Authors: Jingyuan Sun, Chaoran Wang, Mingyu Zhang, Cui Miao, Hongyu Ji, Zihan Qu, Han Sun, Bing Wang, Qingyi Si
  • Category: cs.RO (Robotics)
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09221
  • Video Demonstration: https://youtu.be/YD0qx3vRsfc

Abstract

This paper presents HANDO (Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation), a two-layer framework designed for legged robots equipped with robotic arms to execute human-centric mobile manipulation tasks. The first layer employs goal-conditioned autonomous exploration strategies to guide the robot to semantically-specified targets; the second layer utilizes a unified whole-body mobile manipulation policy that coordinates arm and leg movements for precise interaction tasks. The authors have completed preliminary deployment of the navigation module and will continue advancing the refinement of whole-body mobile manipulation deployment.

Research Background and Motivation

Problem Definition

This research addresses the challenge of seamless mobile manipulation in unstructured environments, particularly the human-robot interaction challenges in last-mile delivery scenarios. Traditional delivery methods rely on pre-built maps and precise localization, which are costly and have limited scalability in dynamic or customized environments.

Significance

Last-mile delivery represents a critical application for service robots, requiring robots to not only traverse complex environments but also engage in physical interaction with humans. Quadruped robot platforms equipped with robotic arms combine agile locomotion capabilities with manipulation functionality, providing an ideal implementation platform for complex delivery scenarios.

Limitations of Existing Approaches

  1. Navigation: Most delivery strategies still rely on maps and perform poorly in frequently changing or rapidly deployable environments
  2. Manipulation: Lack of effective whole-body coordination control, making complex human-robot interaction difficult
  3. Integration Challenges: Sim-to-real deployment faces perception gaps, terrain variations, and hardware constraints

Research Motivation

Develop a hierarchical and integrated framework that unifies map-free navigation with whole-body mobile manipulation in a deployable system, achieving comprehensive autonomy for navigating unknown spaces and executing dexterous manipulation actions.

Core Contributions

  1. Proposed a novel map-free navigation module: Employs vision-language models for cross-scene reasoning and graph matching, driving a three-stage exploration strategy to achieve zero-cost navigation
  2. Designed mobile manipulation policies: Integrates quadruped locomotion and arm control, achieving whole-body interaction behaviors through end-effector trajectory guidance
  3. System integration and validation: Integrates and validates the system on a real quadruped-arm platform, demonstrating end-to-end last-mile delivery combining semantic navigation and whole-body interaction

Methodology Details

Task Definition

The HANDO framework enables quadruped robots equipped with robotic arms to execute complete delivery tasks in unstructured environments, including:

  • Input: Semantic target descriptions (e.g., "black office chair"), environmental perception data, human hand trajectories
  • Output: Robot motion control commands, arm joint commands
  • Constraints: No pre-built maps, real-time requirements, safety constraints

Model Architecture

Layer 1: Goal-Oriented Map-Free Navigation

Three-Stage Exploration Process:

  1. Initial Exploration Stage: When matching score st<σ1s_t < \sigma_1, the system decomposes the semantic target graph GgG_g into sub-goals using boundary-based exploration strategies
  2. Coordinate Projection and Alignment Stage: When σ1st<σ2\sigma_1 \leq s_t < \sigma_2, aligns target graph GgG_g with current scene graph GtG_t
  3. Target Verification Stage: When stσ2s_t \geq \sigma_2, executes target verification and scene graph correction

Action Generation: VLM-based action decoder selects discrete actions at{move forward, turn left, turn right, stop}a_t \in \{\text{move forward, turn left, turn right, stop}\}, mapped to continuous velocity commands: (0.1ms1,π/12rad s1,π/12rad s1,0)(0.1 \text{ms}^{-1}, \pi/12 \text{rad s}^{-1}, -\pi/12 \text{rad s}^{-1}, 0)

Layer 2: Whole-Body Mobile Manipulation Policy

Hand Trajectory Generator:

  • Detects operator hand and selects keyframes through hand velocity valleys
  • Retargets hand position/orientation to robot gripper tool center point (TCP): xttcp=SE(3)(Tcamworld)SE(3)(ht)tcpThandx^{tcp}_t = SE(3)(T_{cam \rightarrow world}) \cdot SE(3)(h_t) \cdot {}^{tcp}T_{hand}

Whole-Body Mobile Manipulation Policy:

  • State Space: Includes previous action, leg state, arm state, base state, and end-effector trajectory
  • Action Space: Uses position PD control with target position qt=qdefault+Δqtq^*_t = q_{default} + \Delta q_t
  • Reward Function:
    • TCP tracking reward: rtrack=exp(pttcppttarσp)exp((Rttcp(Rttar)T)σo)r_{track} = \exp\left(-\frac{\|p^{tcp}_t - p^{tar}_t\|}{\sigma_p}\right) \cdot \exp\left(-\frac{\angle(R^{tcp}_t(R^{tar}_t)^T)}{\sigma_o}\right)
    • Regularization reward: rreg=λττt2λΔqatat12λq¨q¨t2r_{reg} = -\lambda_\tau\|\tau_t\|^2 - \lambda_{\Delta q}\|a_t - a_{t-1}\|^2 - \lambda_{\ddot{q}}\|\ddot{q}_t\|^2

Technical Innovations

  1. Cross-Modal Scene Understanding: Combines vision-language models to achieve direct mapping from semantic targets to navigation behaviors
  2. Hierarchical Control Architecture: Effectively separates high-level semantic reasoning from low-level motion control
  3. Real-Time Hand Tracking Integration: Guides robot end-effector through human hand trajectories, enhancing naturalness of human-robot interaction
  4. Unified Whole-Body Control: Coordinates leg locomotion and arm manipulation within a single policy framework

Experimental Setup

Hardware Platform

  • Robot Platform: Unitree Go1 EDU quadruped robot + AGILEX PIPER lightweight robotic arm
  • Computing Device: NVIDIA RTX 4090 GPU
  • Control Frequency: Both motion strategy and whole-body mobile manipulation strategy run at 50Hz
  • Communication: Wired Ethernet connection supporting low-latency reliable deployment

Experimental Environment

Real-world evaluation conducted in a café with characteristics:

  • Unstructured layout with irregularly arranged tables, chairs, and clutter
  • Partial observability: robot has no prior knowledge of target locations
  • Relies solely on visual input and semantic instructions

Evaluation Metrics

  • Navigation success rate
  • Trajectory smoothness and continuity
  • Target localization accuracy
  • System stability and robustness

Experimental Results

Main Results

The goal-oriented map-free navigation layer demonstrates excellent performance in real environments:

  • Successfully explores environments and approaches targets
  • Recorded base trajectories are smooth and continuous
  • Maintains stable and robust navigation performance despite irregular layouts

Experimental Findings

  1. Navigation Module Validation: Successfully completed preliminary deployment, proving the feasibility of map-free navigation
  2. System Integration: Multi-threaded control achieves real-time operation
  3. Environmental Adaptability: Demonstrates good adaptation in dynamic, unstructured environments

Autonomous Navigation

  • Traditional Methods: Map-based approaches using SLAM and graph planning, effective in static structured environments but costly
  • Map-Free Methods: Frameworks like UniGoal and NaviLa leverage language and visual cues to guide navigation, significantly reducing deployment costs

End-to-End Imitation Learning

  • ACT: Employs Transformer backbone and image encoders
  • Diffusion Policy: Introduces generative diffusion processes to model multimodal action distributions
  • RISE: Utilizes sparse point cloud encoders for continuous control

Mobile Manipulation

  • Early Methods: Optimization-based footstep planning and whole-body trajectory generation with high computational costs
  • Reinforcement Learning Methods: End-to-end control for multiple mobile manipulation tasks
  • MLM: Combines trajectory libraries with diffusion policy-based inference

Conclusions and Discussion

Main Conclusions

The HANDO framework successfully bridges semantic task understanding with low-level physical control, providing an effective solution for complex last-mile delivery tasks in unstructured and human environments.

Limitations

  1. Incomplete Manipulation Module: Whole-body mobile manipulation control is still under development
  2. Limited Experimental Scope: Primarily validates navigation functionality; manipulation requires further testing
  3. Environmental Complexity: Adaptation capability to extreme dynamic environments requires verification

Future Directions

  1. Refined Whole-Body Mobile Manipulation: Improve coordinated control for grasping and handover
  2. Real-Time Hand Tracking Integration: Enhance safety, robustness, and naturalness of human-robot interaction
  3. Extended Application Scenarios: Validate performance in more complex real-world environments

In-Depth Evaluation

Strengths

  1. Systematic Design: Proposes a complete hierarchical framework effectively separating high-level reasoning from low-level control
  2. Strong Practicality: Designed for real-world applications (last-mile delivery)
  3. Technical Innovation: Organic combination of map-free navigation and whole-body control
  4. Real-World Validation: Preliminary validation on real hardware platform

Weaknesses

  1. Incomplete System: Manipulation module remains in design phase, lacking complete system demonstration
  2. Limited Experimental Depth: Primarily showcases navigation functionality, lacking quantitative performance analysis
  3. Missing Comparative Experiments: Lacks detailed comparison with existing methods
  4. Insufficient Robustness Analysis: Limited analysis of failure cases and boundary conditions

Impact

  1. Academic Value: Provides new system architecture insights for mobile manipulation robots
  2. Practical Value: Has application potential in service robotics and delivery robot domains
  3. Reproducibility: Provides detailed technical descriptions but lacks open-source code

Applicable Scenarios

  • Last-mile delivery services
  • Indoor service robot applications
  • Human-robot collaborative tasks
  • Mobile manipulation tasks in unstructured environments

References

The paper cites multiple important related works, including:

  • UniGoal 5: Universal zero-shot goal-oriented navigation
  • NaviLa 3: Vision-language-action navigation model for legged robots
  • MLM 7: Multi-task mobile manipulation whole-body control learning
  • Diffusion Policy 8: Diffusion-based vision-motion policy learning

Overall Assessment: This is a practically valuable systematic work proposing a complete framework design for mobile manipulation robots. Although the manipulation module is still under development, successful deployment of the navigation module proves the feasibility of the approach. The paper's main contributions lie in system architecture design and preliminary real-world validation, laying a foundation for further development in this field.