2025-11-15T09:37:11.895501

HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation

Sun, Wang, Zhang et al.

Seamless loco-manipulation in unstructured environments requires robots to leverage autonomous exploration alongside whole-body control for physical interaction. In this work, we introduce HANDO (Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation), a two-layer framework designed for legged robots equipped with manipulators to perform human-centered mobile manipulation tasks. The first layer utilizes a goal-conditioned autonomous exploration policy to guide the robot to semantically specified targets, such as a black office chair in a dynamic environment. The second layer employs a unified whole-body loco-manipulation policy to coordinate the arm and legs for precise interaction tasks-for example, handing a drink to a person seated on the chair. We have conducted an initial deployment of the navigation module, and will continue to pursue finer-grained deployment of whole-body loco-manipulation.

academic

HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation

Basic Information

Paper ID: 2510.09221
Title: HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation
Authors: Jingyuan Sun, Chaoran Wang, Mingyu Zhang, Cui Miao, Hongyu Ji, Zihan Qu, Han Sun, Bing Wang, Qingyi Si
Category: cs.RO (Robotics)
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09221
Video Demonstration: https://youtu.be/YD0qx3vRsfc

Abstract

This paper presents HANDO (Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation), a two-layer framework designed for legged robots equipped with robotic arms to execute human-centric mobile manipulation tasks. The first layer employs goal-conditioned autonomous exploration strategies to guide the robot to semantically-specified targets; the second layer utilizes a unified whole-body mobile manipulation policy that coordinates arm and leg movements for precise interaction tasks. The authors have completed preliminary deployment of the navigation module and will continue advancing the refinement of whole-body mobile manipulation deployment.

Research Background and Motivation

Problem Definition

This research addresses the challenge of seamless mobile manipulation in unstructured environments, particularly the human-robot interaction challenges in last-mile delivery scenarios. Traditional delivery methods rely on pre-built maps and precise localization, which are costly and have limited scalability in dynamic or customized environments.

Significance

Last-mile delivery represents a critical application for service robots, requiring robots to not only traverse complex environments but also engage in physical interaction with humans. Quadruped robot platforms equipped with robotic arms combine agile locomotion capabilities with manipulation functionality, providing an ideal implementation platform for complex delivery scenarios.

Limitations of Existing Approaches

Navigation: Most delivery strategies still rely on maps and perform poorly in frequently changing or rapidly deployable environments
Manipulation: Lack of effective whole-body coordination control, making complex human-robot interaction difficult
Integration Challenges: Sim-to-real deployment faces perception gaps, terrain variations, and hardware constraints

Research Motivation

Develop a hierarchical and integrated framework that unifies map-free navigation with whole-body mobile manipulation in a deployable system, achieving comprehensive autonomy for navigating unknown spaces and executing dexterous manipulation actions.

Core Contributions

Proposed a novel map-free navigation module: Employs vision-language models for cross-scene reasoning and graph matching, driving a three-stage exploration strategy to achieve zero-cost navigation
Designed mobile manipulation policies: Integrates quadruped locomotion and arm control, achieving whole-body interaction behaviors through end-effector trajectory guidance
System integration and validation: Integrates and validates the system on a real quadruped-arm platform, demonstrating end-to-end last-mile delivery combining semantic navigation and whole-body interaction

Methodology Details

Task Definition

The HANDO framework enables quadruped robots equipped with robotic arms to execute complete delivery tasks in unstructured environments, including:

Input: Semantic target descriptions (e.g., "black office chair"), environmental perception data, human hand trajectories
Output: Robot motion control commands, arm joint commands
Constraints: No pre-built maps, real-time requirements, safety constraints

Model Architecture

Layer 1: Goal-Oriented Map-Free Navigation

Three-Stage Exploration Process:

Initial Exploration Stage: When matching score $s_t < \sigma_1$ , the system decomposes the semantic target graph $G_g$ into sub-goals using boundary-based exploration strategies
Coordinate Projection and Alignment Stage: When $\sigma_1 \leq s_t < \sigma_2$ , aligns target graph $G_g$ with current scene graph $G_t$
Target Verification Stage: When $s_t \geq \sigma_2$ , executes target verification and scene graph correction

Action Generation: VLM-based action decoder selects discrete actions $a_t \in \{\text{move forward, turn left, turn right, stop}\}$ , mapped to continuous velocity commands: $(0.1 \text{ms}^{-1}, \pi/12 \text{rad s}^{-1}, -\pi/12 \text{rad s}^{-1}, 0)$

Layer 2: Whole-Body Mobile Manipulation Policy

Hand Trajectory Generator:

Detects operator hand and selects keyframes through hand velocity valleys
Retargets hand position/orientation to robot gripper tool center point (TCP): $x^{tcp}_t = SE(3)(T_{cam \rightarrow world}) \cdot SE(3)(h_t) \cdot {}^{tcp}T_{hand}$

Whole-Body Mobile Manipulation Policy:

State Space: Includes previous action, leg state, arm state, base state, and end-effector trajectory
Action Space: Uses position PD control with target position $q^*_t = q_{default} + \Delta q_t$
Reward Function:
- TCP tracking reward: $r_{track} = \exp\left(-\frac{\|p^{tcp}_t - p^{tar}_t\|}{\sigma_p}\right) \cdot \exp\left(-\frac{\angle(R^{tcp}_t(R^{tar}_t)^T)}{\sigma_o}\right)$
- Regularization reward: $r_{reg} = -\lambda_\tau\|\tau_t\|^2 - \lambda_{\Delta q}\|a_t - a_{t-1}\|^2 - \lambda_{\ddot{q}}\|\ddot{q}_t\|^2$

Technical Innovations

Cross-Modal Scene Understanding: Combines vision-language models to achieve direct mapping from semantic targets to navigation behaviors
Hierarchical Control Architecture: Effectively separates high-level semantic reasoning from low-level motion control
Real-Time Hand Tracking Integration: Guides robot end-effector through human hand trajectories, enhancing naturalness of human-robot interaction
Unified Whole-Body Control: Coordinates leg locomotion and arm manipulation within a single policy framework

Experimental Setup

Hardware Platform

Robot Platform: Unitree Go1 EDU quadruped robot + AGILEX PIPER lightweight robotic arm
Computing Device: NVIDIA RTX 4090 GPU
Control Frequency: Both motion strategy and whole-body mobile manipulation strategy run at 50Hz
Communication: Wired Ethernet connection supporting low-latency reliable deployment

Experimental Environment

Real-world evaluation conducted in a café with characteristics:

Unstructured layout with irregularly arranged tables, chairs, and clutter
Partial observability: robot has no prior knowledge of target locations
Relies solely on visual input and semantic instructions

Evaluation Metrics

Navigation success rate
Trajectory smoothness and continuity
Target localization accuracy
System stability and robustness

Experimental Results

Main Results

The goal-oriented map-free navigation layer demonstrates excellent performance in real environments:

Successfully explores environments and approaches targets
Recorded base trajectories are smooth and continuous
Maintains stable and robust navigation performance despite irregular layouts

Experimental Findings

Navigation Module Validation: Successfully completed preliminary deployment, proving the feasibility of map-free navigation
System Integration: Multi-threaded control achieves real-time operation
Environmental Adaptability: Demonstrates good adaptation in dynamic, unstructured environments

Autonomous Navigation

Traditional Methods: Map-based approaches using SLAM and graph planning, effective in static structured environments but costly
Map-Free Methods: Frameworks like UniGoal and NaviLa leverage language and visual cues to guide navigation, significantly reducing deployment costs

End-to-End Imitation Learning

ACT: Employs Transformer backbone and image encoders
Diffusion Policy: Introduces generative diffusion processes to model multimodal action distributions
RISE: Utilizes sparse point cloud encoders for continuous control

Mobile Manipulation

Early Methods: Optimization-based footstep planning and whole-body trajectory generation with high computational costs
Reinforcement Learning Methods: End-to-end control for multiple mobile manipulation tasks
MLM: Combines trajectory libraries with diffusion policy-based inference

Conclusions and Discussion

Main Conclusions

The HANDO framework successfully bridges semantic task understanding with low-level physical control, providing an effective solution for complex last-mile delivery tasks in unstructured and human environments.

Limitations

Incomplete Manipulation Module: Whole-body mobile manipulation control is still under development
Limited Experimental Scope: Primarily validates navigation functionality; manipulation requires further testing
Environmental Complexity: Adaptation capability to extreme dynamic environments requires verification

Future Directions

Refined Whole-Body Mobile Manipulation: Improve coordinated control for grasping and handover
Real-Time Hand Tracking Integration: Enhance safety, robustness, and naturalness of human-robot interaction
Extended Application Scenarios: Validate performance in more complex real-world environments

In-Depth Evaluation

Strengths

Systematic Design: Proposes a complete hierarchical framework effectively separating high-level reasoning from low-level control
Strong Practicality: Designed for real-world applications (last-mile delivery)
Technical Innovation: Organic combination of map-free navigation and whole-body control
Real-World Validation: Preliminary validation on real hardware platform

Weaknesses

Incomplete System: Manipulation module remains in design phase, lacking complete system demonstration
Limited Experimental Depth: Primarily showcases navigation functionality, lacking quantitative performance analysis
Missing Comparative Experiments: Lacks detailed comparison with existing methods
Insufficient Robustness Analysis: Limited analysis of failure cases and boundary conditions

Impact

Academic Value: Provides new system architecture insights for mobile manipulation robots
Practical Value: Has application potential in service robotics and delivery robot domains
Reproducibility: Provides detailed technical descriptions but lacks open-source code

Applicable Scenarios

Last-mile delivery services
Indoor service robot applications
Human-robot collaborative tasks
Mobile manipulation tasks in unstructured environments

References

The paper cites multiple important related works, including:

UniGoal 5: Universal zero-shot goal-oriented navigation
NaviLa 3: Vision-language-action navigation model for legged robots
MLM 7: Multi-task mobile manipulation whole-body control learning
Diffusion Policy 8: Diffusion-based vision-motion policy learning

Overall Assessment: This is a practically valuable systematic work proposing a complete framework design for mobile manipulation robots. Although the manipulation module is still under development, successful deployment of the navigation module proves the feasibility of the approach. The paper's main contributions lie in system architecture design and preliminary real-world validation, laying a foundation for further development in this field.