Seamless loco-manipulation in unstructured environments requires robots to leverage autonomous exploration alongside whole-body control for physical interaction. In this work, we introduce HANDO (Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation), a two-layer framework designed for legged robots equipped with manipulators to perform human-centered mobile manipulation tasks. The first layer utilizes a goal-conditioned autonomous exploration policy to guide the robot to semantically specified targets, such as a black office chair in a dynamic environment. The second layer employs a unified whole-body loco-manipulation policy to coordinate the arm and legs for precise interaction tasks-for example, handing a drink to a person seated on the chair. We have conducted an initial deployment of the navigation module, and will continue to pursue finer-grained deployment of whole-body loco-manipulation.
HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation
- Paper ID: 2510.09221
- Title: HANDO: Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation
- Authors: Jingyuan Sun, Chaoran Wang, Mingyu Zhang, Cui Miao, Hongyu Ji, Zihan Qu, Han Sun, Bing Wang, Qingyi Si
- Category: cs.RO (Robotics)
- Publication Date: October 10, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.09221
- Video Demonstration: https://youtu.be/YD0qx3vRsfc
This paper presents HANDO (Hierarchical Autonomous Navigation and Dexterous Omni-loco-manipulation), a two-layer framework designed for legged robots equipped with robotic arms to execute human-centric mobile manipulation tasks. The first layer employs goal-conditioned autonomous exploration strategies to guide the robot to semantically-specified targets; the second layer utilizes a unified whole-body mobile manipulation policy that coordinates arm and leg movements for precise interaction tasks. The authors have completed preliminary deployment of the navigation module and will continue advancing the refinement of whole-body mobile manipulation deployment.
This research addresses the challenge of seamless mobile manipulation in unstructured environments, particularly the human-robot interaction challenges in last-mile delivery scenarios. Traditional delivery methods rely on pre-built maps and precise localization, which are costly and have limited scalability in dynamic or customized environments.
Last-mile delivery represents a critical application for service robots, requiring robots to not only traverse complex environments but also engage in physical interaction with humans. Quadruped robot platforms equipped with robotic arms combine agile locomotion capabilities with manipulation functionality, providing an ideal implementation platform for complex delivery scenarios.
- Navigation: Most delivery strategies still rely on maps and perform poorly in frequently changing or rapidly deployable environments
- Manipulation: Lack of effective whole-body coordination control, making complex human-robot interaction difficult
- Integration Challenges: Sim-to-real deployment faces perception gaps, terrain variations, and hardware constraints
Develop a hierarchical and integrated framework that unifies map-free navigation with whole-body mobile manipulation in a deployable system, achieving comprehensive autonomy for navigating unknown spaces and executing dexterous manipulation actions.
- Proposed a novel map-free navigation module: Employs vision-language models for cross-scene reasoning and graph matching, driving a three-stage exploration strategy to achieve zero-cost navigation
- Designed mobile manipulation policies: Integrates quadruped locomotion and arm control, achieving whole-body interaction behaviors through end-effector trajectory guidance
- System integration and validation: Integrates and validates the system on a real quadruped-arm platform, demonstrating end-to-end last-mile delivery combining semantic navigation and whole-body interaction
The HANDO framework enables quadruped robots equipped with robotic arms to execute complete delivery tasks in unstructured environments, including:
- Input: Semantic target descriptions (e.g., "black office chair"), environmental perception data, human hand trajectories
- Output: Robot motion control commands, arm joint commands
- Constraints: No pre-built maps, real-time requirements, safety constraints
Three-Stage Exploration Process:
- Initial Exploration Stage: When matching score st<σ1, the system decomposes the semantic target graph Gg into sub-goals using boundary-based exploration strategies
- Coordinate Projection and Alignment Stage: When σ1≤st<σ2, aligns target graph Gg with current scene graph Gt
- Target Verification Stage: When st≥σ2, executes target verification and scene graph correction
Action Generation: VLM-based action decoder selects discrete actions at∈{move forward, turn left, turn right, stop}, mapped to continuous velocity commands:
(0.1ms−1,π/12rad s−1,−π/12rad s−1,0)
Hand Trajectory Generator:
- Detects operator hand and selects keyframes through hand velocity valleys
- Retargets hand position/orientation to robot gripper tool center point (TCP):
xttcp=SE(3)(Tcam→world)⋅SE(3)(ht)⋅tcpThand
Whole-Body Mobile Manipulation Policy:
- State Space: Includes previous action, leg state, arm state, base state, and end-effector trajectory
- Action Space: Uses position PD control with target position qt∗=qdefault+Δqt
- Reward Function:
- TCP tracking reward: rtrack=exp(−σp∥pttcp−pttar∥)⋅exp(−σo∠(Rttcp(Rttar)T))
- Regularization reward: rreg=−λτ∥τt∥2−λΔq∥at−at−1∥2−λq¨∥q¨t∥2
- Cross-Modal Scene Understanding: Combines vision-language models to achieve direct mapping from semantic targets to navigation behaviors
- Hierarchical Control Architecture: Effectively separates high-level semantic reasoning from low-level motion control
- Real-Time Hand Tracking Integration: Guides robot end-effector through human hand trajectories, enhancing naturalness of human-robot interaction
- Unified Whole-Body Control: Coordinates leg locomotion and arm manipulation within a single policy framework
- Robot Platform: Unitree Go1 EDU quadruped robot + AGILEX PIPER lightweight robotic arm
- Computing Device: NVIDIA RTX 4090 GPU
- Control Frequency: Both motion strategy and whole-body mobile manipulation strategy run at 50Hz
- Communication: Wired Ethernet connection supporting low-latency reliable deployment
Real-world evaluation conducted in a café with characteristics:
- Unstructured layout with irregularly arranged tables, chairs, and clutter
- Partial observability: robot has no prior knowledge of target locations
- Relies solely on visual input and semantic instructions
- Navigation success rate
- Trajectory smoothness and continuity
- Target localization accuracy
- System stability and robustness
The goal-oriented map-free navigation layer demonstrates excellent performance in real environments:
- Successfully explores environments and approaches targets
- Recorded base trajectories are smooth and continuous
- Maintains stable and robust navigation performance despite irregular layouts
- Navigation Module Validation: Successfully completed preliminary deployment, proving the feasibility of map-free navigation
- System Integration: Multi-threaded control achieves real-time operation
- Environmental Adaptability: Demonstrates good adaptation in dynamic, unstructured environments
- Traditional Methods: Map-based approaches using SLAM and graph planning, effective in static structured environments but costly
- Map-Free Methods: Frameworks like UniGoal and NaviLa leverage language and visual cues to guide navigation, significantly reducing deployment costs
- ACT: Employs Transformer backbone and image encoders
- Diffusion Policy: Introduces generative diffusion processes to model multimodal action distributions
- RISE: Utilizes sparse point cloud encoders for continuous control
- Early Methods: Optimization-based footstep planning and whole-body trajectory generation with high computational costs
- Reinforcement Learning Methods: End-to-end control for multiple mobile manipulation tasks
- MLM: Combines trajectory libraries with diffusion policy-based inference
The HANDO framework successfully bridges semantic task understanding with low-level physical control, providing an effective solution for complex last-mile delivery tasks in unstructured and human environments.
- Incomplete Manipulation Module: Whole-body mobile manipulation control is still under development
- Limited Experimental Scope: Primarily validates navigation functionality; manipulation requires further testing
- Environmental Complexity: Adaptation capability to extreme dynamic environments requires verification
- Refined Whole-Body Mobile Manipulation: Improve coordinated control for grasping and handover
- Real-Time Hand Tracking Integration: Enhance safety, robustness, and naturalness of human-robot interaction
- Extended Application Scenarios: Validate performance in more complex real-world environments
- Systematic Design: Proposes a complete hierarchical framework effectively separating high-level reasoning from low-level control
- Strong Practicality: Designed for real-world applications (last-mile delivery)
- Technical Innovation: Organic combination of map-free navigation and whole-body control
- Real-World Validation: Preliminary validation on real hardware platform
- Incomplete System: Manipulation module remains in design phase, lacking complete system demonstration
- Limited Experimental Depth: Primarily showcases navigation functionality, lacking quantitative performance analysis
- Missing Comparative Experiments: Lacks detailed comparison with existing methods
- Insufficient Robustness Analysis: Limited analysis of failure cases and boundary conditions
- Academic Value: Provides new system architecture insights for mobile manipulation robots
- Practical Value: Has application potential in service robotics and delivery robot domains
- Reproducibility: Provides detailed technical descriptions but lacks open-source code
- Last-mile delivery services
- Indoor service robot applications
- Human-robot collaborative tasks
- Mobile manipulation tasks in unstructured environments
The paper cites multiple important related works, including:
- UniGoal 5: Universal zero-shot goal-oriented navigation
- NaviLa 3: Vision-language-action navigation model for legged robots
- MLM 7: Multi-task mobile manipulation whole-body control learning
- Diffusion Policy 8: Diffusion-based vision-motion policy learning
Overall Assessment: This is a practically valuable systematic work proposing a complete framework design for mobile manipulation robots. Although the manipulation module is still under development, successful deployment of the navigation module proves the feasibility of the approach. The paper's main contributions lie in system architecture design and preliminary real-world validation, laying a foundation for further development in this field.