2025-11-22T23:46:16.732962

Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

Almeida, Lazzarini, Negri et al.

This paper presents a deep learning framework designed to enhance the grasping capabilities of quadrupeds equipped with arms, with a focus on improving precision and adaptability. Our approach centers on a sim-to-real methodology that minimizes reliance on physical data collection. We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects. By simulating thousands of interactions from various perspectives, we created pixel-wise annotated grasp-quality maps to serve as the ground truth for our model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs a grasp-quality heatmap to identify the optimal grasp point. We validated the complete framework on a four-legged robot. The system successfully executed a full loco-manipulation task: autonomously navigating to a target object, perceiving it with its sensors, predicting the optimal grasp pose using our model, and performing a precise grasp. This work proves that leveraging simulated training with advanced sensing offers a scalable and effective solution for object handling.

academic

Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

Basic Information

Paper ID: 2508.17466
Title: Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation
Authors: Dilermando Almeida, Guilherme Lazzarini, Juliano Negri, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker
Classification: cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY
Publication Date: October 11, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2508.17466v2
Funding Agency: Petróleo Brasileiro S/A - Petrobras

Abstract

This paper proposes a deep learning framework designed to enhance the grasping capabilities of quadruped robots equipped with robotic arms, with emphasis on improving precision and adaptability. The approach employs a sim-to-real methodology to minimize reliance on physical data collection. The authors developed a pipeline in the Genesis simulation environment to generate synthetic datasets of grasping attempts on common objects. By simulating thousands of interactions from various viewpoints, pixel-level annotated grasp quality maps were created as ground truth for the model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multimodal inputs from onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs grasp quality heatmaps to identify optimal grasp points. The authors validated the complete framework on a quadruped robot, demonstrating successful execution of a complete loco-manipulation task: autonomous navigation to target objects, object perception with sensors, using the model to predict optimal grasp poses, and executing precise grasping.

Research Background and Motivation

Problem Definition

Precise and adaptive grasping by quadruped robots in complex unstructured environments remains a significant challenge. Traditional methods typically require extensive real-world calibration and pre-programmed grasp configurations, which limits their flexibility.

Significance

Application Value: Quadruped robots equipped with robotic arms can achieve loco-manipulation, with important applications in industrial automation, search and rescue operations, and assistive technologies
Technical Challenges: Requires robust object recognition in dynamic scenes, accurate grasp planning, and seamless integration with locomotion systems
Environmental Adaptability: Ability to operate effectively in unpredictable unstructured environments

Limitations of Existing Methods

Dependence on Predefined Configurations: Traditional methods rely on predefined grasp configurations or intensive manual calibration
Lack of Generalization: Existing solutions are typically context-specific and lack adaptability across scenarios
Data Collection Costs: Requires extensive real-world data collection, which is costly and time-consuming

Research Motivation

Inspired by recent successful applications of deep learning in robotic grasping, the authors propose a deep learning framework specifically designed for quadruped robots, overcoming the limitations of traditional methods through simulation-based training.

Core Contributions

Developed a training pipeline based on the Genesis simulator enabling large-scale parallel data collection without requiring real-world data
Integrated advanced perception methods (such as D2NT) to improve depth-based grasping accuracy and reduce computational costs of ML execution
Developed a flexible framework capable of integrating with high-level control APIs and commercial robots lacking low-level access
Validated the method's effectiveness on physical robots, demonstrating its effectiveness in real-world scenarios

Methodology Details

Task Definition

Input: RGB-D camera data (RGB images, depth maps, segmentation masks, surface normal maps) Output: Grasp quality heatmap identifying 3D coordinates and orientation of optimal grasp points Constraints: Achieve precise grasping in quadruped robot loco-manipulation scenarios

Dataset Generation

Simulation Environment Setup

Utilized the Genesis framework for physics simulation
Selected water bottle 3D models as grasping targets
Configured virtual RGB-D cameras to extract object images

Camera Position Sampling

Sampled 1000 different positions on a 2D grid
100 and 10 points on X and Z axes respectively (range -0.5m to 0.5m)
Y-axis fixed at y=0.5m
Added random perturbations to each position (X,Y: ±0.03m, Z: 0-0.09m)

Grasp Annotation Generation

Performed grasping attempts for each pixel:

Converted pixel coordinates to global coordinate system
Computed corresponding surface normal vectors
Initiated end-effector grasping attempts starting 1.0m from object, at 0.35m from surface
Determined grasp success (1) or failure (0) based on collision detection
Marked areas outside objects as uncertain (-1)

Model Architecture

Network Design

Architecture: Fully convolutional encoder-decoder structure based on U-Net
Encoder: MobileNetV2 as backbone network
Input: 480×640×8 channels (RGB + depth + normal map + segmentation mask)
Output: Single-channel grasp quality map
Parameters: Approximately 5.44 million trainable parameters

Key Technical Details

Employed GroupNorm for improved training stability
Skip connections to fuse fine-grained features from encoder
Transposed convolutions for upsampling
1×1 convolutions to generate final output

Technical Innovations

Multimodal Fusion: Effectively combines RGB, depth, normal vectors, and segmentation information
Sim-to-Real Transfer: Successfully trained entirely on simulated data and deployed to physical robots
End-to-End Pipeline: Complete automation from perception to execution
Surface Normal Integration: Utilizes D2NT algorithm to estimate surface normals from depth maps

Experimental Setup

Dataset

Simulated Data: Synthetic data generated from 1000 viewpoints in Genesis environment
Resolution: 480×640 pixels
Annotation Method: Pixel-level grasp quality annotation (success/failure/uncertain)
Object Types: Water bottle models (later extended to thermal bottles)

Evaluation Metrics

Grasp success rate
Localization accuracy
Real-time performance

Experimental Platform

Robot: Boston Dynamics Spot quadruped robot
Sensors: End-effector RGB-D camera
Control: Boston Dynamics SDK
Object Detection: YOLOv11 pre-trained model

Implementation Details

Camera Intrinsics: fx, fy ≈ 554.26 pixels, principal point (u0=320, v0=240)
Maximum Torque: 3.0 Nm
Grasping Distance: 0.35m from object surface
Force Control: Force-limited control based on SDK

Experimental Results

Main Results

The paper successfully demonstrated a complete loco-manipulation task:

Autonomous Navigation: Robot successfully identified and approached target objects
Perception Accuracy: RGB-D data successfully acquired and processed
Grasp Prediction: CNN model accurately predicted optimal grasp points
Execution Success: Physical robot successfully grasped thermal bottles

System Performance

Real-time Processing: Capable of processing 480×640 resolution multimodal inputs in real-time
Robustness: Demonstrated good adaptability in real-world environments
Precision: Successfully achieved precise force-controlled grasping

Case Analysis

From Figure 8, the following observations are evident:

RGB images clearly capture target objects
Depth maps provide accurate spatial information
YOLO-11 generates precise segmentation masks
D2NT algorithm successfully generates surface normal maps
Model output grasp heatmaps accurately identify optimal regions

Loco-Manipulation Research

Early research focused on developing stable locomotion systems and basic end-effector integration
Traditional methods based on rigid kinematic models and fixed rule-based control strategies
Recent advances include high-precision sensors, computer vision techniques, and motion planning architectures

Deep Learning Applications in Grasping

Machine learning algorithms typically return end-effector opening, orientation, and grasp quality
Deep learning methods learn generalized grasping strategies from data
Sim-to-real transfer has become an important direction for reducing data collection costs

Quadruped Robot Manipulation

Quadruped robots demonstrate excellent performance in complex terrain navigation
Equipped with robotic arms, they achieve loco-manipulation capabilities
Promising applications in industrial automation, search and rescue, and assistive technologies

Conclusions and Discussion

Main Conclusions

Method Effectiveness: Simulation-based deep learning successfully achieves precise grasping for quadruped robots
Technical Feasibility: The combination of multimodal perception and CNN prediction validates the technical approach
Practical Value: Complete loco-manipulation pipeline provides feasible solutions for practical applications

Limitations

Limited Generalization: Model generalization constrained by object geometry and texture variations
Sensor Quality: Lower quality depth sensors on end-effectors result in noisy depth maps
Preprocessing Consistency: Segmentation mask resizing occasionally affects preprocessing consistency
Object Diversity: Currently primarily targets specific-shaped objects (bottle-like)

Future Directions

Dataset Expansion: Include more diverse object shapes, sizes, and textures
Sensor Improvements: Implement smoothing filters or dedicated ML models for depth map denoising
Control Strategies: Explore locomotion and manipulation strategies beyond SDK tools
Complex Environments: Test in complex scenarios with multiple objects and irregular surfaces

In-Depth Evaluation

Strengths

Strong Innovation: Successfully applies sim-to-real methodology to quadruped robot grasping
Complete System: End-to-end solution from perception to execution
Good Practicality: Validated method effectiveness on physical robots
Advanced Technology: Effectively fuses multimodal information with modern deep learning techniques

Weaknesses

Limited Evaluation: Lacks quantitative success rate statistics and comparisons with other methods
Single Object Type: Primarily targets bottle-shaped objects; generalization requires further verification
Simple Environment: Experimental environment relatively simple; performance in complex scenarios unknown
Theoretical Analysis: Lacks in-depth analysis of theoretical foundations and failure cases

Impact

Academic Contribution: Provides new technical pathway for quadruped robot loco-manipulation
Practical Value: Offers reference for industrial applications and service robot development
Reproducibility: Provides GitHub repository facilitating research reproduction and extension
Interdisciplinary Impact: Combines robotics, computer vision, and deep learning

Applicable Scenarios

Industrial Automation: Material handling and manipulation in complex environments
Search and Rescue: Object recognition and rescue operations at disaster sites
Service Robots: Object manipulation in home and office environments
Research Platform: Development and validation platform for loco-manipulation algorithms

References

The paper cites 14 relevant references covering key works in loco-manipulation, quadruped robots, and deep learning-based grasping, providing solid theoretical foundation for the research.

Overall Assessment: This is an application-oriented research paper with clear technical approach and complete implementation. While it has some limitations in theoretical innovation and comprehensive evaluation, its complete system implementation and physical robot validation provide valuable contributions to quadruped robot loco-manipulation research. This work establishes a solid foundation for subsequent research, particularly in sim-to-real transfer and multimodal perception fusion.