This paper presents a deep learning framework designed to enhance the grasping capabilities of quadrupeds equipped with arms, with a focus on improving precision and adaptability. Our approach centers on a sim-to-real methodology that minimizes reliance on physical data collection. We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects. By simulating thousands of interactions from various perspectives, we created pixel-wise annotated grasp-quality maps to serve as the ground truth for our model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs a grasp-quality heatmap to identify the optimal grasp point. We validated the complete framework on a four-legged robot. The system successfully executed a full loco-manipulation task: autonomously navigating to a target object, perceiving it with its sensors, predicting the optimal grasp pose using our model, and performing a precise grasp. This work proves that leveraging simulated training with advanced sensing offers a scalable and effective solution for object handling.
- Paper ID: 2508.17466
- Title: Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation
- Authors: Dilermando Almeida, Guilherme Lazzarini, Juliano Negri, Thiago H. Segreto, Ricardo V. Godoy, Marcelo Becker
- Classification: cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY
- Publication Date: October 11, 2025 (arXiv v2)
- Paper Link: https://arxiv.org/abs/2508.17466v2
- Funding Agency: Petróleo Brasileiro S/A - Petrobras
This paper proposes a deep learning framework designed to enhance the grasping capabilities of quadruped robots equipped with robotic arms, with emphasis on improving precision and adaptability. The approach employs a sim-to-real methodology to minimize reliance on physical data collection. The authors developed a pipeline in the Genesis simulation environment to generate synthetic datasets of grasping attempts on common objects. By simulating thousands of interactions from various viewpoints, pixel-level annotated grasp quality maps were created as ground truth for the model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multimodal inputs from onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs grasp quality heatmaps to identify optimal grasp points. The authors validated the complete framework on a quadruped robot, demonstrating successful execution of a complete loco-manipulation task: autonomous navigation to target objects, object perception with sensors, using the model to predict optimal grasp poses, and executing precise grasping.
Precise and adaptive grasping by quadruped robots in complex unstructured environments remains a significant challenge. Traditional methods typically require extensive real-world calibration and pre-programmed grasp configurations, which limits their flexibility.
- Application Value: Quadruped robots equipped with robotic arms can achieve loco-manipulation, with important applications in industrial automation, search and rescue operations, and assistive technologies
- Technical Challenges: Requires robust object recognition in dynamic scenes, accurate grasp planning, and seamless integration with locomotion systems
- Environmental Adaptability: Ability to operate effectively in unpredictable unstructured environments
- Dependence on Predefined Configurations: Traditional methods rely on predefined grasp configurations or intensive manual calibration
- Lack of Generalization: Existing solutions are typically context-specific and lack adaptability across scenarios
- Data Collection Costs: Requires extensive real-world data collection, which is costly and time-consuming
Inspired by recent successful applications of deep learning in robotic grasping, the authors propose a deep learning framework specifically designed for quadruped robots, overcoming the limitations of traditional methods through simulation-based training.
- Developed a training pipeline based on the Genesis simulator enabling large-scale parallel data collection without requiring real-world data
- Integrated advanced perception methods (such as D2NT) to improve depth-based grasping accuracy and reduce computational costs of ML execution
- Developed a flexible framework capable of integrating with high-level control APIs and commercial robots lacking low-level access
- Validated the method's effectiveness on physical robots, demonstrating its effectiveness in real-world scenarios
Input: RGB-D camera data (RGB images, depth maps, segmentation masks, surface normal maps)
Output: Grasp quality heatmap identifying 3D coordinates and orientation of optimal grasp points
Constraints: Achieve precise grasping in quadruped robot loco-manipulation scenarios
- Utilized the Genesis framework for physics simulation
- Selected water bottle 3D models as grasping targets
- Configured virtual RGB-D cameras to extract object images
- Sampled 1000 different positions on a 2D grid
- 100 and 10 points on X and Z axes respectively (range -0.5m to 0.5m)
- Y-axis fixed at y=0.5m
- Added random perturbations to each position (X,Y: ±0.03m, Z: 0-0.09m)
Performed grasping attempts for each pixel:
- Converted pixel coordinates to global coordinate system
- Computed corresponding surface normal vectors
- Initiated end-effector grasping attempts starting 1.0m from object, at 0.35m from surface
- Determined grasp success (1) or failure (0) based on collision detection
- Marked areas outside objects as uncertain (-1)
- Architecture: Fully convolutional encoder-decoder structure based on U-Net
- Encoder: MobileNetV2 as backbone network
- Input: 480×640×8 channels (RGB + depth + normal map + segmentation mask)
- Output: Single-channel grasp quality map
- Parameters: Approximately 5.44 million trainable parameters
- Employed GroupNorm for improved training stability
- Skip connections to fuse fine-grained features from encoder
- Transposed convolutions for upsampling
- 1×1 convolutions to generate final output
- Multimodal Fusion: Effectively combines RGB, depth, normal vectors, and segmentation information
- Sim-to-Real Transfer: Successfully trained entirely on simulated data and deployed to physical robots
- End-to-End Pipeline: Complete automation from perception to execution
- Surface Normal Integration: Utilizes D2NT algorithm to estimate surface normals from depth maps
- Simulated Data: Synthetic data generated from 1000 viewpoints in Genesis environment
- Resolution: 480×640 pixels
- Annotation Method: Pixel-level grasp quality annotation (success/failure/uncertain)
- Object Types: Water bottle models (later extended to thermal bottles)
- Grasp success rate
- Localization accuracy
- Real-time performance
- Robot: Boston Dynamics Spot quadruped robot
- Sensors: End-effector RGB-D camera
- Control: Boston Dynamics SDK
- Object Detection: YOLOv11 pre-trained model
- Camera Intrinsics: fx, fy ≈ 554.26 pixels, principal point (u0=320, v0=240)
- Maximum Torque: 3.0 Nm
- Grasping Distance: 0.35m from object surface
- Force Control: Force-limited control based on SDK
The paper successfully demonstrated a complete loco-manipulation task:
- Autonomous Navigation: Robot successfully identified and approached target objects
- Perception Accuracy: RGB-D data successfully acquired and processed
- Grasp Prediction: CNN model accurately predicted optimal grasp points
- Execution Success: Physical robot successfully grasped thermal bottles
- Real-time Processing: Capable of processing 480×640 resolution multimodal inputs in real-time
- Robustness: Demonstrated good adaptability in real-world environments
- Precision: Successfully achieved precise force-controlled grasping
From Figure 8, the following observations are evident:
- RGB images clearly capture target objects
- Depth maps provide accurate spatial information
- YOLO-11 generates precise segmentation masks
- D2NT algorithm successfully generates surface normal maps
- Model output grasp heatmaps accurately identify optimal regions
- Early research focused on developing stable locomotion systems and basic end-effector integration
- Traditional methods based on rigid kinematic models and fixed rule-based control strategies
- Recent advances include high-precision sensors, computer vision techniques, and motion planning architectures
- Machine learning algorithms typically return end-effector opening, orientation, and grasp quality
- Deep learning methods learn generalized grasping strategies from data
- Sim-to-real transfer has become an important direction for reducing data collection costs
- Quadruped robots demonstrate excellent performance in complex terrain navigation
- Equipped with robotic arms, they achieve loco-manipulation capabilities
- Promising applications in industrial automation, search and rescue, and assistive technologies
- Method Effectiveness: Simulation-based deep learning successfully achieves precise grasping for quadruped robots
- Technical Feasibility: The combination of multimodal perception and CNN prediction validates the technical approach
- Practical Value: Complete loco-manipulation pipeline provides feasible solutions for practical applications
- Limited Generalization: Model generalization constrained by object geometry and texture variations
- Sensor Quality: Lower quality depth sensors on end-effectors result in noisy depth maps
- Preprocessing Consistency: Segmentation mask resizing occasionally affects preprocessing consistency
- Object Diversity: Currently primarily targets specific-shaped objects (bottle-like)
- Dataset Expansion: Include more diverse object shapes, sizes, and textures
- Sensor Improvements: Implement smoothing filters or dedicated ML models for depth map denoising
- Control Strategies: Explore locomotion and manipulation strategies beyond SDK tools
- Complex Environments: Test in complex scenarios with multiple objects and irregular surfaces
- Strong Innovation: Successfully applies sim-to-real methodology to quadruped robot grasping
- Complete System: End-to-end solution from perception to execution
- Good Practicality: Validated method effectiveness on physical robots
- Advanced Technology: Effectively fuses multimodal information with modern deep learning techniques
- Limited Evaluation: Lacks quantitative success rate statistics and comparisons with other methods
- Single Object Type: Primarily targets bottle-shaped objects; generalization requires further verification
- Simple Environment: Experimental environment relatively simple; performance in complex scenarios unknown
- Theoretical Analysis: Lacks in-depth analysis of theoretical foundations and failure cases
- Academic Contribution: Provides new technical pathway for quadruped robot loco-manipulation
- Practical Value: Offers reference for industrial applications and service robot development
- Reproducibility: Provides GitHub repository facilitating research reproduction and extension
- Interdisciplinary Impact: Combines robotics, computer vision, and deep learning
- Industrial Automation: Material handling and manipulation in complex environments
- Search and Rescue: Object recognition and rescue operations at disaster sites
- Service Robots: Object manipulation in home and office environments
- Research Platform: Development and validation platform for loco-manipulation algorithms
The paper cites 14 relevant references covering key works in loco-manipulation, quadruped robots, and deep learning-based grasping, providing solid theoretical foundation for the research.
Overall Assessment: This is an application-oriented research paper with clear technical approach and complete implementation. While it has some limitations in theoretical innovation and comprehensive evaluation, its complete system implementation and physical robot validation provide valuable contributions to quadruped robot loco-manipulation research. This work establishes a solid foundation for subsequent research, particularly in sim-to-real transfer and multimodal perception fusion.