Recent advances in machine learning force fields (MLFFs) are revolutionizing molecular simulations by bridging the gap between quantum-mechanical (QM) accuracy and the computational efficiency of mechanistic potentials. However, the development of reliable MLFFs for biomolecular systems remains constrained by the scarcity of high-quality, chemically diverse QM datasets that span all of the major classes of biomolecules expressed in living cells. Crucially, such a comprehensive dataset must be computed using non-empirical or minimally empirical approximations to solving the Schrödinger equation. To address these limitations, we introduce the QCell dataset -- a curated collection of 525k new QM calculations for biomolecular fragments encompassing carbohydrates, nucleic acids, lipids, dimers, and ion clusters. QCell complements existing datasets, bringing the total number of available data points to 41 million molecular systems, all calculated using hybrid density functional theory with nonlocal many-body dispersion interactions, as captured by the PBE0+MBD(-NL) level of quantum mechanics. The QCell dataset therefore provides a valuable resource for training next-generation MLFFs capable of modeling the intricate interactions that govern biomolecular dynamics beyond small molecules and proteins.
- Paper ID: 2510.09939
- Title: QCell: Comprehensive Quantum-Mechanical Dataset Spanning Diverse Biomolecular Fragments
- Authors: Adil Kabylda, Sergio Suárez-Dou, Nils Davoine, Florian N. Brünig, Alexandre Tkatchenko
- Classification: physics.chem-ph
- Publication Date: October 11, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.09939
Recent advances in machine learning force fields (MLFFs) are revolutionizing molecular simulation by bridging the gap between quantum mechanical accuracy and computational efficiency of classical potentials. However, the development of reliable MLFFs for biomolecular systems remains constrained by the scarcity of high-quality, chemically diverse quantum mechanical datasets that must encompass all major biomolecular categories expressed in living cells. Critically, such comprehensive datasets must be computed using non-empirical or minimally empirical Schrödinger equation solving approximations. To address these limitations, the authors introduce the QCell dataset—a curated collection of 525,000 new quantum mechanical calculations spanning biomolecular fragments of carbohydrates, nucleic acids, lipids, dimers, and ionic clusters. QCell complements existing datasets, bringing the total number of available data points to 41 million molecular systems, all computed using hybrid density functional theory with non-local many-body dispersion interactions, captured at the PBE0+MBD(-NL) quantum mechanical level.
- Core Issue: Existing quantum mechanical datasets primarily cover small molecules and proteins, with significant gaps in three major biomolecular categories—nucleic acids, lipids, and carbohydrates—which collectively constitute approximately 40% of cellular biomass.
- Significance:
- Biomolecular chemical space possesses unique characteristics, with complexity arising primarily from the conformational space of relatively limited repeated chemical building blocks
- Accurate modeling of biomolecular interactions is essential for computational chemistry and biophysics
- MLFFs require diverse and high-quality QM datasets to faithfully represent the chemical space encountered in biomolecular systems
- Limitations of Existing Approaches:
- Traditional QM methods offer high accuracy but low computational efficiency
- Empirical force fields provide high efficiency but limited accuracy
- Existing datasets such as GEMS, QCML, and OMol25, while representing progress, still exhibit significant gaps in the three major biomolecular categories
- Research Motivation:
- Fill gaps in biomolecular datasets
- Employ a consistent non-empirical quantum mechanical theoretical level
- Provide comprehensive training resources for next-generation MLFFs
- Constructed the QCell Dataset: Contains 525,881 new QM calculations of biomolecular fragments, spanning nucleic acids, lipids, carbohydrates, ions/water, and non-covalent dimers
- Extended Data Coverage: Combined with existing datasets, total data points reach 41 million molecular systems, encompassing 82 chemical elements
- Unified Theoretical Level: All calculations employ the PBE0+MBD(-NL) level, ensuring data consistency
- Deep Conformational Sampling: Focuses on conformational diversity in biologically relevant chemical environments
- Technical Validation: Dataset quality verified through structural analysis and machine learning force field training
The QCell dataset employs a five-step workflow:
- Building Block Library Management and Initial 3D Structure Generation
- Extensive Conformational Sampling (molecular dynamics or dedicated conformational generation tools)
- Representative Fragment Selection
- DFTB+MBD Method Pre-optimization
- High-Quality PBE0+MBD(-NL) Quantum Mechanical Calculations
- Utilized Nucleic Acid Builder to construct solvated DNA heptamer double helices (A-, B-, Z-DNA forms)
- Performed molecular dynamics simulations using the OL21 force field
- Extracted central double-stranded trinucleotide fragments from heptamer trajectories
- Included DNA base pair dimers and gas-phase RNA fragments
- Generated phospholipid membrane structures using CHARMM-GUI Membrane Builder
- Covered POPC, POPE, POPG, POPS phospholipids and cholesterol
- Conducted 500 ns production simulations using the Lipid21 force field
- Selected fatty acid monomers, dimers, and trimers based on geometric proximity
- Constructed a library of 52 common monosaccharides, including α/β anomeric configurations of pentoses and hexoses
- Built disaccharides and sugar-peptide linkages using PyMOL
- Generated conformations using the CREST program with 12 kcal/mol maximum energy threshold
- Clustered by linking dihedral angles and selected representative conformations
- Prepared solvated ionic systems with ions centered in water boxes
- Applied MBpol force field for monovalent ions and AMBER force field for divalent ions
- Captured solvation effects at varying hydration levels (1-100 water molecules)
- Theoretical Level: PBE0+MBD(-NL) - non-empirical hybrid functional with many-body dispersion treatment
- Software: FHI-aims code
- Basis Set: "Tight" basis set for small molecules, "intermediate" basis set for molecules >350 atoms
- Convergence Criteria: Total energy 10^-5 eV, eigenvalue sum 10^-3 eV, charge density 10^-5 electrons/ų, forces 10^-4 eV/Å
| Category | Count | Atom Count | Elements | Theoretical Level |
|---|
| Nucleic Acids | 34,838 | 14-382 | H,C,N,O,Na,Mg,S,P | PBE0+MBD-NL |
| Lipids | 16,000 | 125-402 | H,C,N,O,P | PBE0+MBD |
| Carbohydrates | 74,087 | 35-75 | H,C,N,O | PBE0+MBD |
| Ions/Water | 30,000 | 4-303 | H,O,Na,Cl,K,Mg,Ca | PBE0+MBD-NL |
| Non-covalent Dimers | 370,956 | 2-34 | 20 elements | PBE0+MBD-NL |
- Structural geometric descriptor validation
- Mean absolute error (MAE) of forces in machine learning force fields
- Radial distribution function comparison with experimental reference values
Trained MLFFs using SO3LR architecture to assess dataset quality:
- Three model sizes: small, medium, large
- Combined loss function: forces, dipole moments, Hirshfeld ratios, energies (weights 100:10:10:1)
- 10 Å long-range cutoff, trained on A100 GPU for 180 hours
- Nucleic Acids: Phosphate-phosphate distances and backbone bending angle distributions of DNA fragments reproduced expected values for A-, B-, Z-DNA
- Lipids: Radius of gyration distributions of fatty acid fragments reasonably reflected chain extension and packing
- Carbohydrates: N/O-glycosidic dihedral angles spanned the full conformational space, reproducing all major rotamers
- Ions/Water: Radial distribution functions matched experimental hydration distances, with accurate peak positions for monovalent ion-oxygen and O-O
Force MAE results for different dataset subsets:
- Nucleic Acids: ~0.8 kcal/mol/Å (large model)
- Lipids: ~0.6 kcal/mol/Å (large model)
- Carbohydrates: ~0.5 kcal/mol/Å (large model)
- Ions/Water: ~0.7 kcal/mol/Å (large model)
- DES370k: ~0.8 kcal/mol/Å (large model)
Errors systematically decreased with model capacity, with most subsets achieving below 1 kcal/mol/Å, demonstrating dataset internal consistency and the generalization capability of modern MLFFs across chemically diverse systems.
- QM7-X: Small organic molecules, 4.19 million data points
- MD22: Molecular dynamics trajectories
- GEMS: Hierarchical protein fragmentation strategy
- SPICE: Drug-like molecules and peptides
- QCML: Systematic mapping of small molecule chemical space
- OMol25: Chemically heterogeneous ensemble
- First systematic coverage of three major biomolecular categories: nucleic acids, lipids, and carbohydrates
- Unified non-empirical theoretical level ensures data consistency
- Deep conformational sampling focuses on biologically relevant chemical environments
- Perfect compatibility with existing datasets for unified training
- The QCell dataset successfully fills important gaps in biomolecular QM data
- The unified PBE0+MBD(-NL) theoretical level ensures compatibility with existing datasets
- Structural validation confirms the chemical reasonableness and diversity of the dataset
- Machine learning validation demonstrates excellent predictive performance
- Radial distribution functions for divalent ions show slight deviations from experimental values
- Fragment size limited to 402 atoms
- Primarily focuses on biologically relevant elements with relatively limited elemental diversity
- Balance between gas-phase and solution-phase environments requires further optimization
- Extension to larger biomolecular fragments
- Inclusion of additional solvent effects and environmental conditions
- Further validation and calibration against experimental data
- Development of new MLFF architectures specifically designed for biomolecules
- Fills Important Gaps: First systematic solution to data insufficiency in nucleic acids, lipids, and carbohydrates
- Rigorous Methodology: Employs non-empirical quantum mechanical methods with solid theoretical foundations
- High Data Quality: Multiple validations ensure structural and energetic reasonableness
- Large Practical Value: Compatible with existing datasets, directly applicable to MLFF training
- Open Access: Publicly available dataset promotes field development
- Computational Cost: PBE0+MBD(-NL) calculations are computationally expensive, limiting dataset expansion
- Fragment Constraints: Maximum 402-atom limitation may inadequately capture long-range interactions
- Environmental Simplification: Primarily considers gas-phase and simple solvation, insufficient for complex biological environments
- Limited Validation: Lacks direct comparison with high-accuracy methods (e.g., CCSD(T))
- Academic Contribution: Provides important data foundation for biomolecular MLFF development
- Practical Value: Directly applicable to drug design, biomolecular simulation, and related fields
- Reproducibility: Detailed methodology and open data ensure reproducibility
- Advancement Promotion: May catalyze development of new biomolecular modeling methods
- Biomolecular MLFF Training: Direct application to training universal force fields covering multiple biomolecular types
- Drug Design: Provides data for protein-ligand and DNA-drug interaction modeling
- Membrane Biology: Lipid data applicable to membrane protein and membrane interaction research
- Glycobiology: Carbohydrate data supports glycoprotein and glycolipid research
- Method Development: Provides benchmark testing data for new quantum chemistry methods and MLFF architectures
This paper cites 58 important references, encompassing key works in quantum chemistry methods, machine learning force fields, biomolecular simulation, and related datasets, providing solid theoretical foundations and technical support for the research.