2025-11-14T00:37:10.948372

Generative Deep Learning Framework for Inverse Design of Fuels

Yalamanchi, Pal, Mohan et al.

In the present work, a generative deep learning framework combining a Co-optimized Variational Autoencoder (Co-VAE) architecture with quantitative structure-property relationship (QSPR) techniques is developed to enable accelerated inverse design of fuels. The Co-VAE integrates a property prediction component coupled with the VAE latent space, enhancing molecular reconstruction and accurate estimation of Research Octane Number (RON) (chosen as the fuel property of interest). A subset of the GDB-13 database, enriched with a curated RON database, is used for model training. Hyperparameter tuning is further utilized to optimize the balance among reconstruction fidelity, chemical validity, and RON prediction. An independent regression model is then used to refine RON prediction, while a differential evolution algorithm is employed to efficiently navigate the VAE latent space and identify promising fuel molecule candidates with high RON. This methodology addresses the limitations of traditional fuel screening approaches by capturing complex structure-property relationships within a comprehensive latent representation. The generative model can be adapted to different target properties, enabling systematic exploration of large chemical spaces relevant to fuel design applications. Furthermore, the demonstrated framework can be readily extended by incorporating additional synthesizability criteria to improve applicability and reliability for de novo design of new fuels.

academic

Generative Deep Learning Framework for Inverse Design of Fuels

Basic Information

Paper ID: 2504.12075
Title: Generative Deep Learning Framework for Inverse Design of Fuels
Authors: Kiran K. Yalamanchi, Pinaki Pal, Balaji Mohan, Abdullah S. AlRamadan, Jihad A. Badra, Yuanjiang Pei
Classification: cs.LG physics.chem-ph
Publication Date: October 13, 2025 (arXiv v3)
Paper Link: https://arxiv.org/abs/2504.12075v3

Abstract

This study develops a generative deep learning framework combining a co-optimized variational autoencoder (Co-VAE) architecture with quantitative structure-property relationship (QSPR) techniques for inverse design of fuels. The Co-VAE couples a property prediction component with the VAE latent space, enhancing molecular reconstruction accuracy and research octane number (RON) estimation. The research utilizes a subset of the GDB-13 database combined with a carefully curated RON database for model training. Hyperparameter tuning optimizes the balance between reconstruction fidelity, chemical validity, and RON prediction accuracy. Independent regression models are employed to optimize RON prediction, while differential evolution algorithms efficiently navigate the VAE latent space to identify candidate fuel molecules with high RON values.

Research Background and Motivation

Problem Definition

Advances in modern automotive technology and implementation of stringent environmental regulations create an urgent need for innovative fuels with the following characteristics:

High anti-knock performance to support advanced engine operation
Clean combustion properties to reduce emissions
Efficient engine performance

Problem Significance

Traditional fuel development methods heavily rely on experimental trial-and-error and expert intuition, an approach that is not only time-consuming but also fails to adequately explore the vast chemical space of potential fuel molecules. Given the complexity of chemical space and experimental costs, data-driven approaches are needed to accelerate fuel discovery and optimization.

Limitations of Existing Methods

QSPR Method Limitations: While capable of predicting properties of known structures, they cannot generate new molecular candidates and typically rely on limited datasets and hand-crafted features, potentially failing to generalize across broad chemical spaces
Traditional Generative Models: Lack targeted optimization for specific fuel properties
Decoupled Approaches: Generation and prediction modules are trained independently, lacking synergistic optimization

Research Motivation

Building on the successful application of generative deep learning in drug molecule design, researchers have begun applying these methods to fuel molecule design. This study aims to develop an integrated generation-prediction framework capable of efficiently navigating chemical space to identify molecules with desired fuel properties.

Core Contributions

Proposed Co-VAE Architecture: Directly integrates property prediction components into the VAE, enabling joint optimization of molecular reconstruction and RON prediction
Developed Modular Framework: Separates generation and prediction components, allowing independent training and optimization, improving robustness and performance
Constructed Comprehensive Dataset: Combines GDB-13 database subset with carefully curated RON database, covering 357,907 molecules
Implemented Efficient Screening Strategy: Uses differential evolution algorithm to search for high-RON molecules in latent space, generating 921 novel high-performance fuel candidates
Established Complete Validation Pipeline: Includes chemical validity checks and property prediction consistency verification

Methodology Details

Task Definition

Input: SMILES representation of molecules (one-hot encoded) Output: Novel fuel molecules with high research octane number (RON > 110) Constraints:

Molecules must be chemically valid
Contain only C, H, O atoms
Maximum 10 heavy atoms
Maximum 2 ring structures

Model Architecture

Co-VAE Architecture

Co-VAE extends the standard VAE with three main components:

Encoder: Bidirectional LSTM network processes one-hot encoded SMILES strings, generating mean and log-variance of latent space through fully connected layers
Decoder: Reconstructs molecular structure from latent variables using fully connected layers and LSTM networks
Property Predictor: Bidirectional feedforward neural network predicting RON values from latent space mean

Loss Function

Loss = BCE + β × KLD + L_RON

Where:

BCE: Binary cross-entropy reconstruction loss
KLD: Kullback-Leibler divergence regularization term
L_RON: Mean absolute error of RON prediction
β: Balancing parameter, gradually increased from 0 to 0.25 (over 75 epochs)

Regression Model Optimization

Independent regression models trained on latent space embeddings:

Evaluated 13 different algorithms (XGBoost, CatBoost, LightGBM, etc.)
Hyperparameter tuning via NSGA-II multi-objective optimization
CatBoost showed best performance: R² = 0.929, MAE = 5.365, RMSE = 8.090

Technical Innovations

Joint Optimization Strategy: Co-VAE simultaneously optimizes molecular reconstruction and property prediction, enabling the latent space to learn features meaningful for RON prediction
Modular Design: Separates generation and prediction components, allowing use of more sophisticated regression algorithms and optimization strategies
Progressive β Annealing: Avoids posterior collapse, balancing reconstruction fidelity and latent space regularization
Dual Validation Mechanism: Ensures both chemical validity of generated molecules and consistency of property predictions

Experimental Setup

Dataset

GDB-13 Subset:

Original data: 9.7+ million small molecules (≤13 heavy atoms)
Filtering criteria: C, H, O atoms only, ≤10 heavy atoms, ≤2 rings
Final scale: 357,907 molecules

RON Dataset:

Source: Literature ASTM standard RON values
Scale: 332 molecules with RON values
Data split: Training set, validation set (10%), test set (10%)

Evaluation Metrics

Reconstruction Accuracy: Accuracy of SMILES string reconstruction
Chemical Validity: Proportion of generated molecules validated by RDKit
RON Prediction Performance: MAE, RMSE, R²

Baseline Methods

Evaluated 13 regression algorithms:

Ensemble methods: XGBoost, CatBoost, LightGBM, RandomForest
Linear methods: LinearRegression, Ridge, Lasso, ElasticNet
Others: SVR, KNeighbors, DecisionTree, TabNet, AutoTS

Implementation Details

Hyperparameter Optimization: Bayesian optimization (bayes_opt package)
Training Strategy: 16 random evaluations + 40 sequential optimizations
Validation Method: 10-fold cross-validation
Search Algorithm: Differential evolution (SciPy implementation)

Experimental Results

Main Results

Co-VAE Performance (Optimal Configuration)

Reconstruction Accuracy: 77.56%
Chemical Validity: 55.19%
RON MAE: 9.26

Regression Model Performance Ranking

Model	MAE	RMSE	R²
CatBoost	5.365	8.090	0.929
XGBoost	6.513	10.496	0.880
LightGBM	6.959	10.556	0.878
RandomForest	7.310	10.689	0.872

Final CatBoost Model (10-fold Cross-Validation)

R² = 0.869 ± 0.102
MAE = 4.935 ± 1.041
RMSE = 7.879 ± 2.964

Molecular Generation Results

Total Generated: 1,189 unique valid SMILES
Unique Molecules: 1,185 chemical entities
Novel Molecules: 921 molecules not present in training set
Target Performance: All molecules with predicted RON > 110

Ablation Studies

Validated importance of each component through hyperparameter optimization:

LSTM layers: 2 layers optimal
Hidden layer size: 151 optimal
Latent space dimensionality: 73 optimal
Effectiveness of β annealing strategy verified

Case Analysis

Main characteristics of generated high-RON molecules:

Rich branched structures
Containing alcohol, ether, and aldehyde functional groups
Carbon atom distribution: 4-10 atoms
Oxygen atom distribution: 0-4 atoms

Experimental Findings

Structure-Property Relationships: Branching degree and oxygen-containing functional groups positively correlate with high RON
Model Generalization: Successfully generates valid high-performance molecules outside training set
Search Efficiency: Differential evolution algorithm effectively navigates 73-dimensional latent space

Generative Molecular Design

Application of VAE, GAN, and reinforcement learning in drug design
Liu et al.'s multi-objective imitation learning framework for fuel design
Rittig et al.'s graph machine learning approach for high-octane fuel design

QSPR Methods

Traditional group contribution methods
vom Lehn et al.'s machine learning QSPR models
Chen et al.'s large-scale fuel candidate screening

Ensemble Methods

Liu et al.'s VAE co-optimization architecture
Advantages of this study's modular design compared to ensemble approaches

Conclusions and Discussion

Main Conclusions

Co-VAE successfully co-optimizes generation and prediction tasks, learning latent representations meaningful for RON prediction
Modular design allows use of advanced regression algorithms, significantly improving prediction accuracy
Differential evolution search strategy effectively identifies high-performance fuel candidates
Framework demonstrates good scalability, adaptable to different target properties

Limitations

Imbalanced Data Scale: RON dataset significantly smaller than GDB-13 subset
Chemical Space Constraints: Only considers C, H, O atoms, excluding other important fuel components
Single-Property Optimization: Focuses only on RON, neglecting other fuel properties
Absence of Experimental Validation: Generated molecules require experimental verification of actual performance

Future Directions

Multi-Property Optimization: Integrate multiple fuel properties including energy density, volatility, and emission characteristics
Synthesizability Constraints: Incorporate synthesis difficulty, cost, and toxicity considerations
Dataset Expansion: Include more elements and larger RON databases
Blended Fuel Design: Extend to multi-component fuel mixture design
Uncertainty Quantification: Integrate UQ methods to enhance prediction credibility

In-Depth Evaluation

Strengths

Methodological Innovation: Co-VAE architecture cleverly combines generation and prediction tasks, representing significant progress in fuel design
Experimental Rigor: Systematic hyperparameter optimization, multiple algorithm comparisons, and strict validation procedures
Result Convincingness: Generation of numerous chemically valid high-RON candidate molecules demonstrates practical utility
Writing Clarity: Clear paper structure, detailed technical descriptions, facilitating understanding and reproducibility

Weaknesses

Evaluation Limitations: Lacks experimental validation; reliance on computational predictions may introduce biases
Restricted Chemical Space: Consideration of only simple C, H, O compounds limits application scope
Single-Objective Optimization: Actual fuel design requires consideration of multiple competing properties
Overlooked Synthesizability: Generated molecules may face practical synthesis challenges

Impact

Academic Contribution: Provides new methodological framework for AI-driven fuel design
Practical Value: Accelerates fuel screening process, reducing experimental costs
Reproducibility: Provides detailed implementation details and hyperparameter settings
Extensibility: Framework design demonstrates good scalability for other chemical design tasks

Applicable Scenarios

Fuel Pre-screening: Computational screening before large-scale experiments
Molecular Optimization: Structure improvement based on known molecules
Chemical Space Exploration: Discovery of novel fuel molecules difficult to identify through traditional methods
Educational Research: Teaching and research case study for AI chemistry applications

References

The paper cites 32 important references covering:

Application of generative deep learning in molecular design
QSPR methods and machine learning in fuel property prediction
VAE architectures and optimization strategies
Cheminformatics tools and databases

Overall Assessment: This is a high-quality research paper proposing innovative AI methods for fuel molecule design. Despite certain limitations, its methodological contributions and practical application value are noteworthy. This work provides important reference for AI-driven chemical design and demonstrates both solid academic and practical value.