2025-11-24T07:55:17.096511

Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction

Adrian, Chung, Boyd et al.

Chemical pretrained models, sometimes referred to as foundation models, are receiving considerable interest for drug discovery applications. The general chemical knowledge extracted from self-supervised training has the potential to improve predictions for critical drug discovery endpoints, including on-target potency and ADMET properties. Multi-task learning has previously been successfully leveraged to improve predictive models. Here, we show that enabling multitasking in finetuning of chemical pretrained graph neural network models such as Kinetic GROVER Multi-Task (KERMT), an enhanced version of the GROVER model, and Knowledge-guided Pre-training of Graph Transformer (KGPT) significantly improves performance over non-pretrained graph neural network models. Surprisingly, we find that the performance improvement from finetuning KERMT in a multitask manner is most significant at larger data sizes. Additionally, we publish two multitask ADMET data splits to enable more accurate benchmarking of multitask deep learning methods for drug property prediction. Finally, we provide an accelerated implementation of the KERMT model on GitHub, unlocking large-scale pretraining, finetuning, and inference in industrial drug discovery workflows.

academic

Multitask Finetuning and Acceleration of Chemical Pretrained Models for Small Molecule Drug Property Prediction

Basic Information

Paper ID: 2510.12719
Title: Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction
Authors: Matthew Adrian, Yunsie Chung, Kevin Boyd, Saee Paliwal, Srimukh Prasad Veccham, Alan C. Cheng
Institutions: Merck & Co., Inc. and NVIDIA BioNeMo
Classification: cs.LG (Machine Learning), q-bio.QM (Quantitative Biology Methods)
Publication Date: October 14, 2025
Paper Link: https://arxiv.org/abs/2510.12719v1

Abstract

Chemical pretrained models (also referred to as foundation models) have garnered significant attention in drug discovery applications. General chemical knowledge extracted through self-supervised training has the potential to improve predictions of critical drug discovery endpoints, including target potency and ADMET properties. This study demonstrates that enabling multitask learning during finetuning of chemical pretrained graph neural network models (such as KERMT and KPGT) significantly enhances performance compared to non-pretrained graph neural network models. Notably, the performance improvements from KERMT multitask finetuning are most pronounced at larger data scales. Additionally, the authors release two multitask ADMET dataset splits and provide an accelerated implementation of the KERMT model.

Research Background and Motivation

Core Challenges

Data Scarcity: In drug discovery, particularly for target potency prediction tasks, annotated data is typically limited (10¹ to 10⁶ molecules), while the estimated chemical space contains approximately 10⁶⁰ molecules
Limitations of Traditional Approaches: Supervised graph neural networks show limited performance in small data scenarios, often requiring reliance on classical methods such as random forests
Multitask Learning Potential: Correlations exist between ADMET properties, providing opportunities for multitask learning that has not been fully explored in chemical pretrained model finetuning

Research Motivation

Leverage large-scale unlabeled chemical data for pretraining to learn general chemical knowledge and patterns
Explore the potential of multitask learning in finetuning chemical pretrained models
Address computational efficiency challenges in industrial-scale drug discovery workflows

Core Contributions

First Systematic Study: Introduces multitask learning methods in finetuning chemical pretrained models
KERMT Model Enhancement: Proposes an enhanced version of GROVER supporting distributed pretraining and accelerated inference
Counterintuitive Finding: Demonstrates that KERMT performs better at larger data scales, challenging the assumption that pretrained models primarily excel in small data scenarios
Benchmark Datasets: Releases two multitask ADMET dataset splits to facilitate comparative method evaluation
Engineering Optimization: Provides accelerated implementation supporting industrial-scale applications

Methodology Details

Task Definition

Input: SMILES strings or molecular graph representations of molecules Output: Predictions of multiple ADMET properties or target potency values Objective: Enhance performance of chemical pretrained models on drug property prediction tasks through multitask learning

Model Architecture

KERMT (Kinetic GROVER Multi-Task)

Base Architecture: Graph transformer model based on GROVER
Pretraining Tasks:
- Node/edge-level classification: Identify k-hop local subgraphs from node/edge embeddings
- Graph-level multi-label classification: Identify functional groups present in molecules from graph embeddings
Parameter Scale: ~51 million parameters (base version)
Pretraining Data: 11 million compounds (from ZINC15 and ChEMBL)

KPGT (Knowledge-guided Pre-training of Graph Transformer)

Distinctive Features: Uses molecular line graph representation + knowledge nodes (K-nodes)
Pretraining Tasks:
- Predict masked node and K-node features
- Predict RDKit fingerprints
- Predict 200 molecular descriptors
Parameter Scale: ~100 million parameters
Pretraining Data: 2 million molecules (ChEMBL29)

Multitask Finetuning Strategy

Single-task Finetuning: Updates only encoder and feedforward network weights to predict a single property
Multitask Finetuning: Feedforward networks output n values corresponding to n properties, with encoder weights updated simultaneously

Technical Innovations

Distributed Pretraining: Implements multi-GPU parallel pretraining using PyTorch DDP, achieving 86% scaling efficiency with 8 GPUs
Accelerated Inference: Integrates cuik-molmaker package, achieving 2.2× finetuning acceleration and 2.9× inference acceleration
Automatic Hyperparameter Optimization: Integrates Optuna for hyperparameter search
Memory Optimization: Dynamically generates molecular graphs and descriptors, reducing memory usage by 34%

Experimental Setup

Datasets

Internal Dataset (Merck)

ADMET Data: 30 endpoints, 800,733 compounds (as of 2024)
Target Potency: Target 1 (744 compounds), Target 2 (1,163 compounds)
Split Strategy: 80-20 temporal split (April 2018 as boundary)

Public Datasets

Literature ADMET Data: 25 endpoints, 114,112 compounds
Biogen Dataset: 6 endpoints, 3,521 compounds
BindingDB: EGFR (9,462 compounds), BTK (9,337 compounds), etc.
Split Strategy: Clustering-based split using PCA-reduced Morgan fingerprints

Evaluation Metrics

Primary Metric: Pearson r² correlation coefficient
Secondary Metrics: Coefficient of determination (R²), mean absolute error (MAE), root mean square error (RMSE)
Classification Evaluation: Classification enrichment plots, assessing correct classification rates of high-potency molecules

Comparison Methods

Baseline: Chemprop (D-MPNN)
Pretrained Models: MoLFormer, KPGT, KERMT
Evaluation Modes: Single-task (ST) and multitask (MT) variants

Experimental Results

Main Results

Internal ADMET Data Performance

In temporal split testing on Merck internal data:

KERMT MT: Achieves best or tied-best performance on 5 key endpoints
Performance Improvement: Outperforms Chemprop MT on 18 of 30 endpoints
Average Improvement: Pearson r² improvement of 0.02 (vs. Chemprop) and 0.04 (vs. KPGT)

Specific Results (Pearson r²):

Papp: KERMT MT (0.712) vs Chemprop MT (0.657)
EPSA: KERMT MT (0.822) vs Chemprop MT (0.805)
Fu,p human: KERMT MT (0.666) vs Chemprop MT (0.641)

Public Dataset Performance

Public ADMET Data: KPGT performs better (best on 9/25 endpoints), KERMT MT best on only 3/25
Biogen Data: Lower confidence due to small sample size
Data Scale Dependency: KERMT performs better on large datasets (>10k samples), KPGT superior on small datasets (<3k samples)

Data Scale Analysis

Key Finding: KERMT's advantages are more pronounced at larger data scales

Critical Point: When training set exceeds 60k data points, KERMT significantly outperforms Chemprop
Parameter Scale Impact: KERMT (51 million parameters) more prone to overfitting on small data than Chemprop (5 million parameters)
Multitask Benefits: Performance continuously improves with increasing task numbers (1→30 tasks)

Chemical Space Generalization

Through Tanimoto similarity analysis:

Consistent Advantage: KERMT outperforms Chemprop across all similarity intervals (0.35-0.7)
Generalization Capability: Despite not specifically targeting low-similarity compounds, demonstrates stronger overall generalization
Cyclic Peptide Prediction: Both models show comparable performance on cyclic peptide subsets (Pearson r² = 0.36)

Pretraining Data Impact

Experiments using internally retrained models show:

Limited Improvement: Even with pretraining data more similar to downstream tasks, performance gains are modest
Cyclic Peptide Performance: Base KERMT model still outperforms internally pretrained models on cyclic peptide tasks (5/12 vs 1/12 tasks)
Insight: Improved pretraining tasks needed to better capture relevant information

Chemical Pretrained Models

GROVER: Graph transformer using atomic and bond message passing
MoLFormer: SMILES-based language model with rotary position encoding
KPGT: Knowledge-guided graph transformer integrating molecular descriptors

Multitask Learning

Traditional Applications: Primarily applied to deep learning architectures trained from scratch
This Work's Contribution: First systematic application of multitask learning to chemical pretrained model finetuning

Conclusions and Discussion

Main Conclusions

Multitask Finetuning Effectiveness: KERMT multitask finetuning significantly improves performance, particularly in large data scenarios
Data Scale Dependency: Challenges the conventional view that pretrained models primarily excel in small data settings
Model Selection Guidance: Recommends KERMT MT for medium-to-large datasets and KPGT ST for small data
Engineering Feasibility: Accelerated implementation enables industrial-scale applications

Limitations

Pretraining Task Optimization: Current pretraining tasks may not sufficiently capture downstream task-relevant information
Cyclic Peptide Prediction: Limited improvements on specialized molecular types like cyclic peptides
Dataset Inconsistency: Inconsistent results between internal and public datasets affect generalization assessment
Computational Resource Requirements: Large parameter models demand more computational resources

Future Directions

Pretraining Task Improvement: Design pretraining objectives better suited for downstream multitask learning
Modular Finetuning: Investigate effects of partial encoder freezing across different data scales
Cross-modal Extension: Explore joint protein-small molecule pretraining
Benchmark Development: Create additional high-quality multitask benchmarks

In-Depth Evaluation

Strengths

High Practical Value: Directly addresses real-world challenges in industrial drug discovery
Comprehensive Experiments: Covers multiple datasets, models, and evaluation dimensions
Counterintuitive Findings: Challenges domain assumptions and provides novel insights
Engineering Contributions: Provides complete open-source implementation with acceleration optimizations
Data Contributions: Releases standardized multitask benchmark datasets

Weaknesses

Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why KERMT performs better at larger data scales
Pretraining Strategy: Insufficient exploration of pretraining methods optimized for multitask learning
Statistical Significance: Some results could benefit from more rigorous statistical significance analysis
Computational Cost Analysis: Lacks detailed computational cost comparison analysis

Impact

Academic Impact: Provides important reference for the intersection of cheminformatics and multitask learning
Industrial Application: Directly applicable to pharmaceutical companies' ADMET prediction workflows
Open-Source Contribution: Open code and data promote field development
Methodological Contribution: Establishes new standards for evaluating chemical pretrained models

Applicable Scenarios

Large Pharmaceutical Companies: Organizations with large-scale ADMET data
Multi-property Optimization: Scenarios requiring simultaneous prediction of multiple molecular properties
Industrial Workflows: Production environments requiring efficient inference
Research Benchmarks: Standard baselines for multitask chemical property prediction

References

The paper cites 47 important references covering:

Foundational work on chemical pretrained models (GROVER, MoLFormer, KPGT)
Classical methods and datasets for ADMET prediction
Theoretical foundations of multitask learning
Molecular representation learning and graph neural networks
Comprehensive reviews of machine learning applications in drug discovery

Overall Assessment: This is a high-quality applied research paper with significant value in theoretical contributions, experimental validation, and engineering implementation. Particularly notable are its counterintuitive findings and comprehensive open-source contributions, which hold important significance for advancing the field of cheminformatics.