2025-11-24T07:55:17.096511

Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction

Adrian, Chung, Boyd et al.
Chemical pretrained models, sometimes referred to as foundation models, are receiving considerable interest for drug discovery applications. The general chemical knowledge extracted from self-supervised training has the potential to improve predictions for critical drug discovery endpoints, including on-target potency and ADMET properties. Multi-task learning has previously been successfully leveraged to improve predictive models. Here, we show that enabling multitasking in finetuning of chemical pretrained graph neural network models such as Kinetic GROVER Multi-Task (KERMT), an enhanced version of the GROVER model, and Knowledge-guided Pre-training of Graph Transformer (KGPT) significantly improves performance over non-pretrained graph neural network models. Surprisingly, we find that the performance improvement from finetuning KERMT in a multitask manner is most significant at larger data sizes. Additionally, we publish two multitask ADMET data splits to enable more accurate benchmarking of multitask deep learning methods for drug property prediction. Finally, we provide an accelerated implementation of the KERMT model on GitHub, unlocking large-scale pretraining, finetuning, and inference in industrial drug discovery workflows.
academic

Multitask Finetuning and Acceleration of Chemical Pretrained Models for Small Molecule Drug Property Prediction

Basic Information

  • Paper ID: 2510.12719
  • Title: Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction
  • Authors: Matthew Adrian, Yunsie Chung, Kevin Boyd, Saee Paliwal, Srimukh Prasad Veccham, Alan C. Cheng
  • Institutions: Merck & Co., Inc. and NVIDIA BioNeMo
  • Classification: cs.LG (Machine Learning), q-bio.QM (Quantitative Biology Methods)
  • Publication Date: October 14, 2025
  • Paper Link: https://arxiv.org/abs/2510.12719v1

Abstract

Chemical pretrained models (also referred to as foundation models) have garnered significant attention in drug discovery applications. General chemical knowledge extracted through self-supervised training has the potential to improve predictions of critical drug discovery endpoints, including target potency and ADMET properties. This study demonstrates that enabling multitask learning during finetuning of chemical pretrained graph neural network models (such as KERMT and KPGT) significantly enhances performance compared to non-pretrained graph neural network models. Notably, the performance improvements from KERMT multitask finetuning are most pronounced at larger data scales. Additionally, the authors release two multitask ADMET dataset splits and provide an accelerated implementation of the KERMT model.

Research Background and Motivation

Core Challenges

  1. Data Scarcity: In drug discovery, particularly for target potency prediction tasks, annotated data is typically limited (10¹ to 10⁶ molecules), while the estimated chemical space contains approximately 10⁶⁰ molecules
  2. Limitations of Traditional Approaches: Supervised graph neural networks show limited performance in small data scenarios, often requiring reliance on classical methods such as random forests
  3. Multitask Learning Potential: Correlations exist between ADMET properties, providing opportunities for multitask learning that has not been fully explored in chemical pretrained model finetuning

Research Motivation

  • Leverage large-scale unlabeled chemical data for pretraining to learn general chemical knowledge and patterns
  • Explore the potential of multitask learning in finetuning chemical pretrained models
  • Address computational efficiency challenges in industrial-scale drug discovery workflows

Core Contributions

  1. First Systematic Study: Introduces multitask learning methods in finetuning chemical pretrained models
  2. KERMT Model Enhancement: Proposes an enhanced version of GROVER supporting distributed pretraining and accelerated inference
  3. Counterintuitive Finding: Demonstrates that KERMT performs better at larger data scales, challenging the assumption that pretrained models primarily excel in small data scenarios
  4. Benchmark Datasets: Releases two multitask ADMET dataset splits to facilitate comparative method evaluation
  5. Engineering Optimization: Provides accelerated implementation supporting industrial-scale applications

Methodology Details

Task Definition

Input: SMILES strings or molecular graph representations of molecules Output: Predictions of multiple ADMET properties or target potency values Objective: Enhance performance of chemical pretrained models on drug property prediction tasks through multitask learning

Model Architecture

KERMT (Kinetic GROVER Multi-Task)

  • Base Architecture: Graph transformer model based on GROVER
  • Pretraining Tasks:
    • Node/edge-level classification: Identify k-hop local subgraphs from node/edge embeddings
    • Graph-level multi-label classification: Identify functional groups present in molecules from graph embeddings
  • Parameter Scale: ~51 million parameters (base version)
  • Pretraining Data: 11 million compounds (from ZINC15 and ChEMBL)

KPGT (Knowledge-guided Pre-training of Graph Transformer)

  • Distinctive Features: Uses molecular line graph representation + knowledge nodes (K-nodes)
  • Pretraining Tasks:
    • Predict masked node and K-node features
    • Predict RDKit fingerprints
    • Predict 200 molecular descriptors
  • Parameter Scale: ~100 million parameters
  • Pretraining Data: 2 million molecules (ChEMBL29)

Multitask Finetuning Strategy

  • Single-task Finetuning: Updates only encoder and feedforward network weights to predict a single property
  • Multitask Finetuning: Feedforward networks output n values corresponding to n properties, with encoder weights updated simultaneously

Technical Innovations

  1. Distributed Pretraining: Implements multi-GPU parallel pretraining using PyTorch DDP, achieving 86% scaling efficiency with 8 GPUs
  2. Accelerated Inference: Integrates cuik-molmaker package, achieving 2.2× finetuning acceleration and 2.9× inference acceleration
  3. Automatic Hyperparameter Optimization: Integrates Optuna for hyperparameter search
  4. Memory Optimization: Dynamically generates molecular graphs and descriptors, reducing memory usage by 34%

Experimental Setup

Datasets

Internal Dataset (Merck)

  • ADMET Data: 30 endpoints, 800,733 compounds (as of 2024)
  • Target Potency: Target 1 (744 compounds), Target 2 (1,163 compounds)
  • Split Strategy: 80-20 temporal split (April 2018 as boundary)

Public Datasets

  • Literature ADMET Data: 25 endpoints, 114,112 compounds
  • Biogen Dataset: 6 endpoints, 3,521 compounds
  • BindingDB: EGFR (9,462 compounds), BTK (9,337 compounds), etc.
  • Split Strategy: Clustering-based split using PCA-reduced Morgan fingerprints

Evaluation Metrics

  • Primary Metric: Pearson r² correlation coefficient
  • Secondary Metrics: Coefficient of determination (R²), mean absolute error (MAE), root mean square error (RMSE)
  • Classification Evaluation: Classification enrichment plots, assessing correct classification rates of high-potency molecules

Comparison Methods

  • Baseline: Chemprop (D-MPNN)
  • Pretrained Models: MoLFormer, KPGT, KERMT
  • Evaluation Modes: Single-task (ST) and multitask (MT) variants

Experimental Results

Main Results

Internal ADMET Data Performance

In temporal split testing on Merck internal data:

  • KERMT MT: Achieves best or tied-best performance on 5 key endpoints
  • Performance Improvement: Outperforms Chemprop MT on 18 of 30 endpoints
  • Average Improvement: Pearson r² improvement of 0.02 (vs. Chemprop) and 0.04 (vs. KPGT)

Specific Results (Pearson r²):

  • Papp: KERMT MT (0.712) vs Chemprop MT (0.657)
  • EPSA: KERMT MT (0.822) vs Chemprop MT (0.805)
  • Fu,p human: KERMT MT (0.666) vs Chemprop MT (0.641)

Public Dataset Performance

  • Public ADMET Data: KPGT performs better (best on 9/25 endpoints), KERMT MT best on only 3/25
  • Biogen Data: Lower confidence due to small sample size
  • Data Scale Dependency: KERMT performs better on large datasets (>10k samples), KPGT superior on small datasets (<3k samples)

Data Scale Analysis

Key Finding: KERMT's advantages are more pronounced at larger data scales

  • Critical Point: When training set exceeds 60k data points, KERMT significantly outperforms Chemprop
  • Parameter Scale Impact: KERMT (51 million parameters) more prone to overfitting on small data than Chemprop (5 million parameters)
  • Multitask Benefits: Performance continuously improves with increasing task numbers (1→30 tasks)

Chemical Space Generalization

Through Tanimoto similarity analysis:

  • Consistent Advantage: KERMT outperforms Chemprop across all similarity intervals (0.35-0.7)
  • Generalization Capability: Despite not specifically targeting low-similarity compounds, demonstrates stronger overall generalization
  • Cyclic Peptide Prediction: Both models show comparable performance on cyclic peptide subsets (Pearson r² = 0.36)

Pretraining Data Impact

Experiments using internally retrained models show:

  • Limited Improvement: Even with pretraining data more similar to downstream tasks, performance gains are modest
  • Cyclic Peptide Performance: Base KERMT model still outperforms internally pretrained models on cyclic peptide tasks (5/12 vs 1/12 tasks)
  • Insight: Improved pretraining tasks needed to better capture relevant information

Chemical Pretrained Models

  • GROVER: Graph transformer using atomic and bond message passing
  • MoLFormer: SMILES-based language model with rotary position encoding
  • KPGT: Knowledge-guided graph transformer integrating molecular descriptors

Multitask Learning

  • Traditional Applications: Primarily applied to deep learning architectures trained from scratch
  • This Work's Contribution: First systematic application of multitask learning to chemical pretrained model finetuning

Conclusions and Discussion

Main Conclusions

  1. Multitask Finetuning Effectiveness: KERMT multitask finetuning significantly improves performance, particularly in large data scenarios
  2. Data Scale Dependency: Challenges the conventional view that pretrained models primarily excel in small data settings
  3. Model Selection Guidance: Recommends KERMT MT for medium-to-large datasets and KPGT ST for small data
  4. Engineering Feasibility: Accelerated implementation enables industrial-scale applications

Limitations

  1. Pretraining Task Optimization: Current pretraining tasks may not sufficiently capture downstream task-relevant information
  2. Cyclic Peptide Prediction: Limited improvements on specialized molecular types like cyclic peptides
  3. Dataset Inconsistency: Inconsistent results between internal and public datasets affect generalization assessment
  4. Computational Resource Requirements: Large parameter models demand more computational resources

Future Directions

  1. Pretraining Task Improvement: Design pretraining objectives better suited for downstream multitask learning
  2. Modular Finetuning: Investigate effects of partial encoder freezing across different data scales
  3. Cross-modal Extension: Explore joint protein-small molecule pretraining
  4. Benchmark Development: Create additional high-quality multitask benchmarks

In-Depth Evaluation

Strengths

  1. High Practical Value: Directly addresses real-world challenges in industrial drug discovery
  2. Comprehensive Experiments: Covers multiple datasets, models, and evaluation dimensions
  3. Counterintuitive Findings: Challenges domain assumptions and provides novel insights
  4. Engineering Contributions: Provides complete open-source implementation with acceleration optimizations
  5. Data Contributions: Releases standardized multitask benchmark datasets

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks deep theoretical explanation for why KERMT performs better at larger data scales
  2. Pretraining Strategy: Insufficient exploration of pretraining methods optimized for multitask learning
  3. Statistical Significance: Some results could benefit from more rigorous statistical significance analysis
  4. Computational Cost Analysis: Lacks detailed computational cost comparison analysis

Impact

  1. Academic Impact: Provides important reference for the intersection of cheminformatics and multitask learning
  2. Industrial Application: Directly applicable to pharmaceutical companies' ADMET prediction workflows
  3. Open-Source Contribution: Open code and data promote field development
  4. Methodological Contribution: Establishes new standards for evaluating chemical pretrained models

Applicable Scenarios

  1. Large Pharmaceutical Companies: Organizations with large-scale ADMET data
  2. Multi-property Optimization: Scenarios requiring simultaneous prediction of multiple molecular properties
  3. Industrial Workflows: Production environments requiring efficient inference
  4. Research Benchmarks: Standard baselines for multitask chemical property prediction

References

The paper cites 47 important references covering:

  • Foundational work on chemical pretrained models (GROVER, MoLFormer, KPGT)
  • Classical methods and datasets for ADMET prediction
  • Theoretical foundations of multitask learning
  • Molecular representation learning and graph neural networks
  • Comprehensive reviews of machine learning applications in drug discovery

Overall Assessment: This is a high-quality applied research paper with significant value in theoretical contributions, experimental validation, and engineering implementation. Particularly notable are its counterintuitive findings and comprehensive open-source contributions, which hold important significance for advancing the field of cheminformatics.