2025-11-20T10:52:18.218124

Do Large Language Models Speak Scientific Workflows?

Yildiz, Peterka

With the advent of large language models (LLMs), there is a growing interest in applying LLMs to scientific tasks. In this work, we conduct an experimental study to explore applicability of LLMs for configuring, annotating, translating, explaining, and generating scientific workflows. We use 5 different workflow specific experiments and evaluate several open- and closed-source language models using state-of-the-art workflow systems. Our studies reveal that LLMs often struggle with workflow related tasks due to their lack of knowledge of scientific workflows. We further observe that the performance of LLMs varies across experiments and workflow systems. Our findings can help workflow developers and users in understanding LLMs capabilities in scientific workflows, and motivate further research applying LLMs to workflows.

academic

Do Large Language Models Speak Scientific Workflows?

Basic Information

Paper ID: 2412.10606
Title: Do Large Language Models Speak Scientific Workflows?
Authors: Orcun Yildiz (Argonne National Laboratory), Tom Peterka (Argonne National Laboratory)
Classification: cs.HC (Human-Computer Interaction)
Conference: SC-W'25 (Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis)
Paper Link: https://arxiv.org/abs/2412.10606

Abstract

With the emergence of large language models (LLMs), there is growing interest in applying LLMs to scientific tasks. This study experimentally explores the applicability of LLMs in configuring, annotating, and translating scientific workflows. Using three distinct workflow-specific experiments, the research evaluates the performance of multiple open-source and closed-source language models on state-of-the-art workflow systems. The study finds that LLMs frequently encounter difficulties due to insufficient training data on scientific workflows, and their performance varies across different experiments and workflow systems.

Research Background and Motivation

Problem Definition

Scientific workflows play an important role in high-performance computing (HPC) environments, consisting of a series of collaborative tasks that work together in scheduling and communication. However, many scientists find workflow systems difficult to use and often choose to run tasks manually or develop their own workflow solutions.

Research Significance

Usability Challenges: The complexity of scientific workflow systems hinders widespread adoption
Learning Curve: Even when adopting general-purpose workflow systems, scientists often lack understanding of these systems
LLM Potential: Large language models may help address these challenges, but their capabilities in HPC workflows need to be understood

Limitations of Existing Approaches

Existing research primarily focuses on specific HPC-related tasks, such as code generation, annotation, and query answering
Lacks comprehensive research on the broad applicability of LLMs in complete workflow systems
Lacks systematic evaluation of LLM performance on scientific workflow-specific tasks

Core Contributions

First Systematic Evaluation: Comprehensive experimental assessment of multiple LLMs' capabilities on scientific workflow tasks
Multi-dimensional Experimental Design: Designed three different types of workflow-specific experiments (configuration, annotation, translation)
Multi-system Evaluation: Evaluation across five state-of-the-art workflow systems
Performance Benchmarks: Established performance benchmarks for LLMs on scientific workflow tasks
Improvement Strategies: Explored techniques such as few-shot prompting to enhance LLM performance

Methodology Details

Task Definition

The research defines three core tasks:

Workflow Configuration: Generating workflow configuration scripts based on natural language input
Task Code Annotation: Automatically annotating user task code to adapt to workflow systems
Task Code Translation: Translating annotated task code between different workflow systems

Evaluation Framework

LLM Selection

o3: OpenAI's closed-source model with strong reasoning capabilities
Claude-Sonnet-4: Anthropic's hybrid reasoning model
Gemini-2.5-Pro: Google's advanced model with strong reasoning and coding capabilities
LLaMA-3.3-70B-Instruct: Meta's 70-billion parameter open-source model

Workflow Systems

ADIOS2: Flexible I/O library and middleware for scientific codes
Henson: Collaborative multi-task system for in-situ processing
Parsl: Python parallel programming library supporting task-based execution
PyCOMPSs: Task-based programming model
Wilkins: In-situ workflow system supporting dynamic heterogeneous task specifications

Evaluation Metrics

BLEU: Machine translation evaluation metric based on n-gram precision
ChrF: Character-based evaluation metric calculating character n-gram precision and recall

Experimental Design

Workflow Configuration Experiment

Users provide natural language descriptions, and LLMs generate corresponding workflow configuration files. For example:

User Prompt: I want a 3-node workflow with one producer and two consumer tasks.
The producer generates mesh and particle datasets, consumer1 reads mesh data, 
consumer2 reads particle data. The producer requires 3 processes, each consumer 
runs on a single process. Please provide a workflow configuration file for the 
Wilkins workflow system.

Task Code Annotation Experiment

Simple C language producer code is provided, and LLMs are asked to add annotations for relevant workflow system API calls.

Task Code Translation Experiment

Annotated task code from one workflow system is provided, and LLMs are asked to translate it to another workflow system's code.

Experimental Setup

Experimental Environment

Hardware: Apple M1 Max, 10-core CPU, 24-core GPU, 32GB unified memory
Framework: Inspect AI framework for experiments
Repetitions: Each experiment repeated 5 times to reduce LLM response variability
Parameter Settings: temperature=0.2, top_p=0.95

Prompt Strategy Evaluation

Five different prompt variants were designed:

Original prompt
Different styles
Paraphrasing
Reordering
Detailed prompt (including technical details)

Experimental Results

Main Results

Workflow Configuration Experiment

LLM	ADIOS2	Henson	Wilkins	Overall
o3	59.1±2.3	20.2±2.3	30.0±1.5	36.5±4.5
Gemini-2.5-Pro	73.0±1.8	26.9±1.9	31.6±3.4	43.8±5.7
Claude-Sonnet-4	72.1±0.0	25.0±0.0	36.8±0.8	44.6±5.3
LLaMA-3.3-70B	35.9±0.7	27.7±1.0	39.0±0.0	34.2±1.3

Task Code Annotation Experiment

LLM	ADIOS2	Henson	PyCOMPSs	Parsl	Overall
Gemini-2.5-Pro	51.9±0.7	42.7±9.4	89.3±3.1	35.6±6.3	54.9±5.5
o3	60.3±2.1	38.1±5.0	72.4±1.8	39.3±6.0	52.8±4.1

Task Code Translation Experiment

Translation Direction	Best LLM	BLEU Score
Henson→ADIOS2	o3	56.2±2.1
ADIOS2→Henson	Gemini-2.5-Pro	35.4±1.6
Parsl→PyCOMPSs	Gemini-2.5-Pro	78.4±7.5
PyCOMPSs→Parsl	Gemini-2.5-Pro	39.7±3.3

Key Findings

System Differences: LLMs perform better on well-documented systems such as ADIOS2 and PyCOMPSs
Task Differences: Code annotation tasks show overall better performance than configuration generation
Model Differences: No single model consistently performs best across all tasks
Hallucination Issues: LLMs frequently generate non-existent API calls or configuration fields

Few-shot Prompting Effects

LLM	Zero-shot	Few-shot	Improvement
o3	36.5±4.5	89.3±2.7	+144%
Gemini-2.5-Pro	43.8±5.7	86.7±2.3	+98%
Claude-Sonnet-4	44.6±5.3	91.5±3.0	+105%
LLaMA-3.3-70B	34.2±1.3	84.1±2.1	+146%

Scientific Workflow Research

Distributed Workflows: Run across multiple independent systems, exchanging data through files
In-situ Workflows: Run within a single HPC system, with tasks executing concurrently and exchanging data through memory

LLM Applications in HPC

Duque et al. explored using LLMs to build and execute workflows
Sanger et al. investigated GPT-3.5's applicability in understanding, modifying, and extending scientific workflows
This research uses more recent models and provides broader coverage of workflow systems and scientific tasks

Conclusions and Discussion

Main Conclusions

Knowledge Deficiency: LLMs frequently encounter difficulties due to insufficient training data in the scientific workflow domain
Performance Variability: LLM performance shows significant variation across different experiments and workflow systems
Context Importance: Few-shot prompting significantly improves LLM performance
System Dependency: Well-documented systems (such as ADIOS2, PyCOMPSs) receive better LLM support

Limitations

Training Data Constraints: Scientific workflow documentation is relatively scarce in LLM training data
API Hallucination: LLMs frequently generate non-existent API calls
Configuration Understanding: LLMs struggle to distinguish between workflow configurations and task code
System Specificity: Performance is highly dependent on documentation availability for specific workflow systems

Future Directions

Retrieval-Augmented Generation (RAG): Combine external knowledge bases to enhance LLM performance
Fine-tuning: Specialized model fine-tuning for scientific workflows
Iterative Error Correction: Introduce automatic error detection and correction mechanisms
Multimodal Integration: Combine code, documentation, and visualization information

In-depth Evaluation

Strengths

Systematic Evaluation: First comprehensive assessment of LLMs in the scientific workflow domain
Multi-dimensional Analysis: Covers three key tasks: configuration, annotation, and translation
Practical Value: Provides valuable reference benchmarks for workflow developers and users
Methodological Rigor: Well-designed experiments, appropriate evaluation metrics, and reproducible results

Weaknesses

Evaluation Scope: Covers only three workflow tasks, which may not be comprehensive enough
Dataset Scale: Relatively small experimental scale, which may affect generalizability of conclusions
Depth of Analysis: Analysis of LLM failure causes could be more thorough
Real-world Deployment: Lacks validation in actual scientific computing environments

Impact

Academic Contribution: Provides important benchmarks for LLM applications in scientific computing
Practical Value: Helps researchers understand the capability boundaries of LLMs in workflow tasks
Future Research: Points directions for improving LLM applications in scientific workflows

Applicable Scenarios

Workflow System Development: Provides reference for integrating LLM-assisted features
Scientific Computing Education: Helps understand LLM limitations in specialized domains
HPC Tool Development: Provides foundation for developing intelligent scientific computing tools

References

This research cites 33 relevant papers covering important works in scientific workflows, large language models, HPC, and other related fields, providing a solid theoretical foundation for the research.

Summary: This is a pioneering research paper that systematically evaluates large language models' capabilities in the scientific workflow domain for the first time. The research reveals significant limitations of LLMs while also demonstrating the potential for performance improvement through appropriate techniques (such as few-shot prompting), laying the foundation for future research in this important area.