2025-11-20T10:52:18.218124

Do Large Language Models Speak Scientific Workflows?

Yildiz, Peterka
With the advent of large language models (LLMs), there is a growing interest in applying LLMs to scientific tasks. In this work, we conduct an experimental study to explore applicability of LLMs for configuring, annotating, translating, explaining, and generating scientific workflows. We use 5 different workflow specific experiments and evaluate several open- and closed-source language models using state-of-the-art workflow systems. Our studies reveal that LLMs often struggle with workflow related tasks due to their lack of knowledge of scientific workflows. We further observe that the performance of LLMs varies across experiments and workflow systems. Our findings can help workflow developers and users in understanding LLMs capabilities in scientific workflows, and motivate further research applying LLMs to workflows.
academic

Do Large Language Models Speak Scientific Workflows?

Basic Information

  • Paper ID: 2412.10606
  • Title: Do Large Language Models Speak Scientific Workflows?
  • Authors: Orcun Yildiz (Argonne National Laboratory), Tom Peterka (Argonne National Laboratory)
  • Classification: cs.HC (Human-Computer Interaction)
  • Conference: SC-W'25 (Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis)
  • Paper Link: https://arxiv.org/abs/2412.10606

Abstract

With the emergence of large language models (LLMs), there is growing interest in applying LLMs to scientific tasks. This study experimentally explores the applicability of LLMs in configuring, annotating, and translating scientific workflows. Using three distinct workflow-specific experiments, the research evaluates the performance of multiple open-source and closed-source language models on state-of-the-art workflow systems. The study finds that LLMs frequently encounter difficulties due to insufficient training data on scientific workflows, and their performance varies across different experiments and workflow systems.

Research Background and Motivation

Problem Definition

Scientific workflows play an important role in high-performance computing (HPC) environments, consisting of a series of collaborative tasks that work together in scheduling and communication. However, many scientists find workflow systems difficult to use and often choose to run tasks manually or develop their own workflow solutions.

Research Significance

  1. Usability Challenges: The complexity of scientific workflow systems hinders widespread adoption
  2. Learning Curve: Even when adopting general-purpose workflow systems, scientists often lack understanding of these systems
  3. LLM Potential: Large language models may help address these challenges, but their capabilities in HPC workflows need to be understood

Limitations of Existing Approaches

  • Existing research primarily focuses on specific HPC-related tasks, such as code generation, annotation, and query answering
  • Lacks comprehensive research on the broad applicability of LLMs in complete workflow systems
  • Lacks systematic evaluation of LLM performance on scientific workflow-specific tasks

Core Contributions

  1. First Systematic Evaluation: Comprehensive experimental assessment of multiple LLMs' capabilities on scientific workflow tasks
  2. Multi-dimensional Experimental Design: Designed three different types of workflow-specific experiments (configuration, annotation, translation)
  3. Multi-system Evaluation: Evaluation across five state-of-the-art workflow systems
  4. Performance Benchmarks: Established performance benchmarks for LLMs on scientific workflow tasks
  5. Improvement Strategies: Explored techniques such as few-shot prompting to enhance LLM performance

Methodology Details

Task Definition

The research defines three core tasks:

  1. Workflow Configuration: Generating workflow configuration scripts based on natural language input
  2. Task Code Annotation: Automatically annotating user task code to adapt to workflow systems
  3. Task Code Translation: Translating annotated task code between different workflow systems

Evaluation Framework

LLM Selection

  • o3: OpenAI's closed-source model with strong reasoning capabilities
  • Claude-Sonnet-4: Anthropic's hybrid reasoning model
  • Gemini-2.5-Pro: Google's advanced model with strong reasoning and coding capabilities
  • LLaMA-3.3-70B-Instruct: Meta's 70-billion parameter open-source model

Workflow Systems

  • ADIOS2: Flexible I/O library and middleware for scientific codes
  • Henson: Collaborative multi-task system for in-situ processing
  • Parsl: Python parallel programming library supporting task-based execution
  • PyCOMPSs: Task-based programming model
  • Wilkins: In-situ workflow system supporting dynamic heterogeneous task specifications

Evaluation Metrics

  • BLEU: Machine translation evaluation metric based on n-gram precision
  • ChrF: Character-based evaluation metric calculating character n-gram precision and recall

Experimental Design

Workflow Configuration Experiment

Users provide natural language descriptions, and LLMs generate corresponding workflow configuration files. For example:

User Prompt: I want a 3-node workflow with one producer and two consumer tasks.
The producer generates mesh and particle datasets, consumer1 reads mesh data, 
consumer2 reads particle data. The producer requires 3 processes, each consumer 
runs on a single process. Please provide a workflow configuration file for the 
Wilkins workflow system.

Task Code Annotation Experiment

Simple C language producer code is provided, and LLMs are asked to add annotations for relevant workflow system API calls.

Task Code Translation Experiment

Annotated task code from one workflow system is provided, and LLMs are asked to translate it to another workflow system's code.

Experimental Setup

Experimental Environment

  • Hardware: Apple M1 Max, 10-core CPU, 24-core GPU, 32GB unified memory
  • Framework: Inspect AI framework for experiments
  • Repetitions: Each experiment repeated 5 times to reduce LLM response variability
  • Parameter Settings: temperature=0.2, top_p=0.95

Prompt Strategy Evaluation

Five different prompt variants were designed:

  1. Original prompt
  2. Different styles
  3. Paraphrasing
  4. Reordering
  5. Detailed prompt (including technical details)

Experimental Results

Main Results

Workflow Configuration Experiment

LLMADIOS2HensonWilkinsOverall
o359.1±2.320.2±2.330.0±1.536.5±4.5
Gemini-2.5-Pro73.0±1.826.9±1.931.6±3.443.8±5.7
Claude-Sonnet-472.1±0.025.0±0.036.8±0.844.6±5.3
LLaMA-3.3-70B35.9±0.727.7±1.039.0±0.034.2±1.3

Task Code Annotation Experiment

LLMADIOS2HensonPyCOMPSsParslOverall
Gemini-2.5-Pro51.9±0.742.7±9.489.3±3.135.6±6.354.9±5.5
o360.3±2.138.1±5.072.4±1.839.3±6.052.8±4.1

Task Code Translation Experiment

Translation DirectionBest LLMBLEU Score
Henson→ADIOS2o356.2±2.1
ADIOS2→HensonGemini-2.5-Pro35.4±1.6
Parsl→PyCOMPSsGemini-2.5-Pro78.4±7.5
PyCOMPSs→ParslGemini-2.5-Pro39.7±3.3

Key Findings

  1. System Differences: LLMs perform better on well-documented systems such as ADIOS2 and PyCOMPSs
  2. Task Differences: Code annotation tasks show overall better performance than configuration generation
  3. Model Differences: No single model consistently performs best across all tasks
  4. Hallucination Issues: LLMs frequently generate non-existent API calls or configuration fields

Few-shot Prompting Effects

LLMZero-shotFew-shotImprovement
o336.5±4.589.3±2.7+144%
Gemini-2.5-Pro43.8±5.786.7±2.3+98%
Claude-Sonnet-444.6±5.391.5±3.0+105%
LLaMA-3.3-70B34.2±1.384.1±2.1+146%

Scientific Workflow Research

  • Distributed Workflows: Run across multiple independent systems, exchanging data through files
  • In-situ Workflows: Run within a single HPC system, with tasks executing concurrently and exchanging data through memory

LLM Applications in HPC

  • Duque et al. explored using LLMs to build and execute workflows
  • Sanger et al. investigated GPT-3.5's applicability in understanding, modifying, and extending scientific workflows
  • This research uses more recent models and provides broader coverage of workflow systems and scientific tasks

Conclusions and Discussion

Main Conclusions

  1. Knowledge Deficiency: LLMs frequently encounter difficulties due to insufficient training data in the scientific workflow domain
  2. Performance Variability: LLM performance shows significant variation across different experiments and workflow systems
  3. Context Importance: Few-shot prompting significantly improves LLM performance
  4. System Dependency: Well-documented systems (such as ADIOS2, PyCOMPSs) receive better LLM support

Limitations

  1. Training Data Constraints: Scientific workflow documentation is relatively scarce in LLM training data
  2. API Hallucination: LLMs frequently generate non-existent API calls
  3. Configuration Understanding: LLMs struggle to distinguish between workflow configurations and task code
  4. System Specificity: Performance is highly dependent on documentation availability for specific workflow systems

Future Directions

  1. Retrieval-Augmented Generation (RAG): Combine external knowledge bases to enhance LLM performance
  2. Fine-tuning: Specialized model fine-tuning for scientific workflows
  3. Iterative Error Correction: Introduce automatic error detection and correction mechanisms
  4. Multimodal Integration: Combine code, documentation, and visualization information

In-depth Evaluation

Strengths

  1. Systematic Evaluation: First comprehensive assessment of LLMs in the scientific workflow domain
  2. Multi-dimensional Analysis: Covers three key tasks: configuration, annotation, and translation
  3. Practical Value: Provides valuable reference benchmarks for workflow developers and users
  4. Methodological Rigor: Well-designed experiments, appropriate evaluation metrics, and reproducible results

Weaknesses

  1. Evaluation Scope: Covers only three workflow tasks, which may not be comprehensive enough
  2. Dataset Scale: Relatively small experimental scale, which may affect generalizability of conclusions
  3. Depth of Analysis: Analysis of LLM failure causes could be more thorough
  4. Real-world Deployment: Lacks validation in actual scientific computing environments

Impact

  1. Academic Contribution: Provides important benchmarks for LLM applications in scientific computing
  2. Practical Value: Helps researchers understand the capability boundaries of LLMs in workflow tasks
  3. Future Research: Points directions for improving LLM applications in scientific workflows

Applicable Scenarios

  1. Workflow System Development: Provides reference for integrating LLM-assisted features
  2. Scientific Computing Education: Helps understand LLM limitations in specialized domains
  3. HPC Tool Development: Provides foundation for developing intelligent scientific computing tools

References

This research cites 33 relevant papers covering important works in scientific workflows, large language models, HPC, and other related fields, providing a solid theoretical foundation for the research.


Summary: This is a pioneering research paper that systematically evaluates large language models' capabilities in the scientific workflow domain for the first time. The research reveals significant limitations of LLMs while also demonstrating the potential for performance improvement through appropriate techniques (such as few-shot prompting), laying the foundation for future research in this important area.