2025-11-21T20:19:23.757806

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Gao, Wang, Gao et al.
Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs' performance in the task.
academic

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Basic Information

  • Paper ID: 2501.01329
  • Title: The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation
  • Authors: Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, Michael R. Lyu
  • Classification: cs.SE cs.AI cs.CL
  • Publication Date/Venue: JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020
  • Paper Link: https://arxiv.org/abs/2501.01329

Abstract

Test cases are critical for verifying the reliability and quality of software applications. Recent research demonstrates that Large Language Models (LLMs) possess the capability to generate useful test cases for given source code. However, existing work primarily relies on manually crafted simple prompts, which often yields suboptimal results because LLM performance is highly dependent on prompt quality. Furthermore, these approaches use identical prompts across all LLMs, overlooking the fact that different LLMs may be optimally suited to different prompts. This paper proposes MAPS, which achieves automated prompt optimization tailored to different LLMs through three core modules: diversity-guided prompt generation, failure-driven rule induction, and domain contextual knowledge extraction.

Research Background and Motivation

1. Core Problem

Test case generation is a critical task in software engineering. While traditional methods such as Evosuite and Randoop rely on search and constraint-based techniques, LLM-based approaches, though promising, suffer from the following issues:

  • Dependence on manually crafted simple prompts, leading to suboptimal performance
  • Use of identical prompts across all LLMs, ignoring inter-LLM differences
  • Lack of specialized optimization for test case generation tasks

2. Problem Significance

  • Manual test case writing is time-consuming and difficult
  • High-quality test cases are crucial for software quality assurance
  • The powerful capabilities of LLMs in code understanding and generation need to be fully leveraged through prompt optimization

3. Limitations of Existing Methods

Through preliminary experiments, the authors identify three major problems with existing Automated Prompt Optimization (APO) methods on test case generation tasks:

  • Low Diversity: Generated prompts lack diversity and easily fall into local optima
  • Repeated Errors: Optimized prompts still produce the same errors as original prompts
  • Lack of Domain Knowledge: Missing necessary project-level contextual information, such as inheritance relationships and class call information

Core Contributions

  1. First Study: To the authors' knowledge, this is the first study specifically addressing LLM-tailored prompt optimization for test case generation tasks
  2. Novel Method: Proposes MAPS, integrating diversity-guided prompt generation, failure-driven rule induction, and domain contextual knowledge extraction
  3. Significant Improvements: Experiments on three popular LLMs demonstrate that MAPS achieves an average improvement of 6.19% in line coverage and 5.03% in branch coverage compared to the strongest baseline
  4. LLM Customization: Demonstrates the effectiveness of generating customized prompts for different LLMs

Method Details

Task Definition

Given a black-box model M, a small development set D_dev, a test set D_test, and a scoring function s(·), APO aims to discover an optimized prompt p from the natural language space based on D_dev that maximizes M's performance on the test set D_test.

Model Architecture

MAPS comprises three core modules:

1. Domain Contextual Knowledge Extraction

This module provides LLMs with relevant project-level contextual information:

Intra-file Contextual Knowledge:

  • Class signatures: Type and name of the class containing the focal method
  • Focal method: The specific method for which test cases need to be generated
  • Member method signatures: Function signatures of other methods within the class

Cross-file Contextual Knowledge:

  • Class inheritance information: For abstract or private classes, scan the entire project to locate their subclasses
  • Class call information: Identify parameter types of the focal method and trace definitions and constructors of user-defined types

2. Diversity-guided Prompt Generation

This module creates diverse prompts by exploring different modification paths:

Algorithm 2: PROMPTIMPROVEMENT
1. Select K best-performing prompts
2. Generate N different modification methods
3. Generate new prompts based on each modification method
4. Merge selected prompts and newly generated prompts

3. Failure-driven Rule Induction

This module avoids repeated errors by analyzing failed test cases to induce rules:

Failure Information Selection:

  • Collect failed test cases and error messages
  • Aggregate failure information using DBSCAN clustering algorithm
  • Perform weighted sampling based on cluster size and similarity to historical failures

Error Reflection:

  • Construct reflection prompts using representative failure cases
  • Request LLM to provide detailed explanations and solutions
  • Convert explanations and solutions into concise rules

Rule Validation:

  • Validate the effectiveness of each newly generated rule
  • Retain rules with the best performance

Technical Innovations

  1. Diversity Assurance: Ensures prompt diversity by enforcing different modification methods, avoiding local optima
  2. Failure Learning: Learns from failure cases through rule induction to guide the optimization process
  3. Context Enhancement: Provides project-level contextual information to help LLMs generate accurate test cases
  4. Soft Integration: Converts reflection outputs into concise rules, avoiding performance degradation from verbose prompts

Experimental Setup

Dataset

Uses the widely-adopted Defects4J benchmark, comprising 5 Java projects:

  • Apache Commons CLI (29 bugs)
  • Apache Commons CSV (15 bugs)
  • Google Gson (17 bugs)
  • JFreeChart (26 bugs)
  • Apache Commons Lang (60 bugs)
  • Total: 147 bugs, 85 focal classes, 5,278 focal methods

Evaluation Metrics

  • Line Coverage (%): Percentage of code lines executed during testing
  • Branch Coverage (%): Percentage of branches executed during testing

Baseline Methods

LLM Models:

  • ChatGPT (gpt-3.5-turbo-0125)
  • Llama-3.1-70B-Instruct
  • Qwen2-72B-Instruct

Baseline Methods:

  • Basic: Best seed prompt performance
  • APE: Directly requesting LLM to generate semantically-preserving prompt variants
  • OPRO: Generating new prompts incorporating performance information
  • EVOPROMPT (GA/DE): State-of-the-art evolution algorithm-based prompt optimization methods

Implementation Details

  • Number of seed prompts: 5
  • Number of prompts generated per iteration: 2
  • Maximum iterations: 5
  • Development set: Randomly sampled 10 bugs
  • Experiments repeated 3 times with average results reported

Experimental Results

Main Results

Performance on ChatGPT:

  • Line coverage: MAPS achieves 53.80%, strongest baseline EVOPROMPT(GA) achieves 46.63%, improvement of 7.17%
  • Branch coverage: MAPS achieves 41.84%, strongest baseline achieves 35.88%, improvement of 5.96%

Performance on Llama-3.1:

  • Line coverage: MAPS achieves 50.59%, strongest baseline achieves 46.52%, improvement of 4.07%
  • Branch coverage: MAPS achieves 39.50%, strongest baseline achieves 35.07%, improvement of 4.43%

Performance on Qwen2:

  • Line coverage: MAPS achieves 45.51%, strongest baseline achieves 39.41%, improvement of 6.10%
  • Branch coverage: MAPS achieves 32.71%, strongest baseline achieves 28.92%, improvement of 3.79%

Ablation Study

Module contribution analysis (using ChatGPT as example):

  • Removing domain contextual knowledge extraction: Line coverage decreases 9.64%, branch coverage decreases 8.53%
  • Removing diversity-guided prompt generation: Line coverage decreases 8.21%, branch coverage decreases 7.80%
  • Removing failure-driven rule induction: Line coverage decreases 6.94%, branch coverage decreases 4.76%

LLM Customization Effects

Experiments validate that MAPS can generate customized prompts for different LLMs:

  • Each LLM performs best on its own optimized prompts
  • ChatGPT's final prompts outperform other LLMs' prompts by 2.45% and 2.66% in line coverage respectively
  • MAPS-optimized prompts consistently outperform manually designed prompts

Case Studies

Case 1 - Llama-3.1: Through the second induction rule, the model correctly generated test cases with exception handling Case 2 - ChatGPT: Through cross-file contextual knowledge, the model correctly initialized abstract classes

Automated Prompt Optimization

  • APE: Directly requesting LLM to generate semantically-preserving prompt variants
  • OPRO: Combining performance information to guide prompt generation
  • EVOPROMPT: State-of-the-art evolution algorithm-based methods

Test Case Generation

  • Traditional methods: Randoop (random fuzzing), Evosuite (search algorithms)
  • Deep learning methods: AthenaTest (fine-tuned BART), A3Test (assertion knowledge-enhanced)
  • LLM methods: ChatUniTest, ChatTESTER, etc.

Conclusions and Discussion

Main Conclusions

  1. MAPS significantly outperforms existing prompt optimization methods across all LLMs
  2. Different LLMs indeed require customized prompts
  3. All three core modules contribute importantly to performance improvement, with domain contextual knowledge extraction contributing the most

Limitations

  1. LLM Limitations: Evaluation conducted on only three representative LLMs
  2. Language Limitations: Experiments limited to Java projects, not covering other programming languages
  3. Dataset Scope: Only Defects4J benchmark used

Future Directions

  1. Extension to more LLMs and programming languages
  2. Integration with existing LLM test generation methods
  3. Exploration of more complex project-level contextual information

In-Depth Evaluation

Strengths

  1. Clear Problem Definition: First systematic study of prompt optimization for LLM-based test case generation
  2. Strong Method Innovation: Three well-designed modules addressing key problems of existing methods
  3. Comprehensive Experiments: Full evaluation across multiple LLMs and projects
  4. High Practical Value: General method applicable to different LLMs and projects

Weaknesses

  1. Computational Cost: Iterative optimization may require substantial API calls, incurring high costs
  2. Rule Quality: Failure-driven rule induction depends on LLM's reflection capability, with potentially unstable rule quality
  3. Context Extraction: Completeness and accuracy of cross-file context extraction requires further verification

Impact

  1. Academic Contribution: Opens new research direction in prompt optimization for LLM-based test case generation
  2. Practical Value: Directly applicable to test case generation in real software development
  3. Reproducibility: Provides complete reproduction package facilitating subsequent research

Applicable Scenarios

  1. Software projects requiring automatic generation of high-quality test cases
  2. Teams using different LLMs for code generation
  3. Software engineering tasks requiring LLM performance optimization

References

The paper cites 48 relevant references covering important works in software testing, prompt engineering, and large language models, providing solid theoretical foundation for the research.


Overall Assessment: This is a high-quality software engineering research paper with significant theoretical and practical value in the LLM-based test case generation domain. The method design is sound, experimental evaluation is comprehensive, and results are convincing. While some limitations exist, the overall contribution is substantial and provides important momentum for the field's development.