2025-11-21T20:19:23.757806

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Gao, Wang, Gao et al.

Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs' performance in the task.

academic

The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation

Basic Information

Paper ID: 2501.01329
Title: The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation
Authors: Shuzheng Gao, Chaozheng Wang, Cuiyun Gao, Xiaoqian Jiao, Chun Yong Chong, Shan Gao, Michael R. Lyu
Classification: cs.SE cs.AI cs.CL
Publication Date/Venue: JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020
Paper Link: https://arxiv.org/abs/2501.01329

Abstract

Test cases are critical for verifying the reliability and quality of software applications. Recent research demonstrates that Large Language Models (LLMs) possess the capability to generate useful test cases for given source code. However, existing work primarily relies on manually crafted simple prompts, which often yields suboptimal results because LLM performance is highly dependent on prompt quality. Furthermore, these approaches use identical prompts across all LLMs, overlooking the fact that different LLMs may be optimally suited to different prompts. This paper proposes MAPS, which achieves automated prompt optimization tailored to different LLMs through three core modules: diversity-guided prompt generation, failure-driven rule induction, and domain contextual knowledge extraction.

Research Background and Motivation

1. Core Problem

Test case generation is a critical task in software engineering. While traditional methods such as Evosuite and Randoop rely on search and constraint-based techniques, LLM-based approaches, though promising, suffer from the following issues:

Dependence on manually crafted simple prompts, leading to suboptimal performance
Use of identical prompts across all LLMs, ignoring inter-LLM differences
Lack of specialized optimization for test case generation tasks

2. Problem Significance

Manual test case writing is time-consuming and difficult
High-quality test cases are crucial for software quality assurance
The powerful capabilities of LLMs in code understanding and generation need to be fully leveraged through prompt optimization

3. Limitations of Existing Methods

Through preliminary experiments, the authors identify three major problems with existing Automated Prompt Optimization (APO) methods on test case generation tasks:

Low Diversity: Generated prompts lack diversity and easily fall into local optima
Repeated Errors: Optimized prompts still produce the same errors as original prompts
Lack of Domain Knowledge: Missing necessary project-level contextual information, such as inheritance relationships and class call information

Core Contributions

First Study: To the authors' knowledge, this is the first study specifically addressing LLM-tailored prompt optimization for test case generation tasks
Novel Method: Proposes MAPS, integrating diversity-guided prompt generation, failure-driven rule induction, and domain contextual knowledge extraction
Significant Improvements: Experiments on three popular LLMs demonstrate that MAPS achieves an average improvement of 6.19% in line coverage and 5.03% in branch coverage compared to the strongest baseline
LLM Customization: Demonstrates the effectiveness of generating customized prompts for different LLMs

Method Details

Task Definition

Given a black-box model M, a small development set D_dev, a test set D_test, and a scoring function s(·), APO aims to discover an optimized prompt p from the natural language space based on D_dev that maximizes M's performance on the test set D_test.

Model Architecture

MAPS comprises three core modules:

1. Domain Contextual Knowledge Extraction

This module provides LLMs with relevant project-level contextual information:

Intra-file Contextual Knowledge:

Class signatures: Type and name of the class containing the focal method
Focal method: The specific method for which test cases need to be generated
Member method signatures: Function signatures of other methods within the class

Cross-file Contextual Knowledge:

Class inheritance information: For abstract or private classes, scan the entire project to locate their subclasses
Class call information: Identify parameter types of the focal method and trace definitions and constructors of user-defined types

2. Diversity-guided Prompt Generation

This module creates diverse prompts by exploring different modification paths:

Algorithm 2: PROMPTIMPROVEMENT
1. Select K best-performing prompts
2. Generate N different modification methods
3. Generate new prompts based on each modification method
4. Merge selected prompts and newly generated prompts

3. Failure-driven Rule Induction

This module avoids repeated errors by analyzing failed test cases to induce rules:

Failure Information Selection:

Collect failed test cases and error messages
Aggregate failure information using DBSCAN clustering algorithm
Perform weighted sampling based on cluster size and similarity to historical failures

Error Reflection:

Construct reflection prompts using representative failure cases
Request LLM to provide detailed explanations and solutions
Convert explanations and solutions into concise rules

Rule Validation:

Validate the effectiveness of each newly generated rule
Retain rules with the best performance

Technical Innovations

Diversity Assurance: Ensures prompt diversity by enforcing different modification methods, avoiding local optima
Failure Learning: Learns from failure cases through rule induction to guide the optimization process
Context Enhancement: Provides project-level contextual information to help LLMs generate accurate test cases
Soft Integration: Converts reflection outputs into concise rules, avoiding performance degradation from verbose prompts

Experimental Setup

Dataset

Uses the widely-adopted Defects4J benchmark, comprising 5 Java projects:

Apache Commons CLI (29 bugs)
Apache Commons CSV (15 bugs)
Google Gson (17 bugs)
JFreeChart (26 bugs)
Apache Commons Lang (60 bugs)
Total: 147 bugs, 85 focal classes, 5,278 focal methods

Evaluation Metrics

Line Coverage (%): Percentage of code lines executed during testing
Branch Coverage (%): Percentage of branches executed during testing

Baseline Methods

LLM Models:

ChatGPT (gpt-3.5-turbo-0125)
Llama-3.1-70B-Instruct
Qwen2-72B-Instruct

Baseline Methods:

Basic: Best seed prompt performance
APE: Directly requesting LLM to generate semantically-preserving prompt variants
OPRO: Generating new prompts incorporating performance information
EVOPROMPT (GA/DE): State-of-the-art evolution algorithm-based prompt optimization methods

Implementation Details

Number of seed prompts: 5
Number of prompts generated per iteration: 2
Maximum iterations: 5
Development set: Randomly sampled 10 bugs
Experiments repeated 3 times with average results reported

Experimental Results

Main Results

Performance on ChatGPT:

Line coverage: MAPS achieves 53.80%, strongest baseline EVOPROMPT(GA) achieves 46.63%, improvement of 7.17%
Branch coverage: MAPS achieves 41.84%, strongest baseline achieves 35.88%, improvement of 5.96%

Performance on Llama-3.1:

Line coverage: MAPS achieves 50.59%, strongest baseline achieves 46.52%, improvement of 4.07%
Branch coverage: MAPS achieves 39.50%, strongest baseline achieves 35.07%, improvement of 4.43%

Performance on Qwen2:

Line coverage: MAPS achieves 45.51%, strongest baseline achieves 39.41%, improvement of 6.10%
Branch coverage: MAPS achieves 32.71%, strongest baseline achieves 28.92%, improvement of 3.79%

Ablation Study

Module contribution analysis (using ChatGPT as example):

Removing domain contextual knowledge extraction: Line coverage decreases 9.64%, branch coverage decreases 8.53%
Removing diversity-guided prompt generation: Line coverage decreases 8.21%, branch coverage decreases 7.80%
Removing failure-driven rule induction: Line coverage decreases 6.94%, branch coverage decreases 4.76%

LLM Customization Effects

Experiments validate that MAPS can generate customized prompts for different LLMs:

Each LLM performs best on its own optimized prompts
ChatGPT's final prompts outperform other LLMs' prompts by 2.45% and 2.66% in line coverage respectively
MAPS-optimized prompts consistently outperform manually designed prompts

Case Studies

Case 1 - Llama-3.1: Through the second induction rule, the model correctly generated test cases with exception handling Case 2 - ChatGPT: Through cross-file contextual knowledge, the model correctly initialized abstract classes

Automated Prompt Optimization

APE: Directly requesting LLM to generate semantically-preserving prompt variants
OPRO: Combining performance information to guide prompt generation
EVOPROMPT: State-of-the-art evolution algorithm-based methods

Test Case Generation

Traditional methods: Randoop (random fuzzing), Evosuite (search algorithms)
Deep learning methods: AthenaTest (fine-tuned BART), A3Test (assertion knowledge-enhanced)
LLM methods: ChatUniTest, ChatTESTER, etc.

Conclusions and Discussion

Main Conclusions

MAPS significantly outperforms existing prompt optimization methods across all LLMs
Different LLMs indeed require customized prompts
All three core modules contribute importantly to performance improvement, with domain contextual knowledge extraction contributing the most

Limitations

LLM Limitations: Evaluation conducted on only three representative LLMs
Language Limitations: Experiments limited to Java projects, not covering other programming languages
Dataset Scope: Only Defects4J benchmark used

Future Directions

Extension to more LLMs and programming languages
Integration with existing LLM test generation methods
Exploration of more complex project-level contextual information

In-Depth Evaluation

Strengths

Clear Problem Definition: First systematic study of prompt optimization for LLM-based test case generation
Strong Method Innovation: Three well-designed modules addressing key problems of existing methods
Comprehensive Experiments: Full evaluation across multiple LLMs and projects
High Practical Value: General method applicable to different LLMs and projects

Weaknesses

Computational Cost: Iterative optimization may require substantial API calls, incurring high costs
Rule Quality: Failure-driven rule induction depends on LLM's reflection capability, with potentially unstable rule quality
Context Extraction: Completeness and accuracy of cross-file context extraction requires further verification

Impact

Academic Contribution: Opens new research direction in prompt optimization for LLM-based test case generation
Practical Value: Directly applicable to test case generation in real software development
Reproducibility: Provides complete reproduction package facilitating subsequent research

Applicable Scenarios

Software projects requiring automatic generation of high-quality test cases
Teams using different LLMs for code generation
Software engineering tasks requiring LLM performance optimization

References

The paper cites 48 relevant references covering important works in software testing, prompt engineering, and large language models, providing solid theoretical foundation for the research.

Overall Assessment: This is a high-quality software engineering research paper with significant theoretical and practical value in the LLM-based test case generation domain. The method design is sound, experimental evaluation is comprehensive, and results are convincing. While some limitations exist, the overall contribution is substantial and provides important momentum for the field's development.