The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation
Gao, Wang, Gao et al.
Test cases are essential for validating the reliability and quality of software applications. Recent studies have demonstrated the capability of Large Language Models (LLMs) to generate useful test cases for given source code. However, the existing work primarily relies on human-written plain prompts, which often leads to suboptimal results since the performance of LLMs can be highly influenced by the prompts. Moreover, these approaches use the same prompt for all LLMs, overlooking the fact that different LLMs might be best suited to different prompts. Given the wide variety of possible prompt formulations, automatically discovering the optimal prompt for each LLM presents a significant challenge. Although there are methods on automated prompt optimization in the natural language processing field, they are hard to produce effective prompts for the test case generation task. First, the methods iteratively optimize prompts by simply combining and mutating existing ones without proper guidance, resulting in prompts that lack diversity and tend to repeat the same errors in the generated test cases. Second, the prompts are generally lack of domain contextual knowledge, limiting LLMs' performance in the task.
academic
The Prompt Alchemist: Automated LLM-Tailored Prompt Optimization for Test Case Generation
Test cases are critical for verifying the reliability and quality of software applications. Recent research demonstrates that Large Language Models (LLMs) possess the capability to generate useful test cases for given source code. However, existing work primarily relies on manually crafted simple prompts, which often yields suboptimal results because LLM performance is highly dependent on prompt quality. Furthermore, these approaches use identical prompts across all LLMs, overlooking the fact that different LLMs may be optimally suited to different prompts. This paper proposes MAPS, which achieves automated prompt optimization tailored to different LLMs through three core modules: diversity-guided prompt generation, failure-driven rule induction, and domain contextual knowledge extraction.
Test case generation is a critical task in software engineering. While traditional methods such as Evosuite and Randoop rely on search and constraint-based techniques, LLM-based approaches, though promising, suffer from the following issues:
Dependence on manually crafted simple prompts, leading to suboptimal performance
Use of identical prompts across all LLMs, ignoring inter-LLM differences
Lack of specialized optimization for test case generation tasks
Through preliminary experiments, the authors identify three major problems with existing Automated Prompt Optimization (APO) methods on test case generation tasks:
Low Diversity: Generated prompts lack diversity and easily fall into local optima
Repeated Errors: Optimized prompts still produce the same errors as original prompts
Lack of Domain Knowledge: Missing necessary project-level contextual information, such as inheritance relationships and class call information
First Study: To the authors' knowledge, this is the first study specifically addressing LLM-tailored prompt optimization for test case generation tasks
Significant Improvements: Experiments on three popular LLMs demonstrate that MAPS achieves an average improvement of 6.19% in line coverage and 5.03% in branch coverage compared to the strongest baseline
LLM Customization: Demonstrates the effectiveness of generating customized prompts for different LLMs
Given a black-box model M, a small development set D_dev, a test set D_test, and a scoring function s(·), APO aims to discover an optimized prompt p from the natural language space based on D_dev that maximizes M's performance on the test set D_test.
This module creates diverse prompts by exploring different modification paths:
Algorithm 2: PROMPTIMPROVEMENT
1. Select K best-performing prompts
2. Generate N different modification methods
3. Generate new prompts based on each modification method
4. Merge selected prompts and newly generated prompts
Case 1 - Llama-3.1: Through the second induction rule, the model correctly generated test cases with exception handling
Case 2 - ChatGPT: Through cross-file contextual knowledge, the model correctly initialized abstract classes
The paper cites 48 relevant references covering important works in software testing, prompt engineering, and large language models, providing solid theoretical foundation for the research.
Overall Assessment: This is a high-quality software engineering research paper with significant theoretical and practical value in the LLM-based test case generation domain. The method design is sound, experimental evaluation is comprehensive, and results are convincing. While some limitations exist, the overall contribution is substantial and provides important momentum for the field's development.