Having a high quality software is essential in software engineering, which requires robust validation and verification processes during testing activities. Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods. Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering, particularly in areas like requirements analysis, test automation, and debugging. This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluations across multiple applications in Python and Java demonstrate the system's high test coverage and efficient operation. This research underscores the potential of LLM-powered agents to streamline software testing workflows while addressing challenges in scalability and accuracy.
- Paper ID: 2501.00217
- Title: The Potential of LLMs in Automating Software Testing: From Generation to Reporting
- Authors: Betim Sherifi, Khaled Slhoub, Fitzroy Nembhard (Florida Institute of Technology)
- Classification: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
- Publication Date: December 31, 2024
- Paper Link: https://arxiv.org/abs/2501.00217
The development of high-quality software in software engineering requires robust verification and validation processes. While manual testing is effective, it is time-consuming and costly, creating a growing demand for automated approaches. Recent advances in Large Language Models (LLMs) have significantly impacted software engineering, particularly in requirements analysis, test automation, and debugging. This paper explores an agent-oriented automated software testing approach that leverages LLMs to reduce manual intervention and improve testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluation on multiple applications in Python and Java demonstrates that the system achieves high test coverage and efficient execution.
- Core Issue: Traditional software testing methods suffer from low efficiency, high costs, and excessive manual intervention
- Practical Need: Software quality assurance requires comprehensive verification and validation processes, but manual testing cannot meet the efficiency requirements of modern software development
- Software testing is recognized as one of the most important fields in software engineering education
- Manual testing methods, particularly regression testing, are especially time-consuming and expensive
- Ensuring software products execute as expected and meet quality standards is critical to software engineering
- Manual Testing: Effective but time-consuming and costly
- Traditional Automated Testing: Cannot completely replace manual methods; human involvement is still required in scenarios such as GUI testing
- Traditional Agent-Based Software Testing (ABST): Lacks intelligent test case generation capabilities
To leverage the powerful capabilities of LLMs combined with multi-agent systems to construct an intelligent testing framework capable of dynamically generating test cases, significantly reducing manual input, and minimizing the time required for test case creation and execution.
- Proposed an LLM-based multi-agent software testing framework that achieves end-to-end automation from test generation to reporting
- Designed a four-layer architecture system comprising an audio web client, software testing agent, LLMs, and development environment
- Implemented dynamic test case generation utilizing LLMs to automatically generate customized unit tests and test rationales
- Integrated visualization functionality to automatically generate call graphs in DOT format to display application interactions
- Validated system effectiveness achieving high test coverage (average 93.45%-97.71%) on Python and Java projects
Input: Testing requests provided by users via voice or text (containing project name, subfolder, programming language, etc.)
Output: Comprehensive PDF report containing test results, coverage analysis, test rationales, and call graphs
Constraints: Supports Python and Java projects, focused on unit testing level
The system comprises four main components:
- Audio Web Client: Captures user input (voice commands or text) and initiates the testing workflow via HTTP GET requests
- Software Testing Agent: The core system component that coordinates interactions between components, serving as an abstraction layer for test script generation, execution, and report creation
- Large Language Models (LLMs): Execute entity extraction, test generation, and DOT graph generation tasks
- Development Environment: Provides project code access, executes generated test cases, and displays results
- Initialization: Client sends voice command to test generation API
- Entity Extraction: LLM extracts project name, subfolder, and programming language from user prompt
- File Location: FileLocator module locates specified project folder and extracts file contents
- Test Generation: LLM (using Gemini) generates unit tests and corresponding rationales
- Graph Generation: LLM generates DOT graph strings for call graph visualization
- Execution and Reporting: Test executor runs tests; PDF report generator creates comprehensive report containing results, coverage, and call graphs
- Intelligent Entity Extraction: Automatically extracts critical testing parameters from natural language instructions using LLM
- Dynamic Test Generation: Automatically generates test scripts containing basic and edge cases based on code analysis
- Rationale Generation: Provides detailed test rationales and coverage scenario descriptions for each test case
- Integrated Visualization: Automatically generates call graphs to help understand code repository interaction relationships
- End-to-End Automation: Complete automation from user input to final report
Four applications of varying complexity:
Python Projects:
- Experiment: Basic calculator functionality (47 lines of code)
- Cinema: Movie theater management system (183 lines of code)
Java Projects:
- StudentAverage: Student grade calculation (114 lines of code)
- LibrarySystem: Library management system (269 lines of code)
- Execution Success Rate: Proportion of runs completing all steps (test generation, execution, PDF report generation)
- Test Coverage: Percentage of code covered by generated test cases
- Execution Time: Time analysis of each operational phase
- Language Comparison: Performance differences between Python and Java projects
- LLM Model: Primarily uses Google Gemini; comparative experiments use ChatGPT
- Test Runs: 20 executions for Python projects, 24 executions for Java projects
- Input Format: Testing with multiple natural language prompt formats
- Python Projects: All 20 executions successful (100% success rate)
- Java Projects: 3 failures out of 24 executions (87.5% success rate)
- Failure Causes: Primarily due to ambiguous prompts and generated test script compilation errors
- Average Total Execution Time: 83.5 seconds
- Test Generation Time: 62.8 seconds (largest proportion)
- Folder Location: 9.7 seconds
- DOT Graph Generation: 5.4 seconds
- Test Execution: 3.2 seconds
| Metric | Java | Python |
|---|
| Average Total Execution Time | 86.7s | 80s |
| Test Generation Time | 62.4s | 63.3s |
| Test Execution Time | 5.44s | 0.87s |
| Average Test Coverage | 97.71% | 93.45% |
| Project | Language | Lines of Code | Total Time | Test Generation | Test Execution | Coverage |
|---|
| LibrarySystem | Java | 269 | 119.06s | 92.54s | 5.39s | 94.67% |
| StudentManager | Java | 114 | 62.55s | 39.79s | 5.48s | 100.00% |
| Cinema | Python | 183 | 110.13s | 92.43s | 0.79s | 88.30% |
| Experiment | Python | 47 | 49.78s | 34.17s | 0.96s | 98.60% |
ChatGPT vs Gemini (LibrarySystem project):
- ChatGPT generation time: ~180 seconds (approximately 2x Gemini)
- ChatGPT test coverage: 98%
- Note: Using ChatGPT web application rather than API may affect generation time
Cinema Project - rent_movie Function:
- Basic Case: "Test renting available movie to existing member"
- Edge Cases: "Test renting non-existent movie, renting movie to non-existent member, renting already-rented movie"
Library Project - getTitle Function:
- Basic Case: "Test retrieving book title after object creation"
- Edge Cases: Not applicable
- Development Timeline: Received attention since 1999; achieved significant peak in the past decade
- Application Focus: Primarily emphasizes system-level testing; Java as main target language
- Representative Works:
- Automated testing framework for web systems (multi-agent collaboration)
- Industrial coffee machine testing (fuzzy logic priority sorting)
- Industry Survey: 48% of practitioners have integrated LLMs into testing activities
- Application Domains: Requirements analysis, test plan development, test automation
- Common Tools: ChatGPT, GitHub Copilot
- Research Trends: Analysis of 102 related papers shows LLMs have significant value in test case generation, assertion creation, and other areas
- High Success Rate: LLM-driven agents demonstrate excellent performance in automated software testing, achieving 100% success rate on Python projects
- High Coverage: Average test coverage exceeds 93%, demonstrating the effectiveness of generated test cases
- Efficiency Improvement: Significantly reduces manual intervention, achieving end-to-end automation from test generation to reporting
- Language Adaptability: Framework successfully supports two mainstream programming languages: Python and Java
- Java Project Stability: Relatively high failure rate in Java execution; requires improvement in natural language processing and syntax accuracy
- Testing Scope Restriction: Currently focuses only on unit testing; lacks integration and system-level testing
- Visualization Capabilities: While call graphs are useful, advanced features such as heat map coverage are lacking
- Input Dependency: Sensitive to prompt quality; ambiguous prompts may lead to failures
- Extend Testing Types: Introduce support for integration and system-level testing
- Enhance Language Support: Expand to more programming languages
- Improve Visualization: Add defect propensity heat map coverage
- Requirements Integration: Incorporate requirements specification documents as prompt input to improve precision
- Error Handling: Improve handling of ambiguous prompts and error recovery mechanisms
- Strong Innovation: First systematic combination of LLM with multi-agent architecture for end-to-end software testing automation
- High Practical Value: Addresses real pain points in software testing domain with strong engineering application value
- Comprehensive Evaluation: Cross-language, multi-project assessment with convincing results
- Clear Architecture: Well-designed high and low-level architecture with high modularity, facilitating extension and maintenance
- Limited Testing Scope: Supports only unit testing; cannot meet complete software testing requirements
- Insufficient Error Analysis: Limited in-depth analysis of Java project failure causes
- Missing Benchmark Comparisons: Lacks detailed comparison with existing automated testing tools
- Unverified Scalability: System scalability not validated on large, complex projects
- Academic Contribution: Provides new research direction for LLM applications in software engineering
- Practical Value: Can be directly applied to software development processes to improve testing efficiency
- Technology Promotion: Demonstrates enormous potential of LLMs in automated testing domain
- Reproducibility: Detailed architecture description facilitates reproduction and improvement by other researchers
- Small to Medium Projects: Particularly suitable for projects with code scale under several hundred lines
- Unit Test Automation: Can significantly reduce manual unit test writing work
- Rapid Prototype Validation: Suitable for scenarios requiring quick test case generation
- Education and Training: Applicable to software testing teaching and training scenarios
The paper cites 13 important references covering traditional ABST methods, LLM applications in software testing, and fundamental software testing theory, providing solid theoretical foundation for the research.
Overall Assessment: This is a research paper of significant value at the intersection of software engineering and artificial intelligence. Despite certain limitations, its innovative methodology and practical results open new directions for LLM applications in software test automation, demonstrating good academic value and practical prospects.