2025-11-21T20:16:16.286197

The Potential of LLMs in Automating Software Testing: From Generation to Reporting

Sherifi, Slhoub, Nembhard

Having a high quality software is essential in software engineering, which requires robust validation and verification processes during testing activities. Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods. Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering, particularly in areas like requirements analysis, test automation, and debugging. This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluations across multiple applications in Python and Java demonstrate the system's high test coverage and efficient operation. This research underscores the potential of LLM-powered agents to streamline software testing workflows while addressing challenges in scalability and accuracy.

academic

The Potential of LLMs in Automating Software Testing: From Generation to Reporting

Basic Information

Paper ID: 2501.00217
Title: The Potential of LLMs in Automating Software Testing: From Generation to Reporting
Authors: Betim Sherifi, Khaled Slhoub, Fitzroy Nembhard (Florida Institute of Technology)
Classification: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
Publication Date: December 31, 2024
Paper Link: https://arxiv.org/abs/2501.00217

Abstract

The development of high-quality software in software engineering requires robust verification and validation processes. While manual testing is effective, it is time-consuming and costly, creating a growing demand for automated approaches. Recent advances in Large Language Models (LLMs) have significantly impacted software engineering, particularly in requirements analysis, test automation, and debugging. This paper explores an agent-oriented automated software testing approach that leverages LLMs to reduce manual intervention and improve testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluation on multiple applications in Python and Java demonstrates that the system achieves high test coverage and efficient execution.

Research Background and Motivation

Problem Definition

Core Issue: Traditional software testing methods suffer from low efficiency, high costs, and excessive manual intervention
Practical Need: Software quality assurance requires comprehensive verification and validation processes, but manual testing cannot meet the efficiency requirements of modern software development

Importance Analysis

Software testing is recognized as one of the most important fields in software engineering education
Manual testing methods, particularly regression testing, are especially time-consuming and expensive
Ensuring software products execute as expected and meet quality standards is critical to software engineering

Limitations of Existing Approaches

Manual Testing: Effective but time-consuming and costly
Traditional Automated Testing: Cannot completely replace manual methods; human involvement is still required in scenarios such as GUI testing
Traditional Agent-Based Software Testing (ABST): Lacks intelligent test case generation capabilities

Research Motivation

To leverage the powerful capabilities of LLMs combined with multi-agent systems to construct an intelligent testing framework capable of dynamically generating test cases, significantly reducing manual input, and minimizing the time required for test case creation and execution.

Core Contributions

Proposed an LLM-based multi-agent software testing framework that achieves end-to-end automation from test generation to reporting
Designed a four-layer architecture system comprising an audio web client, software testing agent, LLMs, and development environment
Implemented dynamic test case generation utilizing LLMs to automatically generate customized unit tests and test rationales
Integrated visualization functionality to automatically generate call graphs in DOT format to display application interactions
Validated system effectiveness achieving high test coverage (average 93.45%-97.71%) on Python and Java projects

Methodology Details

Task Definition

Input: Testing requests provided by users via voice or text (containing project name, subfolder, programming language, etc.) Output: Comprehensive PDF report containing test results, coverage analysis, test rationales, and call graphs Constraints: Supports Python and Java projects, focused on unit testing level

Model Architecture

High-Level Architecture

The system comprises four main components:

Audio Web Client: Captures user input (voice commands or text) and initiates the testing workflow via HTTP GET requests
Software Testing Agent: The core system component that coordinates interactions between components, serving as an abstraction layer for test script generation, execution, and report creation
Large Language Models (LLMs): Execute entity extraction, test generation, and DOT graph generation tasks
Development Environment: Provides project code access, executes generated test cases, and displays results

Low-Level Architecture Workflow

Initialization: Client sends voice command to test generation API
Entity Extraction: LLM extracts project name, subfolder, and programming language from user prompt
File Location: FileLocator module locates specified project folder and extracts file contents
Test Generation: LLM (using Gemini) generates unit tests and corresponding rationales
Graph Generation: LLM generates DOT graph strings for call graph visualization
Execution and Reporting: Test executor runs tests; PDF report generator creates comprehensive report containing results, coverage, and call graphs

Technical Innovations

Intelligent Entity Extraction: Automatically extracts critical testing parameters from natural language instructions using LLM
Dynamic Test Generation: Automatically generates test scripts containing basic and edge cases based on code analysis
Rationale Generation: Provides detailed test rationales and coverage scenario descriptions for each test case
Integrated Visualization: Automatically generates call graphs to help understand code repository interaction relationships
End-to-End Automation: Complete automation from user input to final report

Experimental Setup

Dataset

Four applications of varying complexity:

Python Projects:

Experiment: Basic calculator functionality (47 lines of code)
Cinema: Movie theater management system (183 lines of code)

Java Projects:

StudentAverage: Student grade calculation (114 lines of code)
LibrarySystem: Library management system (269 lines of code)

Evaluation Metrics

Execution Success Rate: Proportion of runs completing all steps (test generation, execution, PDF report generation)
Test Coverage: Percentage of code covered by generated test cases
Execution Time: Time analysis of each operational phase
Language Comparison: Performance differences between Python and Java projects

Implementation Details

LLM Model: Primarily uses Google Gemini; comparative experiments use ChatGPT
Test Runs: 20 executions for Python projects, 24 executions for Java projects
Input Format: Testing with multiple natural language prompt formats

Experimental Results

Main Results

Success Rate Performance

Python Projects: All 20 executions successful (100% success rate)
Java Projects: 3 failures out of 24 executions (87.5% success rate)
Failure Causes: Primarily due to ambiguous prompts and generated test script compilation errors

Execution Time Analysis

Average Total Execution Time: 83.5 seconds
Test Generation Time: 62.8 seconds (largest proportion)
Folder Location: 9.7 seconds
DOT Graph Generation: 5.4 seconds
Test Execution: 3.2 seconds

Language Comparison Results

Metric	Java	Python
Average Total Execution Time	86.7s	80s
Test Generation Time	62.4s	63.3s
Test Execution Time	5.44s	0.87s
Average Test Coverage	97.71%	93.45%

Detailed Project Analysis

Project	Language	Lines of Code	Total Time	Test Generation	Test Execution	Coverage
LibrarySystem	Java	269	119.06s	92.54s	5.39s	94.67%
StudentManager	Java	114	62.55s	39.79s	5.48s	100.00%
Cinema	Python	183	110.13s	92.43s	0.79s	88.30%
Experiment	Python	47	49.78s	34.17s	0.96s	98.60%

LLM Comparative Experiments

ChatGPT vs Gemini (LibrarySystem project):

ChatGPT generation time: ~180 seconds (approximately 2x Gemini)
ChatGPT test coverage: 98%
Note: Using ChatGPT web application rather than API may affect generation time

Case Analysis

Test Rationale Examples

Cinema Project - rent_movie Function:

Basic Case: "Test renting available movie to existing member"
Edge Cases: "Test renting non-existent movie, renting movie to non-existent member, renting already-rented movie"

Library Project - getTitle Function:

Basic Case: "Test retrieving book title after object creation"
Edge Cases: Not applicable

Traditional Agent-Based Software Testing (ABST)

Development Timeline: Received attention since 1999; achieved significant peak in the past decade
Application Focus: Primarily emphasizes system-level testing; Java as main target language
Representative Works:
- Automated testing framework for web systems (multi-agent collaboration)
- Industrial coffee machine testing (fuzzy logic priority sorting)

LLM-Enhanced Software Testing

Industry Survey: 48% of practitioners have integrated LLMs into testing activities
Application Domains: Requirements analysis, test plan development, test automation
Common Tools: ChatGPT, GitHub Copilot
Research Trends: Analysis of 102 related papers shows LLMs have significant value in test case generation, assertion creation, and other areas

Conclusions and Discussion

Main Conclusions

High Success Rate: LLM-driven agents demonstrate excellent performance in automated software testing, achieving 100% success rate on Python projects
High Coverage: Average test coverage exceeds 93%, demonstrating the effectiveness of generated test cases
Efficiency Improvement: Significantly reduces manual intervention, achieving end-to-end automation from test generation to reporting
Language Adaptability: Framework successfully supports two mainstream programming languages: Python and Java

Limitations

Java Project Stability: Relatively high failure rate in Java execution; requires improvement in natural language processing and syntax accuracy
Testing Scope Restriction: Currently focuses only on unit testing; lacks integration and system-level testing
Visualization Capabilities: While call graphs are useful, advanced features such as heat map coverage are lacking
Input Dependency: Sensitive to prompt quality; ambiguous prompts may lead to failures

Future Directions

Extend Testing Types: Introduce support for integration and system-level testing
Enhance Language Support: Expand to more programming languages
Improve Visualization: Add defect propensity heat map coverage
Requirements Integration: Incorporate requirements specification documents as prompt input to improve precision
Error Handling: Improve handling of ambiguous prompts and error recovery mechanisms

In-Depth Evaluation

Strengths

Strong Innovation: First systematic combination of LLM with multi-agent architecture for end-to-end software testing automation
High Practical Value: Addresses real pain points in software testing domain with strong engineering application value
Comprehensive Evaluation: Cross-language, multi-project assessment with convincing results
Clear Architecture: Well-designed high and low-level architecture with high modularity, facilitating extension and maintenance

Weaknesses

Limited Testing Scope: Supports only unit testing; cannot meet complete software testing requirements
Insufficient Error Analysis: Limited in-depth analysis of Java project failure causes
Missing Benchmark Comparisons: Lacks detailed comparison with existing automated testing tools
Unverified Scalability: System scalability not validated on large, complex projects

Impact

Academic Contribution: Provides new research direction for LLM applications in software engineering
Practical Value: Can be directly applied to software development processes to improve testing efficiency
Technology Promotion: Demonstrates enormous potential of LLMs in automated testing domain
Reproducibility: Detailed architecture description facilitates reproduction and improvement by other researchers

Applicable Scenarios

Small to Medium Projects: Particularly suitable for projects with code scale under several hundred lines
Unit Test Automation: Can significantly reduce manual unit test writing work
Rapid Prototype Validation: Suitable for scenarios requiring quick test case generation
Education and Training: Applicable to software testing teaching and training scenarios

References

The paper cites 13 important references covering traditional ABST methods, LLM applications in software testing, and fundamental software testing theory, providing solid theoretical foundation for the research.

Overall Assessment: This is a research paper of significant value at the intersection of software engineering and artificial intelligence. Despite certain limitations, its innovative methodology and practical results open new directions for LLM applications in software test automation, demonstrating good academic value and practical prospects.