2025-11-21T20:16:16.286197

The Potential of LLMs in Automating Software Testing: From Generation to Reporting

Sherifi, Slhoub, Nembhard
Having a high quality software is essential in software engineering, which requires robust validation and verification processes during testing activities. Manual testing, while effective, can be time consuming and costly, leading to an increased demand for automated methods. Recent advancements in Large Language Models (LLMs) have significantly influenced software engineering, particularly in areas like requirements analysis, test automation, and debugging. This paper explores an agent-oriented approach to automated software testing, using LLMs to reduce human intervention and enhance testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluations across multiple applications in Python and Java demonstrate the system's high test coverage and efficient operation. This research underscores the potential of LLM-powered agents to streamline software testing workflows while addressing challenges in scalability and accuracy.
academic

The Potential of LLMs in Automating Software Testing: From Generation to Reporting

Basic Information

  • Paper ID: 2501.00217
  • Title: The Potential of LLMs in Automating Software Testing: From Generation to Reporting
  • Authors: Betim Sherifi, Khaled Slhoub, Fitzroy Nembhard (Florida Institute of Technology)
  • Classification: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
  • Publication Date: December 31, 2024
  • Paper Link: https://arxiv.org/abs/2501.00217

Abstract

The development of high-quality software in software engineering requires robust verification and validation processes. While manual testing is effective, it is time-consuming and costly, creating a growing demand for automated approaches. Recent advances in Large Language Models (LLMs) have significantly impacted software engineering, particularly in requirements analysis, test automation, and debugging. This paper explores an agent-oriented automated software testing approach that leverages LLMs to reduce manual intervention and improve testing efficiency. The proposed framework integrates LLMs to generate unit tests, visualize call graphs, and automate test execution and reporting. Evaluation on multiple applications in Python and Java demonstrates that the system achieves high test coverage and efficient execution.

Research Background and Motivation

Problem Definition

  1. Core Issue: Traditional software testing methods suffer from low efficiency, high costs, and excessive manual intervention
  2. Practical Need: Software quality assurance requires comprehensive verification and validation processes, but manual testing cannot meet the efficiency requirements of modern software development

Importance Analysis

  • Software testing is recognized as one of the most important fields in software engineering education
  • Manual testing methods, particularly regression testing, are especially time-consuming and expensive
  • Ensuring software products execute as expected and meet quality standards is critical to software engineering

Limitations of Existing Approaches

  • Manual Testing: Effective but time-consuming and costly
  • Traditional Automated Testing: Cannot completely replace manual methods; human involvement is still required in scenarios such as GUI testing
  • Traditional Agent-Based Software Testing (ABST): Lacks intelligent test case generation capabilities

Research Motivation

To leverage the powerful capabilities of LLMs combined with multi-agent systems to construct an intelligent testing framework capable of dynamically generating test cases, significantly reducing manual input, and minimizing the time required for test case creation and execution.

Core Contributions

  1. Proposed an LLM-based multi-agent software testing framework that achieves end-to-end automation from test generation to reporting
  2. Designed a four-layer architecture system comprising an audio web client, software testing agent, LLMs, and development environment
  3. Implemented dynamic test case generation utilizing LLMs to automatically generate customized unit tests and test rationales
  4. Integrated visualization functionality to automatically generate call graphs in DOT format to display application interactions
  5. Validated system effectiveness achieving high test coverage (average 93.45%-97.71%) on Python and Java projects

Methodology Details

Task Definition

Input: Testing requests provided by users via voice or text (containing project name, subfolder, programming language, etc.) Output: Comprehensive PDF report containing test results, coverage analysis, test rationales, and call graphs Constraints: Supports Python and Java projects, focused on unit testing level

Model Architecture

High-Level Architecture

The system comprises four main components:

  1. Audio Web Client: Captures user input (voice commands or text) and initiates the testing workflow via HTTP GET requests
  2. Software Testing Agent: The core system component that coordinates interactions between components, serving as an abstraction layer for test script generation, execution, and report creation
  3. Large Language Models (LLMs): Execute entity extraction, test generation, and DOT graph generation tasks
  4. Development Environment: Provides project code access, executes generated test cases, and displays results

Low-Level Architecture Workflow

  1. Initialization: Client sends voice command to test generation API
  2. Entity Extraction: LLM extracts project name, subfolder, and programming language from user prompt
  3. File Location: FileLocator module locates specified project folder and extracts file contents
  4. Test Generation: LLM (using Gemini) generates unit tests and corresponding rationales
  5. Graph Generation: LLM generates DOT graph strings for call graph visualization
  6. Execution and Reporting: Test executor runs tests; PDF report generator creates comprehensive report containing results, coverage, and call graphs

Technical Innovations

  1. Intelligent Entity Extraction: Automatically extracts critical testing parameters from natural language instructions using LLM
  2. Dynamic Test Generation: Automatically generates test scripts containing basic and edge cases based on code analysis
  3. Rationale Generation: Provides detailed test rationales and coverage scenario descriptions for each test case
  4. Integrated Visualization: Automatically generates call graphs to help understand code repository interaction relationships
  5. End-to-End Automation: Complete automation from user input to final report

Experimental Setup

Dataset

Four applications of varying complexity:

Python Projects:

  • Experiment: Basic calculator functionality (47 lines of code)
  • Cinema: Movie theater management system (183 lines of code)

Java Projects:

  • StudentAverage: Student grade calculation (114 lines of code)
  • LibrarySystem: Library management system (269 lines of code)

Evaluation Metrics

  1. Execution Success Rate: Proportion of runs completing all steps (test generation, execution, PDF report generation)
  2. Test Coverage: Percentage of code covered by generated test cases
  3. Execution Time: Time analysis of each operational phase
  4. Language Comparison: Performance differences between Python and Java projects

Implementation Details

  • LLM Model: Primarily uses Google Gemini; comparative experiments use ChatGPT
  • Test Runs: 20 executions for Python projects, 24 executions for Java projects
  • Input Format: Testing with multiple natural language prompt formats

Experimental Results

Main Results

Success Rate Performance

  • Python Projects: All 20 executions successful (100% success rate)
  • Java Projects: 3 failures out of 24 executions (87.5% success rate)
  • Failure Causes: Primarily due to ambiguous prompts and generated test script compilation errors

Execution Time Analysis

  • Average Total Execution Time: 83.5 seconds
  • Test Generation Time: 62.8 seconds (largest proportion)
  • Folder Location: 9.7 seconds
  • DOT Graph Generation: 5.4 seconds
  • Test Execution: 3.2 seconds

Language Comparison Results

MetricJavaPython
Average Total Execution Time86.7s80s
Test Generation Time62.4s63.3s
Test Execution Time5.44s0.87s
Average Test Coverage97.71%93.45%

Detailed Project Analysis

ProjectLanguageLines of CodeTotal TimeTest GenerationTest ExecutionCoverage
LibrarySystemJava269119.06s92.54s5.39s94.67%
StudentManagerJava11462.55s39.79s5.48s100.00%
CinemaPython183110.13s92.43s0.79s88.30%
ExperimentPython4749.78s34.17s0.96s98.60%

LLM Comparative Experiments

ChatGPT vs Gemini (LibrarySystem project):

  • ChatGPT generation time: ~180 seconds (approximately 2x Gemini)
  • ChatGPT test coverage: 98%
  • Note: Using ChatGPT web application rather than API may affect generation time

Case Analysis

Test Rationale Examples

Cinema Project - rent_movie Function:

  • Basic Case: "Test renting available movie to existing member"
  • Edge Cases: "Test renting non-existent movie, renting movie to non-existent member, renting already-rented movie"

Library Project - getTitle Function:

  • Basic Case: "Test retrieving book title after object creation"
  • Edge Cases: Not applicable

Traditional Agent-Based Software Testing (ABST)

  • Development Timeline: Received attention since 1999; achieved significant peak in the past decade
  • Application Focus: Primarily emphasizes system-level testing; Java as main target language
  • Representative Works:
    • Automated testing framework for web systems (multi-agent collaboration)
    • Industrial coffee machine testing (fuzzy logic priority sorting)

LLM-Enhanced Software Testing

  • Industry Survey: 48% of practitioners have integrated LLMs into testing activities
  • Application Domains: Requirements analysis, test plan development, test automation
  • Common Tools: ChatGPT, GitHub Copilot
  • Research Trends: Analysis of 102 related papers shows LLMs have significant value in test case generation, assertion creation, and other areas

Conclusions and Discussion

Main Conclusions

  1. High Success Rate: LLM-driven agents demonstrate excellent performance in automated software testing, achieving 100% success rate on Python projects
  2. High Coverage: Average test coverage exceeds 93%, demonstrating the effectiveness of generated test cases
  3. Efficiency Improvement: Significantly reduces manual intervention, achieving end-to-end automation from test generation to reporting
  4. Language Adaptability: Framework successfully supports two mainstream programming languages: Python and Java

Limitations

  1. Java Project Stability: Relatively high failure rate in Java execution; requires improvement in natural language processing and syntax accuracy
  2. Testing Scope Restriction: Currently focuses only on unit testing; lacks integration and system-level testing
  3. Visualization Capabilities: While call graphs are useful, advanced features such as heat map coverage are lacking
  4. Input Dependency: Sensitive to prompt quality; ambiguous prompts may lead to failures

Future Directions

  1. Extend Testing Types: Introduce support for integration and system-level testing
  2. Enhance Language Support: Expand to more programming languages
  3. Improve Visualization: Add defect propensity heat map coverage
  4. Requirements Integration: Incorporate requirements specification documents as prompt input to improve precision
  5. Error Handling: Improve handling of ambiguous prompts and error recovery mechanisms

In-Depth Evaluation

Strengths

  1. Strong Innovation: First systematic combination of LLM with multi-agent architecture for end-to-end software testing automation
  2. High Practical Value: Addresses real pain points in software testing domain with strong engineering application value
  3. Comprehensive Evaluation: Cross-language, multi-project assessment with convincing results
  4. Clear Architecture: Well-designed high and low-level architecture with high modularity, facilitating extension and maintenance

Weaknesses

  1. Limited Testing Scope: Supports only unit testing; cannot meet complete software testing requirements
  2. Insufficient Error Analysis: Limited in-depth analysis of Java project failure causes
  3. Missing Benchmark Comparisons: Lacks detailed comparison with existing automated testing tools
  4. Unverified Scalability: System scalability not validated on large, complex projects

Impact

  1. Academic Contribution: Provides new research direction for LLM applications in software engineering
  2. Practical Value: Can be directly applied to software development processes to improve testing efficiency
  3. Technology Promotion: Demonstrates enormous potential of LLMs in automated testing domain
  4. Reproducibility: Detailed architecture description facilitates reproduction and improvement by other researchers

Applicable Scenarios

  1. Small to Medium Projects: Particularly suitable for projects with code scale under several hundred lines
  2. Unit Test Automation: Can significantly reduce manual unit test writing work
  3. Rapid Prototype Validation: Suitable for scenarios requiring quick test case generation
  4. Education and Training: Applicable to software testing teaching and training scenarios

References

The paper cites 13 important references covering traditional ABST methods, LLM applications in software testing, and fundamental software testing theory, providing solid theoretical foundation for the research.


Overall Assessment: This is a research paper of significant value at the intersection of software engineering and artificial intelligence. Despite certain limitations, its innovative methodology and practical results open new directions for LLM applications in software test automation, demonstrating good academic value and practical prospects.