2025-11-16T06:52:11.231184

VerilogReader: LLM-Aided Hardware Test Generation

Ma, Yang, Liu et al.
Test generation has been a critical and labor-intensive process in hardware design verification. Recently, the emergence of Large Language Model (LLM) with their advanced understanding and inference capabilities, has introduced a novel approach. In this work, we investigate the integration of LLM into the Coverage Directed Test Generation (CDG) process, where the LLM functions as a Verilog Reader. It accurately grasps the code logic, thereby generating stimuli that can reach unexplored code branches. We compare our framework with random testing, using our self-designed Verilog benchmark suite. Experiments demonstrate that our framework outperforms random testing on designs within the LLM's comprehension scope. Our work also proposes prompt engineering optimizations to augment LLM's understanding scope and accuracy.
academic

VerilogReader: LLM-Aided Hardware Test Generation

Basic Information

Abstract

Test generation has been a critical and labor-intensive process in hardware design verification. In recent years, Large Language Models (LLMs), leveraging their advanced comprehension and reasoning capabilities, have introduced novel approaches to this domain. This research explores the integration of LLMs into coverage-directed test generation (CDG) processes, where LLMs serve as Verilog code readers to accurately understand code logic and generate stimuli capable of reaching unexplored code branches. Using a custom-designed Verilog benchmark suite, the authors compare this framework against random testing. Experimental results demonstrate that the framework outperforms random testing on designs within the LLM's comprehension scope, and propose prompt engineering optimizations to enhance the LLM's understanding range and accuracy.

Research Background and Motivation

Problem Background

  1. Importance of Hardware Verification: With the exponential growth in hardware complexity, hardware verification has become increasingly critical in the development process. Undetected hardware errors can lead to severe consequences and substantial economic losses.
  2. Existing Verification Methods: Engineers primarily employ two verification approaches:
    • Formal verification: Using mathematical techniques to prove system correctness
    • Dynamic verification: Generating diverse test cases to simulate the design under test (DUT)
  3. Test Generation Challenges: Achieving coverage objectives requires high-quality test inputs, imposing significant manual burden on verification engineers.

Research Motivation

  1. Automation Requirements: To reduce manual intervention, coverage-directed test generation (CDG) has become a key technology for automating hardware test generation.
  2. LLM Opportunities: The powerful capabilities of LLMs in comprehension and reasoning provide new opportunities for hardware test generation.
  3. Differentiated Positioning: Unlike previous research focusing on functional coverage points, this paper focuses on code coverage—a more fundamental testing objective—positioning LLMs as "VerilogReaders."

Core Contributions

  1. Open Source Framework: First to open-source a framework integrating LLMs into the CDG process, utilizing LLMs as VerilogReaders to understand Verilog code and coverage, aiming to generate tests that achieve code coverage closure.
  2. Prompt Optimization Modules: Proposes Coverage Explainer and DUT Explainer modules to enrich prompts, enhancing LLM understanding of designs and test intentions, improving framework scalability.
  3. Benchmark Test Suite: Creates a benchmark suite containing 24 Verilog designs at simple, intermediate, and complex levels. Experiments demonstrate that the framework outperforms random testing on simple and intermediate-level DUTs.
  4. Capability Boundary Exploration: Clearly delineates the maximum capability boundaries of current LLMs in Verilog reading.

Methodology Details

Task Definition

The task is to leverage LLMs to understand Verilog code logic and current coverage status, generating multi-cycle input stimuli capable of triggering uncovered code branches to achieve improved code coverage.

Model Architecture

Basic Framework

The entire framework employs an iterative CDG process containing the following core components:

  1. LLM Core: Generates multi-cycle inputs in JSON format at each iteration
  2. Input Decoder: Decodes JSON-formatted inputs into hardware stimuli
  3. Coverage Monitor: Provides current code coverage information
  4. Explainer Modules: Including Coverage Explainer and DUT Explainer

Prompt Generator

Employs a two-round question-answer mechanism:

  • First Round: Informs the LLM of test objectives, provides DUT information and coverage data, LLM responds in natural language
  • Second Round: Guides the LLM to convert responses into standardized JSON format

Coverage Explainer

Converts complex coverage reports into LLM-comprehensible formats:

  1. Raw Format: Verilator coverage reports containing unique identifiers and hit counts
  2. Annotated Format: Reports created using the verilator_coverage tool with annotations
  3. LLM-Readable Format: Specially designed format adding 'TO BE COVERED' flags for uncovered lines

DUT Explainer

Provides two functionalities to enhance LLM understanding of DUTs:

  1. Design Description: Provides natural language explanations of DUT functionality and internal logic
  2. Test Guidance: Provides supplementary information for creating specific DUT tests and basic test logic rules

Technical Innovations

  1. Code Coverage Focus: First application of LLMs to hardware code coverage rather than functional coverage points
  2. Phased Processing: Decomposes test generation into DUT understanding and input logic reasoning stages
  3. Natural Language Tagging: Uses natural language tags for uncovered lines, simplifying LLM reasoning
  4. Dual-Round Interaction: Promotes step-by-step reasoning through two-round question-answer exchanges

Experimental Setup

Dataset

Custom-built benchmark suite containing 24 Verilog designs across three difficulty levels:

  1. Simple Level (s01-s10): 10 basic combinational logic circuits (multiplexers, ALUs, etc.)
  2. Intermediate Level (m01-m08): 8 sequential logic circuits (FSMs, counters, arbiters, etc.)
  3. Complex Level (c01-c06): 6 large-scale FSM circuits (16-128 states)

Evaluation Metrics

  • Primary Metric: Total length of input stimuli required to achieve full coverage (measured in clock cycles)
  • Coverage: Line coverage percentage
  • Time Efficiency: Number of iterations to reach target coverage

Comparison Methods

  • Random Testing: Baseline comparison method
  • Different Coverage Report Formats: Raw format, annotated format, LLM-readable format

Implementation Details

  • Language Models: OpenAI GPT-4 and GPT-4-Turbo-0125
  • Simulator: Verilator
  • Code Parsing: Pyverilog
  • Experimental Repetition: Each experiment repeated 5 times to account for LLM stochasticity

Experimental Results

Main Results

Coverage Explanation Method Comparison

Experimental results on intermediate-level DUTs using GPT-4 show:

  • LLM-readable coverage reports significantly outperform raw and annotated formats
  • Raw unreadable coverage reports present the greatest challenge to LLMs
  • Natural language tagging with 'TO BE COVERED' flags effectively improves comprehension

Comparison with Random Testing

Experimental results on simple and intermediate-level DUTs demonstrate:

  • LLM framework achieves 100% coverage using significantly fewer inputs
  • Random testing frequently fails to achieve full coverage within one minute on sequential designs with hard-to-reach branches
  • GPT-4 and GPT-4-Turbo perform similarly on hardware test generation tasks

DUT Explanation Optimization Effects

  • Design Description: LLM-generated design descriptions improve design understanding during test generation
  • Test Guidance: Effects are inconsistent, sometimes reducing input diversity on certain designs (m05, m06)

LLM Scalability Analysis

Experimental results on complex-level FSMs show:

  • 16-state FSM: Approaches 100% line coverage after 20 iterations
  • 64+ state FSM: Coverage cannot exceed 50%
  • Test generation quality degrades sharply as DUT scale increases

Experimental Findings

  1. LLMs perform excellently on simple and moderately complex designs
  2. Current LLMs exhibit significant limitations when handling large-scale Verilog designs
  3. Appropriate prompt engineering can substantially improve LLM performance
  4. Coverage report format is critical for LLM comprehension

Traditional CDG Methods

  • Explore input space using heuristic-based approaches
  • Use coverage state as fundamental feedback for generating new test cases
  • Leverage circuit structure information (control/data flow graphs, module connection diagrams) to guide test generation

LLM Applications in Hardware Domain

  • RTL Writing: Automatic generation of hardware description language code
  • Assertion Generation: Generation of verification assertions
  • Error Fixing: Fixing errors in hardware designs
  • Functional Verification: Functional coverage point verification work by Zhang et al.

Differentiated Positioning of This Work

Compared to prior work focusing on high-level functional test plan descriptions, this paper focuses on more fundamental code coverage objectives, requiring LLMs to deeply understand Verilog code logic.

Conclusions and Discussion

Main Conclusions

  1. LLMs can effectively understand simple and moderately complex Verilog designs and generate targeted test inputs
  2. Appropriate prompt engineering (particularly coverage explanation) is critical for LLM performance
  3. Current LLMs exhibit significant limitations when handling large-scale, complex hardware designs

Limitations

  1. Scale Constraints: LLM performance significantly degrades for complex designs exceeding 100 lines of code
  2. Structural Understanding: LLMs struggle to understand the highly structured nature of Verilog
  3. End-to-End Application: Direct application of LLMs in industrial-grade hardware design remains challenging

Future Directions

  1. Enhanced DUT Explainer: Provide more comprehensive high-level design abstractions and verification intentions
  2. Multimodal Fusion: Combine LLMs and Graph Neural Networks (GNNs), leveraging LLMs for semantic interpretation and GNNs for structural information processing
  3. Hierarchical Processing: Guide LLMs from macroscopic perspectives to handle test generation tasks

In-Depth Evaluation

Strengths

  1. Strong Innovation: First application of LLMs to hardware code coverage test generation with clear positioning
  2. Systematic Approach: Proposes a complete framework with multiple optimization modules
  3. Comprehensive Experiments: Constructs multi-level benchmark suite with well-designed experiments
  4. Open Source Contribution: Provides open-source framework promoting field development
  5. Boundary Exploration: Clearly delineates current LLM capability boundaries

Weaknesses

  1. Scale Limitations: Effective only on relatively simple designs, with significant gaps from practical industrial applications
  2. Benchmark Constraints: Custom benchmarks may lack comprehensiveness, missing comparisons with industrial standard benchmarks
  3. Cost Analysis: Lacks detailed comparison of LLM invocation costs versus traditional methods
  4. Theoretical Analysis: Lacks in-depth theoretical analysis of why LLMs are effective for this task

Impact

  1. Academic Value: Opens new application directions for LLMs in hardware verification
  2. Practical Potential: Possesses practical application value in educational settings and medium-scale design verification
  3. Inspirational Value: Provides valuable baselines and insights for subsequent research

Applicable Scenarios

  1. Educational Environments: Suitable for automated test generation in hardware design courses
  2. Prototype Verification: Applicable to rapid verification in early design phases
  3. Small to Medium-Scale Designs: Practical value for Verilog modules under 100 lines
  4. Auxiliary Tool: Can serve as an auxiliary tool for verification engineers, reducing workload

References

The paper cites 19 relevant references covering formal verification, dynamic verification, CDG methods, and LLM applications in hardware, providing a solid theoretical foundation for the research.


Overall Assessment: This is an innovative work at the intersection of LLMs and hardware verification. While it has limitations in scalability, it provides valuable exploration and foundation for field development. The paper demonstrates systematic methodology, comprehensive experiments, and clear open-source contributions, possessing good academic value and inspirational significance.