2025-11-16T06:52:11.231184

VerilogReader: LLM-Aided Hardware Test Generation

Ma, Yang, Liu et al.

Test generation has been a critical and labor-intensive process in hardware design verification. Recently, the emergence of Large Language Model (LLM) with their advanced understanding and inference capabilities, has introduced a novel approach. In this work, we investigate the integration of LLM into the Coverage Directed Test Generation (CDG) process, where the LLM functions as a Verilog Reader. It accurately grasps the code logic, thereby generating stimuli that can reach unexplored code branches. We compare our framework with random testing, using our self-designed Verilog benchmark suite. Experiments demonstrate that our framework outperforms random testing on designs within the LLM's comprehension scope. Our work also proposes prompt engineering optimizations to augment LLM's understanding scope and accuracy.

academic

VerilogReader: LLM-Aided Hardware Test Generation

Basic Information

Paper ID: 2406.04373
Title: VerilogReader: LLM-Aided Hardware Test Generation
Authors: Ruiyang Ma, Yuxin Yang, Ziqian Liu, Jiaxi Zhang, Min Li, Junhua Huang, Guojie Luo
Categories: cs.SE cs.AI
Publication Date: June 3, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2406.04373
Open Source Code: https://github.com/magicYang1573/llm-hardware-test-generation

Abstract

Test generation has been a critical and labor-intensive process in hardware design verification. In recent years, Large Language Models (LLMs), leveraging their advanced comprehension and reasoning capabilities, have introduced novel approaches to this domain. This research explores the integration of LLMs into coverage-directed test generation (CDG) processes, where LLMs serve as Verilog code readers to accurately understand code logic and generate stimuli capable of reaching unexplored code branches. Using a custom-designed Verilog benchmark suite, the authors compare this framework against random testing. Experimental results demonstrate that the framework outperforms random testing on designs within the LLM's comprehension scope, and propose prompt engineering optimizations to enhance the LLM's understanding range and accuracy.

Research Background and Motivation

Problem Background

Importance of Hardware Verification: With the exponential growth in hardware complexity, hardware verification has become increasingly critical in the development process. Undetected hardware errors can lead to severe consequences and substantial economic losses.
Existing Verification Methods: Engineers primarily employ two verification approaches:
- Formal verification: Using mathematical techniques to prove system correctness
- Dynamic verification: Generating diverse test cases to simulate the design under test (DUT)
Test Generation Challenges: Achieving coverage objectives requires high-quality test inputs, imposing significant manual burden on verification engineers.

Research Motivation

Automation Requirements: To reduce manual intervention, coverage-directed test generation (CDG) has become a key technology for automating hardware test generation.
LLM Opportunities: The powerful capabilities of LLMs in comprehension and reasoning provide new opportunities for hardware test generation.
Differentiated Positioning: Unlike previous research focusing on functional coverage points, this paper focuses on code coverage—a more fundamental testing objective—positioning LLMs as "VerilogReaders."

Core Contributions

Open Source Framework: First to open-source a framework integrating LLMs into the CDG process, utilizing LLMs as VerilogReaders to understand Verilog code and coverage, aiming to generate tests that achieve code coverage closure.
Prompt Optimization Modules: Proposes Coverage Explainer and DUT Explainer modules to enrich prompts, enhancing LLM understanding of designs and test intentions, improving framework scalability.
Benchmark Test Suite: Creates a benchmark suite containing 24 Verilog designs at simple, intermediate, and complex levels. Experiments demonstrate that the framework outperforms random testing on simple and intermediate-level DUTs.
Capability Boundary Exploration: Clearly delineates the maximum capability boundaries of current LLMs in Verilog reading.

Methodology Details

Task Definition

The task is to leverage LLMs to understand Verilog code logic and current coverage status, generating multi-cycle input stimuli capable of triggering uncovered code branches to achieve improved code coverage.

Model Architecture

Basic Framework

The entire framework employs an iterative CDG process containing the following core components:

LLM Core: Generates multi-cycle inputs in JSON format at each iteration
Input Decoder: Decodes JSON-formatted inputs into hardware stimuli
Coverage Monitor: Provides current code coverage information
Explainer Modules: Including Coverage Explainer and DUT Explainer

Prompt Generator

Employs a two-round question-answer mechanism:

First Round: Informs the LLM of test objectives, provides DUT information and coverage data, LLM responds in natural language
Second Round: Guides the LLM to convert responses into standardized JSON format

Coverage Explainer

Converts complex coverage reports into LLM-comprehensible formats:

Raw Format: Verilator coverage reports containing unique identifiers and hit counts
Annotated Format: Reports created using the verilator_coverage tool with annotations
LLM-Readable Format: Specially designed format adding 'TO BE COVERED' flags for uncovered lines

DUT Explainer

Provides two functionalities to enhance LLM understanding of DUTs:

Design Description: Provides natural language explanations of DUT functionality and internal logic
Test Guidance: Provides supplementary information for creating specific DUT tests and basic test logic rules

Technical Innovations

Code Coverage Focus: First application of LLMs to hardware code coverage rather than functional coverage points
Phased Processing: Decomposes test generation into DUT understanding and input logic reasoning stages
Natural Language Tagging: Uses natural language tags for uncovered lines, simplifying LLM reasoning
Dual-Round Interaction: Promotes step-by-step reasoning through two-round question-answer exchanges

Experimental Setup

Dataset

Custom-built benchmark suite containing 24 Verilog designs across three difficulty levels:

Simple Level (s01-s10): 10 basic combinational logic circuits (multiplexers, ALUs, etc.)
Intermediate Level (m01-m08): 8 sequential logic circuits (FSMs, counters, arbiters, etc.)
Complex Level (c01-c06): 6 large-scale FSM circuits (16-128 states)

Evaluation Metrics

Primary Metric: Total length of input stimuli required to achieve full coverage (measured in clock cycles)
Coverage: Line coverage percentage
Time Efficiency: Number of iterations to reach target coverage

Comparison Methods

Random Testing: Baseline comparison method
Different Coverage Report Formats: Raw format, annotated format, LLM-readable format

Implementation Details

Language Models: OpenAI GPT-4 and GPT-4-Turbo-0125
Simulator: Verilator
Code Parsing: Pyverilog
Experimental Repetition: Each experiment repeated 5 times to account for LLM stochasticity

Experimental Results

Main Results

Coverage Explanation Method Comparison

Experimental results on intermediate-level DUTs using GPT-4 show:

LLM-readable coverage reports significantly outperform raw and annotated formats
Raw unreadable coverage reports present the greatest challenge to LLMs
Natural language tagging with 'TO BE COVERED' flags effectively improves comprehension

Comparison with Random Testing

Experimental results on simple and intermediate-level DUTs demonstrate:

LLM framework achieves 100% coverage using significantly fewer inputs
Random testing frequently fails to achieve full coverage within one minute on sequential designs with hard-to-reach branches
GPT-4 and GPT-4-Turbo perform similarly on hardware test generation tasks

DUT Explanation Optimization Effects

Design Description: LLM-generated design descriptions improve design understanding during test generation
Test Guidance: Effects are inconsistent, sometimes reducing input diversity on certain designs (m05, m06)

LLM Scalability Analysis

Experimental results on complex-level FSMs show:

16-state FSM: Approaches 100% line coverage after 20 iterations
64+ state FSM: Coverage cannot exceed 50%
Test generation quality degrades sharply as DUT scale increases

Experimental Findings

LLMs perform excellently on simple and moderately complex designs
Current LLMs exhibit significant limitations when handling large-scale Verilog designs
Appropriate prompt engineering can substantially improve LLM performance
Coverage report format is critical for LLM comprehension

Traditional CDG Methods

Explore input space using heuristic-based approaches
Use coverage state as fundamental feedback for generating new test cases
Leverage circuit structure information (control/data flow graphs, module connection diagrams) to guide test generation

LLM Applications in Hardware Domain

RTL Writing: Automatic generation of hardware description language code
Assertion Generation: Generation of verification assertions
Error Fixing: Fixing errors in hardware designs
Functional Verification: Functional coverage point verification work by Zhang et al.

Differentiated Positioning of This Work

Compared to prior work focusing on high-level functional test plan descriptions, this paper focuses on more fundamental code coverage objectives, requiring LLMs to deeply understand Verilog code logic.

Conclusions and Discussion

Main Conclusions

LLMs can effectively understand simple and moderately complex Verilog designs and generate targeted test inputs
Appropriate prompt engineering (particularly coverage explanation) is critical for LLM performance
Current LLMs exhibit significant limitations when handling large-scale, complex hardware designs

Limitations

Scale Constraints: LLM performance significantly degrades for complex designs exceeding 100 lines of code
Structural Understanding: LLMs struggle to understand the highly structured nature of Verilog
End-to-End Application: Direct application of LLMs in industrial-grade hardware design remains challenging

Future Directions

Enhanced DUT Explainer: Provide more comprehensive high-level design abstractions and verification intentions
Multimodal Fusion: Combine LLMs and Graph Neural Networks (GNNs), leveraging LLMs for semantic interpretation and GNNs for structural information processing
Hierarchical Processing: Guide LLMs from macroscopic perspectives to handle test generation tasks

In-Depth Evaluation

Strengths

Strong Innovation: First application of LLMs to hardware code coverage test generation with clear positioning
Systematic Approach: Proposes a complete framework with multiple optimization modules
Comprehensive Experiments: Constructs multi-level benchmark suite with well-designed experiments
Open Source Contribution: Provides open-source framework promoting field development
Boundary Exploration: Clearly delineates current LLM capability boundaries

Weaknesses

Scale Limitations: Effective only on relatively simple designs, with significant gaps from practical industrial applications
Benchmark Constraints: Custom benchmarks may lack comprehensiveness, missing comparisons with industrial standard benchmarks
Cost Analysis: Lacks detailed comparison of LLM invocation costs versus traditional methods
Theoretical Analysis: Lacks in-depth theoretical analysis of why LLMs are effective for this task

Impact

Academic Value: Opens new application directions for LLMs in hardware verification
Practical Potential: Possesses practical application value in educational settings and medium-scale design verification
Inspirational Value: Provides valuable baselines and insights for subsequent research

Applicable Scenarios

Educational Environments: Suitable for automated test generation in hardware design courses
Prototype Verification: Applicable to rapid verification in early design phases
Small to Medium-Scale Designs: Practical value for Verilog modules under 100 lines
Auxiliary Tool: Can serve as an auxiliary tool for verification engineers, reducing workload

References

The paper cites 19 relevant references covering formal verification, dynamic verification, CDG methods, and LLM applications in hardware, providing a solid theoretical foundation for the research.

Overall Assessment: This is an innovative work at the intersection of LLMs and hardware verification. While it has limitations in scalability, it provides valuable exploration and foundation for field development. The paper demonstrates systematic methodology, comprehensive experiments, and clear open-source contributions, possessing good academic value and inspirational significance.