2025-11-12T23:04:10.380766

LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models

Huang, Zhao, Chen

The rapid development of large language models (LLMs) has revolutionized software testing, particularly fuzz testing, by automating the generation of diverse and effective test inputs. This advancement holds great promise for improving software reliability. Meanwhile, the introduction of MOJO, a high-performance AI programming language blending Python's usability with the efficiency of C and C++, presents new opportunities to enhance AI model scalability and programmability. However, as a new language, MOJO lacks comprehensive testing frameworks and a sufficient corpus for LLM-based testing, which exacerbates model hallucination. In this case, LLMs will generate syntactically valid but semantically incorrect code, significantly reducing the effectiveness of fuzz testing. To address this challenge, we propose MOJOFuzzer, the first adaptive LLM-based fuzzing framework designed for zero-shot learning environments of emerging programming languages. MOJOFuzzer integrates a mutil-phase framework that systematically eliminates low-quality generated inputs before execution, significantly improving test case validity. Furthermore, MOJOFuzzer dynamically adapts LLM prompts based on runtime feedback for test case mutation, enabling an iterative learning process that continuously enhances fuzzing efficiency and bug detection performance. Our experimental results demonstrate that MOJOFuzzer significantly enhances test validity, API coverage, and bug detection performance, outperforming traditional fuzz testing and state-of-the-art LLM-based fuzzing approaches. Using MOJOFuzzer, we have conducted a first large-scale fuzz testing evaluation of MOJO, uncorvering 13 previous unknown bugs. This study not only advances the field of LLM-driven software testing but also establishes a foundational methodology for leveraging LLMs in the testing of emerging programming languages.

academic

LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models

Basic Information

Paper ID: 2510.10179
Title: LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models
Authors: Linghan Huang, Peizhou Zhao, Huaming Chen (University of Sydney)
Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
Publication Date: October 11, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10179

Abstract

The rapid advancement of Large Language Models (LLMs) has revolutionized software testing, particularly fuzzing, through automatic generation of diverse and effective test inputs. Concurrently, the introduction of MOJO—a high-performance AI programming language that combines Python's ease of use with C/C++ efficiency—presents new opportunities for enhancing AI model scalability and programmability. However, as an emerging language, MOJO lacks comprehensive testing frameworks and sufficient LLM training corpora, exacerbating model hallucination issues. To address this challenge, this paper proposes MOJOFuzzer, the first adaptive LLM fuzzing framework designed for zero-shot learning environments in emerging programming languages. Experimental results demonstrate that MOJOFuzzer significantly outperforms traditional fuzzing and state-of-the-art LLM-based fuzzing methods in testing effectiveness, API coverage, and error detection performance, successfully discovering 13 previously unknown errors in MOJO.

Research Background and Motivation

Core Problem

The core problem addressed by this research is the fuzzing testing challenge for emerging programming languages, particularly how to conduct effective testing in zero-shot learning environments lacking sufficient training data.

Problem Significance

AI Development Requirements: With AI's widespread application in critical domains such as autonomous driving, medical diagnosis, and financial services, efficient programming languages are essential
MOJO Language Potential: MOJO achieves performance improvements up to 68,000 times faster than Python, making it a crucial tool for AI development
Missing Testing Frameworks: As an emerging language, MOJO lacks mature testing frameworks, leaving undiscovered software errors and security vulnerabilities

Limitations of Existing Approaches

Traditional LLM Fuzzers rely on large amounts of domain-specific training data, limiting their applicability to emerging languages
Model Hallucination Problem: In zero-shot environments, LLMs tend to generate syntactically correct but semantically erroneous code
Lack of Specificity: Existing tools are not specifically optimized for MOJO language characteristics

Research Motivation

Develop the first LLM fuzzing framework specifically designed for MOJO, leveraging innovative prompt engineering and fine-tuning techniques to achieve effective error detection in zero-shot learning environments.

Core Contributions

First Zero-Shot LLM Fuzzing Framework: MOJOFuzzer is the first LLM-driven fuzzing framework designed for zero-shot learning environments, effectively mitigating LLM hallucination issues
Multi-Stage Quality Control Mechanism: Integrates systematic low-quality input filtering mechanisms, significantly improving test case validity
Adaptive Mutation Strategy: Dynamically adjusts LLM prompts based on runtime feedback, enabling iterative learning processes
Practical Error Discovery: Successfully discovered 13 previously unknown errors in MOJO, with 9 confirmed and fixed by the official team
Significant Performance Improvements: Substantially outperforms existing methods in test validity (98%), API coverage (77.3%), and error detection capability

Methodology Details

Task Definition

Input: MOJO programming language environment with limited syntax rules and historical error reports Output: Valid test cases capable of triggering MOJO errors Constraints: Zero-shot learning environment without extensive MOJO-specific training data

Model Architecture

Overall Framework

MOJOFuzzer employs a multi-stage architecture comprising the following core components:

Data Preparation Stage
- Collects approximately 300 error reports and 1,500 syntax samples from GitHub and official documentation
- Data cleaning and standardization processing
Initialization Stage
- Prompt Bank: Stores structured prompt templates
- Seed Bank: Manages generation and storage of test seeds
Mutation Strategies
- Mutation Scoring Mechanism: Calculates scores based on API call count and code complexity
- Half Mutation: Code-level mutations targeting high-scoring seeds
- Full Mutation: Prompt-level mutations targeting low-scoring seeds

Key Technical Details

Mutation Scoring Formula:

S_mutation = N_API + C_complexity

Where:

N_API: Number of API calls
C_complexity: Code complexity score (assigned different scores for time complexity from O(1) to O(n³))

Prompt Engineering Strategy: Employs chain-of-thought (CoT) and role-based prompting techniques, incorporating 5 core components:

Syntax analysis instructions
Role-based framework
Automated data filtering
Content summarization
Prompt seed generation

Fine-Tuning Strategy

Employs two-stage fine-tuning using LLAMA2 13B model:

First Stage: Learns language structure based on MOJO syntax dataset
Second Stage: Learns defect patterns based on historical error records

Technical Innovations

Zero-Shot Adaptability: First successful implementation of effective LLM fuzzing without extensive training data
Dual-Layer Mutation Mechanism: Combines code-level and prompt-level mutations to enhance test diversity
Adaptive Scoring System: Dynamically evaluates seed quality to optimize resource allocation
Multi-Stage Quality Control: Systematically filters low-quality inputs to reduce hallucination issues

Experimental Setup

Datasets

MOJO Syntax Data: Approximately 1,500 syntax rules and code examples
Historical Error Reports: Approximately 300 error records from GitHub
Testing Environment: MOJO compiler and runtime environment

Evaluation Metrics

Number of Unique Valid Programs: Proportion of syntactically and semantically correct test programs
Mutation Efficiency: Improvements in test diversity, validity, and error detection capability
API Coverage: Number of unique MOJO API functions invoked during testing
Number of Detected Errors: Quantity of distinct software defects discovered

Comparison Methods

Traditional Method: MojoCoder
LLM Fuzzers: Fuzz4All, TitanFuzz
General-Purpose LLMs: GPT-4o, LLAMA3-8B, LLAMA2-7B

Implementation Details

Hardware Platform: NVIDIA A6000 Ada
Fine-Tuning Technique: LoRA (Low-Rank Adaptation)
Maximum Iterations: 10 rounds
Mutation Threshold: Score 50 as the boundary between half mutation and full mutation

Experimental Results

Main Results

API Coverage Comparison

Model	API Coverage
MOJOFuzzer	77.3%
Fine-tuned MojoCoder	68.2%
Fuzz4All	37.8%
TitanFuzz	17.2%
GPT-4o	25.6%

Valid Program Generation Rate

Model	Valid Program Rate
MOJOFuzzer	98%
Mojo-Coder-it 7B	66.4%
GPT-4o	~25%
LLaMA3-8B	~10%
LLaMA2-7B	~10%

Error Detection Capability

Total Errors Discovered: 13 previously unknown errors
Confirmed and Fixed: 9 errors confirmed and fixed by the MOJO team
Error Types: Include random number generator defects, Python library integration issues, etc.

Ablation Study

The ablation study evaluates the contribution of three key components:

Component Configuration	Hallucination Rate	Valid Code Rate	Semantic Correctness
Baseline	40%	60%	50%
Prompt Engineering Only (PE)	28%	75%	65%
Fine-Tuning Only (FT)	15%	88%	78%
Half Mutation Only (HM)	35%	68%	55%
PE + FT	8%	95%	88%
PE + FT + HM (All)	5%	98%	90%

Case Analysis

Examples of Key Errors Discovered:

Random Number Generator Bug:
- Functions random_si64, random_float64, random_ui64 consistently return fixed values
- Affects the correctness of random number generation
Python Library Integration Bug:
- Module retrieval failure when calling numpy functions
- Indicates underlying logic errors in MOJO's Python library integration

Experimental Findings

Critical Role of Fine-Tuning: Fine-tuning is the single most effective factor in reducing hallucination issues
Component Synergy: The three components achieve optimal results when combined
Zero-Shot Learning Feasibility: Demonstrates the viability of effective testing without extensive training data

LLM Fuzzing Development

LLM-Based Fuzzers: TitanFuzz, ChatAFL, Fuzz4All, etc., leverage LLMs to improve seed generation and mutation
Fine-Tuned Fuzzers: FuzzGPT and similar approaches enhance performance through domain-specific data fine-tuning
Traditional Fuzzing: Limitations of conventional tools like OSS-Fuzz on emerging languages

Advantages of This Work

Compared to existing work, MOJOFuzzer's main advantages are:

Zero-Shot Capability: No requirement for extensive pre-training data
Dual-Layer Mutation: Simultaneous mutation at both code and prompt levels
Adaptive Mechanism: Dynamic strategy adjustment based on runtime feedback

Conclusions and Discussion

Main Conclusions

MOJOFuzzer successfully addresses fuzzing testing challenges for emerging programming languages
Zero-shot LLM fuzzing demonstrates practical feasibility in real-world applications
The combination of fine-tuning, prompt engineering, and adaptive mutation significantly outperforms single-technique approaches

Limitations

Temporal Validity Threat: As advanced LLMs gradually integrate MOJO knowledge, zero-shot advantages may diminish
Data Dependency: Still requires minimal amounts of syntax rules and error reports
Computational Resource Requirements: Fine-tuning and inference processes demand substantial computational resources

Future Directions

Full Automation: Progress toward completely automated fuzzing testing
Additional Emerging Languages: Extend the methodology to other emerging programming languages
Pre-Training Data Optimization: Investigate better utilization of limited training data

In-Depth Evaluation

Strengths

Strong Innovation: First zero-shot LLM fuzzing framework specifically for emerging languages
High Practical Value: Successfully discovered 13 actual errors, demonstrating method effectiveness
Complete Technical Solution: Comprehensive pipeline from data collection to error detection
Comprehensive Experiments: Includes thorough comparative experiments and ablation studies
Clear Presentation: Accurate technical detail descriptions and well-designed experiments

Weaknesses

Limited Evaluation Scope: Primarily focuses on MOJO language; generalization capability requires verification
Baseline Comparison: Some baseline methods may not represent optimal choices
Long-Term Validity: Method advantages may diminish as MOJO ecosystem matures
Computational Cost Analysis: Lacks detailed analysis of computational resource consumption

Impact

Academic Contribution: Provides important methodological foundation for emerging language testing
Practical Value: Directly assists MOJO language improvement with immediate impact
Reproducibility: Authors commit to open-sourcing code and data, facilitating future research
Field Advancement: May catalyze more AI testing methods for emerging technologies

Applicable Scenarios

Emerging Programming Languages: Languages lacking mature testing frameworks
Zero-Shot Testing Environments: Scenarios with scarce training data
AI System Testing: AI development environments requiring efficient testing tools
Safety-Critical Systems: Applications requiring potential error discovery

References

The paper cites 58 relevant references covering important works in LLMs, fuzzing, software engineering, and related fields, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality software engineering research paper that proposes innovative solutions to practical problems with rigorous experimental design and convincing results. Beyond technical breakthroughs, this work provides viable methodologies for testing emerging technologies, possessing significant academic and practical value.