2025-11-12T23:04:10.380766

LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models

Huang, Zhao, Chen
The rapid development of large language models (LLMs) has revolutionized software testing, particularly fuzz testing, by automating the generation of diverse and effective test inputs. This advancement holds great promise for improving software reliability. Meanwhile, the introduction of MOJO, a high-performance AI programming language blending Python's usability with the efficiency of C and C++, presents new opportunities to enhance AI model scalability and programmability. However, as a new language, MOJO lacks comprehensive testing frameworks and a sufficient corpus for LLM-based testing, which exacerbates model hallucination. In this case, LLMs will generate syntactically valid but semantically incorrect code, significantly reducing the effectiveness of fuzz testing. To address this challenge, we propose MOJOFuzzer, the first adaptive LLM-based fuzzing framework designed for zero-shot learning environments of emerging programming languages. MOJOFuzzer integrates a mutil-phase framework that systematically eliminates low-quality generated inputs before execution, significantly improving test case validity. Furthermore, MOJOFuzzer dynamically adapts LLM prompts based on runtime feedback for test case mutation, enabling an iterative learning process that continuously enhances fuzzing efficiency and bug detection performance. Our experimental results demonstrate that MOJOFuzzer significantly enhances test validity, API coverage, and bug detection performance, outperforming traditional fuzz testing and state-of-the-art LLM-based fuzzing approaches. Using MOJOFuzzer, we have conducted a first large-scale fuzz testing evaluation of MOJO, uncorvering 13 previous unknown bugs. This study not only advances the field of LLM-driven software testing but also establishes a foundational methodology for leveraging LLMs in the testing of emerging programming languages.
academic

LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models

Basic Information

  • Paper ID: 2510.10179
  • Title: LLMs are All You Need? Improving Fuzz Testing for MOJO with Large Language Models
  • Authors: Linghan Huang, Peizhou Zhao, Huaming Chen (University of Sydney)
  • Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence)
  • Publication Date: October 11, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10179

Abstract

The rapid advancement of Large Language Models (LLMs) has revolutionized software testing, particularly fuzzing, through automatic generation of diverse and effective test inputs. Concurrently, the introduction of MOJO—a high-performance AI programming language that combines Python's ease of use with C/C++ efficiency—presents new opportunities for enhancing AI model scalability and programmability. However, as an emerging language, MOJO lacks comprehensive testing frameworks and sufficient LLM training corpora, exacerbating model hallucination issues. To address this challenge, this paper proposes MOJOFuzzer, the first adaptive LLM fuzzing framework designed for zero-shot learning environments in emerging programming languages. Experimental results demonstrate that MOJOFuzzer significantly outperforms traditional fuzzing and state-of-the-art LLM-based fuzzing methods in testing effectiveness, API coverage, and error detection performance, successfully discovering 13 previously unknown errors in MOJO.

Research Background and Motivation

Core Problem

The core problem addressed by this research is the fuzzing testing challenge for emerging programming languages, particularly how to conduct effective testing in zero-shot learning environments lacking sufficient training data.

Problem Significance

  1. AI Development Requirements: With AI's widespread application in critical domains such as autonomous driving, medical diagnosis, and financial services, efficient programming languages are essential
  2. MOJO Language Potential: MOJO achieves performance improvements up to 68,000 times faster than Python, making it a crucial tool for AI development
  3. Missing Testing Frameworks: As an emerging language, MOJO lacks mature testing frameworks, leaving undiscovered software errors and security vulnerabilities

Limitations of Existing Approaches

  1. Traditional LLM Fuzzers rely on large amounts of domain-specific training data, limiting their applicability to emerging languages
  2. Model Hallucination Problem: In zero-shot environments, LLMs tend to generate syntactically correct but semantically erroneous code
  3. Lack of Specificity: Existing tools are not specifically optimized for MOJO language characteristics

Research Motivation

Develop the first LLM fuzzing framework specifically designed for MOJO, leveraging innovative prompt engineering and fine-tuning techniques to achieve effective error detection in zero-shot learning environments.

Core Contributions

  1. First Zero-Shot LLM Fuzzing Framework: MOJOFuzzer is the first LLM-driven fuzzing framework designed for zero-shot learning environments, effectively mitigating LLM hallucination issues
  2. Multi-Stage Quality Control Mechanism: Integrates systematic low-quality input filtering mechanisms, significantly improving test case validity
  3. Adaptive Mutation Strategy: Dynamically adjusts LLM prompts based on runtime feedback, enabling iterative learning processes
  4. Practical Error Discovery: Successfully discovered 13 previously unknown errors in MOJO, with 9 confirmed and fixed by the official team
  5. Significant Performance Improvements: Substantially outperforms existing methods in test validity (98%), API coverage (77.3%), and error detection capability

Methodology Details

Task Definition

Input: MOJO programming language environment with limited syntax rules and historical error reports Output: Valid test cases capable of triggering MOJO errors Constraints: Zero-shot learning environment without extensive MOJO-specific training data

Model Architecture

Overall Framework

MOJOFuzzer employs a multi-stage architecture comprising the following core components:

  1. Data Preparation Stage
    • Collects approximately 300 error reports and 1,500 syntax samples from GitHub and official documentation
    • Data cleaning and standardization processing
  2. Initialization Stage
    • Prompt Bank: Stores structured prompt templates
    • Seed Bank: Manages generation and storage of test seeds
  3. Mutation Strategies
    • Mutation Scoring Mechanism: Calculates scores based on API call count and code complexity
    • Half Mutation: Code-level mutations targeting high-scoring seeds
    • Full Mutation: Prompt-level mutations targeting low-scoring seeds

Key Technical Details

Mutation Scoring Formula:

S_mutation = N_API + C_complexity

Where:

  • N_API: Number of API calls
  • C_complexity: Code complexity score (assigned different scores for time complexity from O(1) to O(n³))

Prompt Engineering Strategy: Employs chain-of-thought (CoT) and role-based prompting techniques, incorporating 5 core components:

  1. Syntax analysis instructions
  2. Role-based framework
  3. Automated data filtering
  4. Content summarization
  5. Prompt seed generation

Fine-Tuning Strategy

Employs two-stage fine-tuning using LLAMA2 13B model:

  1. First Stage: Learns language structure based on MOJO syntax dataset
  2. Second Stage: Learns defect patterns based on historical error records

Technical Innovations

  1. Zero-Shot Adaptability: First successful implementation of effective LLM fuzzing without extensive training data
  2. Dual-Layer Mutation Mechanism: Combines code-level and prompt-level mutations to enhance test diversity
  3. Adaptive Scoring System: Dynamically evaluates seed quality to optimize resource allocation
  4. Multi-Stage Quality Control: Systematically filters low-quality inputs to reduce hallucination issues

Experimental Setup

Datasets

  • MOJO Syntax Data: Approximately 1,500 syntax rules and code examples
  • Historical Error Reports: Approximately 300 error records from GitHub
  • Testing Environment: MOJO compiler and runtime environment

Evaluation Metrics

  1. Number of Unique Valid Programs: Proportion of syntactically and semantically correct test programs
  2. Mutation Efficiency: Improvements in test diversity, validity, and error detection capability
  3. API Coverage: Number of unique MOJO API functions invoked during testing
  4. Number of Detected Errors: Quantity of distinct software defects discovered

Comparison Methods

  • Traditional Method: MojoCoder
  • LLM Fuzzers: Fuzz4All, TitanFuzz
  • General-Purpose LLMs: GPT-4o, LLAMA3-8B, LLAMA2-7B

Implementation Details

  • Hardware Platform: NVIDIA A6000 Ada
  • Fine-Tuning Technique: LoRA (Low-Rank Adaptation)
  • Maximum Iterations: 10 rounds
  • Mutation Threshold: Score 50 as the boundary between half mutation and full mutation

Experimental Results

Main Results

API Coverage Comparison

ModelAPI Coverage
MOJOFuzzer77.3%
Fine-tuned MojoCoder68.2%
Fuzz4All37.8%
TitanFuzz17.2%
GPT-4o25.6%

Valid Program Generation Rate

ModelValid Program Rate
MOJOFuzzer98%
Mojo-Coder-it 7B66.4%
GPT-4o~25%
LLaMA3-8B~10%
LLaMA2-7B~10%

Error Detection Capability

  • Total Errors Discovered: 13 previously unknown errors
  • Confirmed and Fixed: 9 errors confirmed and fixed by the MOJO team
  • Error Types: Include random number generator defects, Python library integration issues, etc.

Ablation Study

The ablation study evaluates the contribution of three key components:

Component ConfigurationHallucination RateValid Code RateSemantic Correctness
Baseline40%60%50%
Prompt Engineering Only (PE)28%75%65%
Fine-Tuning Only (FT)15%88%78%
Half Mutation Only (HM)35%68%55%
PE + FT8%95%88%
PE + FT + HM (All)5%98%90%

Case Analysis

Examples of Key Errors Discovered:

  1. Random Number Generator Bug:
    • Functions random_si64, random_float64, random_ui64 consistently return fixed values
    • Affects the correctness of random number generation
  2. Python Library Integration Bug:
    • Module retrieval failure when calling numpy functions
    • Indicates underlying logic errors in MOJO's Python library integration

Experimental Findings

  1. Critical Role of Fine-Tuning: Fine-tuning is the single most effective factor in reducing hallucination issues
  2. Component Synergy: The three components achieve optimal results when combined
  3. Zero-Shot Learning Feasibility: Demonstrates the viability of effective testing without extensive training data

LLM Fuzzing Development

  1. LLM-Based Fuzzers: TitanFuzz, ChatAFL, Fuzz4All, etc., leverage LLMs to improve seed generation and mutation
  2. Fine-Tuned Fuzzers: FuzzGPT and similar approaches enhance performance through domain-specific data fine-tuning
  3. Traditional Fuzzing: Limitations of conventional tools like OSS-Fuzz on emerging languages

Advantages of This Work

Compared to existing work, MOJOFuzzer's main advantages are:

  1. Zero-Shot Capability: No requirement for extensive pre-training data
  2. Dual-Layer Mutation: Simultaneous mutation at both code and prompt levels
  3. Adaptive Mechanism: Dynamic strategy adjustment based on runtime feedback

Conclusions and Discussion

Main Conclusions

  1. MOJOFuzzer successfully addresses fuzzing testing challenges for emerging programming languages
  2. Zero-shot LLM fuzzing demonstrates practical feasibility in real-world applications
  3. The combination of fine-tuning, prompt engineering, and adaptive mutation significantly outperforms single-technique approaches

Limitations

  1. Temporal Validity Threat: As advanced LLMs gradually integrate MOJO knowledge, zero-shot advantages may diminish
  2. Data Dependency: Still requires minimal amounts of syntax rules and error reports
  3. Computational Resource Requirements: Fine-tuning and inference processes demand substantial computational resources

Future Directions

  1. Full Automation: Progress toward completely automated fuzzing testing
  2. Additional Emerging Languages: Extend the methodology to other emerging programming languages
  3. Pre-Training Data Optimization: Investigate better utilization of limited training data

In-Depth Evaluation

Strengths

  1. Strong Innovation: First zero-shot LLM fuzzing framework specifically for emerging languages
  2. High Practical Value: Successfully discovered 13 actual errors, demonstrating method effectiveness
  3. Complete Technical Solution: Comprehensive pipeline from data collection to error detection
  4. Comprehensive Experiments: Includes thorough comparative experiments and ablation studies
  5. Clear Presentation: Accurate technical detail descriptions and well-designed experiments

Weaknesses

  1. Limited Evaluation Scope: Primarily focuses on MOJO language; generalization capability requires verification
  2. Baseline Comparison: Some baseline methods may not represent optimal choices
  3. Long-Term Validity: Method advantages may diminish as MOJO ecosystem matures
  4. Computational Cost Analysis: Lacks detailed analysis of computational resource consumption

Impact

  1. Academic Contribution: Provides important methodological foundation for emerging language testing
  2. Practical Value: Directly assists MOJO language improvement with immediate impact
  3. Reproducibility: Authors commit to open-sourcing code and data, facilitating future research
  4. Field Advancement: May catalyze more AI testing methods for emerging technologies

Applicable Scenarios

  1. Emerging Programming Languages: Languages lacking mature testing frameworks
  2. Zero-Shot Testing Environments: Scenarios with scarce training data
  3. AI System Testing: AI development environments requiring efficient testing tools
  4. Safety-Critical Systems: Applications requiring potential error discovery

References

The paper cites 58 relevant references covering important works in LLMs, fuzzing, software engineering, and related fields, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality software engineering research paper that proposes innovative solutions to practical problems with rigorous experimental design and convincing results. Beyond technical breakthroughs, this work provides viable methodologies for testing emerging technologies, possessing significant academic and practical value.