2025-11-13T01:07:10.778375

MASA: LLM-Driven Multi-Agent Systems for Autoformalization

Zhang, Valentino, Freitas

Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi-agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast-evolving field. We showcase the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.

academic

MASA: LLM-Driven Multi-Agent Systems for Autoformalization

Basic Information

Paper ID: 2510.08988
Title: MASA: LLM-Driven Multi-Agent Systems for Autoformalization
Authors: Lan Zhang (University of Manchester), Marco Valentino (University of Sheffield), André Freitas (University of Manchester, Idiap Research Institute, National Biomarker Centre)
Classification: cs.CL (Computational Linguistics), cs.FL (Formal Languages)
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.08988

Abstract

Autoformalization plays a crucial role in bridging natural language and formal reasoning. This paper proposes MASA, a multi-agent system construction framework driven by Large Language Models (LLMs) for autoformalization tasks. MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA emphasizes modularity, flexibility, and scalability, enabling seamless integration of new agents and tools to adapt to the rapidly evolving field. The authors demonstrate the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems driven by LLMs and theorem prover interactions in improving the efficiency and reliability of autoformalization.

Research Background and Motivation

Problem Definition

Autoformalization is the task of automatically converting natural language mathematical statements into machine-verifiable formal logical formulas. The core challenges of this problem include:

Semantic Gap: Significant differences exist between the ambiguity of natural language and the rigor of formal language
Complexity: Real-world mathematical statements often involve complex concepts and reasoning chains
Accuracy Requirements: Formalization results must be correct both syntactically and semantically

Problem Significance

Autoformalization holds important significance:

Mathematical Verification: Provides the foundation for theorem proving and mathematical verification
Reasoning Transparency: Formal reasoning is more systematic, transparent, and rigorous compared to natural language reasoning
Automation Necessity: Manual formalization requires substantial expertise and time investment

Limitations of Existing Approaches

Existing LLM-based autoformalization systems suffer from the following issues:

Monolithic Architecture Constraints: Single LLMs struggle to handle complex real-world autoformalization problems
Lack of Modularity: Existing implementations lack modularity, flexibility, and scalability
Insufficient Component Interaction: Unable to effectively integrate multiple interacting components

Research Motivation

The authors' motivation for proposing the MASA framework is to:

Construct modular multi-agent systems to address the complexity of autoformalization
Provide a flexible and scalable framework supporting researchers in rapid system development and extension
Improve the efficiency and reliability of autoformalization through agent collaboration

Core Contributions

Proposed a Modular Multi-Agent Framework: MASA provides a modular framework for constructing multi-agent systems for autoformalization with excellent flexibility and scalability
Demonstrated Real-World Applications: Through agents formalizing real-world mathematical definitions, highlighting the practical potential of the framework
Systematic Evaluation: Evaluated the framework under three multi-agent settings, proving its effectiveness and providing valuable insights. The final iterative self-improvement system achieves 35.25% and 61.89% syntactically correct and semantically aligned formalization rates on Qwen2.5-7B and GPT-4.1-mini, respectively
Open-Source Implementation: Provides complete code and data, lowering research barriers

Methodology Details

Task Definition

Autoformalization can be defined as a transformation function A that maps natural language statement s to its formal representation φ = A(s). Traditional approaches implement this through LLM prompting: A = LLM(prompt). MASA extends this definition by implementing more complex transformation processes through multi-agent systems.

Model Architecture

The MASA framework contains five core components:

1. Agent

Agents are fundamental elements executing specific functions, including:

AutoformalizationAgent: Performs few-shot autoformalization
HardCritiqueAgent: Provides syntax-level critique based on theorem prover
SoftCritiqueAgent: Provides semantic-level critique based on LLM
FormalRefinementAgent: Refines formalization code based on hard critique
InformalRefinementAgent: Refines formalization code based on soft critique
DenoisingAgent: Tool agent for denoising
ImportRetrievalAgent: Tool agent for import retrieval

2. Large Language Model

LLMs provide reasoning and language capabilities to agents, supporting:

OpenAI models (e.g., GPT-4.1-mini)
HuggingFace local models (e.g., Qwen2.5-7B, DeepSeek-Math)

3. Knowledge Base

Stores information about formal language libraries, supporting relevant knowledge retrieval. Current implementation includes knowledge bases of Isabelle/HOL formal statements and proofs.

4. Retriever

Ranks data points in mathematical libraries by relevance, with current implementation based on BM25 method.

5. Theorem Prover

Verifies syntactic correctness and logical validity of formalizations, providing error messages. Supports:

Isabelle (via dedicated server)
Lean4 (via REPL)

Technical Innovations

Modular Design: Employs abstract base class design for easy extension of new agents and tools
Multi-Level Critique Mechanism: Combines hard critique (syntax) and soft critique (semantics)
Iterative Improvement Process: Supports multi-round self-improvement agent collaboration
Tool Agent Integration: Integrates practical tools such as denoising and import retrieval

Experimental Setup

Datasets

miniF2F: Provides real formalizations in Isabelle/HOL and Lean4 for benchmarking
ProofNet: Provides real formalizations of Lean4 code

Evaluation Metrics

Syntactic Correctness

Pass Rate: Percentage of syntactically correct formalizations

Semantic Evaluation

BLEU-4: N-gram overlap with real formalizations
ChrF: Character-level F-score
RUBY: Code migration evaluation metric

Agent Evaluation

Alignment Faithfulness (AF): Whether formalization code accurately aligns with natural language semantics
Formalization Correctness (FC): Whether formalization code itself is valid, natural, and well-formatted

Comparison Methods

Zero-Shot Prompting (ZS): Direct LLM formalization
Few-Shot Prompting (FS): Formalization using 3 examples
Different LLM Combinations: GPT-4.1-mini, DeepSeek-Math-7B, Qwen2.5-7B

Implementation Details

Uses GPT-4.1-mini as backend LLM for soft critique agent
Supports both Isabelle/HOL and Lean4 formal languages
Provides complete Python implementation and Jupyter Notebook examples

Experimental Results

Main Results

Hard Critique Formal Refinement System (Table 1)

miniF2F-Test (Isabelle/HOL):

GPT-4.1-mini zero-shot: Pass Rate 65.57% → 77.05% (+11.48%)
GPT-4.1-mini few-shot: Pass Rate 76.23% → 86.48% (+10.25%)
DeepSeek-Math few-shot: Pass Rate 29.10% → 36.48% (+7.38%)

ProofNet-Test (Lean4):

GPT-4.1-mini zero-shot: Pass Rate 3.30% → 3.85% (+0.55%)
GPT-4.1-mini few-shot: Pass Rate 12.09% → 14.84% (+2.75%)

Soft Critique Informal Refinement System (Table 2)

miniF2F (Lean4):

DeepSeek-Math + GPT-4.1-mini refinement: AF 38.52% → 90.57%, FC 47.54% → 79.92%
Qwen2.5-7B + GPT-4.1-mini refinement: AF 54.51% → 93.44%, FC 62.70% → 85.25%

Iterative Self-Improvement Experiments

Figure 2 Results Show:

Qwen2.5-7B: After 4 iterations, syntactically correct and semantically aligned proportion reaches 35.25%
GPT-4.1-mini: After 4 iterations, syntactically correct and semantically aligned proportion reaches 61.89%

Experimental Findings

Few-Shot Outperforms Zero-Shot: Few-shot learning significantly improves performance across all settings
Model Capability Impact: Stronger models (GPT-4.1-mini) perform better in formal refinement
Iterative Improvement Effectiveness: Multiple iterations continuously improve formalization quality
Cross-Model Refinement: Stronger models can refine outputs from weaker models

Autoformalization Research

Prompting Methods: Wu et al. (2022) and others use LLM prompting for autoformalization
Data Generation: Jiang et al. (2024) and Liu et al. (2025b) develop large-scale parallel corpora
System Implementation: Existing systems lack modularity and scalability

Multi-Agent Systems

Application Domains: Operating systems, medical education, answer verification, etc.
Reasoning Tasks: Arithmetic reasoning and general reasoning
Autoformalization Applications: Limited research on multi-agent systems in autoformalization

Conclusions and Discussion

Main Conclusions

Framework Effectiveness: MASA successfully constructs a modular multi-agent autoformalization system
Performance Improvement: Multi-agent collaboration significantly improves autoformalization accuracy
Practical Value: The framework provides researchers with a flexible development platform

Limitations

Lack of Central Control: The system lacks a central agent for allocating and controlling different agents
Semantic Evaluation Limitations: Semantic evaluation is limited to high-level judgments, requiring finer-grained evaluation standards
Model Dependency: System performance largely depends on the capabilities of underlying LLMs

Future Directions

Enhanced Central Control: Develop more advanced multi-agent system control mechanisms
Refined Evaluation: Establish more fine-grained semantic evaluation standards
Extended Applications: Apply the framework to broader formalization tasks

In-Depth Evaluation

Strengths

Strong Innovation: First systematic multi-agent autoformalization framework
Reasonable Design: Elegant modular architecture design, easy to extend
Comprehensive Experiments: Covers multiple settings and evaluation metrics
High Practical Value: Open-source implementation lowers research barriers
Convincing Results: Validates method effectiveness across multiple datasets

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical analysis of multi-agent collaboration mechanisms
Computational Cost: Insufficient analysis of computational overhead of multi-agent systems
Error Propagation: Insufficient in-depth analysis of error propagation between agents
Evaluation Limitations: Semantic evaluation still relies on LJM judgment, potentially introducing bias

Impact

Academic Contribution: Provides new paradigm for autoformalization research
Practical Value: Framework can be directly used for practical application development
Reproducibility: Complete open-source implementation ensures reproducibility
Advancement: Likely to promote multi-agent applications in formalization domain

Applicable Scenarios

Mathematical Formalization: Suitable for autoformalization of complex mathematical theorems and definitions
Educational Applications: Can be used for formalization training in mathematics education
Research Tool: Provides foundational platform for autoformalization research
Industrial Applications: Can be integrated into software systems requiring formal verification

References

Key references include:

Wu et al. (2022): Autoformalization with large language models
Zheng et al. (2022): miniF2F: a cross-system benchmark for formal olympiad-level mathematics
Azerbayev et al. (2023): ProofNet: Autoformalizing and formally proving undergraduate-level mathematics
Jiang et al. (2023): Draft, sketch, and prove: Guiding formal theorem provers with informal proofs

Summary: MASA is an innovative multi-agent autoformalization framework that significantly improves autoformalization effectiveness through modular design and agent collaboration. This work excels in technical innovation, experimental validation, and practical value, opening new directions for autoformalization research. Despite some limitations, its open-source implementation and good scalability make it an important contribution to the field.