Autoformalization serves a crucial role in connecting natural language and formal reasoning. This paper presents MASA, a novel framework for building multi-agent systems for autoformalization driven by Large Language Models (LLMs). MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA is designed with a strong emphasis on modularity, flexibility, and extensibility, allowing seamless integration of new agents and tools to adapt to a fast-evolving field. We showcase the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems powered by the interaction of LLMs and theorem provers in enhancing the efficiency and reliability of autoformalization, providing valuable insights and support for researchers and practitioners in the field.
- Paper ID: 2510.08988
- Title: MASA: LLM-Driven Multi-Agent Systems for Autoformalization
- Authors: Lan Zhang (University of Manchester), Marco Valentino (University of Sheffield), André Freitas (University of Manchester, Idiap Research Institute, National Biomarker Centre)
- Classification: cs.CL (Computational Linguistics), cs.FL (Formal Languages)
- Publication Date: October 10, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.08988
Autoformalization plays a crucial role in bridging natural language and formal reasoning. This paper proposes MASA, a multi-agent system construction framework driven by Large Language Models (LLMs) for autoformalization tasks. MASA leverages collaborative agents to convert natural language statements into their formal representations. The architecture of MASA emphasizes modularity, flexibility, and scalability, enabling seamless integration of new agents and tools to adapt to the rapidly evolving field. The authors demonstrate the effectiveness of MASA through use cases on real-world mathematical definitions and experiments on formal mathematics datasets. This work highlights the potential of multi-agent systems driven by LLMs and theorem prover interactions in improving the efficiency and reliability of autoformalization.
Autoformalization is the task of automatically converting natural language mathematical statements into machine-verifiable formal logical formulas. The core challenges of this problem include:
- Semantic Gap: Significant differences exist between the ambiguity of natural language and the rigor of formal language
- Complexity: Real-world mathematical statements often involve complex concepts and reasoning chains
- Accuracy Requirements: Formalization results must be correct both syntactically and semantically
Autoformalization holds important significance:
- Mathematical Verification: Provides the foundation for theorem proving and mathematical verification
- Reasoning Transparency: Formal reasoning is more systematic, transparent, and rigorous compared to natural language reasoning
- Automation Necessity: Manual formalization requires substantial expertise and time investment
Existing LLM-based autoformalization systems suffer from the following issues:
- Monolithic Architecture Constraints: Single LLMs struggle to handle complex real-world autoformalization problems
- Lack of Modularity: Existing implementations lack modularity, flexibility, and scalability
- Insufficient Component Interaction: Unable to effectively integrate multiple interacting components
The authors' motivation for proposing the MASA framework is to:
- Construct modular multi-agent systems to address the complexity of autoformalization
- Provide a flexible and scalable framework supporting researchers in rapid system development and extension
- Improve the efficiency and reliability of autoformalization through agent collaboration
- Proposed a Modular Multi-Agent Framework: MASA provides a modular framework for constructing multi-agent systems for autoformalization with excellent flexibility and scalability
- Demonstrated Real-World Applications: Through agents formalizing real-world mathematical definitions, highlighting the practical potential of the framework
- Systematic Evaluation: Evaluated the framework under three multi-agent settings, proving its effectiveness and providing valuable insights. The final iterative self-improvement system achieves 35.25% and 61.89% syntactically correct and semantically aligned formalization rates on Qwen2.5-7B and GPT-4.1-mini, respectively
- Open-Source Implementation: Provides complete code and data, lowering research barriers
Autoformalization can be defined as a transformation function A that maps natural language statement s to its formal representation φ = A(s). Traditional approaches implement this through LLM prompting: A = LLM(prompt). MASA extends this definition by implementing more complex transformation processes through multi-agent systems.
The MASA framework contains five core components:
Agents are fundamental elements executing specific functions, including:
- AutoformalizationAgent: Performs few-shot autoformalization
- HardCritiqueAgent: Provides syntax-level critique based on theorem prover
- SoftCritiqueAgent: Provides semantic-level critique based on LLM
- FormalRefinementAgent: Refines formalization code based on hard critique
- InformalRefinementAgent: Refines formalization code based on soft critique
- DenoisingAgent: Tool agent for denoising
- ImportRetrievalAgent: Tool agent for import retrieval
LLMs provide reasoning and language capabilities to agents, supporting:
- OpenAI models (e.g., GPT-4.1-mini)
- HuggingFace local models (e.g., Qwen2.5-7B, DeepSeek-Math)
Stores information about formal language libraries, supporting relevant knowledge retrieval. Current implementation includes knowledge bases of Isabelle/HOL formal statements and proofs.
Ranks data points in mathematical libraries by relevance, with current implementation based on BM25 method.
Verifies syntactic correctness and logical validity of formalizations, providing error messages. Supports:
- Isabelle (via dedicated server)
- Lean4 (via REPL)
- Modular Design: Employs abstract base class design for easy extension of new agents and tools
- Multi-Level Critique Mechanism: Combines hard critique (syntax) and soft critique (semantics)
- Iterative Improvement Process: Supports multi-round self-improvement agent collaboration
- Tool Agent Integration: Integrates practical tools such as denoising and import retrieval
- miniF2F: Provides real formalizations in Isabelle/HOL and Lean4 for benchmarking
- ProofNet: Provides real formalizations of Lean4 code
- Pass Rate: Percentage of syntactically correct formalizations
- BLEU-4: N-gram overlap with real formalizations
- ChrF: Character-level F-score
- RUBY: Code migration evaluation metric
- Alignment Faithfulness (AF): Whether formalization code accurately aligns with natural language semantics
- Formalization Correctness (FC): Whether formalization code itself is valid, natural, and well-formatted
- Zero-Shot Prompting (ZS): Direct LLM formalization
- Few-Shot Prompting (FS): Formalization using 3 examples
- Different LLM Combinations: GPT-4.1-mini, DeepSeek-Math-7B, Qwen2.5-7B
- Uses GPT-4.1-mini as backend LLM for soft critique agent
- Supports both Isabelle/HOL and Lean4 formal languages
- Provides complete Python implementation and Jupyter Notebook examples
miniF2F-Test (Isabelle/HOL):
- GPT-4.1-mini zero-shot: Pass Rate 65.57% → 77.05% (+11.48%)
- GPT-4.1-mini few-shot: Pass Rate 76.23% → 86.48% (+10.25%)
- DeepSeek-Math few-shot: Pass Rate 29.10% → 36.48% (+7.38%)
ProofNet-Test (Lean4):
- GPT-4.1-mini zero-shot: Pass Rate 3.30% → 3.85% (+0.55%)
- GPT-4.1-mini few-shot: Pass Rate 12.09% → 14.84% (+2.75%)
miniF2F (Lean4):
- DeepSeek-Math + GPT-4.1-mini refinement: AF 38.52% → 90.57%, FC 47.54% → 79.92%
- Qwen2.5-7B + GPT-4.1-mini refinement: AF 54.51% → 93.44%, FC 62.70% → 85.25%
Figure 2 Results Show:
- Qwen2.5-7B: After 4 iterations, syntactically correct and semantically aligned proportion reaches 35.25%
- GPT-4.1-mini: After 4 iterations, syntactically correct and semantically aligned proportion reaches 61.89%
- Few-Shot Outperforms Zero-Shot: Few-shot learning significantly improves performance across all settings
- Model Capability Impact: Stronger models (GPT-4.1-mini) perform better in formal refinement
- Iterative Improvement Effectiveness: Multiple iterations continuously improve formalization quality
- Cross-Model Refinement: Stronger models can refine outputs from weaker models
- Prompting Methods: Wu et al. (2022) and others use LLM prompting for autoformalization
- Data Generation: Jiang et al. (2024) and Liu et al. (2025b) develop large-scale parallel corpora
- System Implementation: Existing systems lack modularity and scalability
- Application Domains: Operating systems, medical education, answer verification, etc.
- Reasoning Tasks: Arithmetic reasoning and general reasoning
- Autoformalization Applications: Limited research on multi-agent systems in autoformalization
- Framework Effectiveness: MASA successfully constructs a modular multi-agent autoformalization system
- Performance Improvement: Multi-agent collaboration significantly improves autoformalization accuracy
- Practical Value: The framework provides researchers with a flexible development platform
- Lack of Central Control: The system lacks a central agent for allocating and controlling different agents
- Semantic Evaluation Limitations: Semantic evaluation is limited to high-level judgments, requiring finer-grained evaluation standards
- Model Dependency: System performance largely depends on the capabilities of underlying LLMs
- Enhanced Central Control: Develop more advanced multi-agent system control mechanisms
- Refined Evaluation: Establish more fine-grained semantic evaluation standards
- Extended Applications: Apply the framework to broader formalization tasks
- Strong Innovation: First systematic multi-agent autoformalization framework
- Reasonable Design: Elegant modular architecture design, easy to extend
- Comprehensive Experiments: Covers multiple settings and evaluation metrics
- High Practical Value: Open-source implementation lowers research barriers
- Convincing Results: Validates method effectiveness across multiple datasets
- Insufficient Theoretical Analysis: Lacks theoretical analysis of multi-agent collaboration mechanisms
- Computational Cost: Insufficient analysis of computational overhead of multi-agent systems
- Error Propagation: Insufficient in-depth analysis of error propagation between agents
- Evaluation Limitations: Semantic evaluation still relies on LJM judgment, potentially introducing bias
- Academic Contribution: Provides new paradigm for autoformalization research
- Practical Value: Framework can be directly used for practical application development
- Reproducibility: Complete open-source implementation ensures reproducibility
- Advancement: Likely to promote multi-agent applications in formalization domain
- Mathematical Formalization: Suitable for autoformalization of complex mathematical theorems and definitions
- Educational Applications: Can be used for formalization training in mathematics education
- Research Tool: Provides foundational platform for autoformalization research
- Industrial Applications: Can be integrated into software systems requiring formal verification
Key references include:
- Wu et al. (2022): Autoformalization with large language models
- Zheng et al. (2022): miniF2F: a cross-system benchmark for formal olympiad-level mathematics
- Azerbayev et al. (2023): ProofNet: Autoformalizing and formally proving undergraduate-level mathematics
- Jiang et al. (2023): Draft, sketch, and prove: Guiding formal theorem provers with informal proofs
Summary: MASA is an innovative multi-agent autoformalization framework that significantly improves autoformalization effectiveness through modular design and agent collaboration. This work excels in technical innovation, experimental validation, and practical value, opening new directions for autoformalization research. Despite some limitations, its open-source implementation and good scalability make it an important contribution to the field.