2025-11-23T10:13:16.980830

Closure Properties of General Grammars -- Formally Verified

Dvorak, Blanchette
We formalized general (i.e., type-0) grammars using the Lean 3 proof assistant. We defined basic notions of rewrite rules and of words derived by a grammar, and used grammars to show closure of the class of type-0 languages under four operations: union, reversal, concatenation, and the Kleene star. The literature mostly focuses on Turing machine arguments, which are possibly more difficult to formalize. For the Kleene star, we could not follow the literature and came up with our own grammar-based construction.
academic

Closure Properties of General Grammars -- Formally Verified

Basic Information

  • Paper ID: 2302.06420
  • Title: Closure Properties of General Grammars -- Formally Verified
  • Authors: Martin Dvorak (Institute of Science and Technology Austria), Jasmin Blanchette (Ludwig-Maximilians-Universität München)
  • Classification: cs.FL (Formal Languages and Automata Theory)
  • Conference: 14th International Conference on Interactive Theorem Proving (ITP 2023)
  • Paper Link: https://arxiv.org/abs/2302.06420

Abstract

This paper formalizes general grammars (i.e., type-0 grammars) using the Lean 3 proof assistant. The authors define fundamental concepts of rewriting rules and word derivation from grammars, and prove that the class of type-0 languages is closed under four operations: union, reversal, concatenation, and Kleene star. The literature primarily focuses on Turing machine arguments, which may be more difficult to formalize. For the Kleene star operation, the authors could not follow the literature and proposed their own grammar-based construction.

Research Background and Motivation

Problem Background

  1. Importance of Formal Language Theory: Formal language concepts are central to computer science and can be recognized through multiple formalisms, including Turing machines and formal grammars
  2. Equivalence of Type-0 Grammars and Turing Machines: Both Turing machines and general grammars characterize the same class of recursively enumerable or type-0 languages
  3. Limitations of Existing Formalization Work: While substantial work exists on formalizing Turing machines in proof assistants, formalization of general grammars remains relatively scarce

Research Motivation

  1. Advantages of Grammars: General grammars are easier to define than Turing machines, and certain proofs about general grammars are considerably simpler than analogous Turing machine proofs
  2. Importance of Closure Properties: Closure properties of type-0 languages are fundamental results in formal language theory
  3. Need for Formal Verification: Machine-checked rigorous proofs are needed to ensure the correctness of these foundational results

Core Contributions

  1. First Complete Formalization of General Grammars: Comprehensive definition of type-0 grammar concepts and operations in Lean 3
  2. Formal Proofs of Four Closure Properties:
    • Union closure
    • Reversal closure
    • Concatenation closure
    • Kleene star closure
  3. Novel Kleene Star Construction: Due to the absence of grammar-based constructions in the literature, the authors developed their own syntax-based construction method
  4. Reusable Abstract Framework: Development of the lifted_grammar structure to reduce code duplication and provide generic proof patterns
  5. Open-Source Lean Codebase of Approximately 12,500 Lines: Complete formalization implementation available for community use

Methodology Details

Fundamental Definition Structures

Symbol System

inductive symbol (T : Type) (N : Type)
| terminal : T → symbol  
| nonterminal : N → symbol

Grammar Rule Representation

structure grule (T : Type) (N : Type) :=
( input_L : list (symbol T N))
( input_N : N)  
( input_R : list (symbol T N))
( output_string : list (symbol T N))

Grammar Definition

structure grammar (T : Type) :=
(nt : Type)
(initial : nt)
(rules : list (grule T nt))

Core Operation Definitions

Grammar Transformation Relation

def grammar_transforms (g : grammar T) (w1 w2 : list (symbol T g.nt)) : Prop :=
∃ r : grule T g.nt,
  r ∈ g.rules ∧
  ∃ u v : list (symbol T g.nt),
    w1 = u ++ r.input_L ++ [symbol.nonterminal r.input_N] ++ r.input_R ++ v ∧
    w2 = u ++ r.output_string ++ v

Derivation Relation

def grammar_derives (g : grammar T) : 
  list (symbol T g.nt) → list (symbol T g.nt) → Prop :=
relation.refl_trans_gen (grammar_transforms g)

Technical Innovations

1. Lifted Grammar Abstract Framework

To reduce code duplication, the authors developed an abstract structure:

  • Contains a smaller grammar g0 and a larger grammar g
  • Provides lift_nt and sink_nt functions for conversion between different nonterminal types
  • Ensures injectivity and correctness of corresponding rules

2. Innovative Handling of Concatenation

Traditional context-free grammar concatenation constructions fail for general grammars. The authors' solution:

  • Creates proxy nonterminals for each terminal
  • Ensures complete separation of nonterminals used by g1 and g2
  • Avoids string matching problems across concatenation boundaries

3. Original Kleene Star Construction

Due to the absence of grammar-based constructions in the literature, the authors invented a novel approach:

  • Introduces separator # to construct isolated "compartments" for words
  • Uses a cleaner R to traverse from beginning to end and remove separators
  • New rule set: P* = P ∪ {Z → ZS#, Z → R#, R# → R, R# → ε} ∪ {Rt → tR | t ∈ T}

Experimental Setup

Formalization Environment

  • Proof Assistant: Lean 3
  • Mathematical Library: mathlib
  • Code Scale: Approximately 12,500 lines of well-formatted Lean code
  • Metaprogramming: Utilizes Lean's metaprogramming framework to develop small-scale automation

Verification Methods

  • Structural Induction: Uses structural induction on derivation relations for proofs
  • Case Analysis: Exhaustive case analysis for different rule applications
  • Invariant Maintenance: Maintains critical invariants in complex proofs

Experimental Results

Main Theorems

  1. Union Closure: theorem T0_of_T0_u_T0 (L1 L2 : language T) : is_T0 L1 ∧ is_T0 L2 → is_T0 (L1 + L2)
  2. Reversal Closure: theorem T0_of_reverse_T0 (L : language T) : is_T0 L → is_T0 (reverse_lang L)
  3. Concatenation Closure: theorem T0_of_T0_c_T0 (L1 L2 : language T) : is_T0 L1 ∧ is_T0 L2 → is_T0 (L1 * L2)
  4. Kleene Star Closure: theorem T0_of_star_T0 (L : language T) : is_T0 L → is_T0 L.star

Proof Complexity Analysis

  • Union and Reversal: Relatively straightforward, primarily using standard constructions
  • Concatenation: Moderate complexity, requiring handling of boundary matching issues
  • Kleene Star: Most complex, with the star_induction lemma alone exceeding 3,000 lines of code

Byproducts

  • Context-Free Grammars: Proofs can be reused to establish closure properties of context-free languages
  • Concatenation Lemmas: theorem CF_of_CF_u_CF and similar results can be directly applied to context-free languages

Grammar Formalization

  • Context-Free Grammars: Carlson et al. (Mizar), Minamide (Isabelle/HOL), Barthwal and Norrish (HOL4), Firsov and Uustalu (Agda), Ramos (Coq)
  • General Grammars: This paper represents the first complete formalization

Other Computational Models

  • Finite Automata: Thompson and Dillies (Lean), regular expression formalization
  • Turing Machines: Implementations in multiple proof assistants, with recent work by Balbach even proving the Cook-Levin theorem
  • Lambda Calculus: Norrish (HOL4), Forster et al. (Coq)
  • Partial Recursive Functions: Norrish (HOL4), Carneiro (Lean)

Conclusions and Discussion

Main Conclusions

  1. Grammars Superior to Turing Machines: For closure property proofs, grammars may be more convenient than Turing machines
  2. Feasibility of Formalization: Complex formal language theory results can be successfully formalized in modern proof assistants
  3. Importance of Abstraction: Good abstractions (such as lifted_grammar) are crucial for large-scale formalization

Limitations

  1. Complexity Classes: Grammars cannot define important complexity classes (such as P), still requiring models like Turing machines
  2. Constructivity: The authors did not attempt to make the development constructive
  3. Intersection Closure: Intersection closure was not formalized due to lack of knowledge of elegant grammar-based constructions

Future Directions

  1. Complete Chomsky Hierarchy: Incorporate context-sensitive, context-free, and regular grammars into the library
  2. Equivalence Proofs: Attempt to prove equivalence between general grammars and Turing machines
  3. Integration with mathlib: Connect mathlib results on regular languages to this library

In-Depth Evaluation

Strengths

  1. Pioneering Work: First complete formalization of general grammars, filling an important gap
  2. Technical Innovation: The original Kleene star construction demonstrates profound theoretical expertise
  3. Engineering Quality: 12,500 lines of high-quality code with excellent abstract design
  4. Theoretical Contribution: Demonstrates that grammar-based approaches are superior to Turing machine methods in certain cases
  5. Reusability: Abstractions like lifted_grammar provide foundations for subsequent work

Weaknesses

  1. Limited Expressiveness: Cannot handle important concepts in complexity theory
  2. Missing Constructivity: Does not address requirements of constructive mathematics
  3. Incompleteness: Lacks treatment of important operations such as intersection
  4. Documentation: Explanations of certain technical details could be more thorough

Impact

  1. Theoretical Significance: Establishes foundations for machine verification of formal language theory
  2. Methodological Value: Demonstrates best practices for large-scale formalization projects
  3. Community Contribution: Open-source codebase will facilitate related research
  4. Educational Value: Serves as an excellent case study for learning formalization methods

Applicable Scenarios

  1. Theoretical Computer Science: Research in language theory and automata theory
  2. Formal Mathematics: Mathematical research requiring rigorous proofs
  3. Compiler Verification: Correctness assurance for parsing and language processing
  4. Education: Supplementary materials for formal language courses

References

The paper cites 26 important references, covering:

  • Classical textbooks: Aho & Ullman's parsing theory, Hopcroft et al.'s automata theory
  • Formalization work: Implementations of computational models in various proof assistants
  • Tool documentation: Relevant materials for Lean 3 and mathlib

Summary: This is a high-quality theoretical computer science paper that makes important technical contributions and provides valuable experience in formalization methodology. The authors' work establishes a solid foundation for constructing a complete formalized Chomsky hierarchy and holds significant value for both the formal language theory community and the proof assistant community.