We present a method for generating large numbers of isomorphic physics problems using generative AI services such as ChatGPT, through prompt chaining and tool use. This approach enables precise control over structural variations-such as numeric values and spatial relations-while supporting diverse contextual variations in the problem body. By utilizing the Python code interpreter, the method supports automatic solution validation and simple diagram generation, addressing key limitations in existing LLM-based methods. We generated two example isomorphic problem banks and compared the outcome against two simpler prompt-based approaches. Results show that prompt-chaining produces significantly higher quality and more consistent outputs than simpler, non-chaining prompts. We also show that GenAI services can be used to validate the quality of the generated isomorphic problems. This work demonstrates a promising method for efficient and scalable problem creation accessible to the average instructor, which opens new possibilities for personalized adaptive testing and automated content development.
- Paper ID: 2508.14755
- Title: Reliable generation of isomorphic physics problems using Generative AI with prompt-chaining and tool use
- Author: Zhongzhou Chen (University of Central Florida)
- Classification: physics.ed-ph cs.AI
- Publication Year: 2024
- Paper Link: https://arxiv.org/abs/2508.14755
This paper proposes a method for generating large quantities of isomorphic physics problems using generative AI services (such as ChatGPT) through prompt-chaining and tool use. The method enables precise control over structural variations (such as numerical values and spatial relationships) while supporting diverse contextual variations in problem content. By leveraging a Python code interpreter, the method supports automated solution verification and simple diagram generation, addressing key limitations of existing LLM-based approaches. Two exemplary isomorphic problem repositories were generated and compared with two simpler prompt-based methods. Results demonstrate that outputs produced by prompt-chaining exhibit significantly higher and more consistent quality.
This research addresses the challenge of isomorphic physics problem generation in educational contexts. Isomorphic problems are those that assess the same underlying concepts and principles but differ in surface characteristics. Such problems hold significant value in personalized assessment, repeated testing, and deliberate practice.
- Growing Educational Demand: With the development of personalized learning and adaptive testing, there is a need for large quantities of high-quality isomorphic problems
- Limitations of Traditional Methods: Template-based approaches are costly to develop and require specialized programming
- Assessment Quality Control: Precise control over problem difficulty and structure is needed while maintaining novelty
- Early AQG/AIG Methods: Primarily rely on hard-coded templates, time-consuming to develop and requiring domain-specific programming
- Direct LLM Application: Difficult to control difficulty and cognitive complexity; often defaults to factual recall questions
- Numerical Computation Issues: LLMs tend to hallucinate on numerical problems, producing incorrect answers
- Diagram Generation Difficulties: Existing LLMs have limited ability to precisely control visual elements
- Proposed a prompt-chaining and tool-use based method for isomorphic problem generation, achieving precise control over structural variations and contextual diversity
- Developed a seven-step generation workflow that systematically separates construction-related and construction-independent variations
- Implemented automated solution verification and diagram generation through Python code interpreter, addressing key LLM limitations
- Constructed two exemplary problem repositories with systematic comparison, demonstrating method effectiveness
- Demonstrated the feasibility of GenAI services for quality verification, establishing a complete generation-verification feedback loop
Input: Template problem or problem type
Output: Large quantities of isomorphic physics problems, including problem statements, solutions, and (optionally) diagrams
Constraints:
- Maintain identical cognitive difficulty and physics concepts
- Precisely control structural variations (numerical values, spatial relationships, etc.)
- Support diverse contextual variations
- Template Identification: Identify the template problem or problem type
- Component Decomposition: Identify various components of the problem
- Variation Definition: Define structural and contextual variations and their constraints
- Prompt-Chain Design: Design prompt chains for generating component variations
- Execution Optimization: Execute prompt chains and iteratively improve
- Output Combination: Combine components into complete problems and format
- Quality Verification: Use GenAI to verify correctness of generated results
Structural Variations:
- Core construction-related structural changes
- Must remain within precisely defined user-specified ranges
- Include numerical values, spatial arrangements, object quantities, etc.
- Implemented through combination of LLM generation and Python interpreter tools
Contextual Variations:
- Changes in surface characteristics of problems
- Less constrained but requiring LLM creativity
- Consider student reading level, language proficiency, cultural background, etc.
- Primarily implemented through LLM generative capabilities
- Prompt-Chaining Technology: Decompose complex tasks into multiple subtasks, executed through chained prompts, overcoming limitations of single prompts
- Tool Use Integration: Leverage Python code interpreter for numerical computation, constraint checking, and diagram generation
- Variation Type Separation: Systematically distinguish and independently handle structural and contextual variations
- Data Table Transmission: Use tabular format to store and transmit information within prompt chains, improving reliability
- Template: Object pushed/pulled by inclined force on rough surface, moving at constant velocity
- Structural Variations: Force direction and nature, variable numerical values, unknown variable selection
- Constraints: Angles 10-60 degrees, horizontal force component balances kinetic friction
- Prompt Chain: 5 prompts generating context → numerical values → problem statement → solution → formatting
- Template: Projectile motion trajectory comparison, same starting point with different heights and ranges
- Structural Variations: Answer relationships, trajectory parameters, distractor design
- Constraints: No visual overlap, relationship certainty, sufficient visual differentiation
- Prompt Chain: 9 prompts handling more complex structural variations and diagram generation
- Single Prompt Method: Consolidate prompt chain into one or two prompts
- Simple Prompt Method: Simplified prompts based on single examples (for Problem Repository 1 only)
- Output Quality: Problem completeness, numerical accuracy, format consistency
- Structural Control: Degree of constraint compliance
- Contextual Diversity: Variation in scenarios and descriptions
- Answer Correctness: Accuracy verified through GenAI
- Successful Generation: 20 isomorphic problems (10 GPT-4o + 10 Gemini Pro 2.5)
- Quality Control: Each problem features unique background story, appropriate random values, correct answers
- Example Problem: Worker pushing wooden box problem, including complete physical parameters and solution
- Systematic Generation: 26 variations (13 possible relationships × 2 main distractors)
- Diagram Quality: Parabolic trajectory diagrams automatically generated by Python, clearly distinguishable
- Problem Completeness: Each problem includes situational description, diagram, and four choice options
Problem Repository 1:
- Single Prompt Defects: Completely ignored numerical generation instructions; all 10 versions lacked numerical values
- Prompt-Chain Advantages: Precisely followed all constraints, generated complete problems
Problem Repository 2:
- Single Prompt Issues: Trajectories appeared underground or invisible
- Insufficient Generation: Only 7 scenarios and 13 combinations generated, versus expected 10 scenarios and 26 combinations
- Answer Accuracy: Simple prompt-generated answers mostly incorrect (e.g., 140 kg vs. correct answer 148.6 kg)
- Tool Usage: Simple prompt did not activate Python tools, directly hallucinating answers
- Text Quality: Text generated by simple prompt noticeably shorter with reduced quality
- Problem Repository 1: GenAI identified and corrected 6 formula derivation errors (out of 20 problems)
- Problem Repository 2: Identified 3 problems where distractors were equivalent to correct answers
- Student Verification: Problem repository used in midterm exam; students reported no additional errors
- Early Methods: Hard-coded templates with high development costs
- LLM Applications: Dijkstra et al. trained GPT-3 to generate multiple-choice questions; Chan et al. used GPT-3.5/4 to generate STEM problems
- Isomorphic Problems: Arendasy and Sommer generated algebra problems through templates; Norberg et al. used GPT-4 to rewrite mathematical problem explanations
- Traditional AIG: Precise control but lacking creativity
- Direct LLM Application: Strong creativity but difficult control
- This Paper's Method: Combines advantages of both, achieving balance between precise control and creativity
- Prompt-Chaining Significantly Outperforms Single Prompts: Demonstrates superior performance in quality consistency and constraint compliance
- Tool Use is Critical: Python interpreter resolves key issues in numerical computation and diagram generation
- GenAI Quality Verification is Effective: Successfully identifies and corrects errors in the generation process
- Method is Scalable: Can generate nearly unlimited quantities of isomorphic problems
- Single Quality Assessment: Evaluated only by author, lacking systematic quality review
- Unknown Psychometric Properties: Lacks student test data to evaluate psychometric properties of isomorphic problems
- Limited Contextual Control: Primarily focuses on structural variations with limited control over contextual variations
- Diagram Complexity Constraints: Supports only simple diagram generation
- Systematic Quality Assessment: Conduct more comprehensive quality review and student testing
- Fine-Grained Contextual Control: Explore control of contextual variations such as different writing styles
- Complex Diagram Generation: Extend to more complex diagram types
- Automated Prompt-Chain Design: Use GenAI to assist in prompt-chain design
- Real-Time Generation System: Implement instant problem generation for fully personalized assessment
- Strong Method Innovation: First systematic combination of prompt-chaining and tool use for isomorphic problem generation
- High Practical Value: Provides accessible and efficient problem creation method for ordinary teachers
- Well-Designed Experiments: Two different problem types validate method generalizability
- Detailed Technical Implementation: Provides complete prompt chains and implementation details with strong reproducibility
- Complete Quality Control: Establishes complete generation-verification feedback loop
- Limited Evaluation Scope: Validated only on two problem types in physics discipline
- Relatively Small Scale: Generated problem quantities are limited (20+26 problems)
- Missing Cost Analysis: Lacks cost-benefit comparison with traditional methods
- Insufficient User Research: Lacks teacher and student experience studies
- Disciplinary Contribution: Provides new problem generation paradigm for educational technology field
- Practical Value: Directly applicable to personalized learning and adaptive testing
- Technical Demonstration: Showcases possibilities for precise LLM control in educational applications
- Method Generalizability: Technical framework extensible to other disciplines and problem types
- Personalized Learning Platforms: Provide unlimited practice problems for students
- Adaptive Testing Systems: Generate equivalent alternative problems of comparable difficulty
- Teacher Assistance Tools: Help teachers rapidly create high-quality problem repositories
- Online Education Platforms: Support large-scale personalized content generation
The paper cites 14 relevant references covering key works in automatic question generation, isomorphic problem creation, and LLM applications, providing solid theoretical foundation for the research.
Overall Assessment: This is a high-quality applied research paper making important contributions at the intersection of educational technology and AI applications. The method is novel and practical, experimental design is sound, and results are convincing. While there is room for improvement in evaluation scale and disciplinary coverage, the work points to important directions for field development.