2025-11-25T12:37:17.809472

Reliable generation of isomorphic physics problems using Generative AI with prompt-chaining and tool use

Chen
We present a method for generating large numbers of isomorphic physics problems using generative AI services such as ChatGPT, through prompt chaining and tool use. This approach enables precise control over structural variations-such as numeric values and spatial relations-while supporting diverse contextual variations in the problem body. By utilizing the Python code interpreter, the method supports automatic solution validation and simple diagram generation, addressing key limitations in existing LLM-based methods. We generated two example isomorphic problem banks and compared the outcome against two simpler prompt-based approaches. Results show that prompt-chaining produces significantly higher quality and more consistent outputs than simpler, non-chaining prompts. We also show that GenAI services can be used to validate the quality of the generated isomorphic problems. This work demonstrates a promising method for efficient and scalable problem creation accessible to the average instructor, which opens new possibilities for personalized adaptive testing and automated content development.
academic

Reliable generation of isomorphic physics problems using Generative AI with prompt-chaining and tool use

Basic Information

  • Paper ID: 2508.14755
  • Title: Reliable generation of isomorphic physics problems using Generative AI with prompt-chaining and tool use
  • Author: Zhongzhou Chen (University of Central Florida)
  • Classification: physics.ed-ph cs.AI
  • Publication Year: 2024
  • Paper Link: https://arxiv.org/abs/2508.14755

Abstract

This paper proposes a method for generating large quantities of isomorphic physics problems using generative AI services (such as ChatGPT) through prompt-chaining and tool use. The method enables precise control over structural variations (such as numerical values and spatial relationships) while supporting diverse contextual variations in problem content. By leveraging a Python code interpreter, the method supports automated solution verification and simple diagram generation, addressing key limitations of existing LLM-based approaches. Two exemplary isomorphic problem repositories were generated and compared with two simpler prompt-based methods. Results demonstrate that outputs produced by prompt-chaining exhibit significantly higher and more consistent quality.

Research Background and Motivation

Research Problem

This research addresses the challenge of isomorphic physics problem generation in educational contexts. Isomorphic problems are those that assess the same underlying concepts and principles but differ in surface characteristics. Such problems hold significant value in personalized assessment, repeated testing, and deliberate practice.

Problem Significance

  1. Growing Educational Demand: With the development of personalized learning and adaptive testing, there is a need for large quantities of high-quality isomorphic problems
  2. Limitations of Traditional Methods: Template-based approaches are costly to develop and require specialized programming
  3. Assessment Quality Control: Precise control over problem difficulty and structure is needed while maintaining novelty

Limitations of Existing Approaches

  1. Early AQG/AIG Methods: Primarily rely on hard-coded templates, time-consuming to develop and requiring domain-specific programming
  2. Direct LLM Application: Difficult to control difficulty and cognitive complexity; often defaults to factual recall questions
  3. Numerical Computation Issues: LLMs tend to hallucinate on numerical problems, producing incorrect answers
  4. Diagram Generation Difficulties: Existing LLMs have limited ability to precisely control visual elements

Core Contributions

  1. Proposed a prompt-chaining and tool-use based method for isomorphic problem generation, achieving precise control over structural variations and contextual diversity
  2. Developed a seven-step generation workflow that systematically separates construction-related and construction-independent variations
  3. Implemented automated solution verification and diagram generation through Python code interpreter, addressing key LLM limitations
  4. Constructed two exemplary problem repositories with systematic comparison, demonstrating method effectiveness
  5. Demonstrated the feasibility of GenAI services for quality verification, establishing a complete generation-verification feedback loop

Methodology Details

Task Definition

Input: Template problem or problem type Output: Large quantities of isomorphic physics problems, including problem statements, solutions, and (optionally) diagrams Constraints:

  • Maintain identical cognitive difficulty and physics concepts
  • Precisely control structural variations (numerical values, spatial relationships, etc.)
  • Support diverse contextual variations

Core Method Architecture

Seven-Step Generation Workflow

  1. Template Identification: Identify the template problem or problem type
  2. Component Decomposition: Identify various components of the problem
  3. Variation Definition: Define structural and contextual variations and their constraints
  4. Prompt-Chain Design: Design prompt chains for generating component variations
  5. Execution Optimization: Execute prompt chains and iteratively improve
  6. Output Combination: Combine components into complete problems and format
  7. Quality Verification: Use GenAI to verify correctness of generated results

Key Concept Distinctions

Structural Variations:

  • Core construction-related structural changes
  • Must remain within precisely defined user-specified ranges
  • Include numerical values, spatial arrangements, object quantities, etc.
  • Implemented through combination of LLM generation and Python interpreter tools

Contextual Variations:

  • Changes in surface characteristics of problems
  • Less constrained but requiring LLM creativity
  • Consider student reading level, language proficiency, cultural background, etc.
  • Primarily implemented through LLM generative capabilities

Technical Innovations

  1. Prompt-Chaining Technology: Decompose complex tasks into multiple subtasks, executed through chained prompts, overcoming limitations of single prompts
  2. Tool Use Integration: Leverage Python code interpreter for numerical computation, constraint checking, and diagram generation
  3. Variation Type Separation: Systematically distinguish and independently handle structural and contextual variations
  4. Data Table Transmission: Use tabular format to store and transmit information within prompt chains, improving reliability

Experimental Setup

Problem Repository Design

Problem Repository 1: Numerical Computation Problems

  • Template: Object pushed/pulled by inclined force on rough surface, moving at constant velocity
  • Structural Variations: Force direction and nature, variable numerical values, unknown variable selection
  • Constraints: Angles 10-60 degrees, horizontal force component balances kinetic friction
  • Prompt Chain: 5 prompts generating context → numerical values → problem statement → solution → formatting

Problem Repository 2: Conceptual Multiple-Choice Questions (with Diagrams)

  • Template: Projectile motion trajectory comparison, same starting point with different heights and ranges
  • Structural Variations: Answer relationships, trajectory parameters, distractor design
  • Constraints: No visual overlap, relationship certainty, sufficient visual differentiation
  • Prompt Chain: 9 prompts handling more complex structural variations and diagram generation

Comparison Methods

  1. Single Prompt Method: Consolidate prompt chain into one or two prompts
  2. Simple Prompt Method: Simplified prompts based on single examples (for Problem Repository 1 only)

Evaluation Metrics

  1. Output Quality: Problem completeness, numerical accuracy, format consistency
  2. Structural Control: Degree of constraint compliance
  3. Contextual Diversity: Variation in scenarios and descriptions
  4. Answer Correctness: Accuracy verified through GenAI

Experimental Results

Main Results

Problem Repository 1 Generation Performance

  • Successful Generation: 20 isomorphic problems (10 GPT-4o + 10 Gemini Pro 2.5)
  • Quality Control: Each problem features unique background story, appropriate random values, correct answers
  • Example Problem: Worker pushing wooden box problem, including complete physical parameters and solution

Problem Repository 2 Generation Performance

  • Systematic Generation: 26 variations (13 possible relationships × 2 main distractors)
  • Diagram Quality: Parabolic trajectory diagrams automatically generated by Python, clearly distinguishable
  • Problem Completeness: Each problem includes situational description, diagram, and four choice options

Comparative Experimental Results

Single Prompt vs. Prompt-Chain

Problem Repository 1:

  • Single Prompt Defects: Completely ignored numerical generation instructions; all 10 versions lacked numerical values
  • Prompt-Chain Advantages: Precisely followed all constraints, generated complete problems

Problem Repository 2:

  • Single Prompt Issues: Trajectories appeared underground or invisible
  • Insufficient Generation: Only 7 scenarios and 13 combinations generated, versus expected 10 scenarios and 26 combinations

Simple Prompt vs. Prompt-Chain (Problem Repository 1)

  • Answer Accuracy: Simple prompt-generated answers mostly incorrect (e.g., 140 kg vs. correct answer 148.6 kg)
  • Tool Usage: Simple prompt did not activate Python tools, directly hallucinating answers
  • Text Quality: Text generated by simple prompt noticeably shorter with reduced quality

Quality Verification Results

  • Problem Repository 1: GenAI identified and corrected 6 formula derivation errors (out of 20 problems)
  • Problem Repository 2: Identified 3 problems where distractors were equivalent to correct answers
  • Student Verification: Problem repository used in midterm exam; students reported no additional errors

Development of Automatic Question Generation (AQG)

  1. Early Methods: Hard-coded templates with high development costs
  2. LLM Applications: Dijkstra et al. trained GPT-3 to generate multiple-choice questions; Chan et al. used GPT-3.5/4 to generate STEM problems
  3. Isomorphic Problems: Arendasy and Sommer generated algebra problems through templates; Norberg et al. used GPT-4 to rewrite mathematical problem explanations

Technical Method Comparison

  • Traditional AIG: Precise control but lacking creativity
  • Direct LLM Application: Strong creativity but difficult control
  • This Paper's Method: Combines advantages of both, achieving balance between precise control and creativity

Conclusions and Discussion

Main Conclusions

  1. Prompt-Chaining Significantly Outperforms Single Prompts: Demonstrates superior performance in quality consistency and constraint compliance
  2. Tool Use is Critical: Python interpreter resolves key issues in numerical computation and diagram generation
  3. GenAI Quality Verification is Effective: Successfully identifies and corrects errors in the generation process
  4. Method is Scalable: Can generate nearly unlimited quantities of isomorphic problems

Limitations

  1. Single Quality Assessment: Evaluated only by author, lacking systematic quality review
  2. Unknown Psychometric Properties: Lacks student test data to evaluate psychometric properties of isomorphic problems
  3. Limited Contextual Control: Primarily focuses on structural variations with limited control over contextual variations
  4. Diagram Complexity Constraints: Supports only simple diagram generation

Future Directions

  1. Systematic Quality Assessment: Conduct more comprehensive quality review and student testing
  2. Fine-Grained Contextual Control: Explore control of contextual variations such as different writing styles
  3. Complex Diagram Generation: Extend to more complex diagram types
  4. Automated Prompt-Chain Design: Use GenAI to assist in prompt-chain design
  5. Real-Time Generation System: Implement instant problem generation for fully personalized assessment

In-Depth Evaluation

Strengths

  1. Strong Method Innovation: First systematic combination of prompt-chaining and tool use for isomorphic problem generation
  2. High Practical Value: Provides accessible and efficient problem creation method for ordinary teachers
  3. Well-Designed Experiments: Two different problem types validate method generalizability
  4. Detailed Technical Implementation: Provides complete prompt chains and implementation details with strong reproducibility
  5. Complete Quality Control: Establishes complete generation-verification feedback loop

Weaknesses

  1. Limited Evaluation Scope: Validated only on two problem types in physics discipline
  2. Relatively Small Scale: Generated problem quantities are limited (20+26 problems)
  3. Missing Cost Analysis: Lacks cost-benefit comparison with traditional methods
  4. Insufficient User Research: Lacks teacher and student experience studies

Impact

  1. Disciplinary Contribution: Provides new problem generation paradigm for educational technology field
  2. Practical Value: Directly applicable to personalized learning and adaptive testing
  3. Technical Demonstration: Showcases possibilities for precise LLM control in educational applications
  4. Method Generalizability: Technical framework extensible to other disciplines and problem types

Applicable Scenarios

  1. Personalized Learning Platforms: Provide unlimited practice problems for students
  2. Adaptive Testing Systems: Generate equivalent alternative problems of comparable difficulty
  3. Teacher Assistance Tools: Help teachers rapidly create high-quality problem repositories
  4. Online Education Platforms: Support large-scale personalized content generation

References

The paper cites 14 relevant references covering key works in automatic question generation, isomorphic problem creation, and LLM applications, providing solid theoretical foundation for the research.


Overall Assessment: This is a high-quality applied research paper making important contributions at the intersection of educational technology and AI applications. The method is novel and practical, experimental design is sound, and results are convincing. While there is room for improvement in evaluation scale and disciplinary coverage, the work points to important directions for field development.