2025-11-25T12:37:17.809472

Reliable generation of isomorphic physics problems using Generative AI with prompt-chaining and tool use

Chen

We present a method for generating large numbers of isomorphic physics problems using generative AI services such as ChatGPT, through prompt chaining and tool use. This approach enables precise control over structural variations-such as numeric values and spatial relations-while supporting diverse contextual variations in the problem body. By utilizing the Python code interpreter, the method supports automatic solution validation and simple diagram generation, addressing key limitations in existing LLM-based methods. We generated two example isomorphic problem banks and compared the outcome against two simpler prompt-based approaches. Results show that prompt-chaining produces significantly higher quality and more consistent outputs than simpler, non-chaining prompts. We also show that GenAI services can be used to validate the quality of the generated isomorphic problems. This work demonstrates a promising method for efficient and scalable problem creation accessible to the average instructor, which opens new possibilities for personalized adaptive testing and automated content development.

academic

Reliable generation of isomorphic physics problems using Generative AI with prompt-chaining and tool use

Basic Information

Paper ID: 2508.14755
Title: Reliable generation of isomorphic physics problems using Generative AI with prompt-chaining and tool use
Author: Zhongzhou Chen (University of Central Florida)
Classification: physics.ed-ph cs.AI
Publication Year: 2024
Paper Link: https://arxiv.org/abs/2508.14755

Abstract

This paper proposes a method for generating large quantities of isomorphic physics problems using generative AI services (such as ChatGPT) through prompt-chaining and tool use. The method enables precise control over structural variations (such as numerical values and spatial relationships) while supporting diverse contextual variations in problem content. By leveraging a Python code interpreter, the method supports automated solution verification and simple diagram generation, addressing key limitations of existing LLM-based approaches. Two exemplary isomorphic problem repositories were generated and compared with two simpler prompt-based methods. Results demonstrate that outputs produced by prompt-chaining exhibit significantly higher and more consistent quality.

Research Background and Motivation

Research Problem

This research addresses the challenge of isomorphic physics problem generation in educational contexts. Isomorphic problems are those that assess the same underlying concepts and principles but differ in surface characteristics. Such problems hold significant value in personalized assessment, repeated testing, and deliberate practice.

Problem Significance

Growing Educational Demand: With the development of personalized learning and adaptive testing, there is a need for large quantities of high-quality isomorphic problems
Limitations of Traditional Methods: Template-based approaches are costly to develop and require specialized programming
Assessment Quality Control: Precise control over problem difficulty and structure is needed while maintaining novelty

Limitations of Existing Approaches

Early AQG/AIG Methods: Primarily rely on hard-coded templates, time-consuming to develop and requiring domain-specific programming
Direct LLM Application: Difficult to control difficulty and cognitive complexity; often defaults to factual recall questions
Numerical Computation Issues: LLMs tend to hallucinate on numerical problems, producing incorrect answers
Diagram Generation Difficulties: Existing LLMs have limited ability to precisely control visual elements

Core Contributions

Proposed a prompt-chaining and tool-use based method for isomorphic problem generation, achieving precise control over structural variations and contextual diversity
Developed a seven-step generation workflow that systematically separates construction-related and construction-independent variations
Implemented automated solution verification and diagram generation through Python code interpreter, addressing key LLM limitations
Constructed two exemplary problem repositories with systematic comparison, demonstrating method effectiveness
Demonstrated the feasibility of GenAI services for quality verification, establishing a complete generation-verification feedback loop

Methodology Details

Task Definition

Input: Template problem or problem type Output: Large quantities of isomorphic physics problems, including problem statements, solutions, and (optionally) diagrams Constraints:

Maintain identical cognitive difficulty and physics concepts
Precisely control structural variations (numerical values, spatial relationships, etc.)
Support diverse contextual variations

Core Method Architecture

Seven-Step Generation Workflow

Template Identification: Identify the template problem or problem type
Component Decomposition: Identify various components of the problem
Variation Definition: Define structural and contextual variations and their constraints
Prompt-Chain Design: Design prompt chains for generating component variations
Execution Optimization: Execute prompt chains and iteratively improve
Output Combination: Combine components into complete problems and format
Quality Verification: Use GenAI to verify correctness of generated results

Key Concept Distinctions

Structural Variations:

Core construction-related structural changes
Must remain within precisely defined user-specified ranges
Include numerical values, spatial arrangements, object quantities, etc.
Implemented through combination of LLM generation and Python interpreter tools

Contextual Variations:

Changes in surface characteristics of problems
Less constrained but requiring LLM creativity
Consider student reading level, language proficiency, cultural background, etc.
Primarily implemented through LLM generative capabilities

Technical Innovations

Prompt-Chaining Technology: Decompose complex tasks into multiple subtasks, executed through chained prompts, overcoming limitations of single prompts
Tool Use Integration: Leverage Python code interpreter for numerical computation, constraint checking, and diagram generation
Variation Type Separation: Systematically distinguish and independently handle structural and contextual variations
Data Table Transmission: Use tabular format to store and transmit information within prompt chains, improving reliability

Experimental Setup

Problem Repository Design

Problem Repository 1: Numerical Computation Problems

Template: Object pushed/pulled by inclined force on rough surface, moving at constant velocity
Structural Variations: Force direction and nature, variable numerical values, unknown variable selection
Constraints: Angles 10-60 degrees, horizontal force component balances kinetic friction
Prompt Chain: 5 prompts generating context → numerical values → problem statement → solution → formatting

Problem Repository 2: Conceptual Multiple-Choice Questions (with Diagrams)

Template: Projectile motion trajectory comparison, same starting point with different heights and ranges
Structural Variations: Answer relationships, trajectory parameters, distractor design
Constraints: No visual overlap, relationship certainty, sufficient visual differentiation
Prompt Chain: 9 prompts handling more complex structural variations and diagram generation

Comparison Methods

Single Prompt Method: Consolidate prompt chain into one or two prompts
Simple Prompt Method: Simplified prompts based on single examples (for Problem Repository 1 only)

Evaluation Metrics

Output Quality: Problem completeness, numerical accuracy, format consistency
Structural Control: Degree of constraint compliance
Contextual Diversity: Variation in scenarios and descriptions
Answer Correctness: Accuracy verified through GenAI

Experimental Results

Main Results

Problem Repository 1 Generation Performance

Successful Generation: 20 isomorphic problems (10 GPT-4o + 10 Gemini Pro 2.5)
Quality Control: Each problem features unique background story, appropriate random values, correct answers
Example Problem: Worker pushing wooden box problem, including complete physical parameters and solution

Problem Repository 2 Generation Performance

Systematic Generation: 26 variations (13 possible relationships × 2 main distractors)
Diagram Quality: Parabolic trajectory diagrams automatically generated by Python, clearly distinguishable
Problem Completeness: Each problem includes situational description, diagram, and four choice options

Comparative Experimental Results

Single Prompt vs. Prompt-Chain

Problem Repository 1:

Single Prompt Defects: Completely ignored numerical generation instructions; all 10 versions lacked numerical values
Prompt-Chain Advantages: Precisely followed all constraints, generated complete problems

Problem Repository 2:

Single Prompt Issues: Trajectories appeared underground or invisible
Insufficient Generation: Only 7 scenarios and 13 combinations generated, versus expected 10 scenarios and 26 combinations

Simple Prompt vs. Prompt-Chain (Problem Repository 1)

Answer Accuracy: Simple prompt-generated answers mostly incorrect (e.g., 140 kg vs. correct answer 148.6 kg)
Tool Usage: Simple prompt did not activate Python tools, directly hallucinating answers
Text Quality: Text generated by simple prompt noticeably shorter with reduced quality

Quality Verification Results

Problem Repository 1: GenAI identified and corrected 6 formula derivation errors (out of 20 problems)
Problem Repository 2: Identified 3 problems where distractors were equivalent to correct answers
Student Verification: Problem repository used in midterm exam; students reported no additional errors

Development of Automatic Question Generation (AQG)

Early Methods: Hard-coded templates with high development costs
LLM Applications: Dijkstra et al. trained GPT-3 to generate multiple-choice questions; Chan et al. used GPT-3.5/4 to generate STEM problems
Isomorphic Problems: Arendasy and Sommer generated algebra problems through templates; Norberg et al. used GPT-4 to rewrite mathematical problem explanations

Technical Method Comparison

Traditional AIG: Precise control but lacking creativity
Direct LLM Application: Strong creativity but difficult control
This Paper's Method: Combines advantages of both, achieving balance between precise control and creativity

Conclusions and Discussion

Main Conclusions

Prompt-Chaining Significantly Outperforms Single Prompts: Demonstrates superior performance in quality consistency and constraint compliance
Tool Use is Critical: Python interpreter resolves key issues in numerical computation and diagram generation
GenAI Quality Verification is Effective: Successfully identifies and corrects errors in the generation process
Method is Scalable: Can generate nearly unlimited quantities of isomorphic problems

Limitations

Single Quality Assessment: Evaluated only by author, lacking systematic quality review
Unknown Psychometric Properties: Lacks student test data to evaluate psychometric properties of isomorphic problems
Limited Contextual Control: Primarily focuses on structural variations with limited control over contextual variations
Diagram Complexity Constraints: Supports only simple diagram generation

Future Directions

Systematic Quality Assessment: Conduct more comprehensive quality review and student testing
Fine-Grained Contextual Control: Explore control of contextual variations such as different writing styles
Complex Diagram Generation: Extend to more complex diagram types
Automated Prompt-Chain Design: Use GenAI to assist in prompt-chain design
Real-Time Generation System: Implement instant problem generation for fully personalized assessment

In-Depth Evaluation

Strengths

Strong Method Innovation: First systematic combination of prompt-chaining and tool use for isomorphic problem generation
High Practical Value: Provides accessible and efficient problem creation method for ordinary teachers
Well-Designed Experiments: Two different problem types validate method generalizability
Detailed Technical Implementation: Provides complete prompt chains and implementation details with strong reproducibility
Complete Quality Control: Establishes complete generation-verification feedback loop

Weaknesses

Limited Evaluation Scope: Validated only on two problem types in physics discipline
Relatively Small Scale: Generated problem quantities are limited (20+26 problems)
Missing Cost Analysis: Lacks cost-benefit comparison with traditional methods
Insufficient User Research: Lacks teacher and student experience studies

Impact

Disciplinary Contribution: Provides new problem generation paradigm for educational technology field
Practical Value: Directly applicable to personalized learning and adaptive testing
Technical Demonstration: Showcases possibilities for precise LLM control in educational applications
Method Generalizability: Technical framework extensible to other disciplines and problem types

Applicable Scenarios

Personalized Learning Platforms: Provide unlimited practice problems for students
Adaptive Testing Systems: Generate equivalent alternative problems of comparable difficulty
Teacher Assistance Tools: Help teachers rapidly create high-quality problem repositories
Online Education Platforms: Support large-scale personalized content generation

References

The paper cites 14 relevant references covering key works in automatic question generation, isomorphic problem creation, and LLM applications, providing solid theoretical foundation for the research.

Overall Assessment: This is a high-quality applied research paper making important contributions at the intersection of educational technology and AI applications. The method is novel and practical, experimental design is sound, and results are convincing. While there is room for improvement in evaluation scale and disciplinary coverage, the work points to important directions for field development.