2025-11-11T11:43:09.580597

Automatically Generating Questions About Scratch Programs

Obermüller, Fraser
When learning to program, students are usually assessed based on the code they wrote. However, the mere completion of a programming task does not guarantee actual comprehension of the underlying concepts. Asking learners questions about the code they wrote has therefore been proposed as a means to assess program comprehension. As creating targeted questions for individual student programs can be tedious and challenging, prior work has proposed to generate such questions automatically. In this paper we generalize this idea to the block-based programming language Scratch. We propose a set of 30 different questions for Scratch code covering an established program comprehension model, and extend the LitterBox static analysis tool to automatically generate corresponding questions for a given Scratch program. On a dataset of 600,913 projects we generated 54,118,694 questions automatically. Our initial experiments with 34 ninth graders demonstrate that this approach can indeed generate meaningful questions for Scratch programs, and we find that the ability of students to answer these questions on their programs relates to their overall performance.
academic

Automatically Generating Questions About Scratch Programs

Basic Information

  • Paper ID: 2510.11658
  • Title: Automatically Generating Questions About Scratch Programs
  • Authors: Florian Obermüller, Gordon Fraser
  • Classification: cs.SE (Software Engineering)
  • Publication Time/Conference: CompEd 2025 (ACM Global Computing Education Conference 2025)
  • Paper Link: https://arxiv.org/abs/2510.11658

Abstract

In programming education, students are typically assessed based on the code they write. However, successfully completing programming tasks does not guarantee genuine understanding of underlying concepts. Prior research has proposed assessing program comprehension by asking learners questions about their code. Since creating targeted questions for each student program is both tedious and challenging, previous work has proposed methods for automatically generating such questions. This paper extends this idea to Scratch, a block-based programming language. We propose 30 different types of Scratch code questions that cover established program comprehension models and extend the LitterBox static analysis tool to automatically generate corresponding questions for given Scratch programs. On a dataset containing 600,913 projects, we automatically generated 54,118,694 questions. Preliminary experiments with 34 ninth-grade students demonstrate that this approach can generate meaningful questions for Scratch programs and reveal that students' ability to answer these questions correlates with their overall performance.

Research Background and Motivation

Core Problem

The core problem this research addresses is: How can we effectively assess students' understanding of their own Scratch programs, rather than merely checking whether programs run correctly?

Problem Significance

  1. Gap Between Understanding and Implementation: Students may complete programming tasks through trial-and-error, copying, or AI assistance without truly understanding underlying programming concepts
  2. Limitations of Assessment Methods: Traditional assessment methods focus primarily on code correctness rather than students' program comprehension abilities
  3. Scalability Challenges: In large-scale teaching scenarios, teachers struggle to manually create personalized comprehension assessment questions for each student's program

Limitations of Existing Methods

  1. Text Language Limitations: Existing question generation methods primarily target text-based programming languages like Java and are not applicable to block-based languages like Scratch
  2. Language Feature Differences: In Scratch, variables are created through the user interface rather than declaration statements, and blocks cannot be referenced by line numbers
  3. Lack of Systematicity: Absence of systematic question design methods based on theoretical frameworks

Research Motivation

The motivation for this research is to extend the existing concept of "Questions about Learner's Code" (QLCs) to the Scratch environment, providing an automated program comprehension assessment tool for block-based programming education.

Core Contributions

  1. Systematic Question Design: Based on the Block Model program comprehension framework, systematically designed 30 types of questions targeting Scratch code
  2. Tool Extension: Extended the open-source static analysis tool LitterBox to automatically generate comprehension questions for Scratch programs
  3. Large-Scale Validation: Validated the method's applicability on a dataset containing 600,913 public Scratch projects
  4. Empirical Study: Verified question effectiveness and the correlation between student answer performance and programming ability through classroom experiments with 34 ninth-grade students

Methodology Details

Task Definition

Input: A Scratch program project Output: A set of automatically generated comprehension questions about the program, including question text, answer options, and correct answers Constraints: Questions must be based on code constructs actually present in the program and conform to the Block Model theoretical framework

Method Architecture

1. Theoretical Foundation: Block Model Adaptation

The Block Model contains four levels of scope and three program dimensions:

LevelText DimensionExecution DimensionPurpose Dimension
AtomicLanguage elementsElement operationsElement purpose
BlockSyntactically/semantically related regionsCode block operationsCode block functionality
RelationalReferences between code blocksControl flow between blocksTarget and sub-target relationships
MacroOverall program structureAlgorithms or program behaviorProgram goals or objectives

2. Question Type Design

Based on the Block Model, 30 question types were designed across 5 answer formats:

  • Numeric (🔢): Answer is a single number
  • String (📝): Answer is one or more strings
  • Yes/No (✓/✗): Answer is yes or no
  • Multiple Choice (☑️): Select correct answers from options
  • Free Text (📄): Open-ended questions requiring explanatory responses

3. Automatic Generation Implementation

Implemented through extending the LitterBox tool:

  1. AST Parsing: Convert Scratch programs to abstract syntax trees
  2. Visitor Pattern: Implement a question finder for each question type
  3. Code Traversal: Traverse the AST to identify code patterns that can generate questions
  4. Option Generation: Automatically generate distractors for multiple-choice questions

Technical Innovation Points

  1. Block Programming Adaptation: First systematic application of QLCs concept to block-based programming languages
  2. Theory-Driven Design: Question type design based on mature program comprehension theoretical frameworks
  3. Automated Generation: Implemented fully automated question generation workflow
  4. Multi-Dimensional Coverage: Questions span from basic language elements to overall program objectives

Experimental Setup

Dataset

  1. Large-Scale Dataset: 600,913 public Scratch projects, excluding empty and mixed projects
  2. Classroom Experiment Data: 34 German ninth-grade students with Scratch programming experience
  3. Scaffolding Projects: Used Boat Race game as the foundation project for classroom experiments

Evaluation Metrics

  1. Question Generation Frequency: Total generation count and project coverage for each question type
  2. Correlation Analysis: Pearson correlation coefficient between student answer performance and programming task completion
  3. Coverage Analysis: Project coverage percentage for each Block Model dimension

Comparison Methods

Since this is the first QLCs research for Scratch, validation was performed through:

  1. Conceptual comparison with existing text-based language QLCs
  2. Systematic verification based on theoretical frameworks
  3. Application validation in actual teaching scenarios

Implementation Details

  1. Tool Extension: Based on LitterBox static analysis tool
  2. Output Format: JSON format containing code snippets in ScratchBlocks syntax
  3. Question Presentation: Highlights target code portions (as shown in Figure 1a)
  4. Scoring Mechanism: 0.2 points per correct choice in multiple-choice questions, 1 point for single-choice questions

Experimental Results

Main Results

RQ1: Question Generation Frequency

  • Overall Statistics: Generated 54,118,694 questions across 600,913 projects
  • Most Frequent Questions:
    • Purpose of Script: 9,748,844 times (100% project coverage)
    • Purpose of If Condition: 5,103,322 times (41.1% project coverage)
    • Scripts for Actor: 3,524,268 times (100% project coverage)
  • Least Frequent Questions:
    • My Block Definition: 368,712 times (11.3% project coverage)
    • Purpose of Loop Condition: 486,902 times (15.2% project coverage)

Block Model Coverage Analysis

DimensionAtomicBlockRelationalMacro
Text64.5%61.2%46.5%100.0%
Execution30.4%58.4%99.0%71.1%
Purpose49.0%100.0%31.2%100.0%

RQ2: Correlation Between Answer Performance and Programming Ability

  • Correlation Coefficient: r = 0.467 (p = 0.005)
  • Correlation Strength: Moderate positive correlation
  • Statistical Significance: p < 0.01, statistically significant
  • Practical Significance: Students' ability to answer QLCs significantly correlates with their programming task completion

Experimental Findings

  1. Generalizability Verification: All 30 question types can be frequently generated in actual projects
  2. Hierarchical Characteristics: High-level questions (such as program purpose) can be generated in almost all projects, while low-level questions depend on specific programming constructs
  3. Effectiveness Proof: QLCs can serve as valid indicators of program comprehension ability
  4. Teaching Value: Can be used to detect students' knowledge gaps

Main Research Directions

  1. Program Comprehension Assessment: Traditional methods primarily focus on code tracing, explanation, and writing skills
  2. Automatic Question Generation: Existing automatic question generation tools for text-based languages like Java
  3. Block Programming Education: Widespread application of Scratch as an introductory programming language
  1. Theoretical Inheritance: Adopts the Block Model, a mature program comprehension theoretical framework
  2. Technical Extension: First application of existing QLCs concepts to block-based programming languages
  3. Tool Innovation: Implemented automated question generation for Scratch based on the LitterBox tool
  1. Language Adaptability: Specifically designed for characteristics of block-based programming languages
  2. Systematic Completeness: Systematic question design based on theoretical frameworks
  3. Practicality: Large-scale data validation and actual classroom application

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: Can automatically generate large quantities of meaningful comprehension questions for Scratch programs
  2. Educational Effectiveness: Generated questions can effectively assess students' program comprehension abilities
  3. Practical Value: Provides a scalable automated assessment tool for Scratch education

Limitations

  1. Free Text Assessment: Automatic evaluation of open-ended questions still requires human involvement
  2. Question Coverage: Insufficient coverage at lower hierarchical levels for certain Scratch-specific constructs
  3. Experimental Scale: Relatively small sample size in classroom experiments (34 students)
  4. Time Constraints: Time limitations in classroom experiments may affect results

Future Directions

  1. LLM Integration: Utilize large language models to automatically evaluate free-text answers
  2. Question Expansion: Add more question types targeting Scratch-specific constructs
  3. User Interface: Develop question generation and management interfaces suitable for classroom use
  4. Long-Term Effect Research: Evaluate the long-term impact of QLCs on learning outcomes

In-Depth Evaluation

Strengths

  1. Strong Innovation: First systematic application of QLCs to block-based programming languages, filling a research gap
  2. Solid Theoretical Foundation: Systematic design based on Block Model ensures theoretical completeness of questions
  3. Comprehensive Experiments: Combines large-scale data analysis with classroom experiments, validating method feasibility and effectiveness
  4. High Practical Value: Release of open-source tools enables research outcomes to be directly applied to teaching practice
  5. Clear Writing: Well-structured paper with accurate technical detail descriptions

Weaknesses

  1. Assessment Limitations: Free-text question evaluation still requires manual involvement, limiting full automation
  2. Sample Limitations: Relatively small sample size in classroom experiments requires larger-scale validation
  3. Insufficient In-Depth Analysis: Lacks fine-grained analysis of effectiveness across different question types
  4. Adaptability Issues: Insufficient discussion of how to adapt to students of different ages and skill levels

Impact

  1. Academic Contribution: Provides new research directions and tools for programming education assessment
  2. Practical Value: Provides Scratch teachers with practical automated assessment tools
  3. Reproducibility: Open-source code and detailed experimental setup ensure research reproducibility
  4. Promotion Potential: Method is extensible to other block-based programming languages and platforms

Applicable Scenarios

  1. K-12 Programming Education: Particularly suitable for classrooms using Scratch for programming introduction
  2. Online Learning Platforms: Can be integrated into online programming learning systems for automatic feedback
  3. Teacher Training: Helps teachers better understand students' program comprehension levels
  4. Research Tools: Provides standardized assessment tools for programming education research

References

The paper cites 23 important references covering program comprehension theory, programming education assessment, Scratch analysis tools, and other related research areas. Particularly noteworthy are the original Block Model papers, related work on the LitterBox tool, and empirical research on the relationship between program comprehension and programming ability.