2025-11-11T11:43:09.580597

Automatically Generating Questions About Scratch Programs

ObermÃ¼ller, Fraser

When learning to program, students are usually assessed based on the code they wrote. However, the mere completion of a programming task does not guarantee actual comprehension of the underlying concepts. Asking learners questions about the code they wrote has therefore been proposed as a means to assess program comprehension. As creating targeted questions for individual student programs can be tedious and challenging, prior work has proposed to generate such questions automatically. In this paper we generalize this idea to the block-based programming language Scratch. We propose a set of 30 different questions for Scratch code covering an established program comprehension model, and extend the LitterBox static analysis tool to automatically generate corresponding questions for a given Scratch program. On a dataset of 600,913 projects we generated 54,118,694 questions automatically. Our initial experiments with 34 ninth graders demonstrate that this approach can indeed generate meaningful questions for Scratch programs, and we find that the ability of students to answer these questions on their programs relates to their overall performance.

academic

Automatically Generating Questions About Scratch Programs

Basic Information

Paper ID: 2510.11658
Title: Automatically Generating Questions About Scratch Programs
Authors: Florian Obermüller, Gordon Fraser
Classification: cs.SE (Software Engineering)
Publication Time/Conference: CompEd 2025 (ACM Global Computing Education Conference 2025)
Paper Link: https://arxiv.org/abs/2510.11658

Abstract

In programming education, students are typically assessed based on the code they write. However, successfully completing programming tasks does not guarantee genuine understanding of underlying concepts. Prior research has proposed assessing program comprehension by asking learners questions about their code. Since creating targeted questions for each student program is both tedious and challenging, previous work has proposed methods for automatically generating such questions. This paper extends this idea to Scratch, a block-based programming language. We propose 30 different types of Scratch code questions that cover established program comprehension models and extend the LitterBox static analysis tool to automatically generate corresponding questions for given Scratch programs. On a dataset containing 600,913 projects, we automatically generated 54,118,694 questions. Preliminary experiments with 34 ninth-grade students demonstrate that this approach can generate meaningful questions for Scratch programs and reveal that students' ability to answer these questions correlates with their overall performance.

Research Background and Motivation

Core Problem

The core problem this research addresses is: How can we effectively assess students' understanding of their own Scratch programs, rather than merely checking whether programs run correctly?

Problem Significance

Gap Between Understanding and Implementation: Students may complete programming tasks through trial-and-error, copying, or AI assistance without truly understanding underlying programming concepts
Limitations of Assessment Methods: Traditional assessment methods focus primarily on code correctness rather than students' program comprehension abilities
Scalability Challenges: In large-scale teaching scenarios, teachers struggle to manually create personalized comprehension assessment questions for each student's program

Limitations of Existing Methods

Text Language Limitations: Existing question generation methods primarily target text-based programming languages like Java and are not applicable to block-based languages like Scratch
Language Feature Differences: In Scratch, variables are created through the user interface rather than declaration statements, and blocks cannot be referenced by line numbers
Lack of Systematicity: Absence of systematic question design methods based on theoretical frameworks

Research Motivation

The motivation for this research is to extend the existing concept of "Questions about Learner's Code" (QLCs) to the Scratch environment, providing an automated program comprehension assessment tool for block-based programming education.

Core Contributions

Systematic Question Design: Based on the Block Model program comprehension framework, systematically designed 30 types of questions targeting Scratch code
Tool Extension: Extended the open-source static analysis tool LitterBox to automatically generate comprehension questions for Scratch programs
Large-Scale Validation: Validated the method's applicability on a dataset containing 600,913 public Scratch projects
Empirical Study: Verified question effectiveness and the correlation between student answer performance and programming ability through classroom experiments with 34 ninth-grade students

Methodology Details

Task Definition

Input: A Scratch program project Output: A set of automatically generated comprehension questions about the program, including question text, answer options, and correct answers Constraints: Questions must be based on code constructs actually present in the program and conform to the Block Model theoretical framework

Method Architecture

1. Theoretical Foundation: Block Model Adaptation

The Block Model contains four levels of scope and three program dimensions:

Level	Text Dimension	Execution Dimension	Purpose Dimension
Atomic	Language elements	Element operations	Element purpose
Block	Syntactically/semantically related regions	Code block operations	Code block functionality
Relational	References between code blocks	Control flow between blocks	Target and sub-target relationships
Macro	Overall program structure	Algorithms or program behavior	Program goals or objectives

2. Question Type Design

Based on the Block Model, 30 question types were designed across 5 answer formats:

Numeric (🔢): Answer is a single number
String (📝): Answer is one or more strings
Yes/No (✓/✗): Answer is yes or no
Multiple Choice (☑️): Select correct answers from options
Free Text (📄): Open-ended questions requiring explanatory responses

3. Automatic Generation Implementation

Implemented through extending the LitterBox tool:

AST Parsing: Convert Scratch programs to abstract syntax trees
Visitor Pattern: Implement a question finder for each question type
Code Traversal: Traverse the AST to identify code patterns that can generate questions
Option Generation: Automatically generate distractors for multiple-choice questions

Technical Innovation Points

Block Programming Adaptation: First systematic application of QLCs concept to block-based programming languages
Theory-Driven Design: Question type design based on mature program comprehension theoretical frameworks
Automated Generation: Implemented fully automated question generation workflow
Multi-Dimensional Coverage: Questions span from basic language elements to overall program objectives

Experimental Setup

Dataset

Large-Scale Dataset: 600,913 public Scratch projects, excluding empty and mixed projects
Classroom Experiment Data: 34 German ninth-grade students with Scratch programming experience
Scaffolding Projects: Used Boat Race game as the foundation project for classroom experiments

Evaluation Metrics

Question Generation Frequency: Total generation count and project coverage for each question type
Correlation Analysis: Pearson correlation coefficient between student answer performance and programming task completion
Coverage Analysis: Project coverage percentage for each Block Model dimension

Comparison Methods

Since this is the first QLCs research for Scratch, validation was performed through:

Conceptual comparison with existing text-based language QLCs
Systematic verification based on theoretical frameworks
Application validation in actual teaching scenarios

Implementation Details

Tool Extension: Based on LitterBox static analysis tool
Output Format: JSON format containing code snippets in ScratchBlocks syntax
Question Presentation: Highlights target code portions (as shown in Figure 1a)
Scoring Mechanism: 0.2 points per correct choice in multiple-choice questions, 1 point for single-choice questions

Experimental Results

Main Results

RQ1: Question Generation Frequency

Overall Statistics: Generated 54,118,694 questions across 600,913 projects
Most Frequent Questions:
- Purpose of Script: 9,748,844 times (100% project coverage)
- Purpose of If Condition: 5,103,322 times (41.1% project coverage)
- Scripts for Actor: 3,524,268 times (100% project coverage)
Least Frequent Questions:
- My Block Definition: 368,712 times (11.3% project coverage)
- Purpose of Loop Condition: 486,902 times (15.2% project coverage)

Block Model Coverage Analysis

Dimension	Atomic	Block	Relational	Macro
Text	64.5%	61.2%	46.5%	100.0%
Execution	30.4%	58.4%	99.0%	71.1%
Purpose	49.0%	100.0%	31.2%	100.0%

RQ2: Correlation Between Answer Performance and Programming Ability

Correlation Coefficient: r = 0.467 (p = 0.005)
Correlation Strength: Moderate positive correlation
Statistical Significance: p < 0.01, statistically significant
Practical Significance: Students' ability to answer QLCs significantly correlates with their programming task completion

Experimental Findings

Generalizability Verification: All 30 question types can be frequently generated in actual projects
Hierarchical Characteristics: High-level questions (such as program purpose) can be generated in almost all projects, while low-level questions depend on specific programming constructs
Effectiveness Proof: QLCs can serve as valid indicators of program comprehension ability
Teaching Value: Can be used to detect students' knowledge gaps

Main Research Directions

Program Comprehension Assessment: Traditional methods primarily focus on code tracing, explanation, and writing skills
Automatic Question Generation: Existing automatic question generation tools for text-based languages like Java
Block Programming Education: Widespread application of Scratch as an introductory programming language

Theoretical Inheritance: Adopts the Block Model, a mature program comprehension theoretical framework
Technical Extension: First application of existing QLCs concepts to block-based programming languages
Tool Innovation: Implemented automated question generation for Scratch based on the LitterBox tool

Language Adaptability: Specifically designed for characteristics of block-based programming languages
Systematic Completeness: Systematic question design based on theoretical frameworks
Practicality: Large-scale data validation and actual classroom application

Conclusions and Discussion

Main Conclusions

Technical Feasibility: Can automatically generate large quantities of meaningful comprehension questions for Scratch programs
Educational Effectiveness: Generated questions can effectively assess students' program comprehension abilities
Practical Value: Provides a scalable automated assessment tool for Scratch education

Limitations

Free Text Assessment: Automatic evaluation of open-ended questions still requires human involvement
Question Coverage: Insufficient coverage at lower hierarchical levels for certain Scratch-specific constructs
Experimental Scale: Relatively small sample size in classroom experiments (34 students)
Time Constraints: Time limitations in classroom experiments may affect results

Future Directions

LLM Integration: Utilize large language models to automatically evaluate free-text answers
Question Expansion: Add more question types targeting Scratch-specific constructs
User Interface: Develop question generation and management interfaces suitable for classroom use
Long-Term Effect Research: Evaluate the long-term impact of QLCs on learning outcomes

In-Depth Evaluation

Strengths

Strong Innovation: First systematic application of QLCs to block-based programming languages, filling a research gap
Solid Theoretical Foundation: Systematic design based on Block Model ensures theoretical completeness of questions
Comprehensive Experiments: Combines large-scale data analysis with classroom experiments, validating method feasibility and effectiveness
High Practical Value: Release of open-source tools enables research outcomes to be directly applied to teaching practice
Clear Writing: Well-structured paper with accurate technical detail descriptions

Weaknesses

Assessment Limitations: Free-text question evaluation still requires manual involvement, limiting full automation
Sample Limitations: Relatively small sample size in classroom experiments requires larger-scale validation
Insufficient In-Depth Analysis: Lacks fine-grained analysis of effectiveness across different question types
Adaptability Issues: Insufficient discussion of how to adapt to students of different ages and skill levels

Impact

Academic Contribution: Provides new research directions and tools for programming education assessment
Practical Value: Provides Scratch teachers with practical automated assessment tools
Reproducibility: Open-source code and detailed experimental setup ensure research reproducibility
Promotion Potential: Method is extensible to other block-based programming languages and platforms

Applicable Scenarios

K-12 Programming Education: Particularly suitable for classrooms using Scratch for programming introduction
Online Learning Platforms: Can be integrated into online programming learning systems for automatic feedback
Teacher Training: Helps teachers better understand students' program comprehension levels
Research Tools: Provides standardized assessment tools for programming education research

References

The paper cites 23 important references covering program comprehension theory, programming education assessment, Scratch analysis tools, and other related research areas. Particularly noteworthy are the original Block Model papers, related work on the LitterBox tool, and empirical research on the relationship between program comprehension and programming ability.