2025-11-23T19:01:17.127547

Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM)

Khan, Yaqoob, Tasadduq et al.

The evolving pedagogy paradigms are leading toward educational transformations. One fundamental aspect of effective learning is relevant, immediate, and constructive feedback to students. Providing constructive feedback to large cohorts in academia is an ongoing challenge. Therefore, academics are moving towards automated assessment to provide immediate feedback. However, current approaches are often limited in scope, offering simplistic responses that do not provide students with personalized feedback to guide them toward improvements. This paper addresses this limitation by investigating the performance of Large Language Models (LLMs) in processing students assessments with predefined rubrics and marking criteria to generate personalized feedback for in-depth learning. We aim to leverage the power of existing LLMs for Marking Assessments, Tracking, and Evaluation (LLM-MATE) with personalized feedback to enhance students learning. To evaluate the performance of LLM-MATE, we consider the Software Architecture (SA) module as a case study. The LLM-MATE approach can help module leaders overcome assessment challenges with large cohorts. Also, it helps students improve their learning by obtaining personalized feedback in a timely manner. Additionally, the proposed approach will facilitate the establishment of ground truth for automating the generation of students assessment feedback using the ChatGPT API, thereby reducing the overhead associated with large cohort assessments.

academic

Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM)

Basic Information

Paper ID: 2510.11556
Title: Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM)
Authors: Javed Ali Khan, Muhammad Yaqoob, Mamoona Tasadduq, Hafsa Shareef Dar, Aitezaz Ahsan
Classification: cs.CY (Computers and Society)
Publication Date/Venue: 2024 (Preprint)
Paper Link: https://arxiv.org/abs/2510.11556

Abstract

The evolution of educational paradigms is driving transformative change in education. A fundamental aspect of effective learning is providing students with relevant, timely, and constructive feedback. Delivering constructive feedback to large student populations remains an ongoing challenge for academia. Consequently, scholars are turning to automated assessment to provide immediate feedback. However, current approaches often have limited scope and provide simplistic responses that cannot offer personalized feedback to guide student improvement. This paper addresses this limitation by investigating the performance of Large Language Models (LLMs) in processing student assessments using predefined rubrics and generating personalized feedback. The authors aim to leverage the power of existing LLMs for assessment marking, tracking, and evaluation (LLM-MATE), enhancing student learning through personalized feedback.

Research Background and Motivation

1. Core Problems

This research addresses the following key issues:

Scalable Feedback Challenge: Difficulty in providing timely, personalized, and constructive feedback to large student populations
Limitations of Traditional Automated Assessment: Existing automated assessment methods have limited scope and provide only simplistic responses, lacking personalized guidance
Teacher Workload: Manual assessment of numerous student assignments is time-consuming and labor-intensive, making it difficult to ensure feedback quality and consistency

2. Problem Significance

Educational Quality Enhancement: Timely, personalized feedback is fundamental to effective learning
Intelligent Education Development: Post-COVID-19, demand for online education and intelligent educational platforms has surged
Educational Equity: Automated assessment can provide consistent quality feedback to all students

3. Limitations of Existing Approaches

Most research focuses on formative assessment with insufficient attention to summative assessment
Feedback provided by existing AI assessment tools is overly simplistic, lacking detailed improvement suggestions
Assessment criteria are inconsistent, with different instructors potentially providing significantly different evaluations

4. Research Motivation

Leveraging the powerful text comprehension and generation capabilities of Large Language Models, combined with predefined rubrics, to provide personalized and constructive feedback for multimodal assessments (text, images, programming) of computer science students.

Core Contributions

Proposed LLM-MATE Framework: A Large Language Model-based marking, tracking, and evaluation system capable of handling multimodal student assessments
Zero-Shot Prompt Engineering Method: Developed specialized ChatGPT prompting strategies for student assessment that generate high-quality feedback without requiring training data
Multimodal Assessment Capability: Validated the effectiveness of LLMs in processing software architecture assessments containing text and diagrams
Teacher Validation Study: Demonstrated the reliability of AI-generated feedback through comparative validation with human experts
Practical Application Value: Provided a feasible solution for automated assessment in large-scale courses

Methodology Details

Task Definition

Input: Student-submitted assessment work (including text descriptions, software architecture diagrams, etc.) + assessment rubrics and scoring guidelines Output: Structured personalized feedback including:

Analysis of assignment strengths
Identification of weaknesses
Specific improvement recommendations
Quantitative scoring with justification

Constraints:

Must be based on predefined assessment rubrics
Feedback must be constructive and personalized
Applicable to large student populations

Model Architecture

Overall Framework: LLM-MATE Four-Step Approach

Data Collection
- Collect anonymized student assessment data
- Cover multiple assessment types for software architecture modules (use case diagrams, class diagrams, three-tier architecture diagrams)
- Obtain student consent and ensure data security
Prompt Engineering
- Domain Constraint: Use structured prompts to constrain ChatGPT analysis within specific parameter ranges
- Personalized Feedback Generation: Customize prompts to analyze strengths, weaknesses, and improvement suggestions for each submission
- Iterative Testing and Optimization: Ensure consistent output quality through extensive testing
- Error Identification: Design prompts to identify student errors and provide constructive explanations
ChatGPT Assessment Execution
- Input: Student assessment + task requirements + evaluation rubrics
- Processing: Analysis based on provided scoring guidelines
- Output: Constructive feedback + overall score
Evaluation and Negotiation Process
- Cross-validation of AI-generated feedback by human experts
- Comparison with manual assessment results
- Identification and resolution of potential "hallucination" issues

Key Technical Details

Zero-Shot Learning Strategy:

System Prompt + Assessment Introduction + Scoring Rubrics + Student Response + Output Format Requirements

Prompt Structure Design:

Clear role definition (as software architecture assessment expert)
Detailed scoring rubric explanation
Structured output format requirements
Specific requirements for constructive feedback

Technical Innovation Points

Multimodal Processing Capability: Utilizing GPT-4o to simultaneously process text and image content, suitable for software engineering assessment
Zero-Shot Adaptability: Adapts to different assessment tasks through prompt engineering alone without requiring task-specific training data
Structured Feedback Generation: Generates comprehensive feedback including strengths, weaknesses, improvement suggestions, and scoring justification
Human-AI Collaborative Verification: Establishes a negotiation mechanism between AI and human experts to ensure feedback quality

Experimental Setup

Dataset

Source: Software Architecture (SA) module at the University of Hertfordshire, UK
Scale: Obtained consent from 23 students out of 290
Content: Assessment assignments containing use case diagrams, class diagrams, and three-tier architecture diagrams
Weight Distribution: Use case diagrams 30%, class diagrams 30%, three-tier architecture diagrams 40%
Sample Selection: Selected high-scoring, medium-scoring, and low-scoring assignments based on diversity principles

Evaluation Metrics

Confidence Score: Instructor confidence in AI feedback (1-5 scale)
- 1-2: Low confidence
- 3: Medium confidence
- 4-5: High confidence
Feedback Quality Assessment: Comparison of detail level and constructiveness between AI and human feedback

Comparison Methods

Manual Assessment: Hand-graded results from 4 module team members as baseline
Traditional Feedback: Brief summative comments (as shown in Figure 4)
AI Feedback: Detailed structured feedback (as shown in Figure 3)

Implementation Details

Model: GPT-4o (supports text and image analysis)
Interface: ChatGPT web interface
Prompting Strategy: Zero-shot learning
Assessment Focus: Primarily use case diagram assessment (30 points maximum)

Experimental Results

Main Findings

RQ1: ChatGPT Performance in Assessment

Finding: ChatGPT performs well in generating personalized constructive feedback

Capable of elaborating on assignment strengths in detail
Accurately identifies weaknesses
Provides specific improvement recommendations
Offers reasonable scores with justification

Comparative Analysis:

AI Feedback (Figure 3): Detailed, structured, personalized with specific technical suggestions
Human Feedback (Figure 4): Brief summaries lacking detailed improvement guidance

RQ2: Reliability of AI Feedback

Teacher Validation Results:

Confidence scores from 4 teachers: 4, 5, 4, 3
Average Confidence: 4.0 (high confidence range)
Consistency: All teachers acknowledged high-quality AI feedback

Case Analysis

Typical AI Feedback Characteristics:

Strength Identification: Accurately identifies correct implementations in student work
Problem Diagnosis: Specifically points out technical errors and conceptual misunderstandings
Improvement Suggestions: Provides actionable specific improvement plans
Scoring Justification: Explains scoring rationale in detail

Experimental Findings

Consistency Advantage: AI assessment provides more consistent feedback standards compared to manual assessment
Detail Level: AI-generated feedback is more detailed and specific than traditional human feedback
Timeliness: Capable of generating immediate feedback, meeting large-scale teaching needs
Personalization: Provides customized suggestions tailored to each student's specific situation

Main Research Directions

Intelligent Feedback Systems:
- Biswas et al.'s machine learning real-time feedback system
- Gutierrez and Atkinson's adaptive feedback approach
- Van der Merwe et al.'s LMS-integrated feedback mechanism
Automated Assessment:
- Fu et al.'s AI automatic grading tool
- Lu and Cutumisu's deep learning paper scoring
- González-Calatayud et al.'s AI assessment survey
Personalized Learning:
- Maier et al.'s personalized feedback classification framework
- Bimba et al.'s adaptive feedback survey

Innovation Comparison with Existing Work

Aspect	Existing Work	This Paper's Contribution
Assessment Type	Primarily formative assessment	Focuses on summative assessment
Feedback Detail	Simple scoring or classification	Detailed structured feedback
Multimodal Processing	Mostly text-only	Simultaneously processes text and images
Validation Method	Student satisfaction surveys	Expert confidence assessment

Conclusions and Discussion

Main Conclusions

Technical Feasibility: ChatGPT can effectively process multimodal assessments of computer science students and generate high-quality personalized feedback
Educational Value: AI-generated feedback is more detailed and constructive than traditional human feedback, facilitating student learning improvement
Practicality: The LLM-MATE approach can help address assessment challenges in large-scale courses and improve teaching efficiency
Consistency: AI assessment provides more consistent evaluation standards compared to multiple human assessors

Limitations

Data Scale Limitation: Only 23 students consented to participate, resulting in relatively small sample size
Assessment Scope: Primarily validated use case diagram assessment with insufficient validation of class diagrams and architecture diagrams
Hallucination Risk: LLMs may generate content that appears authoritative but is actually incorrect
Domain Dependency: Requires carefully designed rubrics to achieve optimal performance
Lack of Student Perspective: Did not directly assess student acceptance and learning effectiveness regarding AI feedback

Future Directions

Expanded Experiments:
- Increase dataset scale
- Validate other types of software engineering diagrams
- Test applicability across different academic disciplines
Technical Improvements:
- Explore few-shot learning and chain-of-thought prompting methods
- Develop automated ChatGPT API solutions
- Establish more comprehensive human-AI collaboration mechanisms
Educational Effectiveness Evaluation:
- Investigate actual impact of AI feedback on student learning outcomes
- Assess student acceptance and trust in AI feedback

In-Depth Evaluation

Strengths

Problem-Oriented Approach: Addresses real pain points in education with clear application value
Methodological Innovation: Novel application of LLMs to multimodal educational assessment
Sufficient Validation: Expert verification ensures credibility of research results
Strong Practicality: Proposed framework can be directly applied to actual teaching environments

Weaknesses

Limited Experimental Scale: Small sample size may affect generalizability of results
Single Evaluation Dimension: Primarily focuses on feedback quality without direct measurement of learning outcomes
Insufficient Technical Depth: Mainly uses existing APIs with limited deep technical innovation
Missing Cost-Benefit Analysis: Does not discuss costs and sustainability of large-scale deployment

Impact

Academic Contribution: Provides new perspectives on LLM applications in educational technology
Practical Value: Directly applicable to large-scale course assessment in higher education
Reproducibility: Clear methodology description facilitates reproduction and improvement by other researchers
Scalability Potential: Framework demonstrates good generalizability and extensibility to other disciplines

Applicable Scenarios

Large-Scale Courses: Particularly suitable for computer science courses with large student populations
Standardized Assessment: Applicable to technical courses with clear scoring rubrics
Multimodal Assignments: Suitable for comprehensive assessments containing diagrams, code, and text
Online Education: Provides automated assessment solutions for remote education platforms

References

This paper cites 38 relevant references, primarily including:

Core References:

González-Calatayud et al. (2021) - Survey of AI student assessment systems
Maier & Klotz (2022) - Personalized feedback in digital learning environments
Biswas & Bhattacharya (2024) - ML-based intelligent real-time feedback system
Liu et al. (2023) - Systematic review of prompt engineering methods

Technical Support References:

White et al. (2024) - ChatGPT prompting patterns
Wei et al. (2022) - Chain-of-thought prompting method
Chen et al. (2023) - LLM applications in software engineering

Overall Assessment: This is a research paper with practical application value. Although it has certain limitations in technical innovation and experimental scale, it provides valuable exploration and practical experience for the educational technology field. The research methodology is sound, results are credible, and it has positive significance for promoting AI applications in educational assessment.