2025-11-23T19:01:17.127547

Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM)

Khan, Yaqoob, Tasadduq et al.
The evolving pedagogy paradigms are leading toward educational transformations. One fundamental aspect of effective learning is relevant, immediate, and constructive feedback to students. Providing constructive feedback to large cohorts in academia is an ongoing challenge. Therefore, academics are moving towards automated assessment to provide immediate feedback. However, current approaches are often limited in scope, offering simplistic responses that do not provide students with personalized feedback to guide them toward improvements. This paper addresses this limitation by investigating the performance of Large Language Models (LLMs) in processing students assessments with predefined rubrics and marking criteria to generate personalized feedback for in-depth learning. We aim to leverage the power of existing LLMs for Marking Assessments, Tracking, and Evaluation (LLM-MATE) with personalized feedback to enhance students learning. To evaluate the performance of LLM-MATE, we consider the Software Architecture (SA) module as a case study. The LLM-MATE approach can help module leaders overcome assessment challenges with large cohorts. Also, it helps students improve their learning by obtaining personalized feedback in a timely manner. Additionally, the proposed approach will facilitate the establishment of ground truth for automating the generation of students assessment feedback using the ChatGPT API, thereby reducing the overhead associated with large cohort assessments.
academic

Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM)

Basic Information

  • Paper ID: 2510.11556
  • Title: Personalized and Constructive Feedback for Computer Science Students Using the Large Language Model (LLM)
  • Authors: Javed Ali Khan, Muhammad Yaqoob, Mamoona Tasadduq, Hafsa Shareef Dar, Aitezaz Ahsan
  • Classification: cs.CY (Computers and Society)
  • Publication Date/Venue: 2024 (Preprint)
  • Paper Link: https://arxiv.org/abs/2510.11556

Abstract

The evolution of educational paradigms is driving transformative change in education. A fundamental aspect of effective learning is providing students with relevant, timely, and constructive feedback. Delivering constructive feedback to large student populations remains an ongoing challenge for academia. Consequently, scholars are turning to automated assessment to provide immediate feedback. However, current approaches often have limited scope and provide simplistic responses that cannot offer personalized feedback to guide student improvement. This paper addresses this limitation by investigating the performance of Large Language Models (LLMs) in processing student assessments using predefined rubrics and generating personalized feedback. The authors aim to leverage the power of existing LLMs for assessment marking, tracking, and evaluation (LLM-MATE), enhancing student learning through personalized feedback.

Research Background and Motivation

1. Core Problems

This research addresses the following key issues:

  • Scalable Feedback Challenge: Difficulty in providing timely, personalized, and constructive feedback to large student populations
  • Limitations of Traditional Automated Assessment: Existing automated assessment methods have limited scope and provide only simplistic responses, lacking personalized guidance
  • Teacher Workload: Manual assessment of numerous student assignments is time-consuming and labor-intensive, making it difficult to ensure feedback quality and consistency

2. Problem Significance

  • Educational Quality Enhancement: Timely, personalized feedback is fundamental to effective learning
  • Intelligent Education Development: Post-COVID-19, demand for online education and intelligent educational platforms has surged
  • Educational Equity: Automated assessment can provide consistent quality feedback to all students

3. Limitations of Existing Approaches

  • Most research focuses on formative assessment with insufficient attention to summative assessment
  • Feedback provided by existing AI assessment tools is overly simplistic, lacking detailed improvement suggestions
  • Assessment criteria are inconsistent, with different instructors potentially providing significantly different evaluations

4. Research Motivation

Leveraging the powerful text comprehension and generation capabilities of Large Language Models, combined with predefined rubrics, to provide personalized and constructive feedback for multimodal assessments (text, images, programming) of computer science students.

Core Contributions

  1. Proposed LLM-MATE Framework: A Large Language Model-based marking, tracking, and evaluation system capable of handling multimodal student assessments
  2. Zero-Shot Prompt Engineering Method: Developed specialized ChatGPT prompting strategies for student assessment that generate high-quality feedback without requiring training data
  3. Multimodal Assessment Capability: Validated the effectiveness of LLMs in processing software architecture assessments containing text and diagrams
  4. Teacher Validation Study: Demonstrated the reliability of AI-generated feedback through comparative validation with human experts
  5. Practical Application Value: Provided a feasible solution for automated assessment in large-scale courses

Methodology Details

Task Definition

Input: Student-submitted assessment work (including text descriptions, software architecture diagrams, etc.) + assessment rubrics and scoring guidelines Output: Structured personalized feedback including:

  • Analysis of assignment strengths
  • Identification of weaknesses
  • Specific improvement recommendations
  • Quantitative scoring with justification

Constraints:

  • Must be based on predefined assessment rubrics
  • Feedback must be constructive and personalized
  • Applicable to large student populations

Model Architecture

Overall Framework: LLM-MATE Four-Step Approach

  1. Data Collection
    • Collect anonymized student assessment data
    • Cover multiple assessment types for software architecture modules (use case diagrams, class diagrams, three-tier architecture diagrams)
    • Obtain student consent and ensure data security
  2. Prompt Engineering
    • Domain Constraint: Use structured prompts to constrain ChatGPT analysis within specific parameter ranges
    • Personalized Feedback Generation: Customize prompts to analyze strengths, weaknesses, and improvement suggestions for each submission
    • Iterative Testing and Optimization: Ensure consistent output quality through extensive testing
    • Error Identification: Design prompts to identify student errors and provide constructive explanations
  3. ChatGPT Assessment Execution
    • Input: Student assessment + task requirements + evaluation rubrics
    • Processing: Analysis based on provided scoring guidelines
    • Output: Constructive feedback + overall score
  4. Evaluation and Negotiation Process
    • Cross-validation of AI-generated feedback by human experts
    • Comparison with manual assessment results
    • Identification and resolution of potential "hallucination" issues

Key Technical Details

Zero-Shot Learning Strategy:

System Prompt + Assessment Introduction + Scoring Rubrics + Student Response + Output Format Requirements

Prompt Structure Design:

  • Clear role definition (as software architecture assessment expert)
  • Detailed scoring rubric explanation
  • Structured output format requirements
  • Specific requirements for constructive feedback

Technical Innovation Points

  1. Multimodal Processing Capability: Utilizing GPT-4o to simultaneously process text and image content, suitable for software engineering assessment
  2. Zero-Shot Adaptability: Adapts to different assessment tasks through prompt engineering alone without requiring task-specific training data
  3. Structured Feedback Generation: Generates comprehensive feedback including strengths, weaknesses, improvement suggestions, and scoring justification
  4. Human-AI Collaborative Verification: Establishes a negotiation mechanism between AI and human experts to ensure feedback quality

Experimental Setup

Dataset

  • Source: Software Architecture (SA) module at the University of Hertfordshire, UK
  • Scale: Obtained consent from 23 students out of 290
  • Content: Assessment assignments containing use case diagrams, class diagrams, and three-tier architecture diagrams
  • Weight Distribution: Use case diagrams 30%, class diagrams 30%, three-tier architecture diagrams 40%
  • Sample Selection: Selected high-scoring, medium-scoring, and low-scoring assignments based on diversity principles

Evaluation Metrics

  • Confidence Score: Instructor confidence in AI feedback (1-5 scale)
    • 1-2: Low confidence
    • 3: Medium confidence
    • 4-5: High confidence
  • Feedback Quality Assessment: Comparison of detail level and constructiveness between AI and human feedback

Comparison Methods

  • Manual Assessment: Hand-graded results from 4 module team members as baseline
  • Traditional Feedback: Brief summative comments (as shown in Figure 4)
  • AI Feedback: Detailed structured feedback (as shown in Figure 3)

Implementation Details

  • Model: GPT-4o (supports text and image analysis)
  • Interface: ChatGPT web interface
  • Prompting Strategy: Zero-shot learning
  • Assessment Focus: Primarily use case diagram assessment (30 points maximum)

Experimental Results

Main Findings

RQ1: ChatGPT Performance in Assessment

Finding: ChatGPT performs well in generating personalized constructive feedback

  • Capable of elaborating on assignment strengths in detail
  • Accurately identifies weaknesses
  • Provides specific improvement recommendations
  • Offers reasonable scores with justification

Comparative Analysis:

  • AI Feedback (Figure 3): Detailed, structured, personalized with specific technical suggestions
  • Human Feedback (Figure 4): Brief summaries lacking detailed improvement guidance

RQ2: Reliability of AI Feedback

Teacher Validation Results:

  • Confidence scores from 4 teachers: 4, 5, 4, 3
  • Average Confidence: 4.0 (high confidence range)
  • Consistency: All teachers acknowledged high-quality AI feedback

Case Analysis

Typical AI Feedback Characteristics:

  1. Strength Identification: Accurately identifies correct implementations in student work
  2. Problem Diagnosis: Specifically points out technical errors and conceptual misunderstandings
  3. Improvement Suggestions: Provides actionable specific improvement plans
  4. Scoring Justification: Explains scoring rationale in detail

Experimental Findings

  1. Consistency Advantage: AI assessment provides more consistent feedback standards compared to manual assessment
  2. Detail Level: AI-generated feedback is more detailed and specific than traditional human feedback
  3. Timeliness: Capable of generating immediate feedback, meeting large-scale teaching needs
  4. Personalization: Provides customized suggestions tailored to each student's specific situation

Main Research Directions

  1. Intelligent Feedback Systems:
    • Biswas et al.'s machine learning real-time feedback system
    • Gutierrez and Atkinson's adaptive feedback approach
    • Van der Merwe et al.'s LMS-integrated feedback mechanism
  2. Automated Assessment:
    • Fu et al.'s AI automatic grading tool
    • Lu and Cutumisu's deep learning paper scoring
    • González-Calatayud et al.'s AI assessment survey
  3. Personalized Learning:
    • Maier et al.'s personalized feedback classification framework
    • Bimba et al.'s adaptive feedback survey

Innovation Comparison with Existing Work

AspectExisting WorkThis Paper's Contribution
Assessment TypePrimarily formative assessmentFocuses on summative assessment
Feedback DetailSimple scoring or classificationDetailed structured feedback
Multimodal ProcessingMostly text-onlySimultaneously processes text and images
Validation MethodStudent satisfaction surveysExpert confidence assessment

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: ChatGPT can effectively process multimodal assessments of computer science students and generate high-quality personalized feedback
  2. Educational Value: AI-generated feedback is more detailed and constructive than traditional human feedback, facilitating student learning improvement
  3. Practicality: The LLM-MATE approach can help address assessment challenges in large-scale courses and improve teaching efficiency
  4. Consistency: AI assessment provides more consistent evaluation standards compared to multiple human assessors

Limitations

  1. Data Scale Limitation: Only 23 students consented to participate, resulting in relatively small sample size
  2. Assessment Scope: Primarily validated use case diagram assessment with insufficient validation of class diagrams and architecture diagrams
  3. Hallucination Risk: LLMs may generate content that appears authoritative but is actually incorrect
  4. Domain Dependency: Requires carefully designed rubrics to achieve optimal performance
  5. Lack of Student Perspective: Did not directly assess student acceptance and learning effectiveness regarding AI feedback

Future Directions

  1. Expanded Experiments:
    • Increase dataset scale
    • Validate other types of software engineering diagrams
    • Test applicability across different academic disciplines
  2. Technical Improvements:
    • Explore few-shot learning and chain-of-thought prompting methods
    • Develop automated ChatGPT API solutions
    • Establish more comprehensive human-AI collaboration mechanisms
  3. Educational Effectiveness Evaluation:
    • Investigate actual impact of AI feedback on student learning outcomes
    • Assess student acceptance and trust in AI feedback

In-Depth Evaluation

Strengths

  1. Problem-Oriented Approach: Addresses real pain points in education with clear application value
  2. Methodological Innovation: Novel application of LLMs to multimodal educational assessment
  3. Sufficient Validation: Expert verification ensures credibility of research results
  4. Strong Practicality: Proposed framework can be directly applied to actual teaching environments

Weaknesses

  1. Limited Experimental Scale: Small sample size may affect generalizability of results
  2. Single Evaluation Dimension: Primarily focuses on feedback quality without direct measurement of learning outcomes
  3. Insufficient Technical Depth: Mainly uses existing APIs with limited deep technical innovation
  4. Missing Cost-Benefit Analysis: Does not discuss costs and sustainability of large-scale deployment

Impact

  1. Academic Contribution: Provides new perspectives on LLM applications in educational technology
  2. Practical Value: Directly applicable to large-scale course assessment in higher education
  3. Reproducibility: Clear methodology description facilitates reproduction and improvement by other researchers
  4. Scalability Potential: Framework demonstrates good generalizability and extensibility to other disciplines

Applicable Scenarios

  1. Large-Scale Courses: Particularly suitable for computer science courses with large student populations
  2. Standardized Assessment: Applicable to technical courses with clear scoring rubrics
  3. Multimodal Assignments: Suitable for comprehensive assessments containing diagrams, code, and text
  4. Online Education: Provides automated assessment solutions for remote education platforms

References

This paper cites 38 relevant references, primarily including:

Core References:

  1. González-Calatayud et al. (2021) - Survey of AI student assessment systems
  2. Maier & Klotz (2022) - Personalized feedback in digital learning environments
  3. Biswas & Bhattacharya (2024) - ML-based intelligent real-time feedback system
  4. Liu et al. (2023) - Systematic review of prompt engineering methods

Technical Support References:

  • White et al. (2024) - ChatGPT prompting patterns
  • Wei et al. (2022) - Chain-of-thought prompting method
  • Chen et al. (2023) - LLM applications in software engineering

Overall Assessment: This is a research paper with practical application value. Although it has certain limitations in technical innovation and experimental scale, it provides valuable exploration and practical experience for the educational technology field. The research methodology is sound, results are credible, and it has positive significance for promoting AI applications in educational assessment.