2025-11-21T04:13:15.591642

Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

Harada, Yoshida, Kojima et al.
The performance of Large Language Models (LLMs) is highly sensitive to the prompts they are given. Drawing inspiration from the field of prompt optimization, this study investigates the potential for enhancing Automated Essay Scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, our approach prompts models to iteratively refine rubrics by reflecting on models' own scoring rationales and observed discrepancies with human scores on sample essays. Experiments on the TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct show Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47, respectively. Notably, even with a simple initial rubric, our approach achieves comparable or better QWK than using detailed human-authored rubrics. Our findings highlight the importance of iterative rubric refinement in LLM-based AES to enhance alignment with human evaluations.
academic

Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

Basic Information

  • Paper ID: 2510.09030
  • Title: Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise
  • Authors: Keno Harada, Lui Yoshida, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo (The University of Tokyo)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09030

Abstract

The performance of large language models (LLMs) is highly sensitive to given prompts. Inspired by the prompt optimization literature, this research explores the potential of enhancing automated essay scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, the method prompts models to iteratively improve scoring rubrics by reflecting on their own scoring rationales and discrepancies with human scores. Experiments using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct on the TOEFL11 and ASAP datasets demonstrate improvements in Quadratic Weighted Kappa (QWK) of up to 0.19 and 0.47, respectively. Notably, even with simple initial rubrics, the method achieves QWK comparable to or better than using detailed human-written rubrics. The findings highlight the importance of iterative rubric refinement in LLM-based AES for enhancing consistency with human assessment.

Research Background and Motivation

Problem Definition

  1. Core Issue: Traditional LLM-based automated essay scoring systems use static, predefined scoring rubrics designed for human raters, which may not be optimal for LLMs.
  2. Significance: With the widespread application of LLMs in education, there is a need for AES systems capable of providing real-time, scalable feedback to alleviate teacher grading burden.
  3. Existing Limitations:
    • Current LLM-based AES overlooks the collaborative calibration process of human raters
    • Human raters typically score sample essays, discuss judgment discrepancies, and refine their shared understanding of standards
    • This iterative reflective practice is neglected in current LLM-based AES, limiting consistency with human scoring patterns

Research Motivation

Inspired by prompt optimization techniques and the human rater calibration process, the authors propose an iterative refinement method that enables LLMs to reflect on and improve scoring rubrics based on their own performance on sample essays.

Core Contributions

  1. Proposes an iterative rubric refinement method: Based on a reflect-and-revise mechanism, enabling LLMs to automatically improve scoring rubrics based on discrepancies with human scores
  2. Validates method effectiveness: Demonstrates significant performance improvements using three different LLMs on two standard datasets
  3. Discovers new insights on rubric design: Shows that improved rubrics starting from the simplest standards can surpass carefully crafted human-written standards
  4. Provides a practical algorithmic framework: Offers a complete iterative refinement algorithm with good reproducibility

Methodology Details

Task Definition

  • Input: Essay text x and scoring rubric R
  • Output: Predicted score ŷ and scoring rationale z
  • Objective: Maximize Quadratic Weighted Kappa (QWK) between LLM scores and human scores

Model Architecture

Algorithm Flow

The method comprises the following core components:

  1. Scoring Function: Model M receives the rubric and essay, generating predicted scores and textual rationales
  2. Refinement Function: M generates improved rubrics based on previous rubrics, generated rationales, and scoring discrepancies

Iterative Refinement Algorithm (Algorithm 1)

Input: Dataset D, Language Model M, Initial Rubric Rseed
Parameters: Number of iterations T, Batch size b

1. Rbest ← Rinit
2. QWKbest ← EVALUATE(M, Rbest, Dval)
3. for t = 1 to T do
4.   B ← SAMPLEMINIBATCH(Dtrain, b)
5.   FbData ← ∅
6.   for each (x, y) ∈ B do
7.     (ŷ, z) ← SCORE(M, Rbest, x)
8.     Add (rationale=z, pred_score=ŷ, true_score=y) to FbData
9.   end for
10.  Rnew ← REFINE(M, Rbest, FbData)
11.  QWKnew ← EVALUATE(M, Rnew, Dval)
12.  if QWKnew > QWKbest then
13.    Rbest ← Rnew
14.    QWKbest ← QWKnew
15.  end if
16. end for
17. return Rbest

Technical Innovations

  1. Self-Reflection Mechanism: The model can analyze its own scoring rationales and discrepancies with human scores
  2. Iterative Optimization: Progressively improves rubric quality through multiple refinement rounds
  3. Minimal Initial Requirements: Can start from extremely simple rubrics (e.g., "Score responses based on content on a scale of 1-6")
  4. Performance-Driven Updates: Only updates when new rubrics perform better on the validation set

Experimental Setup

Datasets

TOEFL11 Dataset

  • Scale: 12,100 essays, 8 essay prompts
  • Scoring: 3 proficiency levels (high, medium, low), converted from original 5-point scale
  • Split: Training set 100 essays, validation set 100 essays, test set 1,100 essays

ASAP Dataset

  • Subset Used: Prompt 1 (P1), 6-point scale
  • Split: Test set 179 essays (10%), training and validation sets 100 essays each
  • Characteristics: Includes annotations from two human raters

Evaluation Metrics

  • Primary Metric: Quadratic Weighted Kappa (QWK), widely used in AES evaluation
  • Statistical Method: Each experiment run 3 times, reporting mean and standard deviation

Comparison Methods

  • Baseline Method: Using manually written detailed scoring rubrics
  • Seed Rubric Types:
    • simplest_rubric: Simplest rubric
    • human_rubric: Official detailed scoring guidelines
    • simplified_human_rubric: Simplified human rubric

Implementation Details

  • Number of Iterations: T = 10
  • Batch Size: B = 10
  • Models: GPT-4.1, GPT-5-mini, Gemini-2.5-Flash, Gemini-2.5-Pro, Qwen3-Next-80B-A3B-Instruct
  • Temperature Settings: Adjusted according to different models (0.7-1.0)

Experimental Results

Main Results

QWK Improvement Magnitude

  • ASAP Dataset: Maximum improvement of 0.47 QWK
  • TOEFL11 Dataset: Maximum improvement of 0.19 QWK
  • Model Performance: 4 out of 5 models showed improvement on ASAP, 2 on TOEFL11

Performance with Different Initial Rubrics (Table 1)

Initial RubricASAPTOEFL
Refined - Human Rubric0.460.56
Refined - Simplified Rubric0.410.58
Refined - Simplest Rubric0.480.64
Unrefined - Human Rubric0.260.58
Unrefined - Simplified Rubric0.330.59
Unrefined - Simplest Rubric0.170.57

Key Findings

  1. Potential of Simplest Rubrics: Starting from the simplest rubric "Score responses based on content on a scale of 1-6," the refined rubric can surpass carefully crafted human-written rubrics
  2. Characteristics of Improved Rubrics:
    • Addition of visual emphasis (e.g., bold text) highlighting key evidence
    • Brief summary tables added at the end of rubrics
    • Explicit conditional rules: "If X is observed, assign score s"
  3. Dataset Differences: TOEFL11 uses coarse-grained three-level scoring (low/medium/high), with overall higher QWK values, potentially limiting improvement space

Case Analysis

Figure 3 presents the ASAP P1 rubric refined from the simplest rubric, including:

  • Detailed scoring guidance principles
  • Specific explanations of distinctions between 4 and 5 scores
  • Structured scoring summary table
  • Explicit conditional judgment rules

Main Research Directions

  1. LLM-based Automated Assessment: Using checklists and rubrics for non-validated task evaluation
  2. AES Technology Development: Various automated essay scoring techniques
  3. Rubric Design Research:
    • Furuhashi et al. discovered the "negative item" phenomenon
    • Yoshida found that more detailed rubrics do not always improve performance

Advantages of This Work

Compared to existing research, this paper is the first to propose having LLMs reflect on their own outputs to iteratively improve rubrics, simulating the calibration process of human raters.

Conclusions and Discussion

Main Conclusions

  1. Iterative Rubric Refinement is Effective: Method effectiveness validated across multiple datasets and models
  2. Initial Rubric Irrelevance: Excellent performance achievable even starting from extremely simple rubrics
  3. Automation Feasibility: LLMs can autonomously identify relevant assessment standards

Limitations

  1. Limited Dataset Scope: Experiments conducted only on TOEFL11 and ASAP Prompt 1
  2. Annotated Data Requirements: Refinement process requires 200 annotated samples
  3. Single Evaluation Metric: Only QWK used as optimization target, potentially missing all aspects of scoring quality
  4. High Baseline Constraints: Limited improvement space on datasets with already high baseline scores

Future Directions

  1. Extend to more essay types and domains
  2. Explore methods to reduce annotated data requirements
  3. Investigate multi-metric optimization strategies
  4. Deepen understanding of characteristics of LLM-suitable rubrics

In-Depth Evaluation

Strengths

  1. Strong Method Innovation:
    • First application of prompt optimization ideas to AES rubric refinement
    • Simulates human rater calibration process with strong intuitive validity
    • Simple and effective algorithm design
  2. Comprehensive Experimental Design:
    • Validation using multiple models and datasets
    • Comparisons with different initial rubrics
    • Complete statistical significance analysis
  3. Convincing Results:
    • Significant performance improvements (maximum 0.47 QWK)
    • Important finding that simplest rubrics surpass human-written rubrics
    • Concrete examples of improved rubrics provided
  4. High Practical Value:
    • Algorithm easy to implement and reproduce
    • Can reduce costs of manual rubric writing
    • Provides new insights for AES system optimization

Weaknesses

  1. Limited Experimental Scope:
    • Only two datasets tested, generalizability needs verification
    • Lacks validation across different languages and cultural backgrounds
    • Does not consider differences in essay types
  2. Insufficient Theoretical Analysis:
    • Lacks deep theoretical analysis of why the method works
    • Does not explore inherent characteristics and patterns of improved rubrics
    • Lacks theoretical guarantees for convergence and stability
  3. Missing Cost Analysis:
    • Lacks detailed analysis of computational costs and time overhead
    • Absent cost-benefit comparison with traditional methods
    • Insufficient feasibility analysis for practical deployment

Impact

  1. Academic Contribution:
    • Provides new research direction for AES field
    • Demonstrates potential of LLM self-improvement capability in assessment tasks
    • May inspire research on more adaptive assessment systems
  2. Practical Value:
    • Directly applicable to existing LLM-based AES systems
    • Helps educational technology companies improve products
    • Provides new tools for educational assessment standardization
  3. Reproducibility:
    • Provides complete algorithm description
    • Includes detailed experimental settings
    • Good availability of code and data

Applicable Scenarios

  1. Educational Assessment: Essay scoring for various standardized tests
  2. Online Education: Automatic assignment grading for MOOC platforms
  3. Language Learning: Second language writing ability assessment
  4. Corporate Training: Employee writing skill evaluation

References

The paper cites multiple important related works, including:

  • Prompt optimization: Khattab et al. (2023), Agrawal et al. (2025)
  • AES-related: Mizumoto and Eguchi (2023), Lee et al. (2024)
  • Human rater calibration: Trace et al. (2016), Ouyang et al. (2022)
  • LLM self-improvement: Madaan et al. (2023), Kamoi et al. (2024)

Overall Assessment: This is a high-quality research paper that proposes an innovative method and achieves significant experimental results. While there is room for improvement in experimental scope and theoretical analysis, its core ideas have strong practical value and academic significance, making important contributions to the development of the AES field.