2025-11-21T04:13:15.591642

Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

Harada, Yoshida, Kojima et al.

The performance of Large Language Models (LLMs) is highly sensitive to the prompts they are given. Drawing inspiration from the field of prompt optimization, this study investigates the potential for enhancing Automated Essay Scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, our approach prompts models to iteratively refine rubrics by reflecting on models' own scoring rationales and observed discrepancies with human scores on sample essays. Experiments on the TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct show Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47, respectively. Notably, even with a simple initial rubric, our approach achieves comparable or better QWK than using detailed human-authored rubrics. Our findings highlight the importance of iterative rubric refinement in LLM-based AES to enhance alignment with human evaluations.

academic

Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

Basic Information

Paper ID: 2510.09030
Title: Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise
Authors: Keno Harada, Lui Yoshida, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo (The University of Tokyo)
Classification: cs.CL (Computational Linguistics)
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09030

Abstract

The performance of large language models (LLMs) is highly sensitive to given prompts. Inspired by the prompt optimization literature, this research explores the potential of enhancing automated essay scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, the method prompts models to iteratively improve scoring rubrics by reflecting on their own scoring rationales and discrepancies with human scores. Experiments using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct on the TOEFL11 and ASAP datasets demonstrate improvements in Quadratic Weighted Kappa (QWK) of up to 0.19 and 0.47, respectively. Notably, even with simple initial rubrics, the method achieves QWK comparable to or better than using detailed human-written rubrics. The findings highlight the importance of iterative rubric refinement in LLM-based AES for enhancing consistency with human assessment.

Research Background and Motivation

Problem Definition

Core Issue: Traditional LLM-based automated essay scoring systems use static, predefined scoring rubrics designed for human raters, which may not be optimal for LLMs.
Significance: With the widespread application of LLMs in education, there is a need for AES systems capable of providing real-time, scalable feedback to alleviate teacher grading burden.
Existing Limitations:
- Current LLM-based AES overlooks the collaborative calibration process of human raters
- Human raters typically score sample essays, discuss judgment discrepancies, and refine their shared understanding of standards
- This iterative reflective practice is neglected in current LLM-based AES, limiting consistency with human scoring patterns

Research Motivation

Inspired by prompt optimization techniques and the human rater calibration process, the authors propose an iterative refinement method that enables LLMs to reflect on and improve scoring rubrics based on their own performance on sample essays.

Core Contributions

Proposes an iterative rubric refinement method: Based on a reflect-and-revise mechanism, enabling LLMs to automatically improve scoring rubrics based on discrepancies with human scores
Validates method effectiveness: Demonstrates significant performance improvements using three different LLMs on two standard datasets
Discovers new insights on rubric design: Shows that improved rubrics starting from the simplest standards can surpass carefully crafted human-written standards
Provides a practical algorithmic framework: Offers a complete iterative refinement algorithm with good reproducibility

Methodology Details

Task Definition

Input: Essay text x and scoring rubric R
Output: Predicted score ŷ and scoring rationale z
Objective: Maximize Quadratic Weighted Kappa (QWK) between LLM scores and human scores

Model Architecture

Algorithm Flow

The method comprises the following core components:

Scoring Function: Model M receives the rubric and essay, generating predicted scores and textual rationales
Refinement Function: M generates improved rubrics based on previous rubrics, generated rationales, and scoring discrepancies

Iterative Refinement Algorithm (Algorithm 1)

Input: Dataset D, Language Model M, Initial Rubric Rseed
Parameters: Number of iterations T, Batch size b

1. Rbest ← Rinit
2. QWKbest ← EVALUATE(M, Rbest, Dval)
3. for t = 1 to T do
4.   B ← SAMPLEMINIBATCH(Dtrain, b)
5.   FbData ← ∅
6.   for each (x, y) ∈ B do
7.     (ŷ, z) ← SCORE(M, Rbest, x)
8.     Add (rationale=z, pred_score=ŷ, true_score=y) to FbData
9.   end for
10.  Rnew ← REFINE(M, Rbest, FbData)
11.  QWKnew ← EVALUATE(M, Rnew, Dval)
12.  if QWKnew > QWKbest then
13.    Rbest ← Rnew
14.    QWKbest ← QWKnew
15.  end if
16. end for
17. return Rbest

Technical Innovations

Self-Reflection Mechanism: The model can analyze its own scoring rationales and discrepancies with human scores
Iterative Optimization: Progressively improves rubric quality through multiple refinement rounds
Minimal Initial Requirements: Can start from extremely simple rubrics (e.g., "Score responses based on content on a scale of 1-6")
Performance-Driven Updates: Only updates when new rubrics perform better on the validation set

Experimental Setup

Datasets

TOEFL11 Dataset

Scale: 12,100 essays, 8 essay prompts
Scoring: 3 proficiency levels (high, medium, low), converted from original 5-point scale
Split: Training set 100 essays, validation set 100 essays, test set 1,100 essays

ASAP Dataset

Subset Used: Prompt 1 (P1), 6-point scale
Split: Test set 179 essays (10%), training and validation sets 100 essays each
Characteristics: Includes annotations from two human raters

Evaluation Metrics

Primary Metric: Quadratic Weighted Kappa (QWK), widely used in AES evaluation
Statistical Method: Each experiment run 3 times, reporting mean and standard deviation

Comparison Methods

Baseline Method: Using manually written detailed scoring rubrics
Seed Rubric Types:
- simplest_rubric: Simplest rubric
- human_rubric: Official detailed scoring guidelines
- simplified_human_rubric: Simplified human rubric

Implementation Details

Number of Iterations: T = 10
Batch Size: B = 10
Models: GPT-4.1, GPT-5-mini, Gemini-2.5-Flash, Gemini-2.5-Pro, Qwen3-Next-80B-A3B-Instruct
Temperature Settings: Adjusted according to different models (0.7-1.0)

Experimental Results

Main Results

QWK Improvement Magnitude

ASAP Dataset: Maximum improvement of 0.47 QWK
TOEFL11 Dataset: Maximum improvement of 0.19 QWK
Model Performance: 4 out of 5 models showed improvement on ASAP, 2 on TOEFL11

Performance with Different Initial Rubrics (Table 1)

Initial Rubric	ASAP	TOEFL
Refined - Human Rubric	0.46	0.56
Refined - Simplified Rubric	0.41	0.58
Refined - Simplest Rubric	0.48	0.64
Unrefined - Human Rubric	0.26	0.58
Unrefined - Simplified Rubric	0.33	0.59
Unrefined - Simplest Rubric	0.17	0.57

Key Findings

Potential of Simplest Rubrics: Starting from the simplest rubric "Score responses based on content on a scale of 1-6," the refined rubric can surpass carefully crafted human-written rubrics
Characteristics of Improved Rubrics:
- Addition of visual emphasis (e.g., bold text) highlighting key evidence
- Brief summary tables added at the end of rubrics
- Explicit conditional rules: "If X is observed, assign score s"
Dataset Differences: TOEFL11 uses coarse-grained three-level scoring (low/medium/high), with overall higher QWK values, potentially limiting improvement space

Case Analysis

Figure 3 presents the ASAP P1 rubric refined from the simplest rubric, including:

Detailed scoring guidance principles
Specific explanations of distinctions between 4 and 5 scores
Structured scoring summary table
Explicit conditional judgment rules

Main Research Directions

LLM-based Automated Assessment: Using checklists and rubrics for non-validated task evaluation
AES Technology Development: Various automated essay scoring techniques
Rubric Design Research:
- Furuhashi et al. discovered the "negative item" phenomenon
- Yoshida found that more detailed rubrics do not always improve performance

Advantages of This Work

Compared to existing research, this paper is the first to propose having LLMs reflect on their own outputs to iteratively improve rubrics, simulating the calibration process of human raters.

Conclusions and Discussion

Main Conclusions

Iterative Rubric Refinement is Effective: Method effectiveness validated across multiple datasets and models
Initial Rubric Irrelevance: Excellent performance achievable even starting from extremely simple rubrics
Automation Feasibility: LLMs can autonomously identify relevant assessment standards

Limitations

Limited Dataset Scope: Experiments conducted only on TOEFL11 and ASAP Prompt 1
Annotated Data Requirements: Refinement process requires 200 annotated samples
Single Evaluation Metric: Only QWK used as optimization target, potentially missing all aspects of scoring quality
High Baseline Constraints: Limited improvement space on datasets with already high baseline scores

Future Directions

Extend to more essay types and domains
Explore methods to reduce annotated data requirements
Investigate multi-metric optimization strategies
Deepen understanding of characteristics of LLM-suitable rubrics

In-Depth Evaluation

Strengths

Strong Method Innovation:
- First application of prompt optimization ideas to AES rubric refinement
- Simulates human rater calibration process with strong intuitive validity
- Simple and effective algorithm design
Comprehensive Experimental Design:
- Validation using multiple models and datasets
- Comparisons with different initial rubrics
- Complete statistical significance analysis
Convincing Results:
- Significant performance improvements (maximum 0.47 QWK)
- Important finding that simplest rubrics surpass human-written rubrics
- Concrete examples of improved rubrics provided
High Practical Value:
- Algorithm easy to implement and reproduce
- Can reduce costs of manual rubric writing
- Provides new insights for AES system optimization

Weaknesses

Limited Experimental Scope:
- Only two datasets tested, generalizability needs verification
- Lacks validation across different languages and cultural backgrounds
- Does not consider differences in essay types
Insufficient Theoretical Analysis:
- Lacks deep theoretical analysis of why the method works
- Does not explore inherent characteristics and patterns of improved rubrics
- Lacks theoretical guarantees for convergence and stability
Missing Cost Analysis:
- Lacks detailed analysis of computational costs and time overhead
- Absent cost-benefit comparison with traditional methods
- Insufficient feasibility analysis for practical deployment

Impact

Academic Contribution:
- Provides new research direction for AES field
- Demonstrates potential of LLM self-improvement capability in assessment tasks
- May inspire research on more adaptive assessment systems
Practical Value:
- Directly applicable to existing LLM-based AES systems
- Helps educational technology companies improve products
- Provides new tools for educational assessment standardization
Reproducibility:
- Provides complete algorithm description
- Includes detailed experimental settings
- Good availability of code and data

Applicable Scenarios

Educational Assessment: Essay scoring for various standardized tests
Online Education: Automatic assignment grading for MOOC platforms
Language Learning: Second language writing ability assessment
Corporate Training: Employee writing skill evaluation

References

The paper cites multiple important related works, including:

Prompt optimization: Khattab et al. (2023), Agrawal et al. (2025)
AES-related: Mizumoto and Eguchi (2023), Lee et al. (2024)
Human rater calibration: Trace et al. (2016), Ouyang et al. (2022)
LLM self-improvement: Madaan et al. (2023), Kamoi et al. (2024)

Overall Assessment: This is a high-quality research paper that proposes an innovative method and achieves significant experimental results. While there is room for improvement in experimental scope and theoretical analysis, its core ideas have strong practical value and academic significance, making important contributions to the development of the AES field.