Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise
Harada, Yoshida, Kojima et al.
The performance of Large Language Models (LLMs) is highly sensitive to the prompts they are given. Drawing inspiration from the field of prompt optimization, this study investigates the potential for enhancing Automated Essay Scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, our approach prompts models to iteratively refine rubrics by reflecting on models' own scoring rationales and observed discrepancies with human scores on sample essays. Experiments on the TOEFL11 and ASAP datasets using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct show Quadratic Weighted Kappa (QWK) improvements of up to 0.19 and 0.47, respectively. Notably, even with a simple initial rubric, our approach achieves comparable or better QWK than using detailed human-authored rubrics. Our findings highlight the importance of iterative rubric refinement in LLM-based AES to enhance alignment with human evaluations.
academic
Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise
The performance of large language models (LLMs) is highly sensitive to given prompts. Inspired by the prompt optimization literature, this research explores the potential of enhancing automated essay scoring (AES) by refining the scoring rubrics used by LLMs. Specifically, the method prompts models to iteratively improve scoring rubrics by reflecting on their own scoring rationales and discrepancies with human scores. Experiments using GPT-4.1, Gemini-2.5-Pro, and Qwen-3-Next-80B-A3B-Instruct on the TOEFL11 and ASAP datasets demonstrate improvements in Quadratic Weighted Kappa (QWK) of up to 0.19 and 0.47, respectively. Notably, even with simple initial rubrics, the method achieves QWK comparable to or better than using detailed human-written rubrics. The findings highlight the importance of iterative rubric refinement in LLM-based AES for enhancing consistency with human assessment.
Core Issue: Traditional LLM-based automated essay scoring systems use static, predefined scoring rubrics designed for human raters, which may not be optimal for LLMs.
Significance: With the widespread application of LLMs in education, there is a need for AES systems capable of providing real-time, scalable feedback to alleviate teacher grading burden.
Existing Limitations:
Current LLM-based AES overlooks the collaborative calibration process of human raters
Human raters typically score sample essays, discuss judgment discrepancies, and refine their shared understanding of standards
This iterative reflective practice is neglected in current LLM-based AES, limiting consistency with human scoring patterns
Inspired by prompt optimization techniques and the human rater calibration process, the authors propose an iterative refinement method that enables LLMs to reflect on and improve scoring rubrics based on their own performance on sample essays.
Proposes an iterative rubric refinement method: Based on a reflect-and-revise mechanism, enabling LLMs to automatically improve scoring rubrics based on discrepancies with human scores
Validates method effectiveness: Demonstrates significant performance improvements using three different LLMs on two standard datasets
Discovers new insights on rubric design: Shows that improved rubrics starting from the simplest standards can surpass carefully crafted human-written standards
Provides a practical algorithmic framework: Offers a complete iterative refinement algorithm with good reproducibility
Input: Dataset D, Language Model M, Initial Rubric Rseed
Parameters: Number of iterations T, Batch size b
1. Rbest ← Rinit
2. QWKbest ← EVALUATE(M, Rbest, Dval)
3. for t = 1 to T do
4. B ← SAMPLEMINIBATCH(Dtrain, b)
5. FbData ← ∅
6. for each (x, y) ∈ B do
7. (ŷ, z) ← SCORE(M, Rbest, x)
8. Add (rationale=z, pred_score=ŷ, true_score=y) to FbData
9. end for
10. Rnew ← REFINE(M, Rbest, FbData)
11. QWKnew ← EVALUATE(M, Rnew, Dval)
12. if QWKnew > QWKbest then
13. Rbest ← Rnew
14. QWKbest ← QWKnew
15. end if
16. end for
17. return Rbest
Potential of Simplest Rubrics: Starting from the simplest rubric "Score responses based on content on a scale of 1-6," the refined rubric can surpass carefully crafted human-written rubrics
Characteristics of Improved Rubrics:
Addition of visual emphasis (e.g., bold text) highlighting key evidence
Brief summary tables added at the end of rubrics
Explicit conditional rules: "If X is observed, assign score s"
Dataset Differences: TOEFL11 uses coarse-grained three-level scoring (low/medium/high), with overall higher QWK values, potentially limiting improvement space
Compared to existing research, this paper is the first to propose having LLMs reflect on their own outputs to iteratively improve rubrics, simulating the calibration process of human raters.
The paper cites multiple important related works, including:
Prompt optimization: Khattab et al. (2023), Agrawal et al. (2025)
AES-related: Mizumoto and Eguchi (2023), Lee et al. (2024)
Human rater calibration: Trace et al. (2016), Ouyang et al. (2022)
LLM self-improvement: Madaan et al. (2023), Kamoi et al. (2024)
Overall Assessment: This is a high-quality research paper that proposes an innovative method and achieves significant experimental results. While there is room for improvement in experimental scope and theoretical analysis, its core ideas have strong practical value and academic significance, making important contributions to the development of the AES field.