Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
Lai, Zheng, Cheng et al.
The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using LLMs, a paradigm known as "LLM-as-a-judge". However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. Previous studies mainly optimize based on shallow outputs, overlooking rich cross-layer representations. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and task-relevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a post-hoc, plug-and-play framework for improving the alignment of LLM-as-a-Judge point-wise evaluations with human scores by leveraging internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer score-token logits and computing the expected score from a softmax-based distribution, while keeping the LLM backbone frozen and ensuring no impact on the inference process. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the generalization of LAGER.
academic
Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
As the scale of evaluation tasks continues to expand, the "LLM-as-a-Judge" paradigm for automated evaluation using large language models has been widely adopted. However, improving alignment with human preferences without employing complex prompting or fine-tuning remains challenging. Previous research has primarily optimized based on shallow outputs, overlooking rich cross-layer representations. Motivated by preliminary findings that semantic and task-relevant representations encoded in middle-to-upper layers often align better with human judgments than final layers, this work proposes LAGER, a post-hoc plug-and-play framework that improves the alignment of LLM-as-a-Judge point-wise evaluation with human scores by leveraging internal representations. LAGER produces fine-grained judgment scores by aggregating scoring token logits across layers and computing expected scores from softmax-based distributions, while keeping the LLM backbone frozen and ensuring no impact on the inference process.
Core Problem: Existing LLM-as-a-Judge methods primarily rely on final layer outputs for evaluation, neglecting rich cross-layer representation information within the model, resulting in suboptimal alignment with human judgments.
Significance:
LLM-as-a-Judge has broad applications in model evaluation, data synthesis, and model enhancement scenarios
Improving evaluation accuracy and consistency with human preferences is critical for AI system reliability
Large-scale evaluation tasks require efficient and accurate automated assessment methods
Fine-tuning methods face generalization issues with limited adaptability
Traditional methods rely solely on final layer outputs, overlooking semantic information in intermediate layers
Research Motivation:
Preliminary studies reveal that middle-to-upper layers (approximately layers 20-30) often correlate more strongly with human scores than final layers
Different layers encode different types of information: lower layers focus on lexical information, middle-to-upper layers focus on semantic and global information
A lightweight, plug-and-play method is needed to leverage these internal representations
Proposes LAGER Framework: A post-hoc, plug-and-play framework that improves LLM-as-a-Judge alignment with human scores by aggregating cross-layer internal representations
Discovers Advantages of Middle Layers: Empirically demonstrates that middle-to-upper layer representations align better with human judgments than final layers
Achieves Significant Performance Improvements: Achieves up to 7.5% improvement on three standard alignment benchmarks: Flask, HelpSteer, and BIGGen
Demonstrates Generalization Capability: Shows good generalization performance in downstream applications such as instruction data selection and sentiment understanding
Provides Lightweight Solution: Requires training only L+1 weight parameters while keeping the model backbone frozen
Input: Evaluation task description, user instruction, response to be evaluated, scoring criteria
Output: Fine-grained continuous scores (rather than discrete integer scores)
Constraints: Keep LLM backbone parameters frozen without affecting the original inference process
This paper cites extensive related work, including:
LLM-as-a-Judge research (Lin & Chen, 2023; Liu et al., 2023, etc.)
Internal representation studies (Wang et al., 2020; Yang et al., 2022, etc.)
Evaluation benchmarks and methods (Ye et al., 2024; Kim et al., 2024, etc.)
Overall Assessment: This is a high-quality research paper that proposes the innovative LAGER framework, significantly improving human alignment in automated evaluation by leveraging LLM internal representations. The method is simple and effective, with comprehensive experiments, possessing important academic value and practical significance. The main limitation is applicability only to open-source models, but given the rapid development of open-source LLMs, this work still has broad application prospects.