2025-11-21T01:25:15.792540

Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Lai, Zheng, Cheng et al.

The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using LLMs, a paradigm known as "LLM-as-a-judge". However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. Previous studies mainly optimize based on shallow outputs, overlooking rich cross-layer representations. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and task-relevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a post-hoc, plug-and-play framework for improving the alignment of LLM-as-a-Judge point-wise evaluations with human scores by leveraging internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer score-token logits and computing the expected score from a softmax-based distribution, while keeping the LLM backbone frozen and ensuring no impact on the inference process. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the generalization of LAGER.

academic

Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Basic Information

Paper ID: 2508.03550
Title: Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
Authors: Peng Lai, Jianjie Zheng, Sijie Cheng, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
Category: cs.CL (Computational Linguistics)
Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Paper Link: https://arxiv.org/abs/2508.03550

Abstract

As the scale of evaluation tasks continues to expand, the "LLM-as-a-Judge" paradigm for automated evaluation using large language models has been widely adopted. However, improving alignment with human preferences without employing complex prompting or fine-tuning remains challenging. Previous research has primarily optimized based on shallow outputs, overlooking rich cross-layer representations. Motivated by preliminary findings that semantic and task-relevant representations encoded in middle-to-upper layers often align better with human judgments than final layers, this work proposes LAGER, a post-hoc plug-and-play framework that improves the alignment of LLM-as-a-Judge point-wise evaluation with human scores by leveraging internal representations. LAGER produces fine-grained judgment scores by aggregating scoring token logits across layers and computing expected scores from softmax-based distributions, while keeping the LLM backbone frozen and ensuring no impact on the inference process.

Research Background and Motivation

Problem Definition

Core Problem: Existing LLM-as-a-Judge methods primarily rely on final layer outputs for evaluation, neglecting rich cross-layer representation information within the model, resulting in suboptimal alignment with human judgments.
Significance:
- LLM-as-a-Judge has broad applications in model evaluation, data synthesis, and model enhancement scenarios
- Improving evaluation accuracy and consistency with human preferences is critical for AI system reliability
- Large-scale evaluation tasks require efficient and accurate automated assessment methods
Limitations of Existing Methods:
- Prompt-based methods require complex reasoning steps, increasing computational costs
- Fine-tuning methods face generalization issues with limited adaptability
- Traditional methods rely solely on final layer outputs, overlooking semantic information in intermediate layers
Research Motivation:
- Preliminary studies reveal that middle-to-upper layers (approximately layers 20-30) often correlate more strongly with human scores than final layers
- Different layers encode different types of information: lower layers focus on lexical information, middle-to-upper layers focus on semantic and global information
- A lightweight, plug-and-play method is needed to leverage these internal representations

Core Contributions

Proposes LAGER Framework: A post-hoc, plug-and-play framework that improves LLM-as-a-Judge alignment with human scores by aggregating cross-layer internal representations
Discovers Advantages of Middle Layers: Empirically demonstrates that middle-to-upper layer representations align better with human judgments than final layers
Achieves Significant Performance Improvements: Achieves up to 7.5% improvement on three standard alignment benchmarks: Flask, HelpSteer, and BIGGen
Demonstrates Generalization Capability: Shows good generalization performance in downstream applications such as instruction data selection and sentiment understanding
Provides Lightweight Solution: Requires training only L+1 weight parameters while keeping the model backbone frozen

Methodology Details

Task Definition

Input: Evaluation task description, user instruction, response to be evaluated, scoring criteria Output: Fine-grained continuous scores (rather than discrete integer scores) Constraints: Keep LLM backbone parameters frozen without affecting the original inference process

Model Architecture

1. Basic Framework

For decoder models, traditional methods use only the final layer hidden states:

h^(L)_n = f^(L)_decoder ∘ ··· ∘ f^(1)_decoder ∘ f_embd(x<n)

2. LAGER Core Mechanism

Cross-layer Logits Aggregation:

ẑ = Σ(i=0 to L) w_i * ẑ_i = Σ(i=0 to L) w_i * h^(i)_n * W_unembd

Candidate Score Extraction:

ẑ[M] = Σ(i=0 to L) w_i * [h^(i)_n * W_unembd]_M

where M = {Tokenize(s)|s ∈ S} is the set of candidate score tokens

Probability Distribution Computation:

P(s) = exp(ẑ[s]) / Σ(s'∈S) exp(ẑ[s'])

Expected Score:

s* = E_s~P(s)[s] = Σ(s∈S) s × P(s)

3. Weight Training Strategy

Two weight settings are provided:

Non-tuned Version: Uniform aggregation w_l = 1/(L+1)
Tuned Version: Training weights using combined loss function

Loss Function:

L_Final = α·L_CE + (1-α)·L_MAE

where cross-entropy loss handles discrete labels and MAE loss handles continuous scores

Technical Innovations

Cross-layer Information Fusion: First systematic utilization of internal representations from all Transformer layers for evaluation
Expected Score Mechanism: Computing continuous scores through probability distributions rather than simple argmax operations
Plug-and-Play Design: No modification of original model parameters and inference process, directly applicable to existing models
Lightweight Training: Requires training only L+1 weight parameters with minimal training cost

Experimental Setup

Datasets

Flask: 2,001 entries with 12 scoring dimensions (conciseness, insightfulness, readability, etc.)
HelpSteer: 8.95k data points evaluated on 5 criteria (helpfulness, correctness, coherence, etc.)
BiGGen Bench: Comprehensive evaluation benchmark covering 77 tasks assessing 9 generative capabilities

Evaluation Metrics

Primary Metric: Spearman correlation coefficient (suitable for ordinal data, robust to outliers)
Auxiliary Metric: Pearson correlation coefficient

Comparison Methods

Non-training Baselines: GPTScore, Vanilla Score (VScore), Expectation Score (E-Score)
API Models: GPT-4o-mini
Fine-tuned Models: TIGERScore-7B, Prometheus2-7B (for reference only)

Implementation Details

Models: 6 backbone models of different scales (7B-70B)
Decoding Strategy: Greedy decoding for stability
Evaluation Settings: Both direct evaluation and reasoning evaluation settings
Weight Training: Using 1000 HelpSteer samples, Adam optimizer, learning rate 0.01

Experimental Results

Main Results

Significant Performance Improvements:

LAGER outperforms all non-training baselines on all benchmarks
Average Spearman correlation improvement: 4.5% for non-tuned version, higher for tuned version
Maximum improvement of 7.5% on certain models

Key Findings:

Cross-model Consistency: Improvements achieved across 6 models of different scales
Competitive with API Models: Enables open-source models to reach GPT-4o-mini level
Surpasses Fine-tuning Methods: InternLM3-8B and LLaMA3.1-8B outperform same-scale Prometheus2-7B

Ablation Studies

Component Importance Ranking:

Expected Score > Maximum Score (+0.17 improvement)
Logits Aggregation > Probability Aggregation (+0.07 improvement)
Weight Tuning provides +0.10 improvement
Multi-layer integration effects vary across models

Cross-scale Analysis

Scale Effects:

Validation on Qwen2.5 series (0.5B-72B)
LAGER improvements amplify with model scale
Best performance achieved on 72B model (Flask: 0.658 Spearman)

Case Analysis

Distribution Alignment:

LAGER-generated score distributions align better with human annotations
KL divergence reduced from 0.312 to 0.087
MSE reduced from 0.112 to 0.060

Text Generation Evaluation

Traditional Metrics: BLEU, ROUGE and other statistical methods have obvious limitations
Embedding Methods: BERTScore, BARTScore require reference answers
GPTScore: Based on generation probability but ignores semantic quality

LLM-as-a-Judge

Point-wise Evaluation: Independent evaluation of individual responses
Pairwise Comparison: Direct comparison of two responses
Listwise Ranking: Ranking multiple responses

Method Classification:

Prompt-based: Improving judgments through reasoning steps
Fine-tuning-based: Training specialized evaluation models

Conclusions and Discussion

Main Conclusions

Middle Layer Advantages: Middle-to-upper layer representations indeed align better with human judgments than final layers
Lightweight Effectiveness: Training minimal parameters achieves significant performance improvements
Reasoning Not Necessary: Achieves or exceeds reasoning-based methods without explicit reasoning steps
Good Generalization: Excellent performance on multiple downstream tasks

Limitations

Open-source Model Constraints: Requires access to model internal states, cannot be applied to closed-source API models
Computational Overhead: Requires additional computation of hidden states for all layers
Weight Universality: Different model families may require retraining weights

Future Directions

Theoretical Analysis: Deeper understanding of semantic properties of different layer representations
Efficiency Optimization: Methods to reduce computational overhead
Adaptive Weights: Mechanisms for adaptively adjusting weights across different layers

In-depth Evaluation

Strengths

Strong Novelty: First systematic utilization of Transformer internal representations for evaluation
High Practical Value: Plug-and-play design, easy to deploy
Comprehensive Experiments: Full evaluation across multiple benchmarks and model scales
Theoretical Support: Provides theoretical insights through inter-layer similarity analysis

Weaknesses

Limited Applicability: Only applicable to open-source models
Insufficient Mechanism Explanation: Lacks deep theoretical explanation for why middle layers perform better
Computational Cost: Although parameters are minimal, inference requires computing all layers

Impact

Academic Contribution: Provides new perspective for LLM internal representation research
Practical Value: Provides effective tools for open-source model evaluation
Reproducibility: Code is publicly available, experiments are reproducible

Applicable Scenarios

Model Evaluation: Improving existing evaluation pipelines
Data Filtering: High-quality training data selection
Quality Control: Automatic quality assessment of generated content
Research Tool: LLM internal mechanism investigation

References

This paper cites extensive related work, including:

LLM-as-a-Judge research (Lin & Chen, 2023; Liu et al., 2023, etc.)
Internal representation studies (Wang et al., 2020; Yang et al., 2022, etc.)
Evaluation benchmarks and methods (Ye et al., 2024; Kim et al., 2024, etc.)

Overall Assessment: This is a high-quality research paper that proposes the innovative LAGER framework, significantly improving human alignment in automated evaluation by leveraging LLM internal representations. The method is simple and effective, with comprehensive experiments, possessing important academic value and practical significance. The main limitation is applicability only to open-source models, but given the rapid development of open-source LLMs, this work still has broad application prospects.