2025-11-21T01:25:15.792540

Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Lai, Zheng, Cheng et al.
The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using LLMs, a paradigm known as "LLM-as-a-judge". However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. Previous studies mainly optimize based on shallow outputs, overlooking rich cross-layer representations. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and task-relevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a post-hoc, plug-and-play framework for improving the alignment of LLM-as-a-Judge point-wise evaluations with human scores by leveraging internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer score-token logits and computing the expected score from a softmax-based distribution, while keeping the LLM backbone frozen and ensuring no impact on the inference process. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the generalization of LAGER.
academic

Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations

Basic Information

  • Paper ID: 2508.03550
  • Title: Beyond the Surface: Enhancing LLM-as-a-Judge Alignment with Human via Internal Representations
  • Authors: Peng Lai, Jianjie Zheng, Sijie Cheng, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
  • Category: cs.CL (Computational Linguistics)
  • Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
  • Paper Link: https://arxiv.org/abs/2508.03550

Abstract

As the scale of evaluation tasks continues to expand, the "LLM-as-a-Judge" paradigm for automated evaluation using large language models has been widely adopted. However, improving alignment with human preferences without employing complex prompting or fine-tuning remains challenging. Previous research has primarily optimized based on shallow outputs, overlooking rich cross-layer representations. Motivated by preliminary findings that semantic and task-relevant representations encoded in middle-to-upper layers often align better with human judgments than final layers, this work proposes LAGER, a post-hoc plug-and-play framework that improves the alignment of LLM-as-a-Judge point-wise evaluation with human scores by leveraging internal representations. LAGER produces fine-grained judgment scores by aggregating scoring token logits across layers and computing expected scores from softmax-based distributions, while keeping the LLM backbone frozen and ensuring no impact on the inference process.

Research Background and Motivation

Problem Definition

  1. Core Problem: Existing LLM-as-a-Judge methods primarily rely on final layer outputs for evaluation, neglecting rich cross-layer representation information within the model, resulting in suboptimal alignment with human judgments.
  2. Significance:
    • LLM-as-a-Judge has broad applications in model evaluation, data synthesis, and model enhancement scenarios
    • Improving evaluation accuracy and consistency with human preferences is critical for AI system reliability
    • Large-scale evaluation tasks require efficient and accurate automated assessment methods
  3. Limitations of Existing Methods:
    • Prompt-based methods require complex reasoning steps, increasing computational costs
    • Fine-tuning methods face generalization issues with limited adaptability
    • Traditional methods rely solely on final layer outputs, overlooking semantic information in intermediate layers
  4. Research Motivation:
    • Preliminary studies reveal that middle-to-upper layers (approximately layers 20-30) often correlate more strongly with human scores than final layers
    • Different layers encode different types of information: lower layers focus on lexical information, middle-to-upper layers focus on semantic and global information
    • A lightweight, plug-and-play method is needed to leverage these internal representations

Core Contributions

  1. Proposes LAGER Framework: A post-hoc, plug-and-play framework that improves LLM-as-a-Judge alignment with human scores by aggregating cross-layer internal representations
  2. Discovers Advantages of Middle Layers: Empirically demonstrates that middle-to-upper layer representations align better with human judgments than final layers
  3. Achieves Significant Performance Improvements: Achieves up to 7.5% improvement on three standard alignment benchmarks: Flask, HelpSteer, and BIGGen
  4. Demonstrates Generalization Capability: Shows good generalization performance in downstream applications such as instruction data selection and sentiment understanding
  5. Provides Lightweight Solution: Requires training only L+1 weight parameters while keeping the model backbone frozen

Methodology Details

Task Definition

Input: Evaluation task description, user instruction, response to be evaluated, scoring criteria Output: Fine-grained continuous scores (rather than discrete integer scores) Constraints: Keep LLM backbone parameters frozen without affecting the original inference process

Model Architecture

1. Basic Framework

For decoder models, traditional methods use only the final layer hidden states:

h^(L)_n = f^(L)_decoder ∘ ··· ∘ f^(1)_decoder ∘ f_embd(x<n)

2. LAGER Core Mechanism

Cross-layer Logits Aggregation:

ẑ = Σ(i=0 to L) w_i * ẑ_i = Σ(i=0 to L) w_i * h^(i)_n * W_unembd

Candidate Score Extraction:

ẑ[M] = Σ(i=0 to L) w_i * [h^(i)_n * W_unembd]_M

where M = {Tokenize(s)|s ∈ S} is the set of candidate score tokens

Probability Distribution Computation:

P(s) = exp(ẑ[s]) / Σ(s'∈S) exp(ẑ[s'])

Expected Score:

s* = E_s~P(s)[s] = Σ(s∈S) s × P(s)

3. Weight Training Strategy

Two weight settings are provided:

  • Non-tuned Version: Uniform aggregation w_l = 1/(L+1)
  • Tuned Version: Training weights using combined loss function

Loss Function:

L_Final = α·L_CE + (1-α)·L_MAE

where cross-entropy loss handles discrete labels and MAE loss handles continuous scores

Technical Innovations

  1. Cross-layer Information Fusion: First systematic utilization of internal representations from all Transformer layers for evaluation
  2. Expected Score Mechanism: Computing continuous scores through probability distributions rather than simple argmax operations
  3. Plug-and-Play Design: No modification of original model parameters and inference process, directly applicable to existing models
  4. Lightweight Training: Requires training only L+1 weight parameters with minimal training cost

Experimental Setup

Datasets

  1. Flask: 2,001 entries with 12 scoring dimensions (conciseness, insightfulness, readability, etc.)
  2. HelpSteer: 8.95k data points evaluated on 5 criteria (helpfulness, correctness, coherence, etc.)
  3. BiGGen Bench: Comprehensive evaluation benchmark covering 77 tasks assessing 9 generative capabilities

Evaluation Metrics

  • Primary Metric: Spearman correlation coefficient (suitable for ordinal data, robust to outliers)
  • Auxiliary Metric: Pearson correlation coefficient

Comparison Methods

  1. Non-training Baselines: GPTScore, Vanilla Score (VScore), Expectation Score (E-Score)
  2. API Models: GPT-4o-mini
  3. Fine-tuned Models: TIGERScore-7B, Prometheus2-7B (for reference only)

Implementation Details

  • Models: 6 backbone models of different scales (7B-70B)
  • Decoding Strategy: Greedy decoding for stability
  • Evaluation Settings: Both direct evaluation and reasoning evaluation settings
  • Weight Training: Using 1000 HelpSteer samples, Adam optimizer, learning rate 0.01

Experimental Results

Main Results

Significant Performance Improvements:

  • LAGER outperforms all non-training baselines on all benchmarks
  • Average Spearman correlation improvement: 4.5% for non-tuned version, higher for tuned version
  • Maximum improvement of 7.5% on certain models

Key Findings:

  1. Cross-model Consistency: Improvements achieved across 6 models of different scales
  2. Competitive with API Models: Enables open-source models to reach GPT-4o-mini level
  3. Surpasses Fine-tuning Methods: InternLM3-8B and LLaMA3.1-8B outperform same-scale Prometheus2-7B

Ablation Studies

Component Importance Ranking:

  1. Expected Score > Maximum Score (+0.17 improvement)
  2. Logits Aggregation > Probability Aggregation (+0.07 improvement)
  3. Weight Tuning provides +0.10 improvement
  4. Multi-layer integration effects vary across models

Cross-scale Analysis

Scale Effects:

  • Validation on Qwen2.5 series (0.5B-72B)
  • LAGER improvements amplify with model scale
  • Best performance achieved on 72B model (Flask: 0.658 Spearman)

Case Analysis

Distribution Alignment:

  • LAGER-generated score distributions align better with human annotations
  • KL divergence reduced from 0.312 to 0.087
  • MSE reduced from 0.112 to 0.060

Text Generation Evaluation

  • Traditional Metrics: BLEU, ROUGE and other statistical methods have obvious limitations
  • Embedding Methods: BERTScore, BARTScore require reference answers
  • GPTScore: Based on generation probability but ignores semantic quality

LLM-as-a-Judge

  1. Point-wise Evaluation: Independent evaluation of individual responses
  2. Pairwise Comparison: Direct comparison of two responses
  3. Listwise Ranking: Ranking multiple responses

Method Classification:

  • Prompt-based: Improving judgments through reasoning steps
  • Fine-tuning-based: Training specialized evaluation models

Conclusions and Discussion

Main Conclusions

  1. Middle Layer Advantages: Middle-to-upper layer representations indeed align better with human judgments than final layers
  2. Lightweight Effectiveness: Training minimal parameters achieves significant performance improvements
  3. Reasoning Not Necessary: Achieves or exceeds reasoning-based methods without explicit reasoning steps
  4. Good Generalization: Excellent performance on multiple downstream tasks

Limitations

  1. Open-source Model Constraints: Requires access to model internal states, cannot be applied to closed-source API models
  2. Computational Overhead: Requires additional computation of hidden states for all layers
  3. Weight Universality: Different model families may require retraining weights

Future Directions

  1. Theoretical Analysis: Deeper understanding of semantic properties of different layer representations
  2. Efficiency Optimization: Methods to reduce computational overhead
  3. Adaptive Weights: Mechanisms for adaptively adjusting weights across different layers

In-depth Evaluation

Strengths

  1. Strong Novelty: First systematic utilization of Transformer internal representations for evaluation
  2. High Practical Value: Plug-and-play design, easy to deploy
  3. Comprehensive Experiments: Full evaluation across multiple benchmarks and model scales
  4. Theoretical Support: Provides theoretical insights through inter-layer similarity analysis

Weaknesses

  1. Limited Applicability: Only applicable to open-source models
  2. Insufficient Mechanism Explanation: Lacks deep theoretical explanation for why middle layers perform better
  3. Computational Cost: Although parameters are minimal, inference requires computing all layers

Impact

  1. Academic Contribution: Provides new perspective for LLM internal representation research
  2. Practical Value: Provides effective tools for open-source model evaluation
  3. Reproducibility: Code is publicly available, experiments are reproducible

Applicable Scenarios

  1. Model Evaluation: Improving existing evaluation pipelines
  2. Data Filtering: High-quality training data selection
  3. Quality Control: Automatic quality assessment of generated content
  4. Research Tool: LLM internal mechanism investigation

References

This paper cites extensive related work, including:

  • LLM-as-a-Judge research (Lin & Chen, 2023; Liu et al., 2023, etc.)
  • Internal representation studies (Wang et al., 2020; Yang et al., 2022, etc.)
  • Evaluation benchmarks and methods (Ye et al., 2024; Kim et al., 2024, etc.)

Overall Assessment: This is a high-quality research paper that proposes the innovative LAGER framework, significantly improving human alignment in automated evaluation by leveraging LLM internal representations. The method is simple and effective, with comprehensive experiments, possessing important academic value and practical significance. The main limitation is applicability only to open-source models, but given the rapid development of open-source LLMs, this work still has broad application prospects.