2025-11-12T19:43:10.253640

Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Peng, Kumar, Wu et al.

Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem compounded by the scarcity of consistency-focused training data and the limitations of current fine-tuning techniques in enhancing output consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving a ~47.5\% improvement in response similarity over the baseline, thus offering a practical solution for increasing the reliability of an industrial RAG system.

academic

Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Basic Information

Paper ID: 2510.14915
Title: Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation
Authors: Xujun Peng, Anoop Kumar, Jingyu Wu, Parker Glenn, Daben Liu (Capital One AI Foundations)
Classification: cs.CL (Computational Linguistics)
Publication Date: October 16, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.14915

Abstract

Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses based on retrieved context. However, LLMs frequently produce inconsistent outputs when faced with semantically equivalent inputs, a problem exacerbated by the scarcity of consistency-oriented training data and the limitations of current fine-tuning techniques in enhancing output consistency. This paper proposes a method combining systematic synthetic data generation, triplet loss, and a novel layer-wise model merging approach. By employing consistency-aware weights derived from intermediate layer activations, the method effectively integrates knowledge from specialized models. Experimental results demonstrate that the merged model significantly improves output consistency, achieving a 47.5% improvement in response similarity compared to the baseline.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is the output consistency issue in RAG system generation models, manifested as:

Semantically equivalent queries producing different responses: As shown in Figure 1, merely the presence or absence of a question mark can lead to RAG systems providing entirely different answers
Practical challenges in industrial deployment: In production environments, diverse query variants pose threats to system reliability

Problem Significance

Reliability requirements: In high-risk domains such as finance and healthcare, inconsistent responses severely impact user trust
Practical impact: The paper provides empirical evidence that generators are more sensitive to query variations than retrievers
System stability: Output inconsistency directly affects RAG system adoption in industrial environments

Limitations of Existing Approaches

Scarcity of training data: Lack of training data specifically targeting consistency
Fine-tuning technique limitations: Traditional fine-tuning methods show limited effectiveness in improving output consistency
Missing evaluation benchmarks: Absence of specialized consistency evaluation benchmarks and datasets

Core Contributions

Query variant classification: Systematically identifies and categorizes query variant types causing response inconsistencies in industrial RAG systems
Consistency measurement framework: Establishes consistency evaluation metrics including Exact Match (EM), Response Similarity (RS), and BERT Similarity (BS)
Layer-wise model merging method: Proposes a novel layer-wise model merging strategy based on consistency-aware weights
Comprehensive solution: Integrates synthetic data generation, triplet loss training, and model merging into a complete methodology

Methodology Details

Task Definition

Given an original query Q and its semantically equivalent variant Q', the objective is to enable the RAG system's generator to produce consistent responses S and S' for both, i.e., maximize semantic similarity between S and S' while maintaining response accuracy.

Model Architecture

1. Synthetic Data Generation Strategy

Based on analysis of production queries, three main variant categories are identified:

How to/do variants:

Reformulations of procedural questions
Systematically generated using regular expression rules

Singular/plural and article variants:

Noun number variations (e.g., "apple" vs "apples")
Article usage variations (e.g., "a", "an", "the")
Random swapping of singular/plural forms and article modifications

Semantic variants:

Variations maintaining core meaning while using different vocabulary
Generated using Llama-3.1-70B-Instruct for paraphrasing

2. Triplet Loss Training

Introduces triplet loss to enhance the model's semantic representation capability:

L(A,P,N) = max(0, d(f(A), f(P)) - d(f(A), f(N)) + α)

Where:

A is the anchor query
P is the positive sample (semantically similar)
N is the negative sample (semantically dissimilar)
α is the margin parameter

The final loss function combines cross-entropy loss and triplet loss:

L = L_CE + α · L_Triplet

3. Layer-wise Model Merging Algorithm

Core concept: Dynamically assigns merging weights based on each layer's contribution to consistency.

Weight computation workflow:

Activation extraction: Extract activations α_k^(l) from each model k at each layer l from development set S_dev
Similarity matrix computation: Calculate similarity matrices Σ_k^(l) of activations
Reference matrix construction: Build reference similarity matrix Σ_r using sentence encoders
Distance calculation: d_k^(l) = |Σ_k^(l) - Σ_r|
Weight normalization: Obtain final weights w_k^(l) through inverse non-linear normalization

Merging formula:

θ_merged^(l) = θ_P^(l) + Σ_k w_k^(l) · Δθ_k^(l)

Technical Innovations

Consistency-oriented weight design: First to propose layer-wise model merging weight computation based on activation similarity
Diverse synthetic data strategy: Query variant generation methods tailored to industrial scenarios
Triplet loss integration: Incorporates triplet loss from metric learning into LLM fine-tuning to enhance semantic representation quality

Experimental Setup

Datasets

Base data: 2,738 representative queries with retrieved contexts, annotated by domain experts
Data split: 1,421 training samples, 1,317 test samples
Synthetic data:
- 150 "how to/do" variant queries
- 1,421 paraphrased queries
- 952 singular/plural and article variant queries
Consistency test set: 1,579 variants (176 "how to/do", 912 paraphrases, 491 singular/plural/article variations)

Evaluation Metrics

Accuracy metrics:

ROUGE-L: Text overlap measurement
BLEU (up to 4-gram): Lexical alignment measurement

Consistency metrics:

Exact Match (EM): Complete string matching
Response Similarity (RS): Semantic equivalence judgment based on ROUGE threshold
BERT Similarity (BS): BERT-based semantic similarity

Comparison Methods

Baseline models (Llama-3.1-8B-Instruct, Gemma-3-12B-Instruct)
Standard Supervised Fine-Tuning (SFT)
SFT + Triplet Loss
Specialized models for single variant types
Joint training on all data

Implementation Details

Base models: Llama-3.1-8B-Instruct and Gemma-3-12B-Instruct
Training epochs: 2
Triplet construction: Sampling from top-10 and bottom-10 neighbors in semantic feature space

Experimental Results

Main Results

Llama-3.1-8B-Instruct Model Results:

Method	ROUGE	BLEU	EM	RS	BS
Baseline	0.5123	0.2928	0.1051	0.2799	0.9246
Merged Model	0.5379	0.3380	0.2521	0.4129	0.9292

Key findings:

Significant consistency improvement: EM improved by 139.87%, RS improved by 47.52%
Maintained accuracy: ROUGE and BLEU remain competitive
Optimal balance: Merged model achieves best performance on all consistency metrics

Gemma-3-12B-Instruct Model Results:

Similar improvement trends, validating method generalizability
Larger models show slight advantages in accuracy, but consistency improvement patterns remain consistent

Ablation Studies

Component contribution analysis:

Triplet loss effectiveness: Compared to standard SFT, EM improved by 73.4%, RS improved by 26.1%
Specialized model advantages: Single-variant trained models outperform baseline in both accuracy and consistency
Merging strategy effectiveness: Merged model surpasses all single models on consistency metrics

Experimental Findings

Generator vs. retriever: Validates the hypothesis that generators are more sensitive to query variations than retrievers
Specialization vs. generalization: Specialized models outperform joint training in accuracy, but joint training achieves better consistency
Model scale impact: Larger models do not automatically guarantee better consistency

Consistency Definition and Evaluation

Theoretical foundation: Based on semantic equivalence definitions by Patwardhan et al.
Evaluation methods: Draws from semantic consistency measurement frameworks by Raj et al.
Automated evaluation: References consistency evaluation tools by Zhao et al.

LLM Consistency Improvement

Prompt engineering: Self-consistency methods by Wang et al.
Synthetic data: Multi-step prompting and synthetic data methods by Raj et al.
Ensemble methods: Logit-based ensemble methods by Wu et al.

Model Merging Techniques

Foundational methods: DARE-TIES merging algorithm
Weight averaging: Limitations of traditional model merging techniques
Parameter space operations: Operations on parameter differences rather than absolute weights

Conclusions and Discussion

Main Conclusions

Problem characterization: Successfully identifies and quantifies consistency issues in industrial RAG systems
Method effectiveness: The proposed layer-wise merging method significantly improves output consistency (47.5% improvement)
Practical value: Provides actionable reliability enhancement solutions for industrial RAG systems

Limitations

Data scope constraints: Experiments primarily based on industrial data, lacking public benchmark testing
Retriever assumptions: Assumes stable retriever results, does not address retrieval inconsistency
Model scope: Validation on only two LLMs, hyperparameter configurations require further exploration

Future Directions

Public benchmark construction: Plans to construct and publicly release consistency evaluation benchmarks
Retrieval consistency: Extension to retriever inconsistency problems
Adaptive merging: Exploration of methods for dynamically adjusting merging strategies
Cross-domain validation: Validation of method effectiveness on more public datasets

In-Depth Evaluation

Strengths

Strong problem targeting: Directly addresses practical pain points in industrial RAG systems
Method novelty: Layer-wise consistency-aware weight design demonstrates innovation
Comprehensive experimentation: Systematic evaluation across multiple models and metrics
High practical value: 47.5% consistency improvement has significant practical implications

Weaknesses

Insufficient theoretical analysis: Lacks deep theoretical explanation for why layer-wise merging improves consistency
Missing computational overhead analysis: No analysis of computational complexity for layer weight calculation and merging
Limited generalization validation: Primarily validated in specific industrial scenarios, cross-domain generalization capability remains unproven
Benchmark data limitations: Lacks validation on standard public datasets

Impact

Academic contribution: Provides new technical pathways for LLM consistency research
Industrial value: Directly solves critical problems in RAG system deployment
Method reproducibility: Algorithm descriptions are relatively clear with good reproducibility
Research inspiration: Opens new directions for model merging and consistency optimization

Applicable Scenarios

High-reliability requirement scenarios: Domains such as finance, healthcare, and law with extreme consistency requirements
Industrial RAG deployment: Question-answering systems in large-scale production environments
Multi-model integration scenarios: Applications requiring integration of multiple specialized model knowledge
User experience-sensitive applications: Interactive systems with strict response consistency requirements

References

The paper cites multiple important related works, including:

Lewis et al. (2020): Foundational work on RAG framework
Yu et al. (2024), Yadav et al. (2023): DARE-TIES model merging methods
Schroff et al. (2015): Original triplet loss work
Patwardhan et al. (2024): LLM consistency definitions and analysis

Overall Assessment: This is a high-quality applied research paper addressing practical industrial problems, with significant contributions in both methodological innovation and practical value. While there remains room for improvement in theoretical depth and generalization validation, the problem it addresses has important practical significance, and the proposed method demonstrates good operability and effectiveness.