2025-11-12T19:43:10.253640

Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Peng, Kumar, Wu et al.
Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem compounded by the scarcity of consistency-focused training data and the limitations of current fine-tuning techniques in enhancing output consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving a ~47.5\% improvement in response similarity over the baseline, thus offering a practical solution for increasing the reliability of an industrial RAG system.
academic

Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Basic Information

  • Paper ID: 2510.14915
  • Title: Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation
  • Authors: Xujun Peng, Anoop Kumar, Jingyu Wu, Parker Glenn, Daben Liu (Capital One AI Foundations)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: October 16, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.14915

Abstract

Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses based on retrieved context. However, LLMs frequently produce inconsistent outputs when faced with semantically equivalent inputs, a problem exacerbated by the scarcity of consistency-oriented training data and the limitations of current fine-tuning techniques in enhancing output consistency. This paper proposes a method combining systematic synthetic data generation, triplet loss, and a novel layer-wise model merging approach. By employing consistency-aware weights derived from intermediate layer activations, the method effectively integrates knowledge from specialized models. Experimental results demonstrate that the merged model significantly improves output consistency, achieving a 47.5% improvement in response similarity compared to the baseline.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is the output consistency issue in RAG system generation models, manifested as:

  1. Semantically equivalent queries producing different responses: As shown in Figure 1, merely the presence or absence of a question mark can lead to RAG systems providing entirely different answers
  2. Practical challenges in industrial deployment: In production environments, diverse query variants pose threats to system reliability

Problem Significance

  1. Reliability requirements: In high-risk domains such as finance and healthcare, inconsistent responses severely impact user trust
  2. Practical impact: The paper provides empirical evidence that generators are more sensitive to query variations than retrievers
  3. System stability: Output inconsistency directly affects RAG system adoption in industrial environments

Limitations of Existing Approaches

  1. Scarcity of training data: Lack of training data specifically targeting consistency
  2. Fine-tuning technique limitations: Traditional fine-tuning methods show limited effectiveness in improving output consistency
  3. Missing evaluation benchmarks: Absence of specialized consistency evaluation benchmarks and datasets

Core Contributions

  1. Query variant classification: Systematically identifies and categorizes query variant types causing response inconsistencies in industrial RAG systems
  2. Consistency measurement framework: Establishes consistency evaluation metrics including Exact Match (EM), Response Similarity (RS), and BERT Similarity (BS)
  3. Layer-wise model merging method: Proposes a novel layer-wise model merging strategy based on consistency-aware weights
  4. Comprehensive solution: Integrates synthetic data generation, triplet loss training, and model merging into a complete methodology

Methodology Details

Task Definition

Given an original query Q and its semantically equivalent variant Q', the objective is to enable the RAG system's generator to produce consistent responses S and S' for both, i.e., maximize semantic similarity between S and S' while maintaining response accuracy.

Model Architecture

1. Synthetic Data Generation Strategy

Based on analysis of production queries, three main variant categories are identified:

How to/do variants:

  • Reformulations of procedural questions
  • Systematically generated using regular expression rules

Singular/plural and article variants:

  • Noun number variations (e.g., "apple" vs "apples")
  • Article usage variations (e.g., "a", "an", "the")
  • Random swapping of singular/plural forms and article modifications

Semantic variants:

  • Variations maintaining core meaning while using different vocabulary
  • Generated using Llama-3.1-70B-Instruct for paraphrasing

2. Triplet Loss Training

Introduces triplet loss to enhance the model's semantic representation capability:

L(A,P,N) = max(0, d(f(A), f(P)) - d(f(A), f(N)) + α)

Where:

  • A is the anchor query
  • P is the positive sample (semantically similar)
  • N is the negative sample (semantically dissimilar)
  • α is the margin parameter

The final loss function combines cross-entropy loss and triplet loss:

L = L_CE + α · L_Triplet

3. Layer-wise Model Merging Algorithm

Core concept: Dynamically assigns merging weights based on each layer's contribution to consistency.

Weight computation workflow:

  1. Activation extraction: Extract activations α_k^(l) from each model k at each layer l from development set S_dev
  2. Similarity matrix computation: Calculate similarity matrices Σ_k^(l) of activations
  3. Reference matrix construction: Build reference similarity matrix Σ_r using sentence encoders
  4. Distance calculation: d_k^(l) = |Σ_k^(l) - Σ_r|
  5. Weight normalization: Obtain final weights w_k^(l) through inverse non-linear normalization

Merging formula:

θ_merged^(l) = θ_P^(l) + Σ_k w_k^(l) · Δθ_k^(l)

Technical Innovations

  1. Consistency-oriented weight design: First to propose layer-wise model merging weight computation based on activation similarity
  2. Diverse synthetic data strategy: Query variant generation methods tailored to industrial scenarios
  3. Triplet loss integration: Incorporates triplet loss from metric learning into LLM fine-tuning to enhance semantic representation quality

Experimental Setup

Datasets

  • Base data: 2,738 representative queries with retrieved contexts, annotated by domain experts
  • Data split: 1,421 training samples, 1,317 test samples
  • Synthetic data:
    • 150 "how to/do" variant queries
    • 1,421 paraphrased queries
    • 952 singular/plural and article variant queries
  • Consistency test set: 1,579 variants (176 "how to/do", 912 paraphrases, 491 singular/plural/article variations)

Evaluation Metrics

Accuracy metrics:

  • ROUGE-L: Text overlap measurement
  • BLEU (up to 4-gram): Lexical alignment measurement

Consistency metrics:

  • Exact Match (EM): Complete string matching
  • Response Similarity (RS): Semantic equivalence judgment based on ROUGE threshold
  • BERT Similarity (BS): BERT-based semantic similarity

Comparison Methods

  • Baseline models (Llama-3.1-8B-Instruct, Gemma-3-12B-Instruct)
  • Standard Supervised Fine-Tuning (SFT)
  • SFT + Triplet Loss
  • Specialized models for single variant types
  • Joint training on all data

Implementation Details

  • Base models: Llama-3.1-8B-Instruct and Gemma-3-12B-Instruct
  • Training epochs: 2
  • Triplet construction: Sampling from top-10 and bottom-10 neighbors in semantic feature space

Experimental Results

Main Results

Llama-3.1-8B-Instruct Model Results:

MethodROUGEBLEUEMRSBS
Baseline0.51230.29280.10510.27990.9246
Merged Model0.53790.33800.25210.41290.9292

Key findings:

  • Significant consistency improvement: EM improved by 139.87%, RS improved by 47.52%
  • Maintained accuracy: ROUGE and BLEU remain competitive
  • Optimal balance: Merged model achieves best performance on all consistency metrics

Gemma-3-12B-Instruct Model Results:

  • Similar improvement trends, validating method generalizability
  • Larger models show slight advantages in accuracy, but consistency improvement patterns remain consistent

Ablation Studies

Component contribution analysis:

  1. Triplet loss effectiveness: Compared to standard SFT, EM improved by 73.4%, RS improved by 26.1%
  2. Specialized model advantages: Single-variant trained models outperform baseline in both accuracy and consistency
  3. Merging strategy effectiveness: Merged model surpasses all single models on consistency metrics

Experimental Findings

  1. Generator vs. retriever: Validates the hypothesis that generators are more sensitive to query variations than retrievers
  2. Specialization vs. generalization: Specialized models outperform joint training in accuracy, but joint training achieves better consistency
  3. Model scale impact: Larger models do not automatically guarantee better consistency

Consistency Definition and Evaluation

  • Theoretical foundation: Based on semantic equivalence definitions by Patwardhan et al.
  • Evaluation methods: Draws from semantic consistency measurement frameworks by Raj et al.
  • Automated evaluation: References consistency evaluation tools by Zhao et al.

LLM Consistency Improvement

  • Prompt engineering: Self-consistency methods by Wang et al.
  • Synthetic data: Multi-step prompting and synthetic data methods by Raj et al.
  • Ensemble methods: Logit-based ensemble methods by Wu et al.

Model Merging Techniques

  • Foundational methods: DARE-TIES merging algorithm
  • Weight averaging: Limitations of traditional model merging techniques
  • Parameter space operations: Operations on parameter differences rather than absolute weights

Conclusions and Discussion

Main Conclusions

  1. Problem characterization: Successfully identifies and quantifies consistency issues in industrial RAG systems
  2. Method effectiveness: The proposed layer-wise merging method significantly improves output consistency (47.5% improvement)
  3. Practical value: Provides actionable reliability enhancement solutions for industrial RAG systems

Limitations

  1. Data scope constraints: Experiments primarily based on industrial data, lacking public benchmark testing
  2. Retriever assumptions: Assumes stable retriever results, does not address retrieval inconsistency
  3. Model scope: Validation on only two LLMs, hyperparameter configurations require further exploration

Future Directions

  1. Public benchmark construction: Plans to construct and publicly release consistency evaluation benchmarks
  2. Retrieval consistency: Extension to retriever inconsistency problems
  3. Adaptive merging: Exploration of methods for dynamically adjusting merging strategies
  4. Cross-domain validation: Validation of method effectiveness on more public datasets

In-Depth Evaluation

Strengths

  1. Strong problem targeting: Directly addresses practical pain points in industrial RAG systems
  2. Method novelty: Layer-wise consistency-aware weight design demonstrates innovation
  3. Comprehensive experimentation: Systematic evaluation across multiple models and metrics
  4. High practical value: 47.5% consistency improvement has significant practical implications

Weaknesses

  1. Insufficient theoretical analysis: Lacks deep theoretical explanation for why layer-wise merging improves consistency
  2. Missing computational overhead analysis: No analysis of computational complexity for layer weight calculation and merging
  3. Limited generalization validation: Primarily validated in specific industrial scenarios, cross-domain generalization capability remains unproven
  4. Benchmark data limitations: Lacks validation on standard public datasets

Impact

  1. Academic contribution: Provides new technical pathways for LLM consistency research
  2. Industrial value: Directly solves critical problems in RAG system deployment
  3. Method reproducibility: Algorithm descriptions are relatively clear with good reproducibility
  4. Research inspiration: Opens new directions for model merging and consistency optimization

Applicable Scenarios

  1. High-reliability requirement scenarios: Domains such as finance, healthcare, and law with extreme consistency requirements
  2. Industrial RAG deployment: Question-answering systems in large-scale production environments
  3. Multi-model integration scenarios: Applications requiring integration of multiple specialized model knowledge
  4. User experience-sensitive applications: Interactive systems with strict response consistency requirements

References

The paper cites multiple important related works, including:

  • Lewis et al. (2020): Foundational work on RAG framework
  • Yu et al. (2024), Yadav et al. (2023): DARE-TIES model merging methods
  • Schroff et al. (2015): Original triplet loss work
  • Patwardhan et al. (2024): LLM consistency definitions and analysis

Overall Assessment: This is a high-quality applied research paper addressing practical industrial problems, with significant contributions in both methodological innovation and practical value. While there remains room for improvement in theoretical depth and generalization validation, the problem it addresses has important practical significance, and the proposed method demonstrates good operability and effectiveness.