2025-11-21T22:28:16.015152

Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Han
Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.
academic

Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Basic Information

  • Paper ID: 2510.09423
  • Title: Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models
  • Author: Yankun Han (University of Florida)
  • Classification: cs.LG (Machine Learning)
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09423

Abstract

Weight initialization controls signal propagation and gradient flow at the beginning of training. This paper provides a theoretically grounded and empirically validated study covering two domains: compact ReLU multilayer perceptrons and GPT-2-style Transformers. First, through logarithmic sweeps over initial standard deviations, the paper maps regions of vanishing and exploding gradients, identifying a broad stability band with standard deviations between 1e-2 and 1e-1. Second, controlled comparisons demonstrate that under ReLU activation, Kaiming (fan-in) initialization converges faster and more stably than Xavier initialization, consistent with variance preservation theory. Third, in a 12-layer GPT-2-style model built from scratch, the paper tracks variance changes in Q/K/V weight matrices across layers during pretraining, observing depth-dependent equilibrium phenomena: shallow layers expand rapidly while deeper layers change more gradually.

Research Background and Motivation

Problem Definition

The core problems addressed by this research concern the impact of weight initialization on training stability and convergence in deep neural networks and large language models. Specifically:

  1. Initialization Scale Sensitivity: How different initial standard deviations affect training stability
  2. Activation Function Specificity: Whether activation functions like ReLU and GELU require specific initialization strategies
  3. Variance Dynamics in Modern Transformers: Whether variance stabilization persists in large Transformer models

Significance

Weight initialization is a critical factor for successful deep learning training. Improper initialization leads to:

  • Vanishing Gradients: Signal attenuation across layers in deep networks
  • Exploding Gradients: Exponential signal growth during propagation
  • Training Instability: Oscillations and divergence in the optimization process

Limitations of Existing Methods

Although classical initialization methods (LeCun, Xavier/Glorot, He/Kaiming) have intuitive variance preservation foundations, they suffer from the following issues in practical applications:

  1. Sensitivity to deviations from ideal scales has not been adequately quantified
  2. The mechanisms of specific activation functions (e.g., ReLU, GELU) remain unclear
  3. Systematic studies of performance in large Transformers are lacking

Core Contributions

  1. Unified Variance Analysis Framework: Derives forward and backward variance propagation conditions for common activation functions (ReLU, GELU), explaining how fan-in scaling preserves signal amplitude and the origin of the factor of 2 in ReLU
  2. Quantification of Scale Sensitivity: Through logarithmic sweeps over 25 standard deviation values, maps vanishing/exploding gradient regions and identifies a stable training band σ ∈ 10⁻², 10⁻¹
  3. Activation-Aware Initialization Verification: In controlled ReLU MLP training, confirms that Kaiming normal (fan-in) converges faster with smaller loss variance compared to Xavier normal
  4. Transformer Variance Dynamics Analysis: In a 12-layer GPT-2-style model built from scratch, discovers clear depth-dependent patterns: shallow layer weight standard deviations expand rapidly while deeper layers change more gradually, ultimately stabilizing in a narrow variance band

Methodology Details

Theoretical Framework

Forward Propagation Variance Analysis

For linear mapping:

Var[z_l] = n_in σ²_W Var[x_{l-1}]

After nonlinear activation:

Var[x_l] ≈ c_φ n_in σ²_W Var[x_{l-1}]

where c_φ = E[φ(z)²]/Var[z] is an activation function-dependent constant.

To prevent activation values from vanishing or exploding, choose σ²_W ≈ 1/(c_φ n_in):

  • ReLU: c_φ ≈ 1/2, thus σ²_W ≈ 2/n_in (He/Kaiming)
  • GELU: c_φ ≈ 0.45-0.5, slightly smaller than ReLU

Backward Propagation Variance Analysis

Backpropagation yields:

Var[δ_{l-1}] ≈ n_out σ²_W d_φ Var[δ_l]

where d_φ = E[φ'(z)²]. For ReLU, d_φ = 1/2, and balancing gradient variance requires σ²_W ≈ 2/n_out.

Trade-offs and Practical Choices

Forward and backward preservation conditions typically cannot be satisfied simultaneously unless n_in ≈ n_out and c_φ ≈ d_φ. In practice, maintaining forward signal stability is usually more important, explaining why fan-in He/Kaiming converges faster than Xavier.

Experimental Design

Experiment E1: Standard Deviation Sweep

  • Network Architecture: 784→64→32→32→10 ReLU MLP
  • Dataset: MNIST
  • Sweep Range: 25 standard deviation values from 10⁻⁴ to 10, logarithmically spaced
  • Evaluation Metrics: Loss trajectories, classification accuracy

Experiment E2: Xavier vs Kaiming Comparison

  • Network Architecture: 11→16→32→32→1 ReLU network
  • Dataset: UCI Wine binary classification task
  • Comparison Schemes: Xavier normal vs Kaiming uniform
  • Statistical Validation: 10 random runs with paired t-tests

Experiment E3: GPT-2 Variance Dynamics

  • Model Scale: 12-layer GPT-2-style Transformer
  • Initialization: Standard configuration (most modules std=0.02, embedding layer xavier normal)
  • Optimizer: AdamW, learning rate 1×10⁻⁴, batch size 16
  • Tracking Targets: Standard deviations of Q/K/V projection weights across all layers

Experimental Results

E1: Standard Deviation Sweep Results

  • Stable Interval: Training is smooth within σ ∈ 10⁻², 10⁻¹, gradients perform well, accuracy peaks within this range
  • Vanishing Gradients: Extremely small scales (σ ≲ 10⁻³) cause update disappearance and accuracy decline
  • Exploding Gradients: Extremely large scales (σ ≳ 1) produce unstable losses and occasional divergence

E2: Initialization Method Comparison

Kaiming initialization consistently outperforms Xavier across multiple dimensions:

  • Convergence Speed: Fewer median iterations to reach targets, steeper early loss descent
  • Accuracy: Final validation accuracy matches or slightly exceeds Xavier
  • Statistical Significance: Paired t-tests show significant differences in loss and training accuracy (p < 0.05)

E3: Transformer Variance Dynamics Findings

  • Depth-Dependent Patterns: Shallow layers display rapid and significant weight standard deviation expansion in early training, while deeper layers expand more slowly and smoothly
  • Variance Equilibrium: All layers eventually stabilize in a narrow variance band
  • Distribution Sparsification: Post-training weight distributions become sparser, with many entries near zero remaining unchanged while a few large-magnitude weights dominate

Theoretical Insights and Practical Implications

Depth-Dependent Variance Equilibrium Mechanisms

The paper reveals progressive equilibrium patterns in Transformers:

  1. Rapid Shallow Layer Adaptation: Layers near the input have high signal-to-noise ratio gradients, encouraging early aggressive scaling
  2. Progressive Deep Layer Adjustment: Residual path length and pre-normalization limit effective step sizes in deeper layers
  3. Implicit Constraints: Attention softmax saturation and weight decay in AdamW prevent large parameter scales

Practical Guidance Principles

  1. ReLU/GELU MLPs: Start with fan-in He/Kaiming; if highly imbalanced layers cause gradient drift, slightly shift toward fan-average choices
  2. Deep Residual Stacks: Residual scaling (e.g., 1/√L) or normalization helps prevent depth-related variance drift
  3. Transformer Projections: Use small standard deviation initialization (e.g., 0.02), monitor per-layer standard deviations and gradient norms

Foundational Initialization Strategies

  • LeCun Method: Variance preservation rules for linear activations
  • Glorot/Xavier: Fan-based scaling for tanh/sigmoid
  • He/Kaiming: Activation-aware scaling compensating for halved second moments under ReLU

Modern Developments

  • Fixup Initialization: Removes normalization requirements in extremely deep networks through carefully chosen initialization and residual scaling
  • DeepNet: Proposes principled depth-scaling rules enabling training of thousand-layer networks
  • Pre-normalization Advantages: Improves optimization stability compared to post-normalization by smoothing gradient flow

Conclusions and Discussion

Main Conclusions

  1. Stability Band Existence: A broad but sensitive stability band exists within σ ∈ 10⁻², 10⁻¹
  2. Activation Function Specificity Matters: Kaiming initialization indeed outperforms Xavier in ReLU networks
  3. Depth-Dependent Dynamics: Transformers exhibit depth-dependent variance equilibrium with rapid shallow layer adaptation and progressive deep layer adjustment

Limitations

  1. Experimental Scale: GPT-2 experiments are relatively small (12 layers); behavior in larger models may differ
  2. Activation Function Coverage: Primarily focuses on ReLU and GELU; analysis of other activation functions is limited
  3. Optimizer Dependence: Results may be sensitive to specific optimizers (AdamW) and hyperparameter settings

Future Directions

  1. Adaptive Depth-Aware Initialization: Learn per-layer or per-head scales to bring shallow layers closer to final variance levels
  2. Optimizer and Schedule Coupling: Joint optimization of warmup length, weight decay, and gradient clipping
  3. Depth and Width Scaling: Evaluate persistence of depth-dependent equilibrium in larger models

In-Depth Evaluation

Strengths

  1. Theory-Practice Integration: Organically combines classical variance propagation theory with modern Transformer behavior
  2. Systematic Experimental Design: Progressive verification from simple MLPs to complex Transformers
  3. High Practical Value: Provides concrete initialization recommendations and diagnostic methods
  4. Statistical Rigor: Employs paired t-tests and other statistical methods to validate result significance

Weaknesses

  1. Limited Theoretical Depth: Lacks deeper theoretical explanations for depth-dependent phenomena
  2. Experimental Scale Constraints: Limited by computational resources, unable to validate on truly large-scale models
  3. Generalization Issues: Results primarily based on specific architectures and tasks; generalization capability requires further verification

Impact Assessment

  1. Academic Contribution: Provides modern perspective on initialization theory, connecting classical theory with current practice
  2. Practical Value: Offers practitioners explicit initialization strategies and diagnostic tools
  3. Reproducibility: Clear experimental design with detailed code and parameter settings facilitates reproduction

Applicable Scenarios

  1. Deep Network Training: Particularly suitable for deep networks with ReLU/GELU activations
  2. Transformer Optimization: Provides initialization guidance for large language model training
  3. Research Tool: Offers researchers a methodological framework for analyzing weight dynamics

References

The paper cites key works in the initialization field, including foundational research by LeCun, Glorot, He and others, as well as recent advances in Transformer optimization, providing a solid theoretical foundation for this research.