2025-11-21T22:28:16.015152

Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Han

Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.

academic

Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Basic Information

Paper ID: 2510.09423
Title: Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models
Author: Yankun Han (University of Florida)
Classification: cs.LG (Machine Learning)
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09423

Abstract

Weight initialization controls signal propagation and gradient flow at the beginning of training. This paper provides a theoretically grounded and empirically validated study covering two domains: compact ReLU multilayer perceptrons and GPT-2-style Transformers. First, through logarithmic sweeps over initial standard deviations, the paper maps regions of vanishing and exploding gradients, identifying a broad stability band with standard deviations between 1e-2 and 1e-1. Second, controlled comparisons demonstrate that under ReLU activation, Kaiming (fan-in) initialization converges faster and more stably than Xavier initialization, consistent with variance preservation theory. Third, in a 12-layer GPT-2-style model built from scratch, the paper tracks variance changes in Q/K/V weight matrices across layers during pretraining, observing depth-dependent equilibrium phenomena: shallow layers expand rapidly while deeper layers change more gradually.

Research Background and Motivation

Problem Definition

The core problems addressed by this research concern the impact of weight initialization on training stability and convergence in deep neural networks and large language models. Specifically:

Initialization Scale Sensitivity: How different initial standard deviations affect training stability
Activation Function Specificity: Whether activation functions like ReLU and GELU require specific initialization strategies
Variance Dynamics in Modern Transformers: Whether variance stabilization persists in large Transformer models

Significance

Weight initialization is a critical factor for successful deep learning training. Improper initialization leads to:

Vanishing Gradients: Signal attenuation across layers in deep networks
Exploding Gradients: Exponential signal growth during propagation
Training Instability: Oscillations and divergence in the optimization process

Limitations of Existing Methods

Although classical initialization methods (LeCun, Xavier/Glorot, He/Kaiming) have intuitive variance preservation foundations, they suffer from the following issues in practical applications:

Sensitivity to deviations from ideal scales has not been adequately quantified
The mechanisms of specific activation functions (e.g., ReLU, GELU) remain unclear
Systematic studies of performance in large Transformers are lacking

Core Contributions

Unified Variance Analysis Framework: Derives forward and backward variance propagation conditions for common activation functions (ReLU, GELU), explaining how fan-in scaling preserves signal amplitude and the origin of the factor of 2 in ReLU
Quantification of Scale Sensitivity: Through logarithmic sweeps over 25 standard deviation values, maps vanishing/exploding gradient regions and identifies a stable training band σ ∈ 10⁻², 10⁻¹
Activation-Aware Initialization Verification: In controlled ReLU MLP training, confirms that Kaiming normal (fan-in) converges faster with smaller loss variance compared to Xavier normal
Transformer Variance Dynamics Analysis: In a 12-layer GPT-2-style model built from scratch, discovers clear depth-dependent patterns: shallow layer weight standard deviations expand rapidly while deeper layers change more gradually, ultimately stabilizing in a narrow variance band

Methodology Details

Theoretical Framework

Forward Propagation Variance Analysis

For linear mapping:

Var[z_l] = n_in σ²_W Var[x_{l-1}]

After nonlinear activation:

Var[x_l] ≈ c_φ n_in σ²_W Var[x_{l-1}]

where c_φ = E[φ(z)²]/Var[z] is an activation function-dependent constant.

To prevent activation values from vanishing or exploding, choose σ²_W ≈ 1/(c_φ n_in):

ReLU: c_φ ≈ 1/2, thus σ²_W ≈ 2/n_in (He/Kaiming)
GELU: c_φ ≈ 0.45-0.5, slightly smaller than ReLU

Backward Propagation Variance Analysis

Backpropagation yields:

Var[δ_{l-1}] ≈ n_out σ²_W d_φ Var[δ_l]

where d_φ = E[φ'(z)²]. For ReLU, d_φ = 1/2, and balancing gradient variance requires σ²_W ≈ 2/n_out.

Trade-offs and Practical Choices

Forward and backward preservation conditions typically cannot be satisfied simultaneously unless n_in ≈ n_out and c_φ ≈ d_φ. In practice, maintaining forward signal stability is usually more important, explaining why fan-in He/Kaiming converges faster than Xavier.

Experimental Design

Experiment E1: Standard Deviation Sweep

Network Architecture: 784→64→32→32→10 ReLU MLP
Dataset: MNIST
Sweep Range: 25 standard deviation values from 10⁻⁴ to 10, logarithmically spaced
Evaluation Metrics: Loss trajectories, classification accuracy

Experiment E2: Xavier vs Kaiming Comparison

Network Architecture: 11→16→32→32→1 ReLU network
Dataset: UCI Wine binary classification task
Comparison Schemes: Xavier normal vs Kaiming uniform
Statistical Validation: 10 random runs with paired t-tests

Experiment E3: GPT-2 Variance Dynamics

Model Scale: 12-layer GPT-2-style Transformer
Initialization: Standard configuration (most modules std=0.02, embedding layer xavier normal)
Optimizer: AdamW, learning rate 1×10⁻⁴, batch size 16
Tracking Targets: Standard deviations of Q/K/V projection weights across all layers

Experimental Results

E1: Standard Deviation Sweep Results

Stable Interval: Training is smooth within σ ∈ 10⁻², 10⁻¹, gradients perform well, accuracy peaks within this range
Vanishing Gradients: Extremely small scales (σ ≲ 10⁻³) cause update disappearance and accuracy decline
Exploding Gradients: Extremely large scales (σ ≳ 1) produce unstable losses and occasional divergence

E2: Initialization Method Comparison

Kaiming initialization consistently outperforms Xavier across multiple dimensions:

Convergence Speed: Fewer median iterations to reach targets, steeper early loss descent
Accuracy: Final validation accuracy matches or slightly exceeds Xavier
Statistical Significance: Paired t-tests show significant differences in loss and training accuracy (p < 0.05)

E3: Transformer Variance Dynamics Findings

Depth-Dependent Patterns: Shallow layers display rapid and significant weight standard deviation expansion in early training, while deeper layers expand more slowly and smoothly
Variance Equilibrium: All layers eventually stabilize in a narrow variance band
Distribution Sparsification: Post-training weight distributions become sparser, with many entries near zero remaining unchanged while a few large-magnitude weights dominate

Theoretical Insights and Practical Implications

Depth-Dependent Variance Equilibrium Mechanisms

The paper reveals progressive equilibrium patterns in Transformers:

Rapid Shallow Layer Adaptation: Layers near the input have high signal-to-noise ratio gradients, encouraging early aggressive scaling
Progressive Deep Layer Adjustment: Residual path length and pre-normalization limit effective step sizes in deeper layers
Implicit Constraints: Attention softmax saturation and weight decay in AdamW prevent large parameter scales

Practical Guidance Principles

ReLU/GELU MLPs: Start with fan-in He/Kaiming; if highly imbalanced layers cause gradient drift, slightly shift toward fan-average choices
Deep Residual Stacks: Residual scaling (e.g., 1/√L) or normalization helps prevent depth-related variance drift
Transformer Projections: Use small standard deviation initialization (e.g., 0.02), monitor per-layer standard deviations and gradient norms

Foundational Initialization Strategies

LeCun Method: Variance preservation rules for linear activations
Glorot/Xavier: Fan-based scaling for tanh/sigmoid
He/Kaiming: Activation-aware scaling compensating for halved second moments under ReLU

Modern Developments

Fixup Initialization: Removes normalization requirements in extremely deep networks through carefully chosen initialization and residual scaling
DeepNet: Proposes principled depth-scaling rules enabling training of thousand-layer networks
Pre-normalization Advantages: Improves optimization stability compared to post-normalization by smoothing gradient flow

Conclusions and Discussion

Main Conclusions

Stability Band Existence: A broad but sensitive stability band exists within σ ∈ 10⁻², 10⁻¹
Activation Function Specificity Matters: Kaiming initialization indeed outperforms Xavier in ReLU networks
Depth-Dependent Dynamics: Transformers exhibit depth-dependent variance equilibrium with rapid shallow layer adaptation and progressive deep layer adjustment

Limitations

Experimental Scale: GPT-2 experiments are relatively small (12 layers); behavior in larger models may differ
Activation Function Coverage: Primarily focuses on ReLU and GELU; analysis of other activation functions is limited
Optimizer Dependence: Results may be sensitive to specific optimizers (AdamW) and hyperparameter settings

Future Directions

Adaptive Depth-Aware Initialization: Learn per-layer or per-head scales to bring shallow layers closer to final variance levels
Optimizer and Schedule Coupling: Joint optimization of warmup length, weight decay, and gradient clipping
Depth and Width Scaling: Evaluate persistence of depth-dependent equilibrium in larger models

In-Depth Evaluation

Strengths

Theory-Practice Integration: Organically combines classical variance propagation theory with modern Transformer behavior
Systematic Experimental Design: Progressive verification from simple MLPs to complex Transformers
High Practical Value: Provides concrete initialization recommendations and diagnostic methods
Statistical Rigor: Employs paired t-tests and other statistical methods to validate result significance

Weaknesses

Limited Theoretical Depth: Lacks deeper theoretical explanations for depth-dependent phenomena
Experimental Scale Constraints: Limited by computational resources, unable to validate on truly large-scale models
Generalization Issues: Results primarily based on specific architectures and tasks; generalization capability requires further verification

Impact Assessment

Academic Contribution: Provides modern perspective on initialization theory, connecting classical theory with current practice
Practical Value: Offers practitioners explicit initialization strategies and diagnostic tools
Reproducibility: Clear experimental design with detailed code and parameter settings facilitates reproduction

Applicable Scenarios

Deep Network Training: Particularly suitable for deep networks with ReLU/GELU activations
Transformer Optimization: Provides initialization guidance for large language model training
Research Tool: Offers researchers a methodological framework for analyzing weight dynamics

References

The paper cites key works in the initialization field, including foundational research by LeCun, Glorot, He and others, as well as recent advances in Transformer optimization, providing a solid theoretical foundation for this research.