2025-11-12T01:28:29.133817

Stability of Transformers under Layer Normalization

Kan, Li, Zhang et al.
Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.
academic

Stability of Transformers under Layer Normalization

Basic Information

  • Paper ID: 2510.09904
  • Title: Stability of Transformers under Layer Normalization
  • Authors: Kelvin Kan (UCLA), Xingjian Li (UT Austin), Benjamin J. Zhang (UNC Chapel Hill), Tuhin Sahai (SRI International), Stanley Osher (UCLA), Krishna Kumar (UT Austin), Markos A. Katsoulakis (UMass Amherst)
  • Classification: cs.LG, cs.AI, math.OC
  • Publication Date: October 10, 2025
  • Paper Link: https://arxiv.org/abs/2510.09904

Abstract

Although Transformers are widely used, training deep Transformers can be unstable. Layer Normalization (LN) as a standard component improves training stability, but its placement is often ad-hoc. This paper provides a principled investigation of forward stability (hidden states) and backward stability (gradients) of Transformers under different layer normalization positions. Theoretical analysis reveals key insights into training dynamics: whether training drives the Transformer toward regular solutions or pathological behavior. For forward stability, explicit bounds on hidden state growth in trained Transformers are derived. For backward stability, the paper analyzes how layer normalization affects gradient backpropagation, explaining the training dynamics of each layer normalization position. The analysis also guides the scaling of residual step sizes in Transformer blocks, with appropriate choices further improving stability and performance.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is: The mechanism by which different layer normalization positions affect Transformer training stability. Specifically, this includes:

  1. Forward Stability Problem: Controlling hidden state growth in deep networks
  2. Backward Stability Problem: Gradient stability during backpropagation
  3. Architecture Design Guidance: Providing theoretical guidance for new Transformer variants

Importance Analysis

  1. Practical Value: Transformers are fundamental to modern deep learning, and their training stability directly impacts model performance and training efficiency
  2. Theoretical Gap: Existing layer normalization placement choices are primarily empirical, lacking theoretical justification
  3. Industrial Demand: As model scale increases, training stability issues become increasingly prominent

Limitations of Existing Methods

  1. Post-LN: Requires fine-grained optimization schedules, often suboptimal performance
  2. Pre-LN: Improves early training stability but produces excessively large hidden states, leading to numerical instability
  3. Peri-LN: Performs well in practice but lacks theoretical understanding

Research Motivation

The authors adopt a novel perspective using continuous-time dynamics and optimal control theory, modeling Transformer training as a mean-field control problem, enabling analysis of model properties after training convergence rather than focusing solely on initialization behavior.

Core Contributions

  1. Theoretical Framework Innovation: Proposes a novel framework based on optimal control theory to systematically analyze Transformer stability under different layer normalization positions
  2. Forward Stability Analysis: Derives explicit bounds on hidden state growth, proving that Pre-LN leads to unbounded growth while Peri-LN maintains controlled growth
  3. Backward Stability Analysis: Reveals the mechanism by which layer normalization affects gradient backpropagation
  4. Residual Step Size Scaling: Proposes residual step size scaling methods to improve stability and performance
  5. Experimental Validation: Validates theoretical findings on GPT-2 series models

Methodology Details

Task Definition

Investigating Transformer stability under different layer normalization positions, including:

  • Input: Sequence after embedding and positional encoding X0Rd×nX_0 \in \mathbb{R}^{d \times n}
  • Output: Hidden states after D Transformer blocks XDX_D
  • Objective: Analyzing forward and backward propagation stability

Continuous-Time Modeling

Continuous-Time Representation of Transformers

The skip connection structure of standard Transformer blocks is interpreted as Euler discretization of continuous-time dynamics:

dX(t)dt={fattn(X(t),t;θattn(t)),t[ti,ti+Δt)fffn(X(t),t;θffn(t)),t[ti+Δt,ti+1)\frac{dX(t)}{dt} = \begin{cases} f_{attn}(X(t), t; \theta_{attn}(t)), & t \in [t_i, t_i + \Delta t) \\ f_{ffn}(X(t), t; \theta_{ffn}(t)), & t \in [t_i + \Delta t, t_{i+1}) \end{cases}

where Δt=T2D\Delta t = \frac{T}{2D}, ti=2iΔtt_i = 2i\Delta t.

Mean-Field Control Problem Formulation

The training problem is formulated as a continuous-time mean-field control problem:

minθE(X0,y)G(X(T),y)\min_\theta \mathbb{E}_{(X_0,y)} G(X(T), y)s.t. dX(t)dt=f(X(t),t;θ(t))\text{s.t. } \frac{dX(t)}{dt} = f(X(t), t; \theta(t))

where f{fPre,fPeri}f \in \{f_{Pre}, f_{Peri}\} correspond to different layer normalization positions.

Geometric Properties of Layer Normalization

Key Lemma 1: Layer normalization outputs lie on an ellipsoid surface: E={zRd:(zβ)TΓ2(zβ)=d}\mathcal{E} = \{z \in \mathbb{R}^d : (z - \beta)^T\Gamma^{-2}(z - \beta) = d\} where Γ=diag(γ)\Gamma = \text{diag}(\gamma).

Forward Stability Analysis

Unboundedness of Pre-LN

Theorem 2: The optimal solution of the Pre-LN training problem is unbounded in magnitude.

Proof Strategy: By analyzing the Hamilton-Jacobi-Bellman (HJB) partial differential equation, it is proven that the corresponding Hamiltonian does not exist, causing the training problem to degenerate.

Theorem 3: Even with weight decay, Pre-LN Transformer hidden states exhibit exponential growth: MA(XD)(1+C(λ))DX0Fnd=O(eD)MA(X_D) \leq (1 + C(\lambda))^D \frac{\|X_0\|_F}{\sqrt{nd}} = O(e^D)

Controlled Growth of Peri-LN

Theorem 4: Peri-LN Transformer hidden states exhibit linear growth: MA(XD)X0Fnd+2D(γmax+βmax)=O(D)MA(X_D) \leq \frac{\|X_0\|_F}{\sqrt{nd}} + 2D(\gamma_{max} + \beta_{max}) = O(D)

Variance exhibits quadratic growth: Var(XD)(X0F+2Dnd(γmax+βmax))2nd1=O(D2)\text{Var}(X_D) \leq \frac{(\|X_0\|_F + 2D\sqrt{nd}(\gamma_{max} + \beta_{max}))^2}{nd - 1} = O(D^2)

Backward Stability Analysis

Gradient computation formula: θiG(XD)=θiXi+1Ji:DXDG(XD)\nabla_{\theta_i} G(X_D) = \nabla_{\theta_i} X_{i+1} \cdot J_{i:D} \cdot \nabla_{X_D} G(X_D)

where the Jacobian matrix is: Ji:D=j=i+1D(I+Xj1f(Xj1;θj1))J_{i:D} = \prod_{j=i+1}^D (I + \nabla_{X_{j-1}} f(X_{j-1}; \theta_{j-1}))

Proposition 7: Under Pre-LN, the sensitivity Xj1fPre\nabla_{X_{j-1}} f_{Pre} grows proportionally with activation values.

Proposition 8: Under Peri-LN, the sensitivity Xj1fPeri\nabla_{X_{j-1}} f_{Peri} is invariant to activation value magnitude.

Experimental Setup

Datasets

  • OpenWebText Dataset: Approximately 9 billion training tokens, 4 million validation tokens
  • Pre-training using GPT-2 series architecture

Model Configuration

  • GPT-2 (124M parameters)
  • GPT-2 Large (774M parameters)
  • GPT-2 XL (1.5B parameters)

Evaluation Metrics

  • Perplexity
  • ROUGE Scores (Rouge1, Rouge2, RougeL)
  • BERT Scores (BertP, BertR, BertF1)
  • Training Stability: Count of divergent runs

Implementation Details

  • Hyperparameters tuned for Pre-LN, not separately optimized for Peri-LN
  • Residual step size scaling: Δt{0.1,1}\Delta t \in \{0.1, 1\}
  • Hardware: NVIDIA H200 GPU

Experimental Results

Training Stability Comparison

Layer Normalization SetupWeight Decay EnabledWeight Decay Disabled
Pre-LN1/5 diverged3/5 diverged
Peri-LN0/5 diverged0/5 diverged
No LN5/5 diverged

Performance Comparison Results

GPT-2 (124M) Model Results:

  • Pre-LN (Δt=1\Delta t=1): Validation loss 5.43, perplexity 247.52
  • Pre-LN (Δt=0.1\Delta t=0.1): Validation loss 3.13, perplexity 24.43
  • Peri-LN (Δt=1\Delta t=1): Validation loss 3.12, perplexity 24.17
  • Peri-LN (Δt=0.1\Delta t=0.1): Validation loss 3.10, perplexity 23.63

Hidden State Growth Analysis

Experiments validate theoretical predictions:

  • Pre-LN exhibits rapid growth at larger Δt\Delta t
  • Peri-LN maintains more regular linear growth
  • Residual step size scaling effectively controls growth rate

Residual Step Size Scaling Effects

  1. Performance Improvement: Peri-LN + Δt=0.1\Delta t=0.1 shows best performance across all metrics
  2. Stability Improvement: Pre-LN transitions from unstable to stable at Δt=0.1\Delta t=0.1
  3. Growth Control: Effectively reduces mean and variance growth rates of hidden states

Layer Normalization Research

  • Post-LN: Earliest Transformer design, requires fine-grained scheduling
  • Pre-LN: Improves training stability but produces large activation values
  • Peri-LN: Recently adopted in large-scale models such as Gemma2, OLMo2

Theoretical Analysis Methods

  • Existing work primarily focuses on initialization behavior or relies on empirical observations
  • This paper innovatively analyzes model properties after training convergence
  • Continuous-time perspective provides new tools for architecture analysis

Conclusions and Discussion

Main Conclusions

  1. Pre-LN Theoretical Deficiency: The training problem is inherently ill-posed, leading to unbounded solutions
  2. Peri-LN Advantages: Provides well-defined optimization problems and controlled hidden state growth
  3. Residual Scaling Value: Simple and effective method for stability improvement

Limitations

  1. Simplified Assumptions: Theoretical analysis based on continuous-time approximation
  2. Hyperparameter Dependency: Experiments use hyperparameters tuned for Pre-LN
  3. Scale Limitations: Experiments primarily conducted on medium-scale models

Future Directions

  1. Architecture Screening Framework: Provide theoretical screening criteria for new architecture modifications
  2. Larger-Scale Validation: Validate theoretical findings on larger models
  3. Other Normalization Methods: Extend analysis to variants such as RMSNorm

In-Depth Evaluation

Strengths

  1. Strong Theoretical Innovation: First to analyze layer normalization placement using optimal control theory
  2. Mathematical Rigor: Provides complete theoretical derivations and proofs
  3. High Practical Value: Residual step size scaling method is simple and effective
  4. Reasonable Experimental Design: Validates theory across multiple model scales

Weaknesses

  1. Theory-Practice Gap: Continuous-time assumptions differ from actual discrete implementations
  2. Limited Experimental Scope: Primarily validated on GPT-2 series, lacks validation across more architectures
  3. Hyperparameter Fairness: Peri-LN not separately optimized for hyperparameters

Impact Assessment

  1. Academic Contribution: Provides new theoretical framework for Transformer stability analysis
  2. Practical Value: Guides actual model design and training strategies
  3. Reproducibility: Commits to releasing code and models

Applicable Scenarios

  1. Deep Transformer Training: Particularly suitable for large-scale deep models
  2. Architecture Design Guidance: Provides theoretical basis for new architecture modifications
  3. Training Stability Improvement: Enhances training stability through residual scaling

References

The paper cites multiple important works, including:

  • Ba et al. (2016): Original Layer Normalization paper
  • Xiong et al. (2020): Pre-LN vs Post-LN comparative study
  • Kim et al. (2025): Empirical study of Peri-LN
  • He et al. (2016): Pioneering work on residual connections

Overall Assessment: This is a high-quality paper that effectively combines theory and practice, providing a new mathematical framework for Transformer stability analysis with significant academic value and practical implications. The theoretical analysis is rigorous and in-depth, experimental validation is comprehensive, and it provides valuable guidance for deep learning architecture design.