2025-11-12T01:28:29.133817

Stability of Transformers under Layer Normalization

Kan, Li, Zhang et al.

Despite their widespread use, training deep Transformers can be unstable. Layer normalization, a standard component, improves training stability, but its placement has often been ad-hoc. In this paper, we conduct a principled study on the forward (hidden states) and backward (gradient) stability of Transformers under different layer normalization placements. Our theory provides key insights into the training dynamics: whether training drives Transformers toward regular solutions or pathological behaviors. For forward stability, we derive explicit bounds on the growth of hidden states in trained Transformers. For backward stability, we analyze how layer normalization affects the backpropagation of gradients, thereby explaining the training dynamics of each layer normalization placement. Our analysis also guides the scaling of residual steps in Transformer blocks, where appropriate choices can further improve stability and performance. Our numerical results corroborate our theoretical findings. Beyond these results, our framework provides a principled way to sanity-check the stability of Transformers under new architectural modifications, offering guidance for future designs.

academic

Stability of Transformers under Layer Normalization

Basic Information

Paper ID: 2510.09904
Title: Stability of Transformers under Layer Normalization
Authors: Kelvin Kan (UCLA), Xingjian Li (UT Austin), Benjamin J. Zhang (UNC Chapel Hill), Tuhin Sahai (SRI International), Stanley Osher (UCLA), Krishna Kumar (UT Austin), Markos A. Katsoulakis (UMass Amherst)
Classification: cs.LG, cs.AI, math.OC
Publication Date: October 10, 2025
Paper Link: https://arxiv.org/abs/2510.09904

Abstract

Although Transformers are widely used, training deep Transformers can be unstable. Layer Normalization (LN) as a standard component improves training stability, but its placement is often ad-hoc. This paper provides a principled investigation of forward stability (hidden states) and backward stability (gradients) of Transformers under different layer normalization positions. Theoretical analysis reveals key insights into training dynamics: whether training drives the Transformer toward regular solutions or pathological behavior. For forward stability, explicit bounds on hidden state growth in trained Transformers are derived. For backward stability, the paper analyzes how layer normalization affects gradient backpropagation, explaining the training dynamics of each layer normalization position. The analysis also guides the scaling of residual step sizes in Transformer blocks, with appropriate choices further improving stability and performance.

Research Background and Motivation

Problem Definition

The core problem addressed in this research is: The mechanism by which different layer normalization positions affect Transformer training stability. Specifically, this includes:

Forward Stability Problem: Controlling hidden state growth in deep networks
Backward Stability Problem: Gradient stability during backpropagation
Architecture Design Guidance: Providing theoretical guidance for new Transformer variants

Importance Analysis

Practical Value: Transformers are fundamental to modern deep learning, and their training stability directly impacts model performance and training efficiency
Theoretical Gap: Existing layer normalization placement choices are primarily empirical, lacking theoretical justification
Industrial Demand: As model scale increases, training stability issues become increasingly prominent

Limitations of Existing Methods

Post-LN: Requires fine-grained optimization schedules, often suboptimal performance
Pre-LN: Improves early training stability but produces excessively large hidden states, leading to numerical instability
Peri-LN: Performs well in practice but lacks theoretical understanding

Research Motivation

The authors adopt a novel perspective using continuous-time dynamics and optimal control theory, modeling Transformer training as a mean-field control problem, enabling analysis of model properties after training convergence rather than focusing solely on initialization behavior.

Core Contributions

Theoretical Framework Innovation: Proposes a novel framework based on optimal control theory to systematically analyze Transformer stability under different layer normalization positions
Forward Stability Analysis: Derives explicit bounds on hidden state growth, proving that Pre-LN leads to unbounded growth while Peri-LN maintains controlled growth
Backward Stability Analysis: Reveals the mechanism by which layer normalization affects gradient backpropagation
Residual Step Size Scaling: Proposes residual step size scaling methods to improve stability and performance
Experimental Validation: Validates theoretical findings on GPT-2 series models

Methodology Details

Task Definition

Investigating Transformer stability under different layer normalization positions, including:

Input: Sequence after embedding and positional encoding $X_0 \in \mathbb{R}^{d \times n}$
Output: Hidden states after D Transformer blocks $X_D$
Objective: Analyzing forward and backward propagation stability

Continuous-Time Modeling

Continuous-Time Representation of Transformers

The skip connection structure of standard Transformer blocks is interpreted as Euler discretization of continuous-time dynamics:

$\frac{dX(t)}{dt} = \begin{cases} f_{attn}(X(t), t; \theta_{attn}(t)), & t \in [t_i, t_i + \Delta t) \\ f_{ffn}(X(t), t; \theta_{ffn}(t)), & t \in [t_i + \Delta t, t_{i+1}) \end{cases}$

where $\Delta t = \frac{T}{2D}$ , $t_i = 2i\Delta t$ .

Mean-Field Control Problem Formulation

The training problem is formulated as a continuous-time mean-field control problem:

$\min_\theta \mathbb{E}_{(X_0,y)} G(X(T), y)$ $\text{s.t. } \frac{dX(t)}{dt} = f(X(t), t; \theta(t))$

where $f \in \{f_{Pre}, f_{Peri}\}$ correspond to different layer normalization positions.

Geometric Properties of Layer Normalization

Key Lemma 1: Layer normalization outputs lie on an ellipsoid surface: $\mathcal{E} = \{z \in \mathbb{R}^d : (z - \beta)^T\Gamma^{-2}(z - \beta) = d\}$ where $\Gamma = \text{diag}(\gamma)$ .

Forward Stability Analysis

Unboundedness of Pre-LN

Theorem 2: The optimal solution of the Pre-LN training problem is unbounded in magnitude.

Proof Strategy: By analyzing the Hamilton-Jacobi-Bellman (HJB) partial differential equation, it is proven that the corresponding Hamiltonian does not exist, causing the training problem to degenerate.

Theorem 3: Even with weight decay, Pre-LN Transformer hidden states exhibit exponential growth: $MA(X_D) \leq (1 + C(\lambda))^D \frac{\|X_0\|_F}{\sqrt{nd}} = O(e^D)$

Controlled Growth of Peri-LN

Theorem 4: Peri-LN Transformer hidden states exhibit linear growth: $MA(X_D) \leq \frac{\|X_0\|_F}{\sqrt{nd}} + 2D(\gamma_{max} + \beta_{max}) = O(D)$

Variance exhibits quadratic growth: $\text{Var}(X_D) \leq \frac{(\|X_0\|_F + 2D\sqrt{nd}(\gamma_{max} + \beta_{max}))^2}{nd - 1} = O(D^2)$

Backward Stability Analysis

Gradient computation formula: $\nabla_{\theta_i} G(X_D) = \nabla_{\theta_i} X_{i+1} \cdot J_{i:D} \cdot \nabla_{X_D} G(X_D)$

where the Jacobian matrix is: $J_{i:D} = \prod_{j=i+1}^D (I + \nabla_{X_{j-1}} f(X_{j-1}; \theta_{j-1}))$

Proposition 7: Under Pre-LN, the sensitivity $\nabla_{X_{j-1}} f_{Pre}$ grows proportionally with activation values.

Proposition 8: Under Peri-LN, the sensitivity $\nabla_{X_{j-1}} f_{Peri}$ is invariant to activation value magnitude.

Experimental Setup

Datasets

OpenWebText Dataset: Approximately 9 billion training tokens, 4 million validation tokens
Pre-training using GPT-2 series architecture

Model Configuration

GPT-2 (124M parameters)
GPT-2 Large (774M parameters)
GPT-2 XL (1.5B parameters)

Evaluation Metrics

Perplexity
ROUGE Scores (Rouge1, Rouge2, RougeL)
BERT Scores (BertP, BertR, BertF1)
Training Stability: Count of divergent runs

Implementation Details

Hyperparameters tuned for Pre-LN, not separately optimized for Peri-LN
Residual step size scaling: $\Delta t \in \{0.1, 1\}$
Hardware: NVIDIA H200 GPU

Experimental Results

Training Stability Comparison

Layer Normalization Setup	Weight Decay Enabled	Weight Decay Disabled
Pre-LN	1/5 diverged	3/5 diverged
Peri-LN	0/5 diverged	0/5 diverged
No LN	5/5 diverged	—

Performance Comparison Results

GPT-2 (124M) Model Results:

Pre-LN ( $\Delta t=1$ ): Validation loss 5.43, perplexity 247.52
Pre-LN ( $\Delta t=0.1$ ): Validation loss 3.13, perplexity 24.43
Peri-LN ( $\Delta t=1$ ): Validation loss 3.12, perplexity 24.17
Peri-LN ( $\Delta t=0.1$ ): Validation loss 3.10, perplexity 23.63

Hidden State Growth Analysis

Experiments validate theoretical predictions:

Pre-LN exhibits rapid growth at larger $\Delta t$
Peri-LN maintains more regular linear growth
Residual step size scaling effectively controls growth rate

Residual Step Size Scaling Effects

Performance Improvement: Peri-LN + $\Delta t=0.1$ shows best performance across all metrics
Stability Improvement: Pre-LN transitions from unstable to stable at $\Delta t=0.1$
Growth Control: Effectively reduces mean and variance growth rates of hidden states

Layer Normalization Research

Post-LN: Earliest Transformer design, requires fine-grained scheduling
Pre-LN: Improves training stability but produces large activation values
Peri-LN: Recently adopted in large-scale models such as Gemma2, OLMo2

Theoretical Analysis Methods

Existing work primarily focuses on initialization behavior or relies on empirical observations
This paper innovatively analyzes model properties after training convergence
Continuous-time perspective provides new tools for architecture analysis

Conclusions and Discussion

Main Conclusions

Pre-LN Theoretical Deficiency: The training problem is inherently ill-posed, leading to unbounded solutions
Peri-LN Advantages: Provides well-defined optimization problems and controlled hidden state growth
Residual Scaling Value: Simple and effective method for stability improvement

Limitations

Simplified Assumptions: Theoretical analysis based on continuous-time approximation
Hyperparameter Dependency: Experiments use hyperparameters tuned for Pre-LN
Scale Limitations: Experiments primarily conducted on medium-scale models

Future Directions

Architecture Screening Framework: Provide theoretical screening criteria for new architecture modifications
Larger-Scale Validation: Validate theoretical findings on larger models
Other Normalization Methods: Extend analysis to variants such as RMSNorm

In-Depth Evaluation

Strengths

Strong Theoretical Innovation: First to analyze layer normalization placement using optimal control theory
Mathematical Rigor: Provides complete theoretical derivations and proofs
High Practical Value: Residual step size scaling method is simple and effective
Reasonable Experimental Design: Validates theory across multiple model scales

Weaknesses

Theory-Practice Gap: Continuous-time assumptions differ from actual discrete implementations
Limited Experimental Scope: Primarily validated on GPT-2 series, lacks validation across more architectures
Hyperparameter Fairness: Peri-LN not separately optimized for hyperparameters

Impact Assessment

Academic Contribution: Provides new theoretical framework for Transformer stability analysis
Practical Value: Guides actual model design and training strategies
Reproducibility: Commits to releasing code and models

Applicable Scenarios

Deep Transformer Training: Particularly suitable for large-scale deep models
Architecture Design Guidance: Provides theoretical basis for new architecture modifications
Training Stability Improvement: Enhances training stability through residual scaling

References

The paper cites multiple important works, including:

Ba et al. (2016): Original Layer Normalization paper
Xiong et al. (2020): Pre-LN vs Post-LN comparative study
Kim et al. (2025): Empirical study of Peri-LN
He et al. (2016): Pioneering work on residual connections

Overall Assessment: This is a high-quality paper that effectively combines theory and practice, providing a new mathematical framework for Transformer stability analysis with significant academic value and practical implications. The theoretical analysis is rigorous and in-depth, experimental validation is comprehensive, and it provides valuable guidance for deep learning architecture design.