2025-11-25T09:25:17.217625

Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay

Many deployed learning systems must update models on streaming data under memory constraints. The default strategy, sequential fine-tuning on each new phase, is architecture-agnostic but often suffers catastrophic forgetting when later phases correspond to different sub-populations or tasks. Replay with a finite buffer is a simple alternative, yet its behaviour across generative and predictive objectives is not well understood. We present a unified study of stateful replay for streaming autoencoding, time series forecasting, and classification. We view both sequential fine-tuning and replay as stochastic gradient methods for an ideal joint objective, and use a gradient alignment analysis to show when mixing current and historical samples should reduce forgetting. We then evaluate a single replay mechanism on six streaming scenarios built from Rotated MNIST, ElectricityLoadDiagrams 2011-2014, and Airlines delay data, using matched training budgets and three seeds. On heterogeneous multi task streams, replay reduces average forgetting by a factor of two to three, while on benign time based streams both methods perform similarly. These results position stateful replay as a strong and simple baseline for continual learning in streaming environments.

academic

Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay

Basic Information

Paper ID: 2511.17936
Title: Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay
Author: Du Wenzhang (Mahanakorn University of Technology)
Classification: cs.LG (Machine Learning), stat.ML (Machine Learning Statistics)
Submission Date: November 22, 2025 to arXiv
Paper Link: https://arxiv.org/abs/2511.17936

Abstract

This paper addresses the catastrophic forgetting problem in streaming learning environments by proposing a unified study of the stateful replay mechanism. In memory-constrained streaming data scenarios, traditional sequential fine-tuning (SeqFT) methods, while architecture-agnostic, suffer severe catastrophic forgetting when subsequent stages correspond to different subpopulations or tasks. The authors unify reconstruction, prediction, and classification tasks into a negative log-likelihood minimization framework and reveal through gradient alignment analysis how mixing current and historical samples reduces forgetting. Experiments on six streaming scenarios across three public datasets (Rotated MNIST, ElectricityLoadDiagrams, Airlines) demonstrate that: on heterogeneous multi-task streams, the replay mechanism reduces average forgetting by 2-3 times; while on mild temporal streams, both methods perform similarly.

Research Background and Motivation

1. Core Problem

Real-world deployed learning systems often need to update models on streaming data while facing strict memory constraints. Typical applications include:

Power suppliers recording long-term load curves
Airlines recording individual flight data
Perception pipelines observing continuous image and signal streams

These systems typically employ Sequential Fine-Tuning (SeqFT): training sequentially on data from each stage. While simple and architecture-agnostic, this approach suffers from catastrophic forgetting—when subsequent stages correspond to different subpopulations, label subsets, or tasks, gradients from new stages overwrite parameters useful for early stages.

2. Problem Importance

Specificity of generative tasks: For autoencoders or predictors, once unable to reconstruct historical patterns, outputs no longer reflect system history
Practical deployment requirements: Streaming systems must continuously learn under limited memory without re-accessing complete historical data
Insufficient theoretical understanding: While replay with limited buffers is a simple continual learning mechanism, its behavior across different objective functions and stream types remains insufficiently understood

3. Limitations of Existing Methods

Complex continual learning methods: While approaches based on parameter importance regularization, knowledge distillation, and generative replay exist, they introduce additional complexity and hyperparameter tuning costs
Inconsistent empirical reports: Replay provides huge benefits on some benchmarks but appears unnecessary on others
Lack of unified framework: Behavioral differences between generative vs. predictive tasks and heterogeneous vs. stationary streams have not been systematically studied

4. Research Motivation

This paper deliberately focuses on the simplest mechanism—stateful replay with fixed-capacity buffers—to systematically answer two fundamental questions:

(i) When is replay memory theoretically justified and practically necessary in streaming learning?
(ii) How does its effectiveness differ between generative vs. predictive tasks and heterogeneous vs. near-stationary streams?

Core Contributions

Unified streaming learning formalization: Represents autoencoders, prediction, and classification as negative log-likelihood minimization over stage-wise data distributions, defining stage-wise forgetting functions applicable across metrics
Gradient alignment theory for replay: Interprets SeqFT and Replay as stochastic gradient methods for an ideal joint objective, proving that when gradient conflicts exist, replay transforms "forgetting steps" into benign updates by mixing current and historical gradients
Mixed benchmarks and transparent logging: Constructs 6 streaming scenarios (spanning 3 datasets) with recorded initial and final metrics for all stages, supporting reproducible analysis
Empirical characterization: Under matched training budgets, Replay significantly reduces catastrophic forgetting on truly interfering streams (digit pairs, airline groups), while behaving similarly to SeqFT on mild temporal streams

Methodology Details

Task Definition

Streaming generative formalization:

Observe T stages t = 1, ..., T
Each stage associated with distribution P_t and finite samples D_t = {(x_i^(t), y_i^(t))}
Model f_θ loss function: ℓ(f_θ(x), y) = -log q_θ(y|x)

Unified representation of three task types:

Reconstruction (RotMNIST): y = x, q_θ is Gaussian with mean f_θ(x), evaluated with MSE
Prediction (Electricity): x is historical window, y is next timestep, evaluated with MSE
Classification (RotMNIST, Airlines): y ∈ {1,...,C}, q_θ is softmax, evaluated with accuracy but trained with cross-entropy

Risk definition:

Stage t population risk: R_t(θ) = E_{(x,y)~P_t}ℓ(f_θ(x), y)
Ideal joint risk: R_joint(θ) = (1/T)∑R_t(θ)

Stage-wise Forgetting Metric

For each stage k, distinguish:

Initial performance: Risk on validation set after training stage k: R̂_k(θ_k)
Final performance: Risk after training all T stages: R̂_k(θ_T)

Forgetting definition:

F_k = R̂_k(θ_T) - R̂_k(θ_k)  (loss metrics)
F_k = s_k^init - s_k^final   (accuracy metrics)

F_k > 0 indicates forgetting, F_k < 0 indicates positive backward transfer.

Comparison of Two Methods

1. Sequential Fine-Tuning (SeqFT)

Process stages sequentially
Run mini-batch SGD at stage t: R̂_t(θ) = (1/n_t)∑ℓ(f_θ(x), y)
Start from θ_, produce θ_t
Update: θ ← θ - η_t g̃_t(θ), where g̃_t is mini-batch gradient estimate

2. Stateful Replay

Maintain episode buffer B with capacity C storing historical samples
After training stage t, insert subset of D_t into B, evict oldest entries (reservoir sampling style)
For stage t > 1, each update uses mixed mini-batch:
- Draw B samples from D_t
- Draw B samples from buffer B
Expected gradient: g_t^rep(θ) = (1-λ)∇R_t(θ) + λ∇R_B^(t)(θ)
λ ≈ 0.5 is buffer sample ratio
Stage t begins with state (θ_, B_), hence "stateful"

Gradient Alignment Theory Analysis

One-step forgetting and alignment: For past stage k < t, parameter update θ' = θ - ηd, first-order expansion:

R_k(θ') ≈ R_k(θ) - η⟨∇R_k(θ), d⟩

Key observations:

In SeqFT: d ≈ ∇R_t(θ)
Define cosine similarity: cos φ_{k,t}(θ) = ⟨∇R_k, ∇R_t⟩/(||∇R_k|| ||∇R_t||)
cos φ_{k,t} > 0: Stage t step also reduces R_k (positive backward transfer)
cos φ_{k,t} < 0: Gradient conflict, training stage t increases R_k (local forgetting)

Gradient mixing in Replay: Assume buffer approximates historical mixture: ∇R_B^(t)(θ) ≈ ḡ_{<t}(θ) = (1/(t-1))∑∇R_j(θ)

Define mixed direction: d^rep = (1-λ)∇R_t(θ) + λḡ_{<t}(θ)

Proposition 1 (Alignment condition): Assume:

(i) Conflict with current stage: ⟨∇R_k, ∇R_t⟩ < 0
(ii) Historical mixture benign: ⟨∇R_k, ḡ_{<t}⟩ ≥ 0

Then there exists λ* ∈ (0,1) such that for all λ ∈ λ*, 1:

⟨∇R_k, d^rep⟩ ≥ 0

i.e., first-order change in R_k under Replay step is non-positive.

Proof sketch: Let h(λ) = ⟨∇R_k, (1-λ)∇R_t + λḡ_{<t}⟩

By (i): h(0) < 0
By (ii): h(1) ≥ 0
h is affine in λ, root exists λ* ∈ (0,1)
For λ ≥ λ*, h(λ) ≥ 0

Intuitive explanation: When current stage gradient conflicts with past stages while historical mixture is benign for that stage, Replay can flip forgetting steps into non-forgetting steps. This precisely characterizes RotMNIST digit pair and airline group streams.

Finite buffer approximation:

Single loss gradient bound: ||∇_θ ℓ(f_θ(x), y)|| ≤ G
Standard concentration bounds show: buffer gradient deviates from ḡ_{<t} by at most O(G/√C)
In experiments C ~ 10³, approximation error small, Replay robust

Experimental Setup

Datasets

1. Rotated MNIST (RotMNIST)

Source: MNIST rotated variant, 28×28 grayscale digits
Stage division: 5 stages, digit pair grouping: {0,1}, {2,3}, {4,5}, {6,7}, {8,9}
Tasks:
- Reconstruction: Convolutional autoencoder
- Classification: Shared encoder + linear classification head (always predict all 10 digits, creating strong stage interference)

2. Electricity

Source: ElectricityLoadDiagrams2011-2014, hourly loads from 370 customers
Preprocessing: Normalized, sliding windows of length 96, predict next step
Stage division:
- time: 5 consecutive time periods
- meters: 5 disjoint customer groups (each group spans complete time range)
Task: One-step prediction with MSE

3. Airlines

Source: Over 500,000 flights with features including carrier ID, origin/destination airports, day of week, scheduled departure time, duration
Label: Binary delay indicator
Stage division:
- time: 5 time slices
- airline_group: 5 carrier groups (with different delay patterns)
Task: Delay prediction (binary classification)

Model Architecture

RotMNIST: CNN encoder-decoder (reconstruction) + linear classification head (classification)
Electricity: Small 1D CNN/GRU predictor
Airlines: 3-layer MLP with normalized tabular feature inputs
Implementation: PyTorch, Adam optimizer, batch size 128-256

Training Protocol

Number of stages: 5 stages for all scenarios
Hyperparameters: Fixed epochs per stage and learning rate per dataset-scenario (based on preliminary tuning)
Fair comparison: SeqFT and Replay use identical training budget (same epochs and learning rate)
Replay configuration:
- Buffer size: C ~ 10³
- Replay ratio: λ ≈ 0.5
Random seeds: {13, 21, 42}, each method and scenario run 3 times

Evaluation Metrics

Classification tasks: Accuracy, trained with cross-entropy
Reconstruction/Prediction tasks: Mean Squared Error (MSE)
Forgetting metric: F_k = initial metric - final metric

Logging

For each method, seed, and stage k record:

Initial metric (on validation set after training stage k)
Final metric (on same validation set after training all stages)
Dataset, scenario, method identifiers

All logs stored in single structured file for generating all tables and figures.

Experimental Results

Main Results

1. RotMNIST Digit Pair Classification

Figures 1 and Table 2 show:

SeqFT severe forgetting:
- Stage 1: Initial 99.4%, Final 41.3%, Forgetting 58.0 percentage points
- Stage 3: Initial 89.8%, Final 21.5%, Forgetting 68.3 percentage points
- Average forgetting: F̄ = 35.2 ± 28.2
Replay significant improvement:
- Stage 1: Initial 99.4%, Final 95.2%, Forgetting only 4.2 percentage points
- Stage 3: Initial 83.6%, Final 51.2%, Forgetting 32.4 percentage points
- Average forgetting: F̄ = 11.7 ± 13.2
- Forgetting reduced ~3 times
Final stage (Stage 5) shows no forgetting for both methods (trained last)

2. Airlines Airline Group Classification

Figures 2 and Table 3 show:

SeqFT forgetting pattern:
- Stage 1: Initial 71.6%, Final 35.3%, Forgetting 36.4 percentage points
- Stage 4: Initial 63.7%, Final 54.0%, Forgetting 9.7 percentage points
- Average forgetting: F̄ = 10.0 ± 15.2
Replay improvement:
- Stage 1: Initial 71.7%, Final 53.6%, Forgetting 18.0 percentage points (halved)
- Stage 4: Initial 63.0%, Final 62.1%, Forgetting 0.8 percentage points
- Average forgetting: F̄ = 3.8 ± 8.0
- Forgetting reduced ~2.6 times
Stages 2 and 3 even show negative forgetting (positive transfer)

3. Airlines Time Series Classification

Both methods perform similarly:
- SeqFT average forgetting: F̄ = -1.5 ± 3.4
- Replay average forgetting: F̄ = -1.0 ± 2.0
- Both slightly negative, indicating regularization effect from subsequent stages

4. Electricity Prediction

Figure 3 shows:

Both time division and customer group division show:
- SeqFT and Replay initial/final MSE curves nearly overlap
- Many cases show final MSE slightly lower than initial (positive transfer)
- Forgetting negligible or slightly negative
Explanation: These streams resemble non-stationary single-task training, cross-stage gradients essentially aligned

5. RotMNIST Reconstruction

Digit pair reconstruction shows SeqFT and Replay often exhibit negative forgetting
Reason: Strong shared structure between digit pairs, subsequent stages act as additional regularization rather than conflicting tasks

Aggregated Forgetting Analysis

Table 4 and Figure 4 summarize classification tasks:

Dataset	Division	Method	Average Forgetting F̄
RotMNIST	digits_pairs	SeqFT	35.2 ± 28.2
RotMNIST	digits_pairs	Replay	11.7 ± 13.2
Airlines	time	SeqFT	-1.5 ± 3.4
Airlines	time	Replay	-1.0 ± 2.0
Airlines	airline_group	SeqFT	10.0 ± 15.2
Airlines	airline_group	Replay	3.8 ± 8.0

Key findings:

Heterogeneous multi-task streams (digit pairs, airline groups): SeqFT shows large positive forgetting, Replay reduces |F̄| by ~2-3 times
Mild temporal streams: Average forgetting near zero, both methods behave similarly, Replay acts only as mild regularizer

Ablation and Case Analysis

While the paper does not explicitly conduct ablation experiments, cross-scenario comparisons implicitly validate:

Implicit verification of buffer size:

Buffer size C ~ 10³ effective across all scenarios
Section 3.3 theory shows O(G/√C) approximation error, ~3% error for C=1000

Choice of replay ratio λ:

Paper uses λ ≈ 0.5
Proposition 1 shows need for λ ≥ λ*, λ=0.5 sufficient in practice

Natural ablation of stream types:

Heterogeneous streams (strong task interference) vs. temporal streams (mild drift)
Clearly demonstrates when Replay is necessary vs. optional

1. Catastrophic Forgetting Research

Classical work: McCloskey & Cohen (1989) first identified sequential learning problem in connectionist networks
Deep learning era: Goodfellow et al. (2014) empirical study of gradient-based neural networks
Surveys: Parisi et al. (2019) comprehensive review of continual lifelong learning

2. Continual Learning Methods Classification

Parameter importance regularization:

EWC (Kirkpatrick et al., 2017): Weight regularization based on Fisher information matrix
SI (Zenke et al., 2017): Continual learning via synaptic intelligence

Knowledge distillation:

LwF (Li & Hoiem, 2018): Learning without forgetting

Generative replay:

DGR (Shin et al., 2017): Deep generative replay

Episodic memory/replay:

Lin (1992): Experience replay in reinforcement learning
GEM (Lopez-Paz & Ranzato, 2017): Gradient episodic memory
Selective experience replay (Isele & Cosgun, 2018)

3. Streaming Data Mining

Gama et al. (2014): Survey on concept drift adaptation
MOA framework (Bifet et al., 2010): Massive online analysis

4. Paper Positioning

Compared to complex methods: Paper focuses on simplest replay mechanism as strong baseline
Unified perspective: First to uniformly handle generative (reconstruction, prediction) and discriminative (classification) tasks
Theoretical contribution: Gradient alignment analysis provides concise theoretical explanation
Empirical systematicity: Consistent evaluation across multiple task types and stream types

Conclusions and Discussion

Main Conclusions

Theoretical insight: Through gradient alignment analysis, stateful replay transforms forgetting steps into benign updates by mixing historical and current gradients when gradient conflicts exist
Empirical dichotomy:
- Heterogeneous multi-task streams: Replay significantly reduces catastrophic forgetting (2-3 times)
- Mild temporal streams: Replay behaves similarly to SeqFT, forgetting negligible
Method positioning: Stateful replay is a powerful, interpretable, well-documented baseline for streaming continual learning
Practical recommendations:
- For truly interfering task streams (different subpopulations, label subsets), replay is necessary
- For mild temporal drift, SeqFT may suffice
- Simple fixed-capacity buffer (C ~ 10³) and balanced mixing (λ ~ 0.5) suffice for effectiveness

Limitations

Model scale: Experiments use relatively small models (CNN, small MLP)
- Effectiveness on large-scale Transformers not verified
- Relationship between buffer size and model scale not explored
Buffer strategy:
- Uses simple reservoir sampling and FIFO eviction
- More sophisticated sampling strategies (e.g., gradient importance-based) not explored
Theoretical analysis:
- Gradient alignment analysis based on first-order approximation
- No complete non-asymptotic theory or convergence guarantees
- Non-convexity of deep networks insufficiently addressed
Stream type coverage:
- Primarily considers 5-stage streams
- Longer sequences or continuous drift scenarios not tested
- Intra-stage distribution changes not addressed
Computational cost:
- Training time and memory overhead not reported
- Additional storage and sampling costs of Replay not quantified
Hyperparameter sensitivity:
- λ and C selection based on empirical experience
- Sensitivity not systematically studied

Future Directions

Paper explicitly proposes:

More principled buffer construction and sampling strategies:
- Gradient diversity-based sampling
- Adaptive buffer sizing
Combination with parameter regularization methods:
- Replay + EWC
- Replay + knowledge distillation
Extension to larger architectures and multimodal streams:
- Vision Transformers
- Multimodal streaming learning
Real-world resource constraints:
- Edge device deployment
- Communication-limited scenarios

In-Depth Evaluation

Strengths

1. Clear theoretical contribution

Gradient alignment perspective is concise and elegant, providing intuitive explanation
Proposition 1 formalizes conditions for replay effectiveness
Connects optimization theory with continual learning practice

2. Rigorous experimental design

Fair comparison: Matched training budgets, identical hyperparameters
Diverse scenarios: 3 datasets × 6 scenarios, covering generative and discriminative tasks
Sufficient repetition: 3 random seeds, reporting means and standard deviations
Transparent logging: Commits to releasing complete logs and code

3. Practical problem setting

Addresses real deployment scenarios (memory-constrained, streaming data)
Unified framework handling multiple task types
Simple mechanism easy to implement and deploy

4. In-depth result interpretation

Clearly distinguishes different behaviors between heterogeneous and temporal streams
Connects experimental observations with theoretical predictions
Per-stage analysis provides fine-grained insights

5. Clear writing

Well-organized structure, clear motivation
Consistent mathematical notation, clear definitions
Effective figure design conveying information

Weaknesses

1. Limited theoretical analysis

Only first-order approximation, not considering higher-order terms and non-convexity
Lacks quantitative bounds on convergence rate or sample complexity
Condition (ii) in Proposition 1 "historical mixture benign" and how to ensure it in practice not discussed

2. Limited experimental scale

Relatively simple models (small CNN, MLP)
Classic but not large-scale datasets
Current popular large models or Transformers not addressed

3. Insufficient buffer design exploration

Fixed C ~ 10³ lacks systematic tuning
Different sampling strategies (uniform vs. importance sampling) not compared
Buffer update strategies (FIFO vs. others) not ablated

4. Computational cost not reported

Training time, memory consumption not quantified
Additional overhead of Replay not weighed against benefits
Feasibility analysis for practical deployment insufficient

5. Missing comparison with complex methods

Only compared with SeqFT, not with EWC, GEM, etc.
Cannot assess cost-effectiveness of simple replay relative to complex methods
Paper claims "strong baseline" but lacks direct comparison with other baselines

6. Limited stream type coverage

Only 5-stage streams, longer sequences not tested
Clear stage boundaries, gradual drift not simulated
Intra-stage distribution changes not considered

Impact

Contribution to the field:

Theory: Gradient alignment perspective provides new analytical tool for continual learning
Empirics: Systematic benchmark provides reference point for subsequent research
Practice: Simple effective method lowers deployment threshold

Practical value:

Streaming systems (power, transportation, finance) can directly apply
Lightweight solution for continual learning on edge devices
No architecture modification needed, easy integration into existing systems

Reproducibility:

Uses public datasets
Commits to releasing code and logs
Detailed experimental setup description
Explicit random seeds

Potential impact:

Establishes simple strong baseline for streaming learning
Inspires gradient analysis-based continual learning methods
Advances research on continual learning for generative tasks

Applicable Scenarios

Strongly recommended scenarios:

Heterogeneous multi-task streams:
- Recommendation systems for different customer groups
- Quality inspection systems for multi-brand products
- Multi-language NLP tasks
Memory-constrained environments:
- Edge devices (IoT, mobile)
- Embedded systems
- Real-time processing pipelines
Need to retain historical capability:
- Generative models (need reconstruct historical patterns)
- Multi-task services (need support multiple request types simultaneously)
- Long-term deployment systems

Use with caution scenarios:

Mild temporal drift:
- Stationary time series prediction
- Slowly evolving distributions
- SeqFT may suffice here
Extreme resource constraints:
- Cannot maintain buffer (C < 100)
- Sampling overhead unacceptable
Requiring theoretical guarantees:
- Safety-critical applications
- Paper's first-order analysis may be insufficient

Extension directions:

Combine with parameter regularization for improved performance
Adaptive buffer management
Combination with knowledge distillation
Extension to continual fine-tuning of pre-trained large models

Selected References

Goodfellow et al. (2014): An empirical investigation of catastrophic forgetting - Pioneering empirical study of catastrophic forgetting
Kirkpatrick et al. (2017): Elastic Weight Consolidation (EWC) - Representative work on parameter importance regularization
Lopez-Paz & Ranzato (2017): Gradient Episodic Memory (GEM) - Gradient constraint-based continual learning
Parisi et al. (2019): Continual lifelong learning with neural networks - Continual learning survey
Gama et al. (2014): A survey on concept drift adaptation - Concept drift adaptation survey

Overall Assessment: This is a solid continual learning research paper that provides a practical solution to catastrophic forgetting in streaming learning scenarios through concise theoretical analysis and systematic experimental evaluation. The paper's main value lies in: (1) unified task formalization framework; (2) clear gradient alignment theory; (3) systematic evaluation across task types and stream types. While having limitations in model scale, theoretical depth, and method comparison, the positioning as a "strong baseline" is reasonable. For researchers and engineers needing to deploy continual learning systems in resource-constrained environments, this paper provides valuable guidance and reference implementation.