2025-11-25T09:25:17.217625

Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay

Du
Many deployed learning systems must update models on streaming data under memory constraints. The default strategy, sequential fine-tuning on each new phase, is architecture-agnostic but often suffers catastrophic forgetting when later phases correspond to different sub-populations or tasks. Replay with a finite buffer is a simple alternative, yet its behaviour across generative and predictive objectives is not well understood. We present a unified study of stateful replay for streaming autoencoding, time series forecasting, and classification. We view both sequential fine-tuning and replay as stochastic gradient methods for an ideal joint objective, and use a gradient alignment analysis to show when mixing current and historical samples should reduce forgetting. We then evaluate a single replay mechanism on six streaming scenarios built from Rotated MNIST, ElectricityLoadDiagrams 2011-2014, and Airlines delay data, using matched training budgets and three seeds. On heterogeneous multi task streams, replay reduces average forgetting by a factor of two to three, while on benign time based streams both methods perform similarly. These results position stateful replay as a strong and simple baseline for continual learning in streaming environments.
academic

Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay

Basic Information

  • Paper ID: 2511.17936
  • Title: Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay
  • Author: Du Wenzhang (Mahanakorn University of Technology)
  • Classification: cs.LG (Machine Learning), stat.ML (Machine Learning Statistics)
  • Submission Date: November 22, 2025 to arXiv
  • Paper Link: https://arxiv.org/abs/2511.17936

Abstract

This paper addresses the catastrophic forgetting problem in streaming learning environments by proposing a unified study of the stateful replay mechanism. In memory-constrained streaming data scenarios, traditional sequential fine-tuning (SeqFT) methods, while architecture-agnostic, suffer severe catastrophic forgetting when subsequent stages correspond to different subpopulations or tasks. The authors unify reconstruction, prediction, and classification tasks into a negative log-likelihood minimization framework and reveal through gradient alignment analysis how mixing current and historical samples reduces forgetting. Experiments on six streaming scenarios across three public datasets (Rotated MNIST, ElectricityLoadDiagrams, Airlines) demonstrate that: on heterogeneous multi-task streams, the replay mechanism reduces average forgetting by 2-3 times; while on mild temporal streams, both methods perform similarly.

Research Background and Motivation

1. Core Problem

Real-world deployed learning systems often need to update models on streaming data while facing strict memory constraints. Typical applications include:

  • Power suppliers recording long-term load curves
  • Airlines recording individual flight data
  • Perception pipelines observing continuous image and signal streams

These systems typically employ Sequential Fine-Tuning (SeqFT): training sequentially on data from each stage. While simple and architecture-agnostic, this approach suffers from catastrophic forgetting—when subsequent stages correspond to different subpopulations, label subsets, or tasks, gradients from new stages overwrite parameters useful for early stages.

2. Problem Importance

  • Specificity of generative tasks: For autoencoders or predictors, once unable to reconstruct historical patterns, outputs no longer reflect system history
  • Practical deployment requirements: Streaming systems must continuously learn under limited memory without re-accessing complete historical data
  • Insufficient theoretical understanding: While replay with limited buffers is a simple continual learning mechanism, its behavior across different objective functions and stream types remains insufficiently understood

3. Limitations of Existing Methods

  • Complex continual learning methods: While approaches based on parameter importance regularization, knowledge distillation, and generative replay exist, they introduce additional complexity and hyperparameter tuning costs
  • Inconsistent empirical reports: Replay provides huge benefits on some benchmarks but appears unnecessary on others
  • Lack of unified framework: Behavioral differences between generative vs. predictive tasks and heterogeneous vs. stationary streams have not been systematically studied

4. Research Motivation

This paper deliberately focuses on the simplest mechanism—stateful replay with fixed-capacity buffers—to systematically answer two fundamental questions:

  • (i) When is replay memory theoretically justified and practically necessary in streaming learning?
  • (ii) How does its effectiveness differ between generative vs. predictive tasks and heterogeneous vs. near-stationary streams?

Core Contributions

  1. Unified streaming learning formalization: Represents autoencoders, prediction, and classification as negative log-likelihood minimization over stage-wise data distributions, defining stage-wise forgetting functions applicable across metrics
  2. Gradient alignment theory for replay: Interprets SeqFT and Replay as stochastic gradient methods for an ideal joint objective, proving that when gradient conflicts exist, replay transforms "forgetting steps" into benign updates by mixing current and historical gradients
  3. Mixed benchmarks and transparent logging: Constructs 6 streaming scenarios (spanning 3 datasets) with recorded initial and final metrics for all stages, supporting reproducible analysis
  4. Empirical characterization: Under matched training budgets, Replay significantly reduces catastrophic forgetting on truly interfering streams (digit pairs, airline groups), while behaving similarly to SeqFT on mild temporal streams

Methodology Details

Task Definition

Streaming generative formalization:

  • Observe T stages t = 1, ..., T
  • Each stage associated with distribution P_t and finite samples D_t = {(x_i^(t), y_i^(t))}
  • Model f_θ loss function: ℓ(f_θ(x), y) = -log q_θ(y|x)

Unified representation of three task types:

  1. Reconstruction (RotMNIST): y = x, q_θ is Gaussian with mean f_θ(x), evaluated with MSE
  2. Prediction (Electricity): x is historical window, y is next timestep, evaluated with MSE
  3. Classification (RotMNIST, Airlines): y ∈ {1,...,C}, q_θ is softmax, evaluated with accuracy but trained with cross-entropy

Risk definition:

  • Stage t population risk: R_t(θ) = E_{(x,y)~P_t}ℓ(f_θ(x), y)
  • Ideal joint risk: R_joint(θ) = (1/T)∑R_t(θ)

Stage-wise Forgetting Metric

For each stage k, distinguish:

  • Initial performance: Risk on validation set after training stage k: R̂_k(θ_k)
  • Final performance: Risk after training all T stages: R̂_k(θ_T)

Forgetting definition:

F_k = R̂_k(θ_T) - R̂_k(θ_k)  (loss metrics)
F_k = s_k^init - s_k^final   (accuracy metrics)

F_k > 0 indicates forgetting, F_k < 0 indicates positive backward transfer.

Comparison of Two Methods

1. Sequential Fine-Tuning (SeqFT)

  • Process stages sequentially
  • Run mini-batch SGD at stage t: R̂_t(θ) = (1/n_t)∑ℓ(f_θ(x), y)
  • Start from θ_, produce θ_t
  • Update: θ ← θ - η_t g̃_t(θ), where g̃_t is mini-batch gradient estimate

2. Stateful Replay

  • Maintain episode buffer B with capacity C storing historical samples
  • After training stage t, insert subset of D_t into B, evict oldest entries (reservoir sampling style)
  • For stage t > 1, each update uses mixed mini-batch:
    • Draw B samples from D_t
    • Draw B samples from buffer B
  • Expected gradient: g_t^rep(θ) = (1-λ)∇R_t(θ) + λ∇R_B^(t)(θ)
  • λ ≈ 0.5 is buffer sample ratio
  • Stage t begins with state (θ_, B_), hence "stateful"

Gradient Alignment Theory Analysis

One-step forgetting and alignment: For past stage k < t, parameter update θ' = θ - ηd, first-order expansion:

R_k(θ') ≈ R_k(θ) - η⟨∇R_k(θ), d⟩

Key observations:

  • In SeqFT: d ≈ ∇R_t(θ)
  • Define cosine similarity: cos φ_{k,t}(θ) = ⟨∇R_k, ∇R_t⟩/(||∇R_k|| ||∇R_t||)
  • cos φ_{k,t} > 0: Stage t step also reduces R_k (positive backward transfer)
  • cos φ_{k,t} < 0: Gradient conflict, training stage t increases R_k (local forgetting)

Gradient mixing in Replay: Assume buffer approximates historical mixture: ∇R_B^(t)(θ) ≈ ḡ_{<t}(θ) = (1/(t-1))∑∇R_j(θ)

Define mixed direction: d^rep = (1-λ)∇R_t(θ) + λḡ_{<t}(θ)

Proposition 1 (Alignment condition): Assume:

  • (i) Conflict with current stage: ⟨∇R_k, ∇R_t⟩ < 0
  • (ii) Historical mixture benign: ⟨∇R_k, ḡ_{<t}⟩ ≥ 0

Then there exists λ* ∈ (0,1) such that for all λ ∈ λ*, 1:

⟨∇R_k, d^rep⟩ ≥ 0

i.e., first-order change in R_k under Replay step is non-positive.

Proof sketch: Let h(λ) = ⟨∇R_k, (1-λ)∇R_t + λḡ_{<t}⟩

  • By (i): h(0) < 0
  • By (ii): h(1) ≥ 0
  • h is affine in λ, root exists λ* ∈ (0,1)
  • For λ ≥ λ*, h(λ) ≥ 0

Intuitive explanation: When current stage gradient conflicts with past stages while historical mixture is benign for that stage, Replay can flip forgetting steps into non-forgetting steps. This precisely characterizes RotMNIST digit pair and airline group streams.

Finite buffer approximation:

  • Single loss gradient bound: ||∇_θ ℓ(f_θ(x), y)|| ≤ G
  • Standard concentration bounds show: buffer gradient deviates from ḡ_{<t} by at most O(G/√C)
  • In experiments C ~ 10³, approximation error small, Replay robust

Experimental Setup

Datasets

1. Rotated MNIST (RotMNIST)

  • Source: MNIST rotated variant, 28×28 grayscale digits
  • Stage division: 5 stages, digit pair grouping: {0,1}, {2,3}, {4,5}, {6,7}, {8,9}
  • Tasks:
    • Reconstruction: Convolutional autoencoder
    • Classification: Shared encoder + linear classification head (always predict all 10 digits, creating strong stage interference)

2. Electricity

  • Source: ElectricityLoadDiagrams2011-2014, hourly loads from 370 customers
  • Preprocessing: Normalized, sliding windows of length 96, predict next step
  • Stage division:
    • time: 5 consecutive time periods
    • meters: 5 disjoint customer groups (each group spans complete time range)
  • Task: One-step prediction with MSE

3. Airlines

  • Source: Over 500,000 flights with features including carrier ID, origin/destination airports, day of week, scheduled departure time, duration
  • Label: Binary delay indicator
  • Stage division:
    • time: 5 time slices
    • airline_group: 5 carrier groups (with different delay patterns)
  • Task: Delay prediction (binary classification)

Model Architecture

  • RotMNIST: CNN encoder-decoder (reconstruction) + linear classification head (classification)
  • Electricity: Small 1D CNN/GRU predictor
  • Airlines: 3-layer MLP with normalized tabular feature inputs
  • Implementation: PyTorch, Adam optimizer, batch size 128-256

Training Protocol

  • Number of stages: 5 stages for all scenarios
  • Hyperparameters: Fixed epochs per stage and learning rate per dataset-scenario (based on preliminary tuning)
  • Fair comparison: SeqFT and Replay use identical training budget (same epochs and learning rate)
  • Replay configuration:
    • Buffer size: C ~ 10³
    • Replay ratio: λ ≈ 0.5
  • Random seeds: {13, 21, 42}, each method and scenario run 3 times

Evaluation Metrics

  • Classification tasks: Accuracy, trained with cross-entropy
  • Reconstruction/Prediction tasks: Mean Squared Error (MSE)
  • Forgetting metric: F_k = initial metric - final metric

Logging

For each method, seed, and stage k record:

  • Initial metric (on validation set after training stage k)
  • Final metric (on same validation set after training all stages)
  • Dataset, scenario, method identifiers

All logs stored in single structured file for generating all tables and figures.

Experimental Results

Main Results

1. RotMNIST Digit Pair Classification

Figures 1 and Table 2 show:

  • SeqFT severe forgetting:
    • Stage 1: Initial 99.4%, Final 41.3%, Forgetting 58.0 percentage points
    • Stage 3: Initial 89.8%, Final 21.5%, Forgetting 68.3 percentage points
    • Average forgetting: F̄ = 35.2 ± 28.2
  • Replay significant improvement:
    • Stage 1: Initial 99.4%, Final 95.2%, Forgetting only 4.2 percentage points
    • Stage 3: Initial 83.6%, Final 51.2%, Forgetting 32.4 percentage points
    • Average forgetting: F̄ = 11.7 ± 13.2
    • Forgetting reduced ~3 times
  • Final stage (Stage 5) shows no forgetting for both methods (trained last)

2. Airlines Airline Group Classification

Figures 2 and Table 3 show:

  • SeqFT forgetting pattern:
    • Stage 1: Initial 71.6%, Final 35.3%, Forgetting 36.4 percentage points
    • Stage 4: Initial 63.7%, Final 54.0%, Forgetting 9.7 percentage points
    • Average forgetting: F̄ = 10.0 ± 15.2
  • Replay improvement:
    • Stage 1: Initial 71.7%, Final 53.6%, Forgetting 18.0 percentage points (halved)
    • Stage 4: Initial 63.0%, Final 62.1%, Forgetting 0.8 percentage points
    • Average forgetting: F̄ = 3.8 ± 8.0
    • Forgetting reduced ~2.6 times
  • Stages 2 and 3 even show negative forgetting (positive transfer)

3. Airlines Time Series Classification

  • Both methods perform similarly:
    • SeqFT average forgetting: F̄ = -1.5 ± 3.4
    • Replay average forgetting: F̄ = -1.0 ± 2.0
    • Both slightly negative, indicating regularization effect from subsequent stages

4. Electricity Prediction

Figure 3 shows:

  • Both time division and customer group division show:
    • SeqFT and Replay initial/final MSE curves nearly overlap
    • Many cases show final MSE slightly lower than initial (positive transfer)
    • Forgetting negligible or slightly negative
  • Explanation: These streams resemble non-stationary single-task training, cross-stage gradients essentially aligned

5. RotMNIST Reconstruction

  • Digit pair reconstruction shows SeqFT and Replay often exhibit negative forgetting
  • Reason: Strong shared structure between digit pairs, subsequent stages act as additional regularization rather than conflicting tasks

Aggregated Forgetting Analysis

Table 4 and Figure 4 summarize classification tasks:

DatasetDivisionMethodAverage Forgetting F̄
RotMNISTdigits_pairsSeqFT35.2 ± 28.2
RotMNISTdigits_pairsReplay11.7 ± 13.2
AirlinestimeSeqFT-1.5 ± 3.4
AirlinestimeReplay-1.0 ± 2.0
Airlinesairline_groupSeqFT10.0 ± 15.2
Airlinesairline_groupReplay3.8 ± 8.0

Key findings:

  1. Heterogeneous multi-task streams (digit pairs, airline groups): SeqFT shows large positive forgetting, Replay reduces |F̄| by ~2-3 times
  2. Mild temporal streams: Average forgetting near zero, both methods behave similarly, Replay acts only as mild regularizer

Ablation and Case Analysis

While the paper does not explicitly conduct ablation experiments, cross-scenario comparisons implicitly validate:

Implicit verification of buffer size:

  • Buffer size C ~ 10³ effective across all scenarios
  • Section 3.3 theory shows O(G/√C) approximation error, ~3% error for C=1000

Choice of replay ratio λ:

  • Paper uses λ ≈ 0.5
  • Proposition 1 shows need for λ ≥ λ*, λ=0.5 sufficient in practice

Natural ablation of stream types:

  • Heterogeneous streams (strong task interference) vs. temporal streams (mild drift)
  • Clearly demonstrates when Replay is necessary vs. optional

1. Catastrophic Forgetting Research

  • Classical work: McCloskey & Cohen (1989) first identified sequential learning problem in connectionist networks
  • Deep learning era: Goodfellow et al. (2014) empirical study of gradient-based neural networks
  • Surveys: Parisi et al. (2019) comprehensive review of continual lifelong learning

2. Continual Learning Methods Classification

Parameter importance regularization:

  • EWC (Kirkpatrick et al., 2017): Weight regularization based on Fisher information matrix
  • SI (Zenke et al., 2017): Continual learning via synaptic intelligence

Knowledge distillation:

  • LwF (Li & Hoiem, 2018): Learning without forgetting

Generative replay:

  • DGR (Shin et al., 2017): Deep generative replay

Episodic memory/replay:

  • Lin (1992): Experience replay in reinforcement learning
  • GEM (Lopez-Paz & Ranzato, 2017): Gradient episodic memory
  • Selective experience replay (Isele & Cosgun, 2018)

3. Streaming Data Mining

  • Gama et al. (2014): Survey on concept drift adaptation
  • MOA framework (Bifet et al., 2010): Massive online analysis

4. Paper Positioning

  • Compared to complex methods: Paper focuses on simplest replay mechanism as strong baseline
  • Unified perspective: First to uniformly handle generative (reconstruction, prediction) and discriminative (classification) tasks
  • Theoretical contribution: Gradient alignment analysis provides concise theoretical explanation
  • Empirical systematicity: Consistent evaluation across multiple task types and stream types

Conclusions and Discussion

Main Conclusions

  1. Theoretical insight: Through gradient alignment analysis, stateful replay transforms forgetting steps into benign updates by mixing historical and current gradients when gradient conflicts exist
  2. Empirical dichotomy:
    • Heterogeneous multi-task streams: Replay significantly reduces catastrophic forgetting (2-3 times)
    • Mild temporal streams: Replay behaves similarly to SeqFT, forgetting negligible
  3. Method positioning: Stateful replay is a powerful, interpretable, well-documented baseline for streaming continual learning
  4. Practical recommendations:
    • For truly interfering task streams (different subpopulations, label subsets), replay is necessary
    • For mild temporal drift, SeqFT may suffice
    • Simple fixed-capacity buffer (C ~ 10³) and balanced mixing (λ ~ 0.5) suffice for effectiveness

Limitations

  1. Model scale: Experiments use relatively small models (CNN, small MLP)
    • Effectiveness on large-scale Transformers not verified
    • Relationship between buffer size and model scale not explored
  2. Buffer strategy:
    • Uses simple reservoir sampling and FIFO eviction
    • More sophisticated sampling strategies (e.g., gradient importance-based) not explored
  3. Theoretical analysis:
    • Gradient alignment analysis based on first-order approximation
    • No complete non-asymptotic theory or convergence guarantees
    • Non-convexity of deep networks insufficiently addressed
  4. Stream type coverage:
    • Primarily considers 5-stage streams
    • Longer sequences or continuous drift scenarios not tested
    • Intra-stage distribution changes not addressed
  5. Computational cost:
    • Training time and memory overhead not reported
    • Additional storage and sampling costs of Replay not quantified
  6. Hyperparameter sensitivity:
    • λ and C selection based on empirical experience
    • Sensitivity not systematically studied

Future Directions

Paper explicitly proposes:

  1. More principled buffer construction and sampling strategies:
    • Gradient diversity-based sampling
    • Adaptive buffer sizing
  2. Combination with parameter regularization methods:
    • Replay + EWC
    • Replay + knowledge distillation
  3. Extension to larger architectures and multimodal streams:
    • Vision Transformers
    • Multimodal streaming learning
  4. Real-world resource constraints:
    • Edge device deployment
    • Communication-limited scenarios

In-Depth Evaluation

Strengths

1. Clear theoretical contribution

  • Gradient alignment perspective is concise and elegant, providing intuitive explanation
  • Proposition 1 formalizes conditions for replay effectiveness
  • Connects optimization theory with continual learning practice

2. Rigorous experimental design

  • Fair comparison: Matched training budgets, identical hyperparameters
  • Diverse scenarios: 3 datasets × 6 scenarios, covering generative and discriminative tasks
  • Sufficient repetition: 3 random seeds, reporting means and standard deviations
  • Transparent logging: Commits to releasing complete logs and code

3. Practical problem setting

  • Addresses real deployment scenarios (memory-constrained, streaming data)
  • Unified framework handling multiple task types
  • Simple mechanism easy to implement and deploy

4. In-depth result interpretation

  • Clearly distinguishes different behaviors between heterogeneous and temporal streams
  • Connects experimental observations with theoretical predictions
  • Per-stage analysis provides fine-grained insights

5. Clear writing

  • Well-organized structure, clear motivation
  • Consistent mathematical notation, clear definitions
  • Effective figure design conveying information

Weaknesses

1. Limited theoretical analysis

  • Only first-order approximation, not considering higher-order terms and non-convexity
  • Lacks quantitative bounds on convergence rate or sample complexity
  • Condition (ii) in Proposition 1 "historical mixture benign" and how to ensure it in practice not discussed

2. Limited experimental scale

  • Relatively simple models (small CNN, MLP)
  • Classic but not large-scale datasets
  • Current popular large models or Transformers not addressed

3. Insufficient buffer design exploration

  • Fixed C ~ 10³ lacks systematic tuning
  • Different sampling strategies (uniform vs. importance sampling) not compared
  • Buffer update strategies (FIFO vs. others) not ablated

4. Computational cost not reported

  • Training time, memory consumption not quantified
  • Additional overhead of Replay not weighed against benefits
  • Feasibility analysis for practical deployment insufficient

5. Missing comparison with complex methods

  • Only compared with SeqFT, not with EWC, GEM, etc.
  • Cannot assess cost-effectiveness of simple replay relative to complex methods
  • Paper claims "strong baseline" but lacks direct comparison with other baselines

6. Limited stream type coverage

  • Only 5-stage streams, longer sequences not tested
  • Clear stage boundaries, gradual drift not simulated
  • Intra-stage distribution changes not considered

Impact

Contribution to the field:

  • Theory: Gradient alignment perspective provides new analytical tool for continual learning
  • Empirics: Systematic benchmark provides reference point for subsequent research
  • Practice: Simple effective method lowers deployment threshold

Practical value:

  • Streaming systems (power, transportation, finance) can directly apply
  • Lightweight solution for continual learning on edge devices
  • No architecture modification needed, easy integration into existing systems

Reproducibility:

  • Uses public datasets
  • Commits to releasing code and logs
  • Detailed experimental setup description
  • Explicit random seeds

Potential impact:

  • Establishes simple strong baseline for streaming learning
  • Inspires gradient analysis-based continual learning methods
  • Advances research on continual learning for generative tasks

Applicable Scenarios

Strongly recommended scenarios:

  1. Heterogeneous multi-task streams:
    • Recommendation systems for different customer groups
    • Quality inspection systems for multi-brand products
    • Multi-language NLP tasks
  2. Memory-constrained environments:
    • Edge devices (IoT, mobile)
    • Embedded systems
    • Real-time processing pipelines
  3. Need to retain historical capability:
    • Generative models (need reconstruct historical patterns)
    • Multi-task services (need support multiple request types simultaneously)
    • Long-term deployment systems

Use with caution scenarios:

  1. Mild temporal drift:
    • Stationary time series prediction
    • Slowly evolving distributions
    • SeqFT may suffice here
  2. Extreme resource constraints:
    • Cannot maintain buffer (C < 100)
    • Sampling overhead unacceptable
  3. Requiring theoretical guarantees:
    • Safety-critical applications
    • Paper's first-order analysis may be insufficient

Extension directions:

  • Combine with parameter regularization for improved performance
  • Adaptive buffer management
  • Combination with knowledge distillation
  • Extension to continual fine-tuning of pre-trained large models

Selected References

  1. Goodfellow et al. (2014): An empirical investigation of catastrophic forgetting - Pioneering empirical study of catastrophic forgetting
  2. Kirkpatrick et al. (2017): Elastic Weight Consolidation (EWC) - Representative work on parameter importance regularization
  3. Lopez-Paz & Ranzato (2017): Gradient Episodic Memory (GEM) - Gradient constraint-based continual learning
  4. Parisi et al. (2019): Continual lifelong learning with neural networks - Continual learning survey
  5. Gama et al. (2014): A survey on concept drift adaptation - Concept drift adaptation survey

Overall Assessment: This is a solid continual learning research paper that provides a practical solution to catastrophic forgetting in streaming learning scenarios through concise theoretical analysis and systematic experimental evaluation. The paper's main value lies in: (1) unified task formalization framework; (2) clear gradient alignment theory; (3) systematic evaluation across task types and stream types. While having limitations in model scale, theoretical depth, and method comparison, the positioning as a "strong baseline" is reasonable. For researchers and engineers needing to deploy continual learning systems in resource-constrained environments, this paper provides valuable guidance and reference implementation.