Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay
Du
Many deployed learning systems must update models on streaming data under memory constraints. The default strategy, sequential fine-tuning on each new phase, is architecture-agnostic but often suffers catastrophic forgetting when later phases correspond to different sub-populations or tasks. Replay with a finite buffer is a simple alternative, yet its behaviour across generative and predictive objectives is not well understood. We present a unified study of stateful replay for streaming autoencoding, time series forecasting, and classification. We view both sequential fine-tuning and replay as stochastic gradient methods for an ideal joint objective, and use a gradient alignment analysis to show when mixing current and historical samples should reduce forgetting. We then evaluate a single replay mechanism on six streaming scenarios built from Rotated MNIST, ElectricityLoadDiagrams 2011-2014, and Airlines delay data, using matched training budgets and three seeds. On heterogeneous multi task streams, replay reduces average forgetting by a factor of two to three, while on benign time based streams both methods perform similarly. These results position stateful replay as a strong and simple baseline for continual learning in streaming environments.
academic
Mitigating Catastrophic Forgetting in Streaming Generative and Predictive Learning via Stateful Replay
This paper addresses the catastrophic forgetting problem in streaming learning environments by proposing a unified study of the stateful replay mechanism. In memory-constrained streaming data scenarios, traditional sequential fine-tuning (SeqFT) methods, while architecture-agnostic, suffer severe catastrophic forgetting when subsequent stages correspond to different subpopulations or tasks. The authors unify reconstruction, prediction, and classification tasks into a negative log-likelihood minimization framework and reveal through gradient alignment analysis how mixing current and historical samples reduces forgetting. Experiments on six streaming scenarios across three public datasets (Rotated MNIST, ElectricityLoadDiagrams, Airlines) demonstrate that: on heterogeneous multi-task streams, the replay mechanism reduces average forgetting by 2-3 times; while on mild temporal streams, both methods perform similarly.
Real-world deployed learning systems often need to update models on streaming data while facing strict memory constraints. Typical applications include:
Power suppliers recording long-term load curves
Airlines recording individual flight data
Perception pipelines observing continuous image and signal streams
These systems typically employ Sequential Fine-Tuning (SeqFT): training sequentially on data from each stage. While simple and architecture-agnostic, this approach suffers from catastrophic forgetting—when subsequent stages correspond to different subpopulations, label subsets, or tasks, gradients from new stages overwrite parameters useful for early stages.
Specificity of generative tasks: For autoencoders or predictors, once unable to reconstruct historical patterns, outputs no longer reflect system history
Practical deployment requirements: Streaming systems must continuously learn under limited memory without re-accessing complete historical data
Insufficient theoretical understanding: While replay with limited buffers is a simple continual learning mechanism, its behavior across different objective functions and stream types remains insufficiently understood
Complex continual learning methods: While approaches based on parameter importance regularization, knowledge distillation, and generative replay exist, they introduce additional complexity and hyperparameter tuning costs
Inconsistent empirical reports: Replay provides huge benefits on some benchmarks but appears unnecessary on others
Lack of unified framework: Behavioral differences between generative vs. predictive tasks and heterogeneous vs. stationary streams have not been systematically studied
This paper deliberately focuses on the simplest mechanism—stateful replay with fixed-capacity buffers—to systematically answer two fundamental questions:
(i) When is replay memory theoretically justified and practically necessary in streaming learning?
(ii) How does its effectiveness differ between generative vs. predictive tasks and heterogeneous vs. near-stationary streams?
Unified streaming learning formalization: Represents autoencoders, prediction, and classification as negative log-likelihood minimization over stage-wise data distributions, defining stage-wise forgetting functions applicable across metrics
Gradient alignment theory for replay: Interprets SeqFT and Replay as stochastic gradient methods for an ideal joint objective, proving that when gradient conflicts exist, replay transforms "forgetting steps" into benign updates by mixing current and historical gradients
Mixed benchmarks and transparent logging: Constructs 6 streaming scenarios (spanning 3 datasets) with recorded initial and final metrics for all stages, supporting reproducible analysis
Empirical characterization: Under matched training budgets, Replay significantly reduces catastrophic forgetting on truly interfering streams (digit pairs, airline groups), while behaving similarly to SeqFT on mild temporal streams
(ii) Historical mixture benign: ⟨∇R_k, ḡ_{<t}⟩ ≥ 0
Then there exists λ* ∈ (0,1) such that for all λ ∈ λ*, 1:
⟨∇R_k, d^rep⟩ ≥ 0
i.e., first-order change in R_k under Replay step is non-positive.
Proof sketch:
Let h(λ) = ⟨∇R_k, (1-λ)∇R_t + λḡ_{<t}⟩
By (i): h(0) < 0
By (ii): h(1) ≥ 0
h is affine in λ, root exists λ* ∈ (0,1)
For λ ≥ λ*, h(λ) ≥ 0
Intuitive explanation: When current stage gradient conflicts with past stages while historical mixture is benign for that stage, Replay can flip forgetting steps into non-forgetting steps. This precisely characterizes RotMNIST digit pair and airline group streams.
Finite buffer approximation:
Single loss gradient bound: ||∇_θ ℓ(f_θ(x), y)|| ≤ G
Standard concentration bounds show: buffer gradient deviates from ḡ_{<t} by at most O(G/√C)
In experiments C ~ 10³, approximation error small, Replay robust
Theoretical insight: Through gradient alignment analysis, stateful replay transforms forgetting steps into benign updates by mixing historical and current gradients when gradient conflicts exist
Parisi et al. (2019): Continual lifelong learning with neural networks - Continual learning survey
Gama et al. (2014): A survey on concept drift adaptation - Concept drift adaptation survey
Overall Assessment: This is a solid continual learning research paper that provides a practical solution to catastrophic forgetting in streaming learning scenarios through concise theoretical analysis and systematic experimental evaluation. The paper's main value lies in: (1) unified task formalization framework; (2) clear gradient alignment theory; (3) systematic evaluation across task types and stream types. While having limitations in model scale, theoretical depth, and method comparison, the positioning as a "strong baseline" is reasonable. For researchers and engineers needing to deploy continual learning systems in resource-constrained environments, this paper provides valuable guidance and reference implementation.