MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification
Park, Ji, Park et al.
Continual Learning (CL) for malware classification tackles the rapidly evolving nature of malware threats and the frequent emergence of new types. Generative Replay (GR)-based CL systems utilize a generative model to produce synthetic versions of past data, which are then combined with new data to retrain the primary model. Traditional machine learning techniques in this domain often struggle with catastrophic forgetting, where a model's performance on old data degrades over time.
In this paper, we introduce a GR-based CL system that employs Generative Adversarial Networks (GANs) with feature matching loss to generate high-quality malware samples. Additionally, we implement innovative selection schemes for replay samples based on the model's hidden representations.
Our comprehensive evaluation across Windows and Android malware datasets in a class-incremental learning scenario -- where new classes are introduced continuously over multiple tasks -- demonstrates substantial performance improvements over previous methods. For example, our system achieves an average accuracy of 55% on Windows malware samples, significantly outperforming other GR-based models by 28%. This study provides practical insights for advancing GR-based malware classification systems. The implementation is available at \url {https://github.com/MalwareReplayGAN/MalCL}\footnote{The code will be made public upon the presentation of the paper}.
academic
MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification
This paper proposes MalCL, a system addressing the continual learning problem in malware classification. The system employs a Generative Adversarial Network (GAN)-based generative replay approach that generates high-quality malware samples through feature matching loss and implements an innovative sample selection mechanism based on model hidden representations. In class-incremental learning scenarios on Windows and Android malware datasets, the system demonstrates significant performance improvements, achieving 55% mean accuracy on Windows malware samples, representing a 28% improvement over other generative replay-based models.
The primary challenge in malware classification is the phenomenon of catastrophic forgetting. When machine learning models undergo continuous training on new data, their performance on previously learned data significantly degrades. This is particularly severe in the malware domain because:
Rapid malware evolution: The AV-TEST Institute records 450,000 new malware and potentially unwanted applications (PUA) daily
VirusTotal processes over 1 million software submissions daily
Dilemma for antivirus companies: Either remove old samples (risking resurgence of legacy malware) or ignore new samples (missing emerging threats)
The paper defines a concrete threat scenario where attackers exploit legacy malware to evade machine learning systems updated only with new data. As the time gap between original training and attack increases, the likelihood of successful evasion increases.
Traditional machine learning methods: Fail to effectively address catastrophic forgetting
Continual learning methods from computer vision: Perform poorly when directly applied to malware classification, sometimes underperforming the "None" baseline
Storage constraints: Historical data storage is limited due to privacy regulations
Malware-domain-specific continual learning model: Proposes MalCL, achieving 55% mean accuracy across 11 continual learning tasks on 100 malware families, representing a 28% improvement over existing methods
Improved feature matching generative replay: Employs GAN generator combined with feature matching loss to reduce feature discrepancies between original and synthetic samples
Innovative replay sample selection mechanism: Develops multiple selection strategies based on intermediate layer features in the classifier to improve alignment between generated and original data
Strategic task set construction: Explores the strategy of assigning large categories to initial tasks, effectively mitigating catastrophic forgetting
LG = 1/m ∑(i=1 to m) ||Ex~pdata[D(f)(x)] - Ez~pz[D(f)(G(z))]||
Where D(f)(·) denotes the intermediate layer output of the discriminator. This loss function focuses on richer intermediate features rather than final outputs.
This paper cites important works from continual learning, malware detection, and generative adversarial networks, including:
Shin et al. (2017): Continual learning with deep generative replay
Rahman, Coull, and Wright (2022): First exploration of continual learning in malware classification
Anderson and Roth (2018): EMBER dataset
Arp et al. (2014): Drebin feature extraction methodology
Overall Assessment: This paper proposes an innovative solution to catastrophic forgetting in malware classification, with adequate technical methodology and experimental validation. While performance improvement potential remains, it makes important contributions to both research and practical applications in this domain.