2025-11-15T15:52:10.939408

DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

Ying, Zhu, Lv et al.

As the scope and impact of cyber threats have expanded, analysts utilize audit logs to hunt threats and investigate attacks. The provenance graphs constructed from kernel logs are increasingly considered as an ideal data source due to their powerful semantic expression and attack historic correlation ability. However, storing provenance graphs with traditional databases faces the challenge of high storage overhead, given the high frequency of kernel events and the persistence of attacks. To address this, we propose Dehydrator, an efficient provenance graph storage system. For the logs generated by auditing frameworks, Dehydrator uses field mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally learns a deep neural network to support batch querying. We have conducted evaluations on seven datasets totaling over one billion log entries. Experimental results show that Dehydrator reduces the storage space by 84.55%. Dehydrator is 7.36 times more efficient than PostgreSQL, 7.16 times than Neo4j, and 16.17 times than Leonard (the work most closely related to Dehydrator, published at Usenix Security'23).

academic

DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

基本信息

论文ID: 2501.00446
标题: DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation
作者: Jie Ying, Tiantian Zhu*, Mingqi Lv, Tieming Chen (浙江工业大学)
分类: cs.CR (Cryptography and Security)
发表期刊: IEEE Transactions on Information Forensics and Security
论文链接: https://arxiv.org/abs/2501.00446

摘要

随着网络威胁范围和影响的扩大，分析师利用审计日志来追踪威胁和调查攻击。从内核日志构建的溯源图因其强大的语义表达能力和攻击历史关联能力，越来越被视为理想的数据源。然而，由于内核事件的高频率和攻击的持久性，使用传统数据库存储溯源图面临高存储开销的挑战。为解决这一问题，本文提出了DEHYDRATOR，一个高效的溯源图存储系统。对于审计框架生成的日志，DEHYDRATOR使用字段映射编码过滤字段级冗余，分层编码过滤结构级冗余，最后学习深度神经网络支持批量查询。在总计超过10亿条日志条目的七个数据集上进行评估，实验结果显示DEHYDRATOR将存储空间减少了84.55%，比PostgreSQL高效7.36倍，比Neo4j高效7.16倍，比Leonard高效16.17倍。

研究背景与动机

问题背景

网络威胁激增：截至2024年5月，已有9,478起数据泄露事件，其中2024年1月的MOAB事件泄露了260亿条记录
溯源图的重要性：溯源图作为有向图结构，节点代表系统实体（进程、文件、套接字），边代表系统事件，具有强大的语义表达和攻击历史关联能力
存储挑战：四个现象导致存储困难：
- 不可逆增长：为保持数据完整性，只添加不删除数据
- 快速扩展：每台机器每天产生GB级日志
- 持续时间长：入侵平均持续188天才被发现
- 查询需求：需要支持威胁猎杀和因果分析的大规模查询

现有方法局限性

现有的高效溯源图存储系统(ESSPGs)分为两类：

基于剪枝的方法（如LogGC、CPR、NodeMerge、DPR）：有损压缩，可能导致上层组件产生假阴性
基于编码的方法（如SEAL、SLEUTH、ELISE、Leonard）：要么无法支持查询，要么辅助组件占用大量存储空间

研究动机

现有方法无法同时满足三个关键要求：

内容无损：保留所有数据避免假阴性
存储高效：最小化存储开销
查询支持：处理大规模查询需求

核心贡献

提出DEHYDRATOR系统：一个高效的溯源图存储系统，克服现有方法局限性，使用字段映射编码过滤字段级冗余，分层编码过滤结构级冗余，深度神经网络支持批量查询
构建原型系统并大规模评估：在七个数据集（总计超过10亿条日志）上评估，存储空间减少84.55%，比PostgreSQL、Neo4j、Leonard分别高效7.36倍、7.16倍、16.17倍
全面评估分析：探索组件影响、适用场景和性能下界，定义延迟存储比(LSR)指标平衡存储开销和延迟