As the scope and impact of cyber threats have expanded, analysts utilize audit logs to hunt threats and investigate attacks. The provenance graphs constructed from kernel logs are increasingly considered as an ideal data source due to their powerful semantic expression and attack historic correlation ability. However, storing provenance graphs with traditional databases faces the challenge of high storage overhead, given the high frequency of kernel events and the persistence of attacks. To address this, we propose Dehydrator, an efficient provenance graph storage system. For the logs generated by auditing frameworks, Dehydrator uses field mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally learns a deep neural network to support batch querying. We have conducted evaluations on seven datasets totaling over one billion log entries. Experimental results show that Dehydrator reduces the storage space by 84.55%. Dehydrator is 7.36 times more efficient than PostgreSQL, 7.16 times than Neo4j, and 16.17 times than Leonard (the work most closely related to Dehydrator, published at Usenix Security'23).
DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation
- Paper ID: 2501.00446
- Title: DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation
- Authors: Jie Ying, Tiantian Zhu*, Mingqi Lv, Tieming Chen (Zhejiang University of Technology)
- Classification: cs.CR (Cryptography and Security)
- Published Journal: IEEE Transactions on Information Forensics and Security
- Paper Link: https://arxiv.org/abs/2501.00446
As the scope and impact of cyber threats expand, analysts leverage audit logs to track threats and investigate attacks. Provenance graphs constructed from kernel logs are increasingly viewed as ideal data sources due to their powerful semantic expressiveness and ability to correlate attack histories. However, due to the high frequency of kernel events and the persistence of attacks, storing provenance graphs using traditional databases faces challenges of high storage overhead. To address this issue, this paper proposes DEHYDRATOR, an efficient provenance graph storage system. For logs generated by audit frameworks, DEHYDRATOR uses field-mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally trains deep neural networks to support batch queries. Evaluated on seven datasets comprising over one billion log entries in total, experimental results demonstrate that DEHYDRATOR reduces storage space by 84.55%, achieving 7.36× efficiency over PostgreSQL, 7.16× over Neo4j, and 16.17× over Leonard.
- Escalating Cyber Threats: As of May 2024, there have been 9,478 data breach incidents, with the MOAB event in January 2024 leaking 26 billion records
- Importance of Provenance Graphs: Provenance graphs, as directed graph structures with nodes representing system entities (processes, files, sockets) and edges representing system events, possess powerful semantic expressiveness and attack history correlation capabilities
- Storage Challenges: Four phenomena create storage difficulties:
- Irreversible Growth: To maintain data integrity, data is only added, never deleted
- Rapid Expansion: Each machine generates GB-level logs daily
- Extended Duration: Intrusions are discovered on average after 188 days
- Query Demands: Need to support large-scale queries for threat hunting and causal analysis
Existing efficient provenance graph storage systems (ESSPGs) fall into two categories:
- Pruning-based Methods (e.g., LogGC, CPR, NodeMerge, DPR): Lossy compression that may cause false negatives in upper-layer components
- Encoding-based Methods (e.g., SEAL, SLEUTH, ELISE, Leonard): Either unable to support queries or auxiliary components consume substantial storage space
Existing methods cannot simultaneously satisfy three critical requirements:
- Content Losslessness: Retain all data to avoid false negatives
- Storage Efficiency: Minimize storage overhead
- Query Support: Handle large-scale query demands
- Proposes DEHYDRATOR System: An efficient provenance graph storage system that overcomes limitations of existing methods, using field-mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and deep neural networks to support batch queries
- Constructs Prototype System and Conducts Large-Scale Evaluation: Evaluated on seven datasets (totaling over one billion log entries), achieving 84.55% storage space reduction, with 7.36×, 7.16×, and 16.17× efficiency improvements over PostgreSQL, Neo4j, and Leonard respectively
- Comprehensive Evaluation Analysis: Explores component impact, applicable scenarios, and performance lower bounds; defines Latency-Storage Ratio (LSR) metric to balance storage overhead and latency
Input: Raw kernel logs collected by audit frameworks
Output: Efficiently stored provenance graphs supporting queries from upper-layer components
Constraints: Content losslessness, storage efficiency, query support
DEHYDRATOR adopts a three-stage framework:
- Log Parsing: Extracts key fields from raw logs using regular expressions
- Provenance Graph Construction: Constructs node table NT (IdentiID, Name, Type) and edge table ET (SrcID, DstID, TimeStamp, Operation)
- Field-Mapping Encoding: Handles three types of field-level redundancy
- Unique values: Replaced with shorter numeric characters
- Repeated values: Replaced with indices
- Incremental values: Replaced with offsets
Hierarchical Encoding:
- Models provenance graphs as hierarchical directed graphs
- For each node v, records all source nodes and incoming edge information
- Constructs merged mapping table MMT and hierarchical edge table EThi
- Nested list structure: Operation: timeOffset: nodeOffset
Model Training:
- Selects single-layer decoder-only Transformer
- Models storage task as sequence generation task
- Uses char2vec encoding with autoregressive generation
- Constructs error correction table ECT to handle model prediction errors
- Node Information: Obtains indices through mapping table MT, retrieves node information
- Edge Information: Inputs indices to DNN model, generates sequences, applies ECT error correction, performs hierarchical decoding to obtain readable information
- Hierarchical Encoding Design:
- Based on backward query characteristics of causal analysis
- Compresses multiple parallel edges into compact encoding form
- Increases information density and accelerates model training
- DNN Model Selection:
- Single-layer decoder-only Transformer replaces multi-layer LSTM
- Better parallelization and feature extraction capabilities
- Suitable for identifying low-level repetitive patterns in storage tasks
- Error Correction Mechanism:
- ECT table records positions and correct characters
- Ensures content losslessness while supporting DNN compression
Seven datasets totaling over one billion log entries:
- G1-G4: DARPA TC E3 CADETS, THEIA, TRACE groups
- G5-G6: DARPA TC E4 TRACE group
- G7: DEPIMACT dataset subset
- Average edge count: 17,754,566 (9.6× larger than Leonard)
- Storage Overhead: BPpre (pretreatment) and BPpost (post-treatment) bytes
- Storage Latency: Ts time cost
- Latency-Storage Ratio: LSR = (BPpre - BPpost)/Ts
- PostgreSQL: Relational database
- Neo4j: Graph database
- Leonard: DNN-based storage system (Usenix Security'23)
- Environment: Python 3.9, PyTorch 1.13.1, AMD EPYC 7513 processor, RTX A6000 GPU
- Hyperparameters: Batch size 4096, Adam optimizer, learning rate 0.001, maximum training epochs 5
| System | Average Storage Overhead (MB) | Average Latency (s) | Improvement over DEHYDRATOR |
|---|
| PostgreSQL | 1,818 | 45 | 7.36× |
| Neo4j | 1,770 | 21 | 7.16× |
| Leonard | 3,991 | 30,233 | 16.17× |
| DEHYDRATOR | 247 | 3,205 | - |
In BFS query tests at different depths:
- Neo4j performs best (~4.92s)
- DEHYDRATOR ranks second (~32.02s)
- PostgreSQL performs worst (~32.08s)
Component Contribution Analysis:
- Original graph: 1598.69MB
- After field-mapping encoding: 405.2MB (25.3%)
- After hierarchical encoding: 75.98MB (4.7%)
- After model training: 192.42MB (12%)
Hierarchical Encoding Impact:
- With hierarchical encoding: EThi 20.19M, training time 660.69s, ECT 50.79M
- Without hierarchical encoding: EThi 268.31M, training time 5814.42s, ECT 1064.25M
- Hierarchical encoding reduces training time by 8.8× and ECT size by 20.95×
Theoretical derivation proves hierarchical encoding is effective when average degree davg ≥ 3
Experimental validation: Hierarchical encoding is effective on datasets with degrees 3, 4, and 5
- Heuristic Methods: HOLMES, SLEUTH, Poirot and others construct matching rules based on MITRE ATT&CK
- Anomaly Detection: Streamspot, Unicorn, KAIROS and others detect intrusions by identifying deviations from normal behavior
- RapSheet, HERCULE, NODOZE and other systems perform threat scoring and causal analysis
- DEPIMPACT, ATLAS and others perform dependency analysis and attack pattern recognition
- Lossy Methods: LogGC, CPR, NodeMerge, DPR and other pruning techniques
- Lossless Methods: SEAL, ELISE, Leonard and other encoding techniques
- DEHYDRATOR successfully addresses three major challenges in provenance graph storage: content losslessness, storage efficiency, and query support
- Hierarchical encoding is the key innovation, effectively handling structure-level redundancy
- Single-layer Transformer is more suitable for storage tasks than multi-layer LSTM
- Significantly outperforms existing methods on large-scale datasets
- High Storage Latency: Average 3205 seconds, accounting for 13.29% of dataset time span
- Query Efficiency: Autoregressive generation leads to high latency for long sequence queries
- Model Capacity Selection: Lacks theoretical guidance for determining optimal model capacity η
- Applicable Scope: Primarily suitable for cold storage scenarios, does not support ACID properties
- Leverage AI acceleration techniques to improve training and inference efficiency
- Provide theoretical analysis for optimal model capacity selection
- Extend to general-purpose graph database applications
- Optimize query algorithms to reduce latency
- Problem Importance: Addresses practical pain points in cybersecurity domain
- Methodological Innovation: Hierarchical encoding cleverly combines domain characteristics with DNN advantages
- Experimental Sufficiency: Large-scale dataset validation with comprehensive ablation studies and comparative analysis
- Engineering Value: Significant storage efficiency improvements with strong practical applicability
- Latency Issues: Storage and query latencies remain high, limiting real-time applications
- Theoretical Analysis: Lacks theoretical guidance for model capacity selection
- Applicable Scope: Primarily targets specific provenance graph scenarios with limited generalizability
- Baseline Comparison: Leonard implementation may involve unfair comparison
- Academic Contribution: Provides new technical pathways for provenance graph storage
- Practical Value: Significant importance for cybersecurity infrastructure
- Reproducibility: Commits to open-sourcing code and data
- Extensibility: Methods can be extended to other graph storage scenarios
- Cybersecurity: EDR systems, threat hunting, attack investigation
- Cold Storage: Historical data archival and analysis
- Large-Scale Graph Data: Storage of high-degree, high-redundancy graph structures
- Batch Queries: Applications requiring large-scale parallel queries
The paper cites 93 relevant references covering multiple domains including cybersecurity, graph compression, and deep learning, providing a solid theoretical foundation for the research.