2025-11-15T15:52:10.939408

DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

Ying, Zhu, Lv et al.

As the scope and impact of cyber threats have expanded, analysts utilize audit logs to hunt threats and investigate attacks. The provenance graphs constructed from kernel logs are increasingly considered as an ideal data source due to their powerful semantic expression and attack historic correlation ability. However, storing provenance graphs with traditional databases faces the challenge of high storage overhead, given the high frequency of kernel events and the persistence of attacks. To address this, we propose Dehydrator, an efficient provenance graph storage system. For the logs generated by auditing frameworks, Dehydrator uses field mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally learns a deep neural network to support batch querying. We have conducted evaluations on seven datasets totaling over one billion log entries. Experimental results show that Dehydrator reduces the storage space by 84.55%. Dehydrator is 7.36 times more efficient than PostgreSQL, 7.16 times than Neo4j, and 16.17 times than Leonard (the work most closely related to Dehydrator, published at Usenix Security'23).

academic

DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

Basic Information

Paper ID: 2501.00446
Title: DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation
Authors: Jie Ying, Tiantian Zhu*, Mingqi Lv, Tieming Chen (Zhejiang University of Technology)
Classification: cs.CR (Cryptography and Security)
Published Journal: IEEE Transactions on Information Forensics and Security
Paper Link: https://arxiv.org/abs/2501.00446

Abstract

As the scope and impact of cyber threats expand, analysts leverage audit logs to track threats and investigate attacks. Provenance graphs constructed from kernel logs are increasingly viewed as ideal data sources due to their powerful semantic expressiveness and ability to correlate attack histories. However, due to the high frequency of kernel events and the persistence of attacks, storing provenance graphs using traditional databases faces challenges of high storage overhead. To address this issue, this paper proposes DEHYDRATOR, an efficient provenance graph storage system. For logs generated by audit frameworks, DEHYDRATOR uses field-mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally trains deep neural networks to support batch queries. Evaluated on seven datasets comprising over one billion log entries in total, experimental results demonstrate that DEHYDRATOR reduces storage space by 84.55%, achieving 7.36× efficiency over PostgreSQL, 7.16× over Neo4j, and 16.17× over Leonard.

Research Background and Motivation

Problem Background

Escalating Cyber Threats: As of May 2024, there have been 9,478 data breach incidents, with the MOAB event in January 2024 leaking 26 billion records
Importance of Provenance Graphs: Provenance graphs, as directed graph structures with nodes representing system entities (processes, files, sockets) and edges representing system events, possess powerful semantic expressiveness and attack history correlation capabilities
Storage Challenges: Four phenomena create storage difficulties:
- Irreversible Growth: To maintain data integrity, data is only added, never deleted
- Rapid Expansion: Each machine generates GB-level logs daily
- Extended Duration: Intrusions are discovered on average after 188 days
- Query Demands: Need to support large-scale queries for threat hunting and causal analysis

Limitations of Existing Methods

Existing efficient provenance graph storage systems (ESSPGs) fall into two categories:

Pruning-based Methods (e.g., LogGC, CPR, NodeMerge, DPR): Lossy compression that may cause false negatives in upper-layer components
Encoding-based Methods (e.g., SEAL, SLEUTH, ELISE, Leonard): Either unable to support queries or auxiliary components consume substantial storage space

Research Motivation

Existing methods cannot simultaneously satisfy three critical requirements:

Content Losslessness: Retain all data to avoid false negatives
Storage Efficiency: Minimize storage overhead
Query Support: Handle large-scale query demands

Core Contributions

Proposes DEHYDRATOR System: An efficient provenance graph storage system that overcomes limitations of existing methods, using field-mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and deep neural networks to support batch queries
Constructs Prototype System and Conducts Large-Scale Evaluation: Evaluated on seven datasets (totaling over one billion log entries), achieving 84.55% storage space reduction, with 7.36×, 7.16×, and 16.17× efficiency improvements over PostgreSQL, Neo4j, and Leonard respectively
Comprehensive Evaluation Analysis: Explores component impact, applicable scenarios, and performance lower bounds; defines Latency-Storage Ratio (LSR) metric to balance storage overhead and latency

Methodology Details

Task Definition

Input: Raw kernel logs collected by audit frameworks Output: Efficiently stored provenance graphs supporting queries from upper-layer components Constraints: Content losslessness, storage efficiency, query support

System Architecture

DEHYDRATOR adopts a three-stage framework:

1. Pretreatment Stage

Log Parsing: Extracts key fields from raw logs using regular expressions
Provenance Graph Construction: Constructs node table NT (IdentiID, Name, Type) and edge table ET (SrcID, DstID, TimeStamp, Operation)
Field-Mapping Encoding: Handles three types of field-level redundancy
- Unique values: Replaced with shorter numeric characters
- Repeated values: Replaced with indices
- Incremental values: Replaced with offsets

2. Storage Stage

Hierarchical Encoding:

Models provenance graphs as hierarchical directed graphs
For each node v, records all source nodes and incoming edge information
Constructs merged mapping table MMT and hierarchical edge table EThi
Nested list structure: Operation: timeOffset: nodeOffset

Model Training:

Selects single-layer decoder-only Transformer
Models storage task as sequence generation task
Uses char2vec encoding with autoregressive generation
Constructs error correction table ECT to handle model prediction errors

3. Query Stage

Node Information: Obtains indices through mapping table MT, retrieves node information
Edge Information: Inputs indices to DNN model, generates sequences, applies ECT error correction, performs hierarchical decoding to obtain readable information

Technical Innovations

Hierarchical Encoding Design:
- Based on backward query characteristics of causal analysis
- Compresses multiple parallel edges into compact encoding form
- Increases information density and accelerates model training
DNN Model Selection:
- Single-layer decoder-only Transformer replaces multi-layer LSTM
- Better parallelization and feature extraction capabilities
- Suitable for identifying low-level repetitive patterns in storage tasks
Error Correction Mechanism:
- ECT table records positions and correct characters
- Ensures content losslessness while supporting DNN compression

Experimental Setup

Datasets

Seven datasets totaling over one billion log entries:

G1-G4: DARPA TC E3 CADETS, THEIA, TRACE groups
G5-G6: DARPA TC E4 TRACE group
G7: DEPIMACT dataset subset
Average edge count: 17,754,566 (9.6× larger than Leonard)

Evaluation Metrics

Storage Overhead: BPpre (pretreatment) and BPpost (post-treatment) bytes
Storage Latency: Ts time cost
Latency-Storage Ratio: LSR = (BPpre - BPpost)/Ts

Comparison Methods

PostgreSQL: Relational database
Neo4j: Graph database
Leonard: DNN-based storage system (Usenix Security'23)

Implementation Details

Environment: Python 3.9, PyTorch 1.13.1, AMD EPYC 7513 processor, RTX A6000 GPU
Hyperparameters: Batch size 4096, Adam optimizer, learning rate 0.001, maximum training epochs 5

Experimental Results

Main Results

System	Average Storage Overhead (MB)	Average Latency (s)	Improvement over DEHYDRATOR
PostgreSQL	1,818	45	7.36×
Neo4j	1,770	21	7.16×
Leonard	3,991	30,233	16.17×
DEHYDRATOR	247	3,205	-

Query Performance

In BFS query tests at different depths:

Neo4j performs best (~4.92s)
DEHYDRATOR ranks second (~32.02s)
PostgreSQL performs worst (~32.08s)

Ablation Studies

Component Contribution Analysis:

Original graph: 1598.69MB
After field-mapping encoding: 405.2MB (25.3%)
After hierarchical encoding: 75.98MB (4.7%)
After model training: 192.42MB (12%)

Hierarchical Encoding Impact:

With hierarchical encoding: EThi 20.19M, training time 660.69s, ECT 50.79M
Without hierarchical encoding: EThi 268.31M, training time 5814.42s, ECT 1064.25M
Hierarchical encoding reduces training time by 8.8× and ECT size by 20.95×

Applicable Scenario Analysis

Theoretical derivation proves hierarchical encoding is effective when average degree davg ≥ 3 Experimental validation: Hierarchical encoding is effective on datasets with degrees 3, 4, and 5

Intrusion Detection

Heuristic Methods: HOLMES, SLEUTH, Poirot and others construct matching rules based on MITRE ATT&CK
Anomaly Detection: Streamspot, Unicorn, KAIROS and others detect intrusions by identifying deviations from normal behavior

Attack Investigation

RapSheet, HERCULE, NODOZE and other systems perform threat scoring and causal analysis
DEPIMPACT, ATLAS and others perform dependency analysis and attack pattern recognition

Graph Compression

Lossy Methods: LogGC, CPR, NodeMerge, DPR and other pruning techniques
Lossless Methods: SEAL, ELISE, Leonard and other encoding techniques

Conclusions and Discussion

Main Conclusions

DEHYDRATOR successfully addresses three major challenges in provenance graph storage: content losslessness, storage efficiency, and query support
Hierarchical encoding is the key innovation, effectively handling structure-level redundancy
Single-layer Transformer is more suitable for storage tasks than multi-layer LSTM
Significantly outperforms existing methods on large-scale datasets

Limitations

High Storage Latency: Average 3205 seconds, accounting for 13.29% of dataset time span
Query Efficiency: Autoregressive generation leads to high latency for long sequence queries
Model Capacity Selection: Lacks theoretical guidance for determining optimal model capacity η
Applicable Scope: Primarily suitable for cold storage scenarios, does not support ACID properties

Future Directions

Leverage AI acceleration techniques to improve training and inference efficiency
Provide theoretical analysis for optimal model capacity selection
Extend to general-purpose graph database applications
Optimize query algorithms to reduce latency

In-Depth Evaluation

Strengths

Problem Importance: Addresses practical pain points in cybersecurity domain
Methodological Innovation: Hierarchical encoding cleverly combines domain characteristics with DNN advantages
Experimental Sufficiency: Large-scale dataset validation with comprehensive ablation studies and comparative analysis
Engineering Value: Significant storage efficiency improvements with strong practical applicability

Weaknesses

Latency Issues: Storage and query latencies remain high, limiting real-time applications
Theoretical Analysis: Lacks theoretical guidance for model capacity selection
Applicable Scope: Primarily targets specific provenance graph scenarios with limited generalizability
Baseline Comparison: Leonard implementation may involve unfair comparison

Impact

Academic Contribution: Provides new technical pathways for provenance graph storage
Practical Value: Significant importance for cybersecurity infrastructure
Reproducibility: Commits to open-sourcing code and data
Extensibility: Methods can be extended to other graph storage scenarios

Applicable Scenarios

Cybersecurity: EDR systems, threat hunting, attack investigation
Cold Storage: Historical data archival and analysis
Large-Scale Graph Data: Storage of high-degree, high-redundancy graph structures
Batch Queries: Applications requiring large-scale parallel queries

References

The paper cites 93 relevant references covering multiple domains including cybersecurity, graph compression, and deep learning, providing a solid theoretical foundation for the research.