2025-11-15T15:52:10.939408

DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

Ying, Zhu, Lv et al.
As the scope and impact of cyber threats have expanded, analysts utilize audit logs to hunt threats and investigate attacks. The provenance graphs constructed from kernel logs are increasingly considered as an ideal data source due to their powerful semantic expression and attack historic correlation ability. However, storing provenance graphs with traditional databases faces the challenge of high storage overhead, given the high frequency of kernel events and the persistence of attacks. To address this, we propose Dehydrator, an efficient provenance graph storage system. For the logs generated by auditing frameworks, Dehydrator uses field mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally learns a deep neural network to support batch querying. We have conducted evaluations on seven datasets totaling over one billion log entries. Experimental results show that Dehydrator reduces the storage space by 84.55%. Dehydrator is 7.36 times more efficient than PostgreSQL, 7.16 times than Neo4j, and 16.17 times than Leonard (the work most closely related to Dehydrator, published at Usenix Security'23).
academic

DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation

Basic Information

  • Paper ID: 2501.00446
  • Title: DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation
  • Authors: Jie Ying, Tiantian Zhu*, Mingqi Lv, Tieming Chen (Zhejiang University of Technology)
  • Classification: cs.CR (Cryptography and Security)
  • Published Journal: IEEE Transactions on Information Forensics and Security
  • Paper Link: https://arxiv.org/abs/2501.00446

Abstract

As the scope and impact of cyber threats expand, analysts leverage audit logs to track threats and investigate attacks. Provenance graphs constructed from kernel logs are increasingly viewed as ideal data sources due to their powerful semantic expressiveness and ability to correlate attack histories. However, due to the high frequency of kernel events and the persistence of attacks, storing provenance graphs using traditional databases faces challenges of high storage overhead. To address this issue, this paper proposes DEHYDRATOR, an efficient provenance graph storage system. For logs generated by audit frameworks, DEHYDRATOR uses field-mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and finally trains deep neural networks to support batch queries. Evaluated on seven datasets comprising over one billion log entries in total, experimental results demonstrate that DEHYDRATOR reduces storage space by 84.55%, achieving 7.36× efficiency over PostgreSQL, 7.16× over Neo4j, and 16.17× over Leonard.

Research Background and Motivation

Problem Background

  1. Escalating Cyber Threats: As of May 2024, there have been 9,478 data breach incidents, with the MOAB event in January 2024 leaking 26 billion records
  2. Importance of Provenance Graphs: Provenance graphs, as directed graph structures with nodes representing system entities (processes, files, sockets) and edges representing system events, possess powerful semantic expressiveness and attack history correlation capabilities
  3. Storage Challenges: Four phenomena create storage difficulties:
    • Irreversible Growth: To maintain data integrity, data is only added, never deleted
    • Rapid Expansion: Each machine generates GB-level logs daily
    • Extended Duration: Intrusions are discovered on average after 188 days
    • Query Demands: Need to support large-scale queries for threat hunting and causal analysis

Limitations of Existing Methods

Existing efficient provenance graph storage systems (ESSPGs) fall into two categories:

  1. Pruning-based Methods (e.g., LogGC, CPR, NodeMerge, DPR): Lossy compression that may cause false negatives in upper-layer components
  2. Encoding-based Methods (e.g., SEAL, SLEUTH, ELISE, Leonard): Either unable to support queries or auxiliary components consume substantial storage space

Research Motivation

Existing methods cannot simultaneously satisfy three critical requirements:

  • Content Losslessness: Retain all data to avoid false negatives
  • Storage Efficiency: Minimize storage overhead
  • Query Support: Handle large-scale query demands

Core Contributions

  1. Proposes DEHYDRATOR System: An efficient provenance graph storage system that overcomes limitations of existing methods, using field-mapping encoding to filter field-level redundancy, hierarchical encoding to filter structure-level redundancy, and deep neural networks to support batch queries
  2. Constructs Prototype System and Conducts Large-Scale Evaluation: Evaluated on seven datasets (totaling over one billion log entries), achieving 84.55% storage space reduction, with 7.36×, 7.16×, and 16.17× efficiency improvements over PostgreSQL, Neo4j, and Leonard respectively
  3. Comprehensive Evaluation Analysis: Explores component impact, applicable scenarios, and performance lower bounds; defines Latency-Storage Ratio (LSR) metric to balance storage overhead and latency

Methodology Details

Task Definition

Input: Raw kernel logs collected by audit frameworks Output: Efficiently stored provenance graphs supporting queries from upper-layer components Constraints: Content losslessness, storage efficiency, query support

System Architecture

DEHYDRATOR adopts a three-stage framework:

1. Pretreatment Stage

  • Log Parsing: Extracts key fields from raw logs using regular expressions
  • Provenance Graph Construction: Constructs node table NT (IdentiID, Name, Type) and edge table ET (SrcID, DstID, TimeStamp, Operation)
  • Field-Mapping Encoding: Handles three types of field-level redundancy
    • Unique values: Replaced with shorter numeric characters
    • Repeated values: Replaced with indices
    • Incremental values: Replaced with offsets

2. Storage Stage

Hierarchical Encoding:

  • Models provenance graphs as hierarchical directed graphs
  • For each node v, records all source nodes and incoming edge information
  • Constructs merged mapping table MMT and hierarchical edge table EThi
  • Nested list structure: Operation: timeOffset: nodeOffset

Model Training:

  • Selects single-layer decoder-only Transformer
  • Models storage task as sequence generation task
  • Uses char2vec encoding with autoregressive generation
  • Constructs error correction table ECT to handle model prediction errors

3. Query Stage

  • Node Information: Obtains indices through mapping table MT, retrieves node information
  • Edge Information: Inputs indices to DNN model, generates sequences, applies ECT error correction, performs hierarchical decoding to obtain readable information

Technical Innovations

  1. Hierarchical Encoding Design:
    • Based on backward query characteristics of causal analysis
    • Compresses multiple parallel edges into compact encoding form
    • Increases information density and accelerates model training
  2. DNN Model Selection:
    • Single-layer decoder-only Transformer replaces multi-layer LSTM
    • Better parallelization and feature extraction capabilities
    • Suitable for identifying low-level repetitive patterns in storage tasks
  3. Error Correction Mechanism:
    • ECT table records positions and correct characters
    • Ensures content losslessness while supporting DNN compression

Experimental Setup

Datasets

Seven datasets totaling over one billion log entries:

  • G1-G4: DARPA TC E3 CADETS, THEIA, TRACE groups
  • G5-G6: DARPA TC E4 TRACE group
  • G7: DEPIMACT dataset subset
  • Average edge count: 17,754,566 (9.6× larger than Leonard)

Evaluation Metrics

  • Storage Overhead: BPpre (pretreatment) and BPpost (post-treatment) bytes
  • Storage Latency: Ts time cost
  • Latency-Storage Ratio: LSR = (BPpre - BPpost)/Ts

Comparison Methods

  • PostgreSQL: Relational database
  • Neo4j: Graph database
  • Leonard: DNN-based storage system (Usenix Security'23)

Implementation Details

  • Environment: Python 3.9, PyTorch 1.13.1, AMD EPYC 7513 processor, RTX A6000 GPU
  • Hyperparameters: Batch size 4096, Adam optimizer, learning rate 0.001, maximum training epochs 5

Experimental Results

Main Results

SystemAverage Storage Overhead (MB)Average Latency (s)Improvement over DEHYDRATOR
PostgreSQL1,818457.36×
Neo4j1,770217.16×
Leonard3,99130,23316.17×
DEHYDRATOR2473,205-

Query Performance

In BFS query tests at different depths:

  • Neo4j performs best (~4.92s)
  • DEHYDRATOR ranks second (~32.02s)
  • PostgreSQL performs worst (~32.08s)

Ablation Studies

Component Contribution Analysis:

  • Original graph: 1598.69MB
  • After field-mapping encoding: 405.2MB (25.3%)
  • After hierarchical encoding: 75.98MB (4.7%)
  • After model training: 192.42MB (12%)

Hierarchical Encoding Impact:

  • With hierarchical encoding: EThi 20.19M, training time 660.69s, ECT 50.79M
  • Without hierarchical encoding: EThi 268.31M, training time 5814.42s, ECT 1064.25M
  • Hierarchical encoding reduces training time by 8.8× and ECT size by 20.95×

Applicable Scenario Analysis

Theoretical derivation proves hierarchical encoding is effective when average degree davg ≥ 3 Experimental validation: Hierarchical encoding is effective on datasets with degrees 3, 4, and 5

Intrusion Detection

  • Heuristic Methods: HOLMES, SLEUTH, Poirot and others construct matching rules based on MITRE ATT&CK
  • Anomaly Detection: Streamspot, Unicorn, KAIROS and others detect intrusions by identifying deviations from normal behavior

Attack Investigation

  • RapSheet, HERCULE, NODOZE and other systems perform threat scoring and causal analysis
  • DEPIMPACT, ATLAS and others perform dependency analysis and attack pattern recognition

Graph Compression

  • Lossy Methods: LogGC, CPR, NodeMerge, DPR and other pruning techniques
  • Lossless Methods: SEAL, ELISE, Leonard and other encoding techniques

Conclusions and Discussion

Main Conclusions

  1. DEHYDRATOR successfully addresses three major challenges in provenance graph storage: content losslessness, storage efficiency, and query support
  2. Hierarchical encoding is the key innovation, effectively handling structure-level redundancy
  3. Single-layer Transformer is more suitable for storage tasks than multi-layer LSTM
  4. Significantly outperforms existing methods on large-scale datasets

Limitations

  1. High Storage Latency: Average 3205 seconds, accounting for 13.29% of dataset time span
  2. Query Efficiency: Autoregressive generation leads to high latency for long sequence queries
  3. Model Capacity Selection: Lacks theoretical guidance for determining optimal model capacity η
  4. Applicable Scope: Primarily suitable for cold storage scenarios, does not support ACID properties

Future Directions

  1. Leverage AI acceleration techniques to improve training and inference efficiency
  2. Provide theoretical analysis for optimal model capacity selection
  3. Extend to general-purpose graph database applications
  4. Optimize query algorithms to reduce latency

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses practical pain points in cybersecurity domain
  2. Methodological Innovation: Hierarchical encoding cleverly combines domain characteristics with DNN advantages
  3. Experimental Sufficiency: Large-scale dataset validation with comprehensive ablation studies and comparative analysis
  4. Engineering Value: Significant storage efficiency improvements with strong practical applicability

Weaknesses

  1. Latency Issues: Storage and query latencies remain high, limiting real-time applications
  2. Theoretical Analysis: Lacks theoretical guidance for model capacity selection
  3. Applicable Scope: Primarily targets specific provenance graph scenarios with limited generalizability
  4. Baseline Comparison: Leonard implementation may involve unfair comparison

Impact

  1. Academic Contribution: Provides new technical pathways for provenance graph storage
  2. Practical Value: Significant importance for cybersecurity infrastructure
  3. Reproducibility: Commits to open-sourcing code and data
  4. Extensibility: Methods can be extended to other graph storage scenarios

Applicable Scenarios

  1. Cybersecurity: EDR systems, threat hunting, attack investigation
  2. Cold Storage: Historical data archival and analysis
  3. Large-Scale Graph Data: Storage of high-degree, high-redundancy graph structures
  4. Batch Queries: Applications requiring large-scale parallel queries

References

The paper cites 93 relevant references covering multiple domains including cybersecurity, graph compression, and deep learning, providing a solid theoretical foundation for the research.