2025-11-17T12:40:13.303016

Over-Threshold Multiparty Private Set Intersection for Collaborative Network Intrusion Detection

Arpaci, Boutaba, Kerschbaum

An important function of collaborative network intrusion detection is to analyze the network logs of the collaborators for joint IP addresses. However, sharing IP addresses in plain is sensitive and may be even subject to privacy legislation as it is personally identifiable information. In this paper, we present the privacy-preserving collection of IP addresses. We propose a single collector, over-threshold private set intersection protocol. In this protocol $N$ participants identify the IP addresses that appear in at least $t$ participant's sets without revealing any information about other IP addresses. Using a novel hashing scheme, we reduce the computational complexity of the previous state-of-the-art solution from $O(M(N \log{M}/t)^{2t})$ to $O(t^2M\binom{N}{t})$, where $M$ denotes the dataset size. This reduction makes it practically feasible to apply our protocol to real network logs. We test our protocol using joint networks logs of multiple institutions. Additionally, we present two deployment options: a collusion-safe deployment, which provides stronger security guarantees at the cost of increased communication overhead, and a non-interactive deployment, which assumes a non-colluding collector but offers significantly lower communication costs and applicable to many use cases of collaborative network intrusion detection similar to ours.

academic

Over-Threshold Multiparty Private Set Intersection for Collaborative Network Intrusion Detection

Basic Information

Paper ID: 2510.12045
Title: Over-Threshold Multiparty Private Set Intersection for Collaborative Network Intrusion Detection
Authors: Onur Eren Arpaci, Raouf Boutaba, Florian Kerschbaum (University of Waterloo)
Classification: cs.CR (Cryptography and Security), cs.NI (Networking and Internet Architecture)
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12045

Abstract

A critical function in collaborative network intrusion detection is analyzing network logs from collaborators to identify common IP addresses. However, sharing IP addresses in plaintext is sensitive and may be constrained by privacy legislation, as it constitutes personally identifiable information. This paper proposes a privacy-preserving collection method for IP addresses through a single-aggregator, over-threshold private set intersection protocol. In this protocol, N participants identify IP addresses appearing in at least t participants' sets without revealing any information about other IP addresses. Through a novel hashing scheme, the computational complexity of the previous state-of-the-art solution is reduced from $O(M(N \log{M}/t)^{2t})$ to $O(t^2M\binom{N}{t})$ , where M denotes the dataset size. This reduction makes applying the protocol to real network logs practically feasible.

Research Background and Motivation

Problem Definition

The core challenge in collaborative network intrusion detection is identifying multi-institutional attacks while preserving privacy. Research demonstrates that 75% of institutional attacks spread to a second institution within one day, with over 40% spreading within one hour. Attackers typically exploit a small number of external IP addresses to simultaneously attack multiple institutions. If an external IP connects to at least t institutions within a specific time window, it can be classified as malicious with 95% recall.

Privacy Challenges

Traditional approaches require institutions to share network logs in plaintext, introducing severe privacy risks:

Legal Compliance: IP addresses are classified as personally identifiable information under GDPR, PIPEDA, CCPA, and other regulations
Data Sensitivity: Raw network data is more sensitive than security alerts, containing extensive irrelevant sensitive information
Data Scale: Raw data is several orders of magnitude larger than security alerts, rendering existing solutions computationally infeasible

Limitations of Existing Methods

Kissner and Song 26: Requires O(N) communication rounds with computational complexity O(N³M³)
Mahdavi et al. 34: While improving communication complexity, computational cost remains prohibitive at O(M(N logM/t)²ᵗ)

Core Contributions

Novel Hashing Scheme: Proposes an innovative hashing algorithm reducing computational complexity from O(M(N logM/t)²ᵗ) to O(t²M(N choose t)), achieving linear complexity in M
Practical Enhancement: Enables the protocol to handle real-scale network logs, completing detection within 170 seconds for 33 participating institutions with up to 144,045 IPs
Dual Deployment Options:
- Collusion-resistant deployment: Provides stronger security guarantees but higher communication overhead
- Non-interactive deployment: Assumes non-colluding aggregator, significantly reducing communication costs
Security Proof: Proves protocol security under the semi-honest multiparty computation model
Practical Validation: Evaluation using real network logs from the CANARIE IDS project

Methodology Details

Task Definition

Over-Threshold Multiparty Private Set Intersection (OT-MP-PSI):

Input: N participants, each holding at most M elements in set Si
Output: Identifies elements appearing in at least t sets without revealing information about below-threshold elements
Constraints: Semi-honest security model where participants follow the protocol but may attempt to learn additional information

Core Technical Components

Uses (t,n) threshold scheme where any t parties can reconstruct secret V, while fewer than t parties gain no information:

P(x) = c_{t-1}x^{t-1} + c_{t-2}x^{t-2} + ... + c_1x + V

2. Oblivious Pseudorandom Function (OPRF)

Participants learn H_K(x) without knowing key K, while the key holder remains unaware of input x or output values.

Combines secret sharing and OPRF security properties, allowing participants to obtain unique secret shares from the key holder.

Protocol Architecture

Non-Interactive Deployment

Share Creation: Each participant creates secret shares for every element in their set
Hash Mapping: Uses the novel hashing scheme to map shares into size-1 buckets
Reconstruction: Aggregator attempts Lagrange interpolation for all combinations of t participants
Result Notification: Aggregator notifies participants of successfully reconstructed indices

Collusion-Resistant Deployment

Replaces shared keys with OPR-SS protocol, computing hash functions through multi-key OPRF protocol for enhanced collusion resistance.

Technical Innovation

Novel Hashing Scheme

Core Concept: Uses size-1 buckets to avoid hash collisions rather than accommodate them:

Collision Handling: When multiple elements map to the same bucket, uses pseudorandom ordering to select the smallest
Multi-Table Strategy: Each participant creates multiple tables, ensuring lost values appear in other tables
Failure Probability Control: Controls failure probability below 2⁻⁴⁰ using 20 tables

Key Advantages:

Aggregator only attempts participant combinations, not share combinations
Complexity reduces from exponential to linear (in M)
Avoids large constant factors of traditional bucketing methods

Optimization Techniques

Sorting Inversion: Consecutive two tables use the same sorting function with even tables reversing the order
Empty Bucket Utilization: Second insertion leverages empty buckets from first insertion

Experimental Setup

Dataset

Real Data: Network connection logs from 54 institutions in the CANARIE IDS project
Time Range: November 1-8, 2023, approximately 8 billion log entries daily
Data Scale: Approximately 700GB per day (gzip compressed)
Processing: Batch processing by hour, extracting external-to-internal IP connections

Evaluation Metrics

Reconstruction Time: Time for aggregator to complete reconstruction
Share Generation Time: Time for individual participant to create shares
Communication Complexity: Total communication overhead of the protocol
Correctness: Failure probability verification

Comparison Methods

Mahdavi et al. 34: Current state-of-the-art OT-MP-PSI solution
Theoretical Bounds: Comparison with computed failure probability upper bounds

Implementation Details

Programming Language: Julia, 430 lines of code
Cryptographic Libraries: SHA.jl, Nettle.jl
Finite Field: 61-bit Mersenne prime
Hardware Environment: 8×Intel Xeon E7-8870 processors, 80 physical cores, 1TB memory

Experimental Results

Main Results

Performance Comparison

Compared to Mahdavi et al. 34:

Speed Improvement: At least two orders of magnitude, up to 23,066× faster
Scalability: At M=10⁵, this method requires hundreds of seconds versus days for the comparison method

Real Data Performance

Performance on CANARIE data:

Average Reconstruction Time: 170 seconds
Maximum Reconstruction Time: 438 seconds (40 institutions, 220,011 IPs)
Average Participating Institutions: 33
Average Maximum Set Size: 144,045 IPs

Complexity Analysis

Computational Complexity

Aggregator: O(t²M(N choose t))
Participants (Non-interactive): O(tM)
Special Cases: O(M) when N=t=2, O(N²M) when N=t

Communication Complexity

Non-interactive Deployment: O(tMN)
Collusion-resistant Deployment: O(tkMN), 5 rounds of communication

Correctness Verification

Theoretical Failure Probability: 2⁻⁴⁰
Experimental Verification: In 10⁷ trials, actual failures far below theoretical upper bound
Table Count Optimization: Optimized from theoretical 28 tables to practical 20 tables

OT-MP-PSI Solutions

Kissner and Song 26: First solution using polynomial rings, O(N³M³) complexity
Mahdavi et al. 34: Constant rounds, O(M(N logM/t)²ᵗ) complexity
Ma et al. 33: Suitable for small input sets and small domains, O(N|S|) complexity

Threshold Private Set Intersection (TPSI): Learning intersection if and only if intersection size exceeds threshold
Quorum-PSI: Special case of OT-MP-PSI with output restricted to specific participants

Hashing Techniques

Cuckoo Hashing: Used to avoid hash collisions, widely applied in PSI protocols
2D Cuckoo Hashing: Designed for two-party PSI, achieving O(M) complexity

Conclusions and Discussion

Main Conclusions

Practical Breakthrough: First to make OT-MP-PSI practical for real network log scales
Efficiency Gains: Achieves orders of magnitude performance improvement through novel hashing scheme
Flexible Deployment: Provides two deployment options accommodating different security requirements
Practical Validation: Verifies protocol practicality in real multi-institutional network environments

Limitations

Semi-Honest Model: While extensible to malicious model, remains vulnerable to input substitution attacks
Set Size Leakage: Core protocol does not treat participant set sizes as private information
Aggregator Trust: Non-interactive deployment requires trusting the aggregator not to collude with participants
Threshold Constraints: Complexity advantages may diminish when t approaches N/2 with large N

Future Directions

Malicious Security: Extend protocol to resist active attackers
Dynamic Thresholds: Support multi-threshold computation without additional client cost
Large-Scale Optimization: Further optimize handling of participant combinations
Privacy Enhancement: Explore differential privacy techniques to protect set size information

In-Depth Evaluation

Strengths

Significant Theoretical Contribution: Novel hashing scheme represents important breakthrough, reducing exponential to linear complexity
High Practical Value: Addresses critical privacy challenges in real-world collaborative intrusion detection
Comprehensive Experiments: Combines theoretical analysis with real data validation, well-designed experimental setup
Complete Engineering Implementation: Provides open-source implementation, enhancing reproducibility
Rigorous Security: Offers formal security proofs and dual deployment options

Weaknesses

Security Model Limitations: Semi-honest model may be insufficient for certain practical scenarios
Parameter Selection: Choice of 20 tables based on empirical optimization lacks sufficient theoretical guidance
Scalability Boundaries: Applicability to extreme scales (e.g., global internet-level) insufficiently explored
Limited Comparison Baselines: Primarily compares against one baseline method, lacking more comprehensive performance comparison

Impact

Academic Value: Provides new technical pathways for multiparty secure computation research
Practical Significance: Directly addresses real-world network security requirements
Technology Dissemination: Hashing scheme potentially applicable to other multiparty computation problems
Standardization Potential: Likely to become standard protocol for collaborative intrusion detection

Applicable Scenarios

Multi-Institutional Network Security: Collaborative protection among similar institutions like universities and hospitals
Cloud Security Services: Privacy-preserving log analysis in security operations centers
Threat Intelligence Sharing: Threat indicator exchange under privacy constraints
Compliance Requirements: Data collaboration satisfying GDPR and other privacy regulations

References

This paper cites 53 relevant references spanning cryptography, network security, and multiparty computation, providing solid theoretical foundation and comprehensive technical background.

Overall Assessment: This is a high-quality applied cryptography paper achieving excellent balance between theoretical innovation and practical application. The proposed hashing scheme represents significant theoretical breakthrough while demonstrating substantial practical value. The paper provides comprehensive experimental validation and rigorous security analysis, making important technical contributions to collaborative network security.