2025-11-14T18:25:11.461015

The Future of Fully Homomorphic Encryption System: from a Storage I/O Perspective

Chen, Xu, Sun et al.

Fully Homomorphic Encryption (FHE) allows computations to be performed on encrypted data, significantly enhancing user privacy. However, the I/O challenges associated with deploying FHE applications remains understudied. We analyze the impact of storage I/O on the performance of FHE applications and summarize key lessons from the status quo. Key results include that storage I/O can degrade the performance of ASICs by as much as 357$\times$ and reduce GPUs performance by up to 22$\times$.

academic

The Future of Fully Homomorphic Encryption System: from a Storage I/O Perspective

Basic Information

Paper ID: 2511.04946
Title: The Future of Fully Homomorphic Encryption System: from a Storage I/O Perspective
Authors: Lei Chen, Erci Xu, Yiming Sun, Shengyu Fan, Xianglong Deng, Guiming Shi, Guang Fan, Liang Kong, Yilan Zhu, Shoumeng Yan, Mingzhe Zhang (from Ant Group, Shanghai Jiao Tong University, University of Chinese Academy of Sciences, Tsinghua University)
Classification: cs.CR (Cryptography and Security), cs.DC (Distributed Computing)
Submission Date: November 7, 2025 to arXiv
Paper Link: https://arxiv.org/abs/2511.04946

Abstract

Fully Homomorphic Encryption (FHE) enables direct computation on encrypted data, significantly enhancing user privacy protection. However, the I/O challenges encountered when deploying FHE applications remain insufficiently studied. This paper analyzes the impact of storage I/O on FHE application performance and summarizes key lessons from the current state. Core results demonstrate that storage I/O can reduce ASIC performance by up to 357×, and GPU performance by up to 22×.

Research Background and Motivation

Problem Statement

This paper focuses on the storage I/O bottleneck that has been severely overlooked in FHE system deployment. Although existing research has achieved significant progress in computational acceleration (reducing the slowdown from 10^5× on CPU to merely 3× difference), the impact of storage I/O has received little attention.

Problem Significance

Real-world requirements in cloud computing: In multi-user cloud environments, each user possesses independent ciphertexts and evaluation keys, which may exceed device memory capacity
Data scale explosion: FHE workflows significantly amplify data scale (e.g., 3KB image → 8MB plaintext polynomial → 16MB ciphertext → 5GB evaluation keys)
Multi-user concurrency: Servers must simultaneously serve multiple users, making it impossible to maintain all user data in high-bandwidth memory (HBM)

Limitations of Existing Approaches

Existing FHE accelerator research relies on two unrealistic assumptions:

Assumption 1: All data is stored in HBM
Assumption 2: Data fetching overhead from HBM to on-chip cache can be completely eliminated through static optimal prefetching strategies, data reuse algorithms, and large on-chip caches (200-500 MiB)

These assumptions are difficult to satisfy in practical cloud computing deployments because:

HBM capacity is limited (approximately tens of GB)
Multi-user environments cannot reserve space for all user data
Large models (e.g., 13B parameter LLM requiring 26GB weights + 1.6GB KV cache) consume substantial HBM
Static prefetching strategies have limited effectiveness when multiple applications compete for resources

Research Motivation

This paper systematically quantifies the actual impact of I/O on FHE performance through comprehensive experiments, providing guidance for practical FHE system deployment.

Core Contributions

First systematic study: First in-depth analysis of storage I/O impact on FHE accelerator performance, filling a research gap in this field
Comprehensive experimental evaluation: Using SimGrid simulator, testing representative FHE applications across multiple storage devices (HBM, DDR5, PCIe, RDMA) and network configurations
Three key findings:
- I/O access significantly reduces FHE application performance (ASIC up to 357×, GPU up to 22×)
- Distributed computing does not always solve the problem; in some cases it reduces performance
- I/O overhead impact varies across applications and FHE parameter settings
Future research directions: Proposing locality-first scheduling, near-data processing, and I/O-friendly application implementations as solutions
Open resource commitment: Commitment to publicly release traces and software to facilitate future research

Methodology Details

Task Definition

This research aims to quantify the impact of storage I/O on end-to-end FHE application performance, specifically including:

Input: Different storage hierarchies (HBM, DDR, PCIe, RDMA), different network configurations (Ethernet, FastFabric), different applications (ResNet-20, HELR)
Output: Normalized performance metrics, execution time decomposition (computation/I/O/communication)
Constraints: Simulating real cloud environment cold starts and multi-user scenarios

FHE Workflow Details

1. Encode

Encodes input (e.g., vector of length n) into a polynomial with N coefficients (N/2 ≥ n)
Uses Chinese Remainder Theorem (CRT) to decompose large integers into multiple small integers (called limbs)
Modulus Q typically exceeds 1000 bits
Data expansion: 3KB image → 8MB polynomial (N=2^16 coefficients)

2. Encrypt

Encrypts plaintext polynomial using public key into ciphertext (containing two polynomials)
Introduces random error polynomial to guarantee RLWE security
Data expansion: 8MB plaintext → 16MB ciphertext

3. Compute

Supports five basic operations (see Table 1):

PAdd/HAdd: Plaintext-ciphertext/ciphertext-ciphertext addition, complexity O(N)
PMult/HMult: Plaintext-ciphertext/ciphertext-ciphertext multiplication, accelerated to O(N logN) using NTT
HRot: Rotation operation, used to implement accumulation operations
Key characteristic: HMult and HRot require access to evaluation keys (ResNet-20 requires over 100 different evaluation keys, totaling >5GB)

4. Decrypt & Decode

Inverse operations of encryption and encoding

Experimental Architecture Design

Accelerator Selection

Sharp: State-of-the-art ASIC accelerator (ISCA 2023)
- Uses simulator from original paper
- Baseline: Ideal performance (assuming sufficient HBM, all optimizations enabled)
TensorFHE: State-of-the-art GPU acceleration approach (HPCA 2023)
- Uses publicly available code running on NVIDIA A100 40GB GPU
- Baseline: Optimal performance with all data in GPU memory

Storage Hierarchy

HBM: 1 TiB/s bandwidth
DDR5-5600: 358.4 GiB/s (8 channels)
PCIe5 ×16: 64 GiB/s
RDMA Disk: 12.5 GiB/s

Experimental Configuration

Cold start: Bypasses device cache, simulating multi-user cloud environment
Throughput-only evaluation: FHE data access typically ranges from tens to hundreds of MB
Distributed simulation: Uses SimGrid simulator with star topology, supporting Ethernet (400Gb/s) and FastFabric (300GiB/s)

Application Workloads

HELR: Logistic regression training (MNIST dataset, 1024 images/batch, 32 training iterations)
ResNet-20: CNN inference (CIFAR-10 dataset, implemented using CKKS)

Parallelism Model

Adopts residue-polynomial-level parallelism (rPLP) model:

Represents large coefficient polynomials as a series of small coefficient residue polynomials
Each server computes independent residue polynomials
Most operations can be computed locally, reducing communication

Technical Innovations

First quantification of I/O impact: Breaks the limitation of existing research ignoring I/O, systematically evaluating real deployment scenarios
Multi-dimensional evaluation framework: Combines comprehensive analysis of storage hierarchies, network configurations, accelerator types, and application characteristics
Cache hit rate analysis: Reveals required cache hit rates to achieve target performance at different storage bandwidths (e.g., 80% performance requires 90.2%-99.9% hit rate)
Distributed computing paradox: Discovers that distributed computing can reduce performance in certain configurations, challenging conventional wisdom

Experimental Setup

Datasets

MNIST: Used for HELR logistic regression training
- Batch size: 1024 images
- Training iterations: 32
CIFAR-10: Used for ResNet-20 inference
- Single image inference
- Image size: 32×32×3

Evaluation Metrics

Normalized performance: Performance ratio relative to ideal baseline
Execution time: Absolute execution time (seconds)
Time decomposition: Proportion of computation/I/O/communication overhead
Speedup: Performance improvement of distributed computing relative to single machine
I/O pressure: Average bytes accessed per cycle

Comparison Methods

Baseline 1 (Sharp): Assumes unlimited HBM capacity, enables prefetching, scheduling, and data reuse optimizations
Baseline 2 (TensorFHE): Optimal configuration with all data in GPU memory
Comparison dimensions: Different storage hierarchies, different networks, different server counts (1/2/4/8/16/32)

Implementation Details

Sharp simulator:
- Polynomial coefficients: 1555-bit integers
- On-chip cache: Hundreds of MB
- I/O pressure: Average 3381 bytes/cycle
TensorFHE configuration:
- ResNet-20: 840-bit integers
- HELR: 1092-bit integers
- I/O pressure: Average 101 bytes/cycle
- Evaluation key size: 5.5× of Sharp
SimGrid configuration:
- Topology: Star network
- Offline profiling of all GPU kernels
- Imports profiling results to simulate distributed execution

Experimental Results

Main Results

Observation 1: Storage I/O Significantly Reduces Performance (Figure 4)

ASIC (Sharp) Performance Degradation:

HBM: ResNet-20 reduced 2.63×, HELR reduced 5.5× (average 4.0×)
DDR5: ResNet-20 reduced 5.56×, HELR reduced 13.4×
PCIe: ResNet-20 reduced 26.5×, HELR reduced 70.6×
RDMA: ResNet-20 reduced 131.7×, HELR reduced 357.2× (maximum degradation)

GPU (TensorFHE) Performance Degradation:

HBM: Slight reduction 1.2×
DDR5: Reduction 1.5×
PCIe: Reduction 3.8×
RDMA: ResNet-20 reduced 15.2×, HELR reduced 22×

Root Causes:

Sharp's I/O pressure is extremely high (3381 bytes/cycle) vs. TensorFHE (101 bytes/cycle)
GPU processing capability is relatively lower, making I/O pressure relatively moderate

Observation 2: Cache Hit Rate Requirements (Figure 5)

To achieve 80% baseline performance, required cache hit rates:

ResNet-20: HBM 90.2%, DDR 96.2%, PCIe 99.3%, RDMA 99.9%
HELR: Higher requirements, RDMA requires near 100% hit rate

Implications: Low-bandwidth storage requires extremely high cache hit rates, which is practically difficult to achieve

Distributed Computing Results

Observation 3: Dual Nature of Distributed Computing (Figure 6)

TensorFHE Performance:

32-server speedup:
- Ethernet: 6.6× (effective)
- FastFabric: 9.7× (more effective)

Sharp Performance (Complex Situation): Using Ethernet with 32 servers:

HBM: Performance degradation 6.08× (negative optimization!)
DDR: Performance degradation 2.74× (negative optimization!)
PCIe: Speedup 1.72×
RDMA: Speedup 5.78×

Using FastFabric with 32 servers:

HBM: Almost no improvement (0.94×)
DDR: Speedup 1.99×
PCIe: Speedup 6.42×
RDMA: Speedup 11.96×

Root Causes (Figure 7 Time Decomposition): Sharp using 32 servers (PCIe+Ethernet):

Computation overhead: 3.8%→0.3% (significantly reduced)
I/O overhead: 96.2%→7.2% (significantly reduced)
Communication overhead: 0%→92.5% (becomes new bottleneck!)

TensorFHE using 32 servers:

Computation overhead: 40.1% (still significant due to GPU batching characteristics)
I/O overhead: 18.1%
Communication overhead: 41.8%

Application Difference Analysis

Observation 4: I/O Sensitivity of Different Applications

HELR vs. ResNet-20:

HELR contains numerous rotation operations (implementing vector inner products), requiring frequent evaluation key access
Sharp's I/O requirement for HELR: 5130 bytes/cycle vs. ResNet-20's 1633 bytes/cycle (3.1×)
HELR performance degradation is more severe (e.g., 357× on RDMA)

Impact of Different FHE Parameters:

Sharp polynomial size: 1.85× of TensorFHE (ResNet-20) and 1.43× (HELR)
But TensorFHE evaluation key size: 5.5× of Sharp
TensorFHE total I/O data: 2.8× of Sharp (ResNet-20) and 4.5× (HELR)

Ablation Studies

Although the paper does not conduct traditional ablation studies, it achieves similar effects through multi-dimensional comparisons:

Storage hierarchy ablation: HBM→DDR→PCIe→RDMA, progressively reducing bandwidth, observing performance changes
Network configuration ablation: Ethernet vs. FastFabric, verifying communication bandwidth impact
Server count ablation: 1/2/4/8/16/32 servers, analyzing scalability
Accelerator type comparison: ASIC vs. GPU, revealing I/O sensitivity differences across architectures

Case Studies

ResNet-20 on Sharp in typical scenario (PCIe storage + Ethernet network):

Single machine: Execution time approximately 3.8 seconds, I/O accounts for 96.2%
32 servers: Execution time approximately 2.2 seconds, communication accounts for 92.5%
Limited performance improvement: Only 1.72× speedup, far below theoretical 32×

HELR on RDMA storage in extreme case:

Sharp performance reduced 357×, nearly unusable
Root cause: Low bandwidth (12.5 GiB/s) + high I/O requirement (5130 bytes/cycle)

Experimental Findings

I/O bottleneck is universal: Even HBM causes 4× performance degradation
ASIC is more sensitive: Due to extremely high processing capability, I/O becomes severe bottleneck
Distributed computing is not panacea: With high-bandwidth storage + low-bandwidth network, distributed computing reduces performance
Application characteristics are critical: Rotation-intensive applications (e.g., HELR) are more affected by I/O
Parameter selection is important: Different FHE parameters lead to different I/O patterns and performance

FHE Computational Acceleration

The paper reviews the development history of FHE accelerators (Figure 1):

CPU baseline: 10^5× slower than plaintext computation
Early accelerators (2021-2022):
- F1+ (MICRO'21)
- BTS (ISCA'22)
- CraterLake (ISCA'22)
- ARK (MICRO'22)
Recent progress (2023-2024):
- Sharp (ISCA'23): Only 3× difference
- TensorFHE (HPCA'23)
- Trinity (MICRO'24)
- HEAP (HPCA'24)

Assumptions in Existing Work

Most accelerator research assumes:

Data location: All data in HBM
Optimization techniques:
- Static optimal prefetching strategies
- Data reuse algorithm optimization (e.g., ARK's rotation optimization)
- Large on-chip caches (200-500 MiB)

ARK 30: Algorithm optimization only applies to specific computation patterns (e.g., same-stride rotations in ResNet-20), not applicable to HELR and sorting
Sharp 29: Reports ideal performance without considering practical I/O constraints
TensorFHE 21: GPU implementation with relatively lower I/O pressure but still affected

Advantages of This Paper

Fills gap: First systematic study of I/O impact
Real scenarios: Considers multi-user cloud environments
Quantitative analysis: Provides concrete performance data
Comprehensive evaluation: Covers multiple configurations and applications

Conclusions and Discussion

Main Conclusions

I/O is critical bottleneck for FHE deployment: Storage I/O can reduce ASIC performance by up to 357×, GPU by 22×, far exceeding benefits from computational optimization
Existing assumptions are unrealistic: Assumptions that all data fits in HBM and overhead can be eliminated are difficult to satisfy in cloud environments
Distributed computing is not silver bullet: In certain configurations (high-bandwidth storage + low-bandwidth network), distributed computing reduces performance
Application and parameter sensitive: Different applications and FHE parameter choices lead to significantly different I/O behaviors

Limitations

Simulation-based experiments: Uses SimGrid simulator rather than real hardware, may have accuracy differences
Limited application coverage: Only tests ResNet-20 and HELR applications
Single FHE scheme: Only evaluates CKKS scheme, does not cover BGV, BFV, TFHE, etc.
Static workloads: Does not consider dynamic multi-user load variations
Simplified network model: Uses star topology, does not consider more complex network topologies
Lacks real deployment verification: Findings not verified in actual cloud environments

Future Directions

The paper proposes three research directions:

1. Locality-first Scheduling

Problem: Distributed computing is not always beneficial
Approach:
- Allocate dedicated servers to users to reduce I/O access
- Study user access patterns
- Pipeline access to hide context switching overhead
Challenge: Balance resource efficiency with performance

2. Near-Data Processing (Most Promising)

Motivation: Evaluation keys are only accessed in specific operations (HRot, HMult)
Approach:
- Integrate FHE computation components into storage devices
- Design specialized compute units for specific operations
- Execute I/O-intensive computations at storage end
Advantage: Significantly reduces I/O overhead between host and storage

3. I/O-Friendly Application Implementation

Observation: FHE addition does not require evaluation key access
Approach:
- Restructure programs to leverage I/O characteristics
- May increase computation overhead but reduce I/O
- Combine with rapidly growing FHE accelerator processing capability
Example: Replace some multiplication/rotation operations with multiple additions

In-Depth Evaluation

Strengths

1. Unique and Important Research Perspective

Fills critical gap: First systematic study of FHE I/O bottleneck, breaking single perspective of computational acceleration research
Significant practical value: Addresses real cloud deployment scenarios rather than idealized laboratory environments
Timely: After significant progress in FHE computational acceleration, timely identifies next critical challenge

2. Comprehensive and Rigorous Experimental Design

Multi-dimensional evaluation: Storage hierarchy × network configuration × accelerator type × application × server count
Realistic configurations: Cold starts, cache bypass, simulating multi-user cloud environment
Sufficient comparisons: Covers complete storage hierarchy from HBM to RDMA
Precise quantification: Provides concrete performance data (e.g., 357×, 22×) rather than vague descriptions

3. Insightful Findings

Counter-intuitive conclusions: Distributed computing may reduce performance, challenging conventional wisdom
Cache hit rate analysis: Reveals impracticality of 99.9% hit rate requirements
Time decomposition: Clearly shows bottleneck shift from I/O to communication
Application differences: Deep analysis of impact mechanisms across applications and parameters

4. Clear Writing and Complete Structure

Sufficient background: Detailed explanation of FHE workflow and data expansion
Rich figures: 11 figures effectively support arguments
Rigorous logic: Clear progression from problem → experiment → findings → directions
Reproducibility commitment: Commits to releasing traces and software

Weaknesses

1. Experimental Limitations

Simulation rather than measurement: SimGrid simulation may not fully capture real hardware behavior (e.g., cache coherence, scheduling latency)
Narrow application coverage: Only two applications, difficult to fully represent FHE application ecosystem
Single FHE scheme: CKKS targets floating-point numbers, does not evaluate integer schemes (BGV, BFV) or binary schemes (TFHE, FHEW)
Static load: Does not consider dynamic user request arrivals, load fluctuations, priorities, etc.

2. Analysis Depth Could Be Improved

Lacks theoretical model: No mathematical model established for I/O overhead vs. system parameters
Prefetching strategy not deeply analyzed: Limited detailed analysis of different prefetching strategy effects
Simplified cache management: Does not consider complex cache replacement strategies and multi-level caches
Missing power analysis: Impact of I/O overhead on energy consumption not addressed

3. Preliminary Solutions

Future directions lack detail: Three directions only conceptually described, lacking specific designs
No prototype verification: Solutions like near-data processing lack prototype implementation to verify feasibility
Insufficient trade-off analysis: Costs, complexity, and applicable scenarios of each solution not thoroughly discussed

4. Experimental Setup Issues

Sharp simulator dependency: Depends on original paper's simulator, cannot verify its accuracy
Simplified network model: Star topology does not represent real data center networks (e.g., Clos, Fat-tree)
Security not considered: Multi-user isolation, side-channel attacks, etc. not addressed

Impact

Contribution to the Field

Paradigm shift: Extends FHE research focus from pure computation to system level
Warning effect: Alerts researchers to I/O bottleneck, avoiding excessive computational optimization
Benchmark data: Provides performance data across different configurations as reference for future research
Research inspiration: Three future directions may catalyze series of follow-up work

Practical Value

Deployment guidance: Provides quantitative evidence for cloud service providers deploying FHE
Architecture design: Guides next-generation FHE accelerator I/O subsystem design
Parameter selection: Helps application developers choose FHE parameters based on I/O characteristics
Cost assessment: Provides performance prediction for FHE cloud service pricing

Reproducibility

Open source commitment: Traces and software will be publicly released, facilitating verification and extension
Detailed configuration: Experimental setup sufficiently described for reproduction
Public code dependencies: TensorFHE uses publicly available implementation
But challenges exist: Sharp simulator not publicly available, complete reproduction difficult

Applicable Scenarios

Suitable Scenarios

Cloud FHE service planning: Cloud service providers evaluating FHE service feasibility and resource requirements
FHE accelerator design: Hardware designers balancing computational capability with I/O subsystem
Application optimization: FHE application developers optimizing algorithms based on I/O characteristics
System research: Storage system researchers exploring FHE's special I/O patterns

Less Suitable Scenarios

Single-user scenarios: Paper focuses on multi-user cloud environments; single-user may not face I/O constraints
Small-scale data: When data completely fits in HBM, I/O impact is minimal
Non-CKKS schemes: Other FHE schemes may have different I/O characteristics
Edge computing: Edge devices' resource constraints and usage patterns differ from cloud

Potential Extension Directions

Real hardware verification: Deployment and measurement in real cloud environments
More FHE schemes: Extension to BGV, BFV, TFHE, etc.
More applications: Database queries, genomic analysis, financial computation, etc.
Dynamic load: Simulating realistic user request arrival patterns
Security analysis: Impact of I/O optimization on side-channel attacks
Prototype implementation: Implement near-data processing FHE storage device prototype
Theoretical modeling: Establish performance models for I/O overhead
Scheduling algorithms: Design locality-aware FHE task schedulers

References

The paper cites 46 references, key references include:

FHE Accelerators

29 Sharp (ISCA'23): State-of-the-art ASIC accelerator, main comparison target in this paper
21 TensorFHE (HPCA'23): GPU acceleration approach
30 ARK (MICRO'22): Proposes data reuse optimization
40 CraterLake (ISCA'22): Early ASIC design

FHE Schemes

15 CKKS: FHE scheme supporting floating-point numbers, adopted in this paper
12 BGV: Integer FHE scheme
11,20 BFV: Another integer scheme
16 TFHE: Binary FHE scheme

Applications

24 HELR: Logistic regression training
34 ResNet-20: CNN inference

System Tools

13 SimGrid: Distributed system simulator

Overall Assessment: This is a systems research paper with unique perspective, solid experiments, and important findings. It fills a critical gap in FHE research regarding I/O bottlenecks, providing important warnings and guidance for practical FHE deployment. Despite limitations such as simulation-based experiments and limited application coverage, its core contribution—revealing the severity of I/O bottlenecks—has significant academic and practical value. The three proposed future directions, particularly near-data processing, may lead FHE systems research in new directions. This is essential reading for cloud service providers, hardware designers, and FHE application developers.