Fully Homomorphic Encryption (FHE) allows computations to be performed on encrypted data, significantly enhancing user privacy. However, the I/O challenges associated with deploying FHE applications remains understudied. We analyze the impact of storage I/O on the performance of FHE applications and summarize key lessons from the status quo. Key results include that storage I/O can degrade the performance of ASICs by as much as 357$\times$ and reduce GPUs performance by up to 22$\times$.
- Paper ID: 2511.04946
- Title: The Future of Fully Homomorphic Encryption System: from a Storage I/O Perspective
- Authors: Lei Chen, Erci Xu, Yiming Sun, Shengyu Fan, Xianglong Deng, Guiming Shi, Guang Fan, Liang Kong, Yilan Zhu, Shoumeng Yan, Mingzhe Zhang (from Ant Group, Shanghai Jiao Tong University, University of Chinese Academy of Sciences, Tsinghua University)
- Classification: cs.CR (Cryptography and Security), cs.DC (Distributed Computing)
- Submission Date: November 7, 2025 to arXiv
- Paper Link: https://arxiv.org/abs/2511.04946
Fully Homomorphic Encryption (FHE) enables direct computation on encrypted data, significantly enhancing user privacy protection. However, the I/O challenges encountered when deploying FHE applications remain insufficiently studied. This paper analyzes the impact of storage I/O on FHE application performance and summarizes key lessons from the current state. Core results demonstrate that storage I/O can reduce ASIC performance by up to 357×, and GPU performance by up to 22×.
This paper focuses on the storage I/O bottleneck that has been severely overlooked in FHE system deployment. Although existing research has achieved significant progress in computational acceleration (reducing the slowdown from 10^5× on CPU to merely 3× difference), the impact of storage I/O has received little attention.
- Real-world requirements in cloud computing: In multi-user cloud environments, each user possesses independent ciphertexts and evaluation keys, which may exceed device memory capacity
- Data scale explosion: FHE workflows significantly amplify data scale (e.g., 3KB image → 8MB plaintext polynomial → 16MB ciphertext → 5GB evaluation keys)
- Multi-user concurrency: Servers must simultaneously serve multiple users, making it impossible to maintain all user data in high-bandwidth memory (HBM)
Existing FHE accelerator research relies on two unrealistic assumptions:
- Assumption 1: All data is stored in HBM
- Assumption 2: Data fetching overhead from HBM to on-chip cache can be completely eliminated through static optimal prefetching strategies, data reuse algorithms, and large on-chip caches (200-500 MiB)
These assumptions are difficult to satisfy in practical cloud computing deployments because:
- HBM capacity is limited (approximately tens of GB)
- Multi-user environments cannot reserve space for all user data
- Large models (e.g., 13B parameter LLM requiring 26GB weights + 1.6GB KV cache) consume substantial HBM
- Static prefetching strategies have limited effectiveness when multiple applications compete for resources
This paper systematically quantifies the actual impact of I/O on FHE performance through comprehensive experiments, providing guidance for practical FHE system deployment.
- First systematic study: First in-depth analysis of storage I/O impact on FHE accelerator performance, filling a research gap in this field
- Comprehensive experimental evaluation: Using SimGrid simulator, testing representative FHE applications across multiple storage devices (HBM, DDR5, PCIe, RDMA) and network configurations
- Three key findings:
- I/O access significantly reduces FHE application performance (ASIC up to 357×, GPU up to 22×)
- Distributed computing does not always solve the problem; in some cases it reduces performance
- I/O overhead impact varies across applications and FHE parameter settings
- Future research directions: Proposing locality-first scheduling, near-data processing, and I/O-friendly application implementations as solutions
- Open resource commitment: Commitment to publicly release traces and software to facilitate future research
This research aims to quantify the impact of storage I/O on end-to-end FHE application performance, specifically including:
- Input: Different storage hierarchies (HBM, DDR, PCIe, RDMA), different network configurations (Ethernet, FastFabric), different applications (ResNet-20, HELR)
- Output: Normalized performance metrics, execution time decomposition (computation/I/O/communication)
- Constraints: Simulating real cloud environment cold starts and multi-user scenarios
- Encodes input (e.g., vector of length n) into a polynomial with N coefficients (N/2 ≥ n)
- Uses Chinese Remainder Theorem (CRT) to decompose large integers into multiple small integers (called limbs)
- Modulus Q typically exceeds 1000 bits
- Data expansion: 3KB image → 8MB polynomial (N=2^16 coefficients)
- Encrypts plaintext polynomial using public key into ciphertext (containing two polynomials)
- Introduces random error polynomial to guarantee RLWE security
- Data expansion: 8MB plaintext → 16MB ciphertext
Supports five basic operations (see Table 1):
- PAdd/HAdd: Plaintext-ciphertext/ciphertext-ciphertext addition, complexity O(N)
- PMult/HMult: Plaintext-ciphertext/ciphertext-ciphertext multiplication, accelerated to O(N logN) using NTT
- HRot: Rotation operation, used to implement accumulation operations
- Key characteristic: HMult and HRot require access to evaluation keys (ResNet-20 requires over 100 different evaluation keys, totaling >5GB)
Inverse operations of encryption and encoding
- Sharp: State-of-the-art ASIC accelerator (ISCA 2023)
- Uses simulator from original paper
- Baseline: Ideal performance (assuming sufficient HBM, all optimizations enabled)
- TensorFHE: State-of-the-art GPU acceleration approach (HPCA 2023)
- Uses publicly available code running on NVIDIA A100 40GB GPU
- Baseline: Optimal performance with all data in GPU memory
- HBM: 1 TiB/s bandwidth
- DDR5-5600: 358.4 GiB/s (8 channels)
- PCIe5 ×16: 64 GiB/s
- RDMA Disk: 12.5 GiB/s
- Cold start: Bypasses device cache, simulating multi-user cloud environment
- Throughput-only evaluation: FHE data access typically ranges from tens to hundreds of MB
- Distributed simulation: Uses SimGrid simulator with star topology, supporting Ethernet (400Gb/s) and FastFabric (300GiB/s)
- HELR: Logistic regression training (MNIST dataset, 1024 images/batch, 32 training iterations)
- ResNet-20: CNN inference (CIFAR-10 dataset, implemented using CKKS)
Adopts residue-polynomial-level parallelism (rPLP) model:
- Represents large coefficient polynomials as a series of small coefficient residue polynomials
- Each server computes independent residue polynomials
- Most operations can be computed locally, reducing communication
- First quantification of I/O impact: Breaks the limitation of existing research ignoring I/O, systematically evaluating real deployment scenarios
- Multi-dimensional evaluation framework: Combines comprehensive analysis of storage hierarchies, network configurations, accelerator types, and application characteristics
- Cache hit rate analysis: Reveals required cache hit rates to achieve target performance at different storage bandwidths (e.g., 80% performance requires 90.2%-99.9% hit rate)
- Distributed computing paradox: Discovers that distributed computing can reduce performance in certain configurations, challenging conventional wisdom
- MNIST: Used for HELR logistic regression training
- Batch size: 1024 images
- Training iterations: 32
- CIFAR-10: Used for ResNet-20 inference
- Single image inference
- Image size: 32×32×3
- Normalized performance: Performance ratio relative to ideal baseline
- Execution time: Absolute execution time (seconds)
- Time decomposition: Proportion of computation/I/O/communication overhead
- Speedup: Performance improvement of distributed computing relative to single machine
- I/O pressure: Average bytes accessed per cycle
- Baseline 1 (Sharp): Assumes unlimited HBM capacity, enables prefetching, scheduling, and data reuse optimizations
- Baseline 2 (TensorFHE): Optimal configuration with all data in GPU memory
- Comparison dimensions: Different storage hierarchies, different networks, different server counts (1/2/4/8/16/32)
- Sharp simulator:
- Polynomial coefficients: 1555-bit integers
- On-chip cache: Hundreds of MB
- I/O pressure: Average 3381 bytes/cycle
- TensorFHE configuration:
- ResNet-20: 840-bit integers
- HELR: 1092-bit integers
- I/O pressure: Average 101 bytes/cycle
- Evaluation key size: 5.5× of Sharp
- SimGrid configuration:
- Topology: Star network
- Offline profiling of all GPU kernels
- Imports profiling results to simulate distributed execution
ASIC (Sharp) Performance Degradation:
- HBM: ResNet-20 reduced 2.63×, HELR reduced 5.5× (average 4.0×)
- DDR5: ResNet-20 reduced 5.56×, HELR reduced 13.4×
- PCIe: ResNet-20 reduced 26.5×, HELR reduced 70.6×
- RDMA: ResNet-20 reduced 131.7×, HELR reduced 357.2× (maximum degradation)
GPU (TensorFHE) Performance Degradation:
- HBM: Slight reduction 1.2×
- DDR5: Reduction 1.5×
- PCIe: Reduction 3.8×
- RDMA: ResNet-20 reduced 15.2×, HELR reduced 22×
Root Causes:
- Sharp's I/O pressure is extremely high (3381 bytes/cycle) vs. TensorFHE (101 bytes/cycle)
- GPU processing capability is relatively lower, making I/O pressure relatively moderate
To achieve 80% baseline performance, required cache hit rates:
- ResNet-20: HBM 90.2%, DDR 96.2%, PCIe 99.3%, RDMA 99.9%
- HELR: Higher requirements, RDMA requires near 100% hit rate
Implications: Low-bandwidth storage requires extremely high cache hit rates, which is practically difficult to achieve
TensorFHE Performance:
- 32-server speedup:
- Ethernet: 6.6× (effective)
- FastFabric: 9.7× (more effective)
Sharp Performance (Complex Situation):
Using Ethernet with 32 servers:
- HBM: Performance degradation 6.08× (negative optimization!)
- DDR: Performance degradation 2.74× (negative optimization!)
- PCIe: Speedup 1.72×
- RDMA: Speedup 5.78×
Using FastFabric with 32 servers:
- HBM: Almost no improvement (0.94×)
- DDR: Speedup 1.99×
- PCIe: Speedup 6.42×
- RDMA: Speedup 11.96×
Root Causes (Figure 7 Time Decomposition):
Sharp using 32 servers (PCIe+Ethernet):
- Computation overhead: 3.8%→0.3% (significantly reduced)
- I/O overhead: 96.2%→7.2% (significantly reduced)
- Communication overhead: 0%→92.5% (becomes new bottleneck!)
TensorFHE using 32 servers:
- Computation overhead: 40.1% (still significant due to GPU batching characteristics)
- I/O overhead: 18.1%
- Communication overhead: 41.8%
HELR vs. ResNet-20:
- HELR contains numerous rotation operations (implementing vector inner products), requiring frequent evaluation key access
- Sharp's I/O requirement for HELR: 5130 bytes/cycle vs. ResNet-20's 1633 bytes/cycle (3.1×)
- HELR performance degradation is more severe (e.g., 357× on RDMA)
Impact of Different FHE Parameters:
- Sharp polynomial size: 1.85× of TensorFHE (ResNet-20) and 1.43× (HELR)
- But TensorFHE evaluation key size: 5.5× of Sharp
- TensorFHE total I/O data: 2.8× of Sharp (ResNet-20) and 4.5× (HELR)
Although the paper does not conduct traditional ablation studies, it achieves similar effects through multi-dimensional comparisons:
- Storage hierarchy ablation: HBM→DDR→PCIe→RDMA, progressively reducing bandwidth, observing performance changes
- Network configuration ablation: Ethernet vs. FastFabric, verifying communication bandwidth impact
- Server count ablation: 1/2/4/8/16/32 servers, analyzing scalability
- Accelerator type comparison: ASIC vs. GPU, revealing I/O sensitivity differences across architectures
ResNet-20 on Sharp in typical scenario (PCIe storage + Ethernet network):
- Single machine: Execution time approximately 3.8 seconds, I/O accounts for 96.2%
- 32 servers: Execution time approximately 2.2 seconds, communication accounts for 92.5%
- Limited performance improvement: Only 1.72× speedup, far below theoretical 32×
HELR on RDMA storage in extreme case:
- Sharp performance reduced 357×, nearly unusable
- Root cause: Low bandwidth (12.5 GiB/s) + high I/O requirement (5130 bytes/cycle)
- I/O bottleneck is universal: Even HBM causes 4× performance degradation
- ASIC is more sensitive: Due to extremely high processing capability, I/O becomes severe bottleneck
- Distributed computing is not panacea: With high-bandwidth storage + low-bandwidth network, distributed computing reduces performance
- Application characteristics are critical: Rotation-intensive applications (e.g., HELR) are more affected by I/O
- Parameter selection is important: Different FHE parameters lead to different I/O patterns and performance
The paper reviews the development history of FHE accelerators (Figure 1):
- CPU baseline: 10^5× slower than plaintext computation
- Early accelerators (2021-2022):
- F1+ (MICRO'21)
- BTS (ISCA'22)
- CraterLake (ISCA'22)
- ARK (MICRO'22)
- Recent progress (2023-2024):
- Sharp (ISCA'23): Only 3× difference
- TensorFHE (HPCA'23)
- Trinity (MICRO'24)
- HEAP (HPCA'24)
Most accelerator research assumes:
- Data location: All data in HBM
- Optimization techniques:
- Static optimal prefetching strategies
- Data reuse algorithm optimization (e.g., ARK's rotation optimization)
- Large on-chip caches (200-500 MiB)
- ARK 30: Algorithm optimization only applies to specific computation patterns (e.g., same-stride rotations in ResNet-20), not applicable to HELR and sorting
- Sharp 29: Reports ideal performance without considering practical I/O constraints
- TensorFHE 21: GPU implementation with relatively lower I/O pressure but still affected
- Fills gap: First systematic study of I/O impact
- Real scenarios: Considers multi-user cloud environments
- Quantitative analysis: Provides concrete performance data
- Comprehensive evaluation: Covers multiple configurations and applications
- I/O is critical bottleneck for FHE deployment: Storage I/O can reduce ASIC performance by up to 357×, GPU by 22×, far exceeding benefits from computational optimization
- Existing assumptions are unrealistic: Assumptions that all data fits in HBM and overhead can be eliminated are difficult to satisfy in cloud environments
- Distributed computing is not silver bullet: In certain configurations (high-bandwidth storage + low-bandwidth network), distributed computing reduces performance
- Application and parameter sensitive: Different applications and FHE parameter choices lead to significantly different I/O behaviors
- Simulation-based experiments: Uses SimGrid simulator rather than real hardware, may have accuracy differences
- Limited application coverage: Only tests ResNet-20 and HELR applications
- Single FHE scheme: Only evaluates CKKS scheme, does not cover BGV, BFV, TFHE, etc.
- Static workloads: Does not consider dynamic multi-user load variations
- Simplified network model: Uses star topology, does not consider more complex network topologies
- Lacks real deployment verification: Findings not verified in actual cloud environments
The paper proposes three research directions:
- Problem: Distributed computing is not always beneficial
- Approach:
- Allocate dedicated servers to users to reduce I/O access
- Study user access patterns
- Pipeline access to hide context switching overhead
- Challenge: Balance resource efficiency with performance
- Motivation: Evaluation keys are only accessed in specific operations (HRot, HMult)
- Approach:
- Integrate FHE computation components into storage devices
- Design specialized compute units for specific operations
- Execute I/O-intensive computations at storage end
- Advantage: Significantly reduces I/O overhead between host and storage
- Observation: FHE addition does not require evaluation key access
- Approach:
- Restructure programs to leverage I/O characteristics
- May increase computation overhead but reduce I/O
- Combine with rapidly growing FHE accelerator processing capability
- Example: Replace some multiplication/rotation operations with multiple additions
- Fills critical gap: First systematic study of FHE I/O bottleneck, breaking single perspective of computational acceleration research
- Significant practical value: Addresses real cloud deployment scenarios rather than idealized laboratory environments
- Timely: After significant progress in FHE computational acceleration, timely identifies next critical challenge
- Multi-dimensional evaluation: Storage hierarchy × network configuration × accelerator type × application × server count
- Realistic configurations: Cold starts, cache bypass, simulating multi-user cloud environment
- Sufficient comparisons: Covers complete storage hierarchy from HBM to RDMA
- Precise quantification: Provides concrete performance data (e.g., 357×, 22×) rather than vague descriptions
- Counter-intuitive conclusions: Distributed computing may reduce performance, challenging conventional wisdom
- Cache hit rate analysis: Reveals impracticality of 99.9% hit rate requirements
- Time decomposition: Clearly shows bottleneck shift from I/O to communication
- Application differences: Deep analysis of impact mechanisms across applications and parameters
- Sufficient background: Detailed explanation of FHE workflow and data expansion
- Rich figures: 11 figures effectively support arguments
- Rigorous logic: Clear progression from problem → experiment → findings → directions
- Reproducibility commitment: Commits to releasing traces and software
- Simulation rather than measurement: SimGrid simulation may not fully capture real hardware behavior (e.g., cache coherence, scheduling latency)
- Narrow application coverage: Only two applications, difficult to fully represent FHE application ecosystem
- Single FHE scheme: CKKS targets floating-point numbers, does not evaluate integer schemes (BGV, BFV) or binary schemes (TFHE, FHEW)
- Static load: Does not consider dynamic user request arrivals, load fluctuations, priorities, etc.
- Lacks theoretical model: No mathematical model established for I/O overhead vs. system parameters
- Prefetching strategy not deeply analyzed: Limited detailed analysis of different prefetching strategy effects
- Simplified cache management: Does not consider complex cache replacement strategies and multi-level caches
- Missing power analysis: Impact of I/O overhead on energy consumption not addressed
- Future directions lack detail: Three directions only conceptually described, lacking specific designs
- No prototype verification: Solutions like near-data processing lack prototype implementation to verify feasibility
- Insufficient trade-off analysis: Costs, complexity, and applicable scenarios of each solution not thoroughly discussed
- Sharp simulator dependency: Depends on original paper's simulator, cannot verify its accuracy
- Simplified network model: Star topology does not represent real data center networks (e.g., Clos, Fat-tree)
- Security not considered: Multi-user isolation, side-channel attacks, etc. not addressed
- Paradigm shift: Extends FHE research focus from pure computation to system level
- Warning effect: Alerts researchers to I/O bottleneck, avoiding excessive computational optimization
- Benchmark data: Provides performance data across different configurations as reference for future research
- Research inspiration: Three future directions may catalyze series of follow-up work
- Deployment guidance: Provides quantitative evidence for cloud service providers deploying FHE
- Architecture design: Guides next-generation FHE accelerator I/O subsystem design
- Parameter selection: Helps application developers choose FHE parameters based on I/O characteristics
- Cost assessment: Provides performance prediction for FHE cloud service pricing
- Open source commitment: Traces and software will be publicly released, facilitating verification and extension
- Detailed configuration: Experimental setup sufficiently described for reproduction
- Public code dependencies: TensorFHE uses publicly available implementation
- But challenges exist: Sharp simulator not publicly available, complete reproduction difficult
- Cloud FHE service planning: Cloud service providers evaluating FHE service feasibility and resource requirements
- FHE accelerator design: Hardware designers balancing computational capability with I/O subsystem
- Application optimization: FHE application developers optimizing algorithms based on I/O characteristics
- System research: Storage system researchers exploring FHE's special I/O patterns
- Single-user scenarios: Paper focuses on multi-user cloud environments; single-user may not face I/O constraints
- Small-scale data: When data completely fits in HBM, I/O impact is minimal
- Non-CKKS schemes: Other FHE schemes may have different I/O characteristics
- Edge computing: Edge devices' resource constraints and usage patterns differ from cloud
- Real hardware verification: Deployment and measurement in real cloud environments
- More FHE schemes: Extension to BGV, BFV, TFHE, etc.
- More applications: Database queries, genomic analysis, financial computation, etc.
- Dynamic load: Simulating realistic user request arrival patterns
- Security analysis: Impact of I/O optimization on side-channel attacks
- Prototype implementation: Implement near-data processing FHE storage device prototype
- Theoretical modeling: Establish performance models for I/O overhead
- Scheduling algorithms: Design locality-aware FHE task schedulers
The paper cites 46 references, key references include:
- 29 Sharp (ISCA'23): State-of-the-art ASIC accelerator, main comparison target in this paper
- 21 TensorFHE (HPCA'23): GPU acceleration approach
- 30 ARK (MICRO'22): Proposes data reuse optimization
- 40 CraterLake (ISCA'22): Early ASIC design
- 15 CKKS: FHE scheme supporting floating-point numbers, adopted in this paper
- 12 BGV: Integer FHE scheme
- 11,20 BFV: Another integer scheme
- 16 TFHE: Binary FHE scheme
- 24 HELR: Logistic regression training
- 34 ResNet-20: CNN inference
- 13 SimGrid: Distributed system simulator
Overall Assessment: This is a systems research paper with unique perspective, solid experiments, and important findings. It fills a critical gap in FHE research regarding I/O bottlenecks, providing important warnings and guidance for practical FHE deployment. Despite limitations such as simulation-based experiments and limited application coverage, its core contribution—revealing the severity of I/O bottlenecks—has significant academic and practical value. The three proposed future directions, particularly near-data processing, may lead FHE systems research in new directions. This is essential reading for cloud service providers, hardware designers, and FHE application developers.