Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today's practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.
- Paper ID: 2410.12794
- Title: A Disaggregation Approach to Embedding Recommendation Systems
- Authors: Yibo Huang, Zhenning Yang, Jiarong Xing, Yi Dai, Yiming Qiu, Dingming Wu, Fan Lai, Ang Chen
- Classification: cs.IR cs.AI
- Publication Date/Venue: arXiv 2024 (Work in Progress)
- Paper Link: https://arxiv.org/abs/2410.12794
Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their ever-growing memory requirements. Current practice involves distributing models across multiple monolithic servers where GPUs, CPUs, and DRAM are provisioned in fixed ratios. This approach results in suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference presents a promising solution but introduces new networking challenges. This paper discusses the design of FlexEMR for optimizing EMR disaggregation. FlexEMR proposes two sets of techniques to address networking challenges: leveraging temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests.
- Massive Memory Requirements: Production-level EMR models' embedding tables can reach terabyte scale (e.g., Meta's 50TB DLRM model), accounting for over 99% of model parameters
- Rigid Resource Configuration: Existing monolithic servers provision GPU, CPU, and DRAM in fixed ratios, unable to adapt to varying resource demands across different models and time periods
- Low Cost Efficiency: Fixed resource configuration leads to resource waste, with studies showing up to 23.1% cost overhead
- EMR models dominate the AI inference lifecycle in production data centers (e.g., Meta's data centers)
- Widely deployed in e-commerce, search engines, short-video services, and other core internet businesses
- Memory bottleneck has become the primary limiting factor for EMR model deployment
- Monolithic Server Architecture: Resources provisioned in fixed ratios, difficult to scale independently
- GPU Memory Contention: Embedding caches compete with neural network computation for limited GPU memory
- Insufficient Network Optimization: Existing RDMA systems lack optimization for EMR disaggregation scenarios
- Proposes FlexEMR Disaggregation Architecture: Completely separates embedding storage and neural network computation onto independent servers
- Designs Locality-Enhancement Optimization: Leverages temporal and spatial locality to reduce network data transfer
- Develops Multi-threaded RDMA Engine: Concurrent lookup engine optimized for EMR scenarios
- Implements Adaptive Caching Strategy: Dynamically adjusts cache size to avoid GPU memory contention
- Proposes Hierarchical Pooling Mechanism: Pushes partial pooling operations down to embedding servers
Input: User queries containing categorical (sparse) and continuous (dense) features
Output: Top-K ranked results of candidate items
Constraints: Minimize total cost of ownership (TCO) while meeting service level objectives (SLO)
FlexEMR employs a disaggregated architecture comprising:
- Ranker Nodes: Equipped with GPUs, responsible for neural network inference computation
- Embedding Servers: Equipped with CPUs and large memory, store embedding tables and process lookup requests
- High-Speed Network: Connects the two types of nodes via RDMA and similar technologies
1. Adaptive Embedding Caching (§3.1.1)
- Dynamic Load Monitoring: Uses sliding window algorithm to monitor task queue size
- Memory Allocation Strategy: Dynamically adjusts cache size based on NN computation requirements
- Asynchronous Data Exchange: Transparently performs swap-in/swap-out operations for hot embeddings
2. Hierarchical Embedding Pooling (§3.1.2)
- Spatial Locality Exploitation: Identifies multiple vectors on the same embedding server
- Distributed Pooling: Embedding servers perform local pooling, Rankers perform global pooling
- Routing Table Optimization: Range-based routing tables reduce memory footprint
3. Multi-threaded RDMA Engine (§3.2)
- Mapping-Aware Design: Eliminates contention between RNIC parallel units
- Dynamic Connection Migration: Load balancing in response to skewed access patterns
- Credit Flow Control: Fast credit control channel based on QoS
- Traditional Approach: Fixed-size GPU cache competing with NN computation for memory
- FlexEMR: Dynamically adjusts cache size, balancing latency and throughput
- Traditional Approach: All embedding vectors transferred to Ranker for pooling
- FlexEMR: Leverages embedding server CPU resources for pre-aggregation
- Traditional Approach: Multi-threaded contention for RNIC resources, 62% performance degradation
- FlexEMR: One-to-one mapping eliminates contention, 2.3x performance improvement
- MLPerf Framework: Standardized recommendation system benchmark tests
- Meta Production Traces: Production-level embedding lookup traces from Meta
- RMC2 Model: Representative recommendation model for performance evaluation
- Throughput: Requests processed per second (rps)
- Latency: Including median and P99 latency
- GPU Memory Utilization: Maximum supported batch size
- Network Transfer Efficiency: Data transfer volume and bandwidth utilization
- Hardware Configuration: Intel Xeon servers (32 cores, 128GB memory), Nvidia A100 GPU (80GB)
- Network: 100Gbps Mellanox RDMA NIC
- Comparison Methods: Single-threaded RDMA baseline, fixed caching strategy
- Uses resource domain feature for RDMA mapping awareness
- Sliding window size dynamically adjusted based on workload
- Credit flow control implemented at connection-level QoS
1. GPU Memory Contention Analysis (Figure 7)
- No cache: Maximum batch size approximately 2000
- Large cache (75GB): Maximum batch size reduced to approximately 500
- FlexEMR adaptive cache: Maintains high throughput while preserving latency advantages
2. Multi-threaded RDMA Performance (Figure 8 Left)
- Baseline method: Performance degrades with increasing thread count
- FlexEMR: 2.3x throughput improvement with 8 RDMA engines, reaching 15M rps
3. Credit Flow Control Effect (Figure 8 Right)
- Median latency: FlexEMR reduces approximately 35% compared to baseline
- P99 latency: Significant improvement in tail latency performance
The paper demonstrates independent contributions of each component:
- Mapping-aware multi-threading: Addresses RNIC resource contention
- Adaptive caching: Balances memory usage and performance
- Hierarchical pooling: Reduces network transfer overhead
- Memory Contention is Critical Bottleneck: GPU cache and NN computation memory contention significantly impacts performance
- Network Optimization Highly Effective: Optimized RDMA engine substantially improves concurrent lookup performance
- Locality Exploitation Effective: Temporal and spatial locality exploitation effectively reduces network overhead
- GPU-Centric Approaches: Treating EMR as general deep learning models, primarily using GPU resources
- Caching Optimization: Various embedding caching mechanisms to accelerate lookup operations
- Specialized Hardware: FPGA and other specialized hardware accelerators for recommendation systems
- Compression and Sharding: Embedding table compression and sharding optimization techniques
- Comprehensive Disaggregation Solution: First systematic EMR disaggregation architecture design
- Network Optimization Focus: In-depth addressing of networking challenges introduced by disaggregation
- Dynamic Adaptation Capability: Provides dynamic optimization compared to DisaggRec's static resource allocation
- EMR disaggregation architecture significantly improves resource utilization and cost efficiency
- Locality-aware optimization effectively reduces network overhead
- Targeted RDMA optimization is critical for disaggregation architecture performance
- Adaptive strategies better suit dynamic workloads than static configuration
- Prototype Stage: Currently an early prototype lacking large-scale deployment validation
- Network Dependency: Performance highly dependent on high-speed networks, increasing infrastructure costs
- Increased Complexity: Disaggregation architecture increases system complexity and operational overhead
- Latency Overhead: Network communication inevitably introduces additional latency
- Extension to Other Models: Application to LLMs, multimodal models, MoE, etc.
- Smarter Scheduling: Development of more sophisticated resource scheduling algorithms
- Hardware Co-design: Collaboration with network hardware vendors for optimization
- Fault Tolerance Mechanisms: Enhanced system robustness and failure recovery capabilities
- Accurate Problem Identification: Precisely identifies core challenges and bottlenecks in EMR services
- Reasonable Solution Design: Disaggregation architecture aligns with data center disaggregation trends
- Effective Technical Innovations: Multiple technical innovations supported by experimental validation
- High Practical Value: Addresses important problems in production environments
- Limited Evaluation Scope: Tested only in small-scale environments, lacking large-scale validation
- Insufficient Cost Analysis: Lacks detailed cost-benefit analysis
- Missing Fault Handling: Insufficient discussion of fault handling in disaggregated architecture
- Integration with Existing Systems: Lacks discussion on integration with existing recommendation systems
- Academic Contribution: Provides comprehensive technical framework for EMR system disaggregation
- Industrial Value: Significant guidance for large-scale recommendation system deployment
- Technology Advancement: Promotes application of disaggregated architecture in AI services
- Standardization Potential: Could become reference standard for EMR disaggregation deployment
- Large-Scale Recommendation Systems: Suitable for Meta, Alibaba, and other major internet companies
- Resource-Constrained Environments: Data centers requiring resource utilization optimization
- Dynamic Workload Scenarios: Recommendation services with significant workload variations
- Cost-Sensitive Applications: Commercial scenarios with strict TCO requirements
The paper cites 61 related references, primarily including:
- EMR system optimization work (e.g., AdaEmbed, RecSSD)
- Disaggregated system architecture research (e.g., LegoOS, DxPU)
- RDMA network optimization techniques (e.g., FaRM, Aeolus)
- Recommendation system benchmarks (MLPerf, Meta DLRM datasets)
Overall Assessment: This is a high-quality systems research paper proposing an innovative disaggregation architecture solution addressing practical challenges in EMR services. While currently at the prototype stage, its technical approach demonstrates strong practical value and promotion potential, with significant implications for the development of recommendation system infrastructure.