2025-11-15T16:01:12.014757

Disaggregating Embedding Recommendation Systems with FlexEMR

Huang, Yang, Xing et al.

Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their increasingly large memory requirements. Today's practice splits the model across many monolithic servers, where a mix of GPUs, CPUs, and DRAM is provisioned in fixed proportions. This approach leads to suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference is a promising solution but raises novel networking challenges. In this paper, we discuss the design of FlexEMR for optimized EMR disaggregation. FlexEMR proposes two sets of techniques to tackle the networking challenges: Leveraging the temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests. We outline the design space for each technique and present initial results from our early prototype.

academic

Disaggregating Embedding Recommendation Systems with FlexEMR

Basic Information

Paper ID: 2410.12794
Title: A Disaggregation Approach to Embedding Recommendation Systems
Authors: Yibo Huang, Zhenning Yang, Jiarong Xing, Yi Dai, Yiming Qiu, Dingming Wu, Fan Lai, Ang Chen
Classification: cs.IR cs.AI
Publication Date/Venue: arXiv 2024 (Work in Progress)
Paper Link: https://arxiv.org/abs/2410.12794

Abstract

Efficiently serving embedding-based recommendation (EMR) models remains a significant challenge due to their ever-growing memory requirements. Current practice involves distributing models across multiple monolithic servers where GPUs, CPUs, and DRAM are provisioned in fixed ratios. This approach results in suboptimal resource utilization and increased costs. Disaggregating embedding operations from neural network inference presents a promising solution but introduces new networking challenges. This paper discusses the design of FlexEMR for optimizing EMR disaggregation. FlexEMR proposes two sets of techniques to address networking challenges: leveraging temporal and spatial locality of embedding lookups to reduce data movement over the network, and designing an optimized multi-threaded RDMA engine for concurrent lookup subrequests.

Research Background and Motivation

Problem Description

Massive Memory Requirements: Production-level EMR models' embedding tables can reach terabyte scale (e.g., Meta's 50TB DLRM model), accounting for over 99% of model parameters
Rigid Resource Configuration: Existing monolithic servers provision GPU, CPU, and DRAM in fixed ratios, unable to adapt to varying resource demands across different models and time periods
Low Cost Efficiency: Fixed resource configuration leads to resource waste, with studies showing up to 23.1% cost overhead

Significance

EMR models dominate the AI inference lifecycle in production data centers (e.g., Meta's data centers)
Widely deployed in e-commerce, search engines, short-video services, and other core internet businesses
Memory bottleneck has become the primary limiting factor for EMR model deployment

Limitations of Existing Approaches

Monolithic Server Architecture: Resources provisioned in fixed ratios, difficult to scale independently
GPU Memory Contention: Embedding caches compete with neural network computation for limited GPU memory
Insufficient Network Optimization: Existing RDMA systems lack optimization for EMR disaggregation scenarios

Core Contributions

Proposes FlexEMR Disaggregation Architecture: Completely separates embedding storage and neural network computation onto independent servers
Designs Locality-Enhancement Optimization: Leverages temporal and spatial locality to reduce network data transfer
Develops Multi-threaded RDMA Engine: Concurrent lookup engine optimized for EMR scenarios
Implements Adaptive Caching Strategy: Dynamically adjusts cache size to avoid GPU memory contention
Proposes Hierarchical Pooling Mechanism: Pushes partial pooling operations down to embedding servers

Methodology Details

Task Definition

Input: User queries containing categorical (sparse) and continuous (dense) features Output: Top-K ranked results of candidate items Constraints: Minimize total cost of ownership (TCO) while meeting service level objectives (SLO)

Model Architecture

Overall Architecture Design

FlexEMR employs a disaggregated architecture comprising:

Ranker Nodes: Equipped with GPUs, responsible for neural network inference computation
Embedding Servers: Equipped with CPUs and large memory, store embedding tables and process lookup requests
High-Speed Network: Connects the two types of nodes via RDMA and similar technologies

Core Module Functions

1. Adaptive Embedding Caching (§3.1.1)

Dynamic Load Monitoring: Uses sliding window algorithm to monitor task queue size
Memory Allocation Strategy: Dynamically adjusts cache size based on NN computation requirements
Asynchronous Data Exchange: Transparently performs swap-in/swap-out operations for hot embeddings

2. Hierarchical Embedding Pooling (§3.1.2)

Spatial Locality Exploitation: Identifies multiple vectors on the same embedding server
Distributed Pooling: Embedding servers perform local pooling, Rankers perform global pooling
Routing Table Optimization: Range-based routing tables reduce memory footprint

3. Multi-threaded RDMA Engine (§3.2)

Mapping-Aware Design: Eliminates contention between RNIC parallel units
Dynamic Connection Migration: Load balancing in response to skewed access patterns
Credit Flow Control: Fast credit control channel based on QoS

Technical Innovations

1. Adaptive Caching vs. Traditional Caching

Traditional Approach: Fixed-size GPU cache competing with NN computation for memory
FlexEMR: Dynamically adjusts cache size, balancing latency and throughput

2. Hierarchical Pooling vs. Centralized Pooling

Traditional Approach: All embedding vectors transferred to Ranker for pooling
FlexEMR: Leverages embedding server CPU resources for pre-aggregation

3. Mapping-Aware RDMA vs. Traditional Multi-threaded RDMA

Traditional Approach: Multi-threaded contention for RNIC resources, 62% performance degradation
FlexEMR: One-to-one mapping eliminates contention, 2.3x performance improvement

Experimental Setup

Datasets

MLPerf Framework: Standardized recommendation system benchmark tests
Meta Production Traces: Production-level embedding lookup traces from Meta
RMC2 Model: Representative recommendation model for performance evaluation

Evaluation Metrics

Throughput: Requests processed per second (rps)
Latency: Including median and P99 latency
GPU Memory Utilization: Maximum supported batch size
Network Transfer Efficiency: Data transfer volume and bandwidth utilization

Experimental Environment

Hardware Configuration: Intel Xeon servers (32 cores, 128GB memory), Nvidia A100 GPU (80GB)
Network: 100Gbps Mellanox RDMA NIC
Comparison Methods: Single-threaded RDMA baseline, fixed caching strategy

Implementation Details

Uses resource domain feature for RDMA mapping awareness
Sliding window size dynamically adjusted based on workload
Credit flow control implemented at connection-level QoS

Experimental Results

Main Results

1. GPU Memory Contention Analysis (Figure 7)

No cache: Maximum batch size approximately 2000
Large cache (75GB): Maximum batch size reduced to approximately 500
FlexEMR adaptive cache: Maintains high throughput while preserving latency advantages

2. Multi-threaded RDMA Performance (Figure 8 Left)

Baseline method: Performance degrades with increasing thread count
FlexEMR: 2.3x throughput improvement with 8 RDMA engines, reaching 15M rps

3. Credit Flow Control Effect (Figure 8 Right)

Median latency: FlexEMR reduces approximately 35% compared to baseline
P99 latency: Significant improvement in tail latency performance

Ablation Studies

The paper demonstrates independent contributions of each component:

Mapping-aware multi-threading: Addresses RNIC resource contention
Adaptive caching: Balances memory usage and performance
Hierarchical pooling: Reduces network transfer overhead

Key Findings

Memory Contention is Critical Bottleneck: GPU cache and NN computation memory contention significantly impacts performance
Network Optimization Highly Effective: Optimized RDMA engine substantially improves concurrent lookup performance
Locality Exploitation Effective: Temporal and spatial locality exploitation effectively reduces network overhead

Main Research Directions

GPU-Centric Approaches: Treating EMR as general deep learning models, primarily using GPU resources
Caching Optimization: Various embedding caching mechanisms to accelerate lookup operations
Specialized Hardware: FPGA and other specialized hardware accelerators for recommendation systems
Compression and Sharding: Embedding table compression and sharding optimization techniques

Advantages of This Work

Comprehensive Disaggregation Solution: First systematic EMR disaggregation architecture design
Network Optimization Focus: In-depth addressing of networking challenges introduced by disaggregation
Dynamic Adaptation Capability: Provides dynamic optimization compared to DisaggRec's static resource allocation

Conclusions and Discussion

Main Conclusions

EMR disaggregation architecture significantly improves resource utilization and cost efficiency
Locality-aware optimization effectively reduces network overhead
Targeted RDMA optimization is critical for disaggregation architecture performance
Adaptive strategies better suit dynamic workloads than static configuration

Limitations

Prototype Stage: Currently an early prototype lacking large-scale deployment validation
Network Dependency: Performance highly dependent on high-speed networks, increasing infrastructure costs
Increased Complexity: Disaggregation architecture increases system complexity and operational overhead
Latency Overhead: Network communication inevitably introduces additional latency

Future Directions

Extension to Other Models: Application to LLMs, multimodal models, MoE, etc.
Smarter Scheduling: Development of more sophisticated resource scheduling algorithms
Hardware Co-design: Collaboration with network hardware vendors for optimization
Fault Tolerance Mechanisms: Enhanced system robustness and failure recovery capabilities

In-Depth Evaluation

Strengths

Accurate Problem Identification: Precisely identifies core challenges and bottlenecks in EMR services
Reasonable Solution Design: Disaggregation architecture aligns with data center disaggregation trends
Effective Technical Innovations: Multiple technical innovations supported by experimental validation
High Practical Value: Addresses important problems in production environments

Weaknesses

Limited Evaluation Scope: Tested only in small-scale environments, lacking large-scale validation
Insufficient Cost Analysis: Lacks detailed cost-benefit analysis
Missing Fault Handling: Insufficient discussion of fault handling in disaggregated architecture
Integration with Existing Systems: Lacks discussion on integration with existing recommendation systems

Impact

Academic Contribution: Provides comprehensive technical framework for EMR system disaggregation
Industrial Value: Significant guidance for large-scale recommendation system deployment
Technology Advancement: Promotes application of disaggregated architecture in AI services
Standardization Potential: Could become reference standard for EMR disaggregation deployment

Applicable Scenarios

Large-Scale Recommendation Systems: Suitable for Meta, Alibaba, and other major internet companies
Resource-Constrained Environments: Data centers requiring resource utilization optimization
Dynamic Workload Scenarios: Recommendation services with significant workload variations
Cost-Sensitive Applications: Commercial scenarios with strict TCO requirements

References

The paper cites 61 related references, primarily including:

EMR system optimization work (e.g., AdaEmbed, RecSSD)
Disaggregated system architecture research (e.g., LegoOS, DxPU)
RDMA network optimization techniques (e.g., FaRM, Aeolus)
Recommendation system benchmarks (MLPerf, Meta DLRM datasets)

Overall Assessment: This is a high-quality systems research paper proposing an innovative disaggregation architecture solution addressing practical challenges in EMR services. While currently at the prototype stage, its technical approach demonstrates strong practical value and promotion potential, with significant implications for the development of recommendation system infrastructure.