2025-11-15T01:49:17.958429

Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems

Lakew, Svärd, Elmroth et al.
Disaggregated systems have a novel architecture motivated by the requirements of resource intensive applications such as social networking, search, and in-memory databases. The total amount of resources such as memory and CPU cores is very large in such systems. However, the distributed topology of disaggregated server systems result in non-uniform access latency and performance, with both NUMA aspects inside each box, as well as additional access latency for remote resources. In this work, we study the effects complex NUMA topologies on application performance and propose a method for improved, NUMA-aware, mapping for virtualized environments running on disaggregated systems. Our mapping algorithm is based on pinning of virtual cores and/or migration of memory across a disaggregated system and takes into account application performance, resource contention, and utilization. The proposed method is evaluated on a 288 cores and around 1TB memory system, composed of six disaggregated commodity servers, through a combination of benchmarks and real applications such as memory intensive graph databases. Our evaluation demonstrates significant improvement over the vanilla resource mapping methods. Overall, the mapping algorithm is able to improve performance by significant magnitude compared the default Linux scheduler used in system.
academic

Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems

Basic Information

  • Paper ID: 2501.01356
  • Title: Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems
  • Authors: Ewnetu Bayuh Lakew, Petter Svärd, Erik Elmroth, Johan Tordsson (Umeå University, Sweden)
  • Classification: cs.DC (Distributed, Parallel and Cluster Computing)
  • Publication Date: January 2, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.01356

Abstract

This paper investigates the impact of complex NUMA topologies on application performance in disaggregated systems and proposes an improved NUMA-aware mapping method. The approach is based on virtual core binding and memory migration, comprehensively considering application performance, resource contention, and utilization. Evaluation on a disaggregated system comprising 6 commodity servers with 288 cores and approximately 1TB memory demonstrates significant performance improvements compared to the default Linux scheduler.

Research Background and Motivation

Problem Definition

  1. Disaggregated System Architecture Challenges: Disaggregated systems support resource-intensive applications (such as social networks, search engines, and in-memory databases) by aggregating resources from multiple physical servers, but distributed topologies introduce non-uniform access latencies and performance issues
  2. Multi-Level NUMA Complexity: The system simultaneously exhibits intra-machine NUMA characteristics and cross-machine remote resource access latencies, forming a complex multi-level NUMA topology
  3. Virtualization Environment Optimization: Existing Linux schedulers cannot effectively handle such complex resource mapping scenarios

Research Significance

  • Modern applications' computational resource demands exceed single-machine capabilities, making disaggregated systems an important development direction
  • Resource mapping strategies directly impact application performance; improper mapping can cause severe performance degradation
  • Comprehensive optimization considering resource contention, locality, and interference degree is necessary

Limitations of Existing Methods

  • Traditional NUMA optimization work primarily targets small-scale systems or relies on simulation-based evaluation
  • Lacks empirical measurement studies on large-scale disaggregated systems using actual hardware
  • Insufficiently considers the combined impact of resource contention, locality, and interference degree

Core Contributions

  1. First In-Depth Empirical Study of Disaggregated Systems: Conducts comprehensive measurements on real disaggregated hardware, considering resource contention, locality, and interference degree
  2. Application Classification and Performance Metrics Framework: Employs Animal Classes classification for applications and uses IPC and MPI as performance indicators
  3. NUMA-Aware Mapping Algorithm: Proposes an online mapping algorithm considering application classification, resource proximity, and runtime hardware performance counters
  4. Significant Performance Improvements: Achieves average 50× performance improvement on actual systems

Methodology Details

Task Definition

Input: Virtual machine requests (including CPU core count and memory requirements), application classification, system resource state Output: Optimal mapping scheme from virtual CPUs to physical CPUs Constraints: Avoid resource oversubscription, minimize NUMA distance, reduce inter-application interference

Application Classification Framework

Based on Animal Classes classification, applications are categorized into three types:

  • Sheep (Benign): Applications not easily affected by cache sharing
  • Rabbit (Sensitive): Fast-performing applications that degrade quickly when cache allocation is insufficient or shared
  • Devil (Destructive): Applications with frequent cache accesses and high miss rates, impacting other applications' performance

Additionally, applications are further classified as sensitive or insensitive based on remote memory sensitivity.

Mapping Algorithm Architecture

Two-Stage Mapping Strategy

Stage 1: Remoteness Handling (Upon Application Arrival)

if VMi is a new arrival then
    if Free slot is suitable for VMi given ci, ai then
        Map VMi directly
    else
        Reshuffle existing VMs to create suitable slot
        Map VMi to new slot

Stage 2: Interference Minimization (Runtime Optimization)

for each VMi do
    if (expected_perf - measured_perf)/expected_perf ≥ Threshold then
        Add VMi to affected list
        
for each affected VM do
    Build potential neighbor list based on class compatibility
    Compute new configuration with minimal reshuffle
    Remap if beneficial

Application Compatibility Matrix

Application TypeSheepRabbitDevil
Sheep
Rabbit
Devil

Benefit Assessment Matrix

Application TypeSocket-LevelNUMA Node-LevelServer-Level
Sheep158
Rabbit479
Devil169

Performance Monitoring Mechanism

  • IPC (Instructions Per Cycle): Indicates relative application performance; higher values indicate better performance
  • MPI (Misses Per Instruction): Measures cache miss rate; lower values indicate better performance
  • Utilizes Linux Perf tool to collect hardware performance counters in real-time

Experimental Setup

Hardware Platform

  • System Configuration: 6 IBM x3755 M3 servers
  • Processors: 2× AMD 6380 per server (48 cores each)
  • Memory: 192GB RAM per server, 1176GB total
  • Network: NumaConnect N323 adapter, 2D ring topology
  • Total Resources: 288 cores, approximately 1TB memory

NumaConnect Technology Characteristics

  • Cache-coherent shared memory system
  • Unified programming model, transparent to applications
  • NUMA distances: Local 10, Neighbors 16/22, Remote 160/200

Experimental Workloads

ApplicationTypeClassificationCharacteristics
Neo4jGraph DatabaseSheepCPU and memory intensive
SockshopMicroservicesSheepCloud application representative
DerbyBenchmarkSheepDatabase benchmark
SPECjvm2008BenchmarkRabbit/DevilJava runtime performance
StreamMemory Bandwidth-Memory bandwidth test

VM Type Configuration

VM TypeCPU CoresMemory (GB)Quantity
Small41612
Medium8324
Large16642
Huge722882

Experimental Results

Primary Performance Improvements

Compared to the default Linux scheduler (Vanilla), the proposed algorithm achieves significant performance improvements:

ApplicationSM-IPC ImprovementSM-MPI Improvement
Derby215×241×
FFT33×37×
Sockshop25×23×
Sunflow34×34×
Mpegaudio
SOR17×23×
Neo4j
Stream105×105×

Performance Stability Analysis

  • Vanilla Algorithm: Standard deviation to mean performance ratio >0.4, indicating unpredictable performance
  • SM-IPC/SM-MPI: This ratio <0.04, indicating stable and predictable performance

VM Scale Impact Analysis

Taking Stream application as an example, performance improvements across different VM scales:

VM TypeSM-IPC ImprovementSM-MPI Improvement
Small48×47×
Medium105×105×
Large41×39×
Huge

Key Findings:

  • Huge VMs show relatively smaller performance improvements due to inherently better locality
  • Small to medium-scale VMs benefit most, as they are more susceptible to improper mapping

NUMA Distance Impact

Mpegaudio application performance across different NUMA distances:

  • Local access: Baseline performance (1.0)
  • Neighbor access (distance 16/22): Performance degradation approximately 5-10%
  • Remote access (distance 160/200): Maximum performance degradation 17%

Traditional NUMA Optimization Research

  • Panagouirgious: Demonstrated the impact of memory location on NUMA system performance
  • Lepers et al.: Investigated asymmetric interconnect effects on x86 systems
  • Mayo and Gross: Proposed thread placement algorithms to reduce data locality issues

Virtualization Environment Optimization

  • Rao et al.: Proposed biased random vCPU migration algorithms
  • Tang et al.: Investigated NUMA impact in Google's large-scale production environments

Innovations of This Work

  • First in-depth empirical study of disaggregated systems using actual hardware
  • Comprehensively considers resource contention, locality, and interference degree
  • Provides a complete application classification and mapping algorithm framework

Conclusions and Discussion

Main Conclusions

  1. Significant Performance Improvements: The proposed NUMA-aware mapping algorithm achieves average 50× performance improvement compared to default schedulers
  2. Improved Stability: Substantially reduces performance variability, providing predictable performance
  3. Effectiveness of Application Classification: The Animal Classes-based classification method effectively guides resource mapping decisions

Limitations

  1. Static Classification Assumption: Current application classification is static and does not account for dynamic application behavior changes
  2. Limited Workload Types: Evaluation primarily focuses on specific application types
  3. Platform-Specific: Experiments conducted only on NumaConnect platform

Future Directions

  1. Linux Scheduler Tuning: Investigate effects of Linux scheduler optimization to reduce randomness
  2. Memory Migration Techniques: Employ "memory follows cores" memory migration technology in libvirt
  3. Dynamic Application Classification: Develop runtime application behavior analysis and dynamic reclassification mechanisms

In-Depth Evaluation

Strengths

  1. High Practical Value: Evaluation on real hardware provides results with strong practical applicability
  2. Complete Methodology: Forms a comprehensive system from problem analysis to solution design to experimental validation
  3. Significant Performance Improvements: Experimental results demonstrate substantial performance improvement potential
  4. Systematic Research: Comprehensively considers multiple influencing factors and provides holistic solutions

Weaknesses

  1. Platform Dependency: Research primarily based on NumaConnect platform; applicability to other disaggregated systems requires verification
  2. Limited Workload Coverage: Evaluated application types are relatively limited; more diverse workloads needed for validation
  3. Insufficient Dynamism: Algorithm's adaptability to dynamic system load changes requires further investigation
  4. Lack of Theoretical Analysis: Lacks theoretical analysis of algorithm complexity and convergence

Impact

  1. Domain Contribution: Provides important theoretical foundation and practical guidance for resource management in disaggregated systems
  2. Practical Value: Algorithm can be directly applied to cloud computing and high-performance computing environments
  3. Reproducibility: Authors commit to providing source code, facilitating dissemination and verification of research results

Applicable Scenarios

  1. Large-Scale Cloud Computing Environments: Particularly suitable for resource pooling cloud infrastructure
  2. High-Performance Computing Clusters: Can optimize HPC application resource allocation
  3. Memory-Intensive Applications: Particularly effective for graph databases and in-memory computing applications
  4. Virtualized Data Centers: Can be integrated into existing virtualization management systems

References

This paper cites 26 relevant references covering important research achievements in multiple domains including disaggregated systems, NUMA optimization, and virtualization technologies, providing a solid theoretical foundation for the research work.


Overall Assessment: This is a paper with significant contributions to the field of resource management in disaggregated systems. Through rigorous experimental design and comprehensive performance evaluation, it demonstrates the effectiveness of NUMA-aware mapping algorithms. Despite certain limitations, its practical value and academic contributions are substantial, laying a solid foundation for further development in this field.