2025-11-15T01:49:17.958429

Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems

Lakew, SvÃ¤rd, Elmroth et al.

Disaggregated systems have a novel architecture motivated by the requirements of resource intensive applications such as social networking, search, and in-memory databases. The total amount of resources such as memory and CPU cores is very large in such systems. However, the distributed topology of disaggregated server systems result in non-uniform access latency and performance, with both NUMA aspects inside each box, as well as additional access latency for remote resources. In this work, we study the effects complex NUMA topologies on application performance and propose a method for improved, NUMA-aware, mapping for virtualized environments running on disaggregated systems. Our mapping algorithm is based on pinning of virtual cores and/or migration of memory across a disaggregated system and takes into account application performance, resource contention, and utilization. The proposed method is evaluated on a 288 cores and around 1TB memory system, composed of six disaggregated commodity servers, through a combination of benchmarks and real applications such as memory intensive graph databases. Our evaluation demonstrates significant improvement over the vanilla resource mapping methods. Overall, the mapping algorithm is able to improve performance by significant magnitude compared the default Linux scheduler used in system.

academic

Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems

Basic Information

Paper ID: 2501.01356
Title: Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems
Authors: Ewnetu Bayuh Lakew, Petter Svärd, Erik Elmroth, Johan Tordsson (Umeå University, Sweden)
Classification: cs.DC (Distributed, Parallel and Cluster Computing)
Publication Date: January 2, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.01356

Abstract

This paper investigates the impact of complex NUMA topologies on application performance in disaggregated systems and proposes an improved NUMA-aware mapping method. The approach is based on virtual core binding and memory migration, comprehensively considering application performance, resource contention, and utilization. Evaluation on a disaggregated system comprising 6 commodity servers with 288 cores and approximately 1TB memory demonstrates significant performance improvements compared to the default Linux scheduler.

Research Background and Motivation

Problem Definition

Disaggregated System Architecture Challenges: Disaggregated systems support resource-intensive applications (such as social networks, search engines, and in-memory databases) by aggregating resources from multiple physical servers, but distributed topologies introduce non-uniform access latencies and performance issues
Multi-Level NUMA Complexity: The system simultaneously exhibits intra-machine NUMA characteristics and cross-machine remote resource access latencies, forming a complex multi-level NUMA topology
Virtualization Environment Optimization: Existing Linux schedulers cannot effectively handle such complex resource mapping scenarios

Research Significance

Modern applications' computational resource demands exceed single-machine capabilities, making disaggregated systems an important development direction
Resource mapping strategies directly impact application performance; improper mapping can cause severe performance degradation
Comprehensive optimization considering resource contention, locality, and interference degree is necessary

Limitations of Existing Methods

Traditional NUMA optimization work primarily targets small-scale systems or relies on simulation-based evaluation
Lacks empirical measurement studies on large-scale disaggregated systems using actual hardware
Insufficiently considers the combined impact of resource contention, locality, and interference degree

Core Contributions

First In-Depth Empirical Study of Disaggregated Systems: Conducts comprehensive measurements on real disaggregated hardware, considering resource contention, locality, and interference degree
Application Classification and Performance Metrics Framework: Employs Animal Classes classification for applications and uses IPC and MPI as performance indicators
NUMA-Aware Mapping Algorithm: Proposes an online mapping algorithm considering application classification, resource proximity, and runtime hardware performance counters
Significant Performance Improvements: Achieves average 50× performance improvement on actual systems

Methodology Details

Task Definition

Input: Virtual machine requests (including CPU core count and memory requirements), application classification, system resource state Output: Optimal mapping scheme from virtual CPUs to physical CPUs Constraints: Avoid resource oversubscription, minimize NUMA distance, reduce inter-application interference

Application Classification Framework

Based on Animal Classes classification, applications are categorized into three types:

Sheep (Benign): Applications not easily affected by cache sharing
Rabbit (Sensitive): Fast-performing applications that degrade quickly when cache allocation is insufficient or shared
Devil (Destructive): Applications with frequent cache accesses and high miss rates, impacting other applications' performance

Additionally, applications are further classified as sensitive or insensitive based on remote memory sensitivity.

Mapping Algorithm Architecture

Two-Stage Mapping Strategy

Stage 1: Remoteness Handling (Upon Application Arrival)

if VMi is a new arrival then
    if Free slot is suitable for VMi given ci, ai then
        Map VMi directly
    else
        Reshuffle existing VMs to create suitable slot
        Map VMi to new slot

Stage 2: Interference Minimization (Runtime Optimization)

for each VMi do
    if (expected_perf - measured_perf)/expected_perf ≥ Threshold then
        Add VMi to affected list
        
for each affected VM do
    Build potential neighbor list based on class compatibility
    Compute new configuration with minimal reshuffle
    Remap if beneficial

Application Compatibility Matrix

Application Type	Sheep	Rabbit	Devil
Sheep	✓	✓	✓
Rabbit	✓	✗	✗
Devil	✓	✗	✓

Benefit Assessment Matrix

Application Type	Socket-Level	NUMA Node-Level	Server-Level
Sheep	1	5	8
Rabbit	4	7	9
Devil	1	6	9

Performance Monitoring Mechanism

IPC (Instructions Per Cycle): Indicates relative application performance; higher values indicate better performance
MPI (Misses Per Instruction): Measures cache miss rate; lower values indicate better performance
Utilizes Linux Perf tool to collect hardware performance counters in real-time

Experimental Setup

Hardware Platform

System Configuration: 6 IBM x3755 M3 servers
Processors: 2× AMD 6380 per server (48 cores each)
Memory: 192GB RAM per server, 1176GB total
Network: NumaConnect N323 adapter, 2D ring topology
Total Resources: 288 cores, approximately 1TB memory

NumaConnect Technology Characteristics

Cache-coherent shared memory system
Unified programming model, transparent to applications
NUMA distances: Local 10, Neighbors 16/22, Remote 160/200

Experimental Workloads

Application	Type	Classification	Characteristics
Neo4j	Graph Database	Sheep	CPU and memory intensive
Sockshop	Microservices	Sheep	Cloud application representative
Derby	Benchmark	Sheep	Database benchmark
SPECjvm2008	Benchmark	Rabbit/Devil	Java runtime performance
Stream	Memory Bandwidth	-	Memory bandwidth test

VM Type Configuration

VM Type	CPU Cores	Memory (GB)	Quantity
Small	4	16	12
Medium	8	32	4
Large	16	64	2
Huge	72	288	2

Experimental Results

Primary Performance Improvements

Compared to the default Linux scheduler (Vanilla), the proposed algorithm achieves significant performance improvements:

Application	SM-IPC Improvement	SM-MPI Improvement
Derby	215×	241×
FFT	33×	37×
Sockshop	25×	23×
Sunflow	34×	34×
Mpegaudio	5×	5×
SOR	17×	23×
Neo4j	8×	8×
Stream	105×	105×

Performance Stability Analysis

Vanilla Algorithm: Standard deviation to mean performance ratio >0.4, indicating unpredictable performance
SM-IPC/SM-MPI: This ratio <0.04, indicating stable and predictable performance

VM Scale Impact Analysis

Taking Stream application as an example, performance improvements across different VM scales:

VM Type	SM-IPC Improvement	SM-MPI Improvement
Small	48×	47×
Medium	105×	105×
Large	41×	39×
Huge	2×	2×

Key Findings:

Huge VMs show relatively smaller performance improvements due to inherently better locality
Small to medium-scale VMs benefit most, as they are more susceptible to improper mapping

NUMA Distance Impact

Mpegaudio application performance across different NUMA distances:

Local access: Baseline performance (1.0)
Neighbor access (distance 16/22): Performance degradation approximately 5-10%
Remote access (distance 160/200): Maximum performance degradation 17%

Traditional NUMA Optimization Research

Panagouirgious: Demonstrated the impact of memory location on NUMA system performance
Lepers et al.: Investigated asymmetric interconnect effects on x86 systems
Mayo and Gross: Proposed thread placement algorithms to reduce data locality issues

Virtualization Environment Optimization

Rao et al.: Proposed biased random vCPU migration algorithms
Tang et al.: Investigated NUMA impact in Google's large-scale production environments

Innovations of This Work

First in-depth empirical study of disaggregated systems using actual hardware
Comprehensively considers resource contention, locality, and interference degree
Provides a complete application classification and mapping algorithm framework

Conclusions and Discussion

Main Conclusions

Significant Performance Improvements: The proposed NUMA-aware mapping algorithm achieves average 50× performance improvement compared to default schedulers
Improved Stability: Substantially reduces performance variability, providing predictable performance
Effectiveness of Application Classification: The Animal Classes-based classification method effectively guides resource mapping decisions

Limitations

Static Classification Assumption: Current application classification is static and does not account for dynamic application behavior changes
Limited Workload Types: Evaluation primarily focuses on specific application types
Platform-Specific: Experiments conducted only on NumaConnect platform

Future Directions

Linux Scheduler Tuning: Investigate effects of Linux scheduler optimization to reduce randomness
Memory Migration Techniques: Employ "memory follows cores" memory migration technology in libvirt
Dynamic Application Classification: Develop runtime application behavior analysis and dynamic reclassification mechanisms

In-Depth Evaluation

Strengths

High Practical Value: Evaluation on real hardware provides results with strong practical applicability
Complete Methodology: Forms a comprehensive system from problem analysis to solution design to experimental validation
Significant Performance Improvements: Experimental results demonstrate substantial performance improvement potential
Systematic Research: Comprehensively considers multiple influencing factors and provides holistic solutions

Weaknesses

Platform Dependency: Research primarily based on NumaConnect platform; applicability to other disaggregated systems requires verification
Limited Workload Coverage: Evaluated application types are relatively limited; more diverse workloads needed for validation
Insufficient Dynamism: Algorithm's adaptability to dynamic system load changes requires further investigation
Lack of Theoretical Analysis: Lacks theoretical analysis of algorithm complexity and convergence

Impact

Domain Contribution: Provides important theoretical foundation and practical guidance for resource management in disaggregated systems
Practical Value: Algorithm can be directly applied to cloud computing and high-performance computing environments
Reproducibility: Authors commit to providing source code, facilitating dissemination and verification of research results

Applicable Scenarios

Large-Scale Cloud Computing Environments: Particularly suitable for resource pooling cloud infrastructure
High-Performance Computing Clusters: Can optimize HPC application resource allocation
Memory-Intensive Applications: Particularly effective for graph databases and in-memory computing applications
Virtualized Data Centers: Can be integrated into existing virtualization management systems

References

This paper cites 26 relevant references covering important research achievements in multiple domains including disaggregated systems, NUMA optimization, and virtualization technologies, providing a solid theoretical foundation for the research work.

Overall Assessment: This is a paper with significant contributions to the field of resource management in disaggregated systems. Through rigorous experimental design and comprehensive performance evaluation, it demonstrates the effectiveness of NUMA-aware mapping algorithms. Despite certain limitations, its practical value and academic contributions are substantial, laying a solid foundation for further development in this field.