2025-11-15T01:49:17.958429

Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems

Lakew, SvÃ¤rd, Elmroth et al.

Disaggregated systems have a novel architecture motivated by the requirements of resource intensive applications such as social networking, search, and in-memory databases. The total amount of resources such as memory and CPU cores is very large in such systems. However, the distributed topology of disaggregated server systems result in non-uniform access latency and performance, with both NUMA aspects inside each box, as well as additional access latency for remote resources. In this work, we study the effects complex NUMA topologies on application performance and propose a method for improved, NUMA-aware, mapping for virtualized environments running on disaggregated systems. Our mapping algorithm is based on pinning of virtual cores and/or migration of memory across a disaggregated system and takes into account application performance, resource contention, and utilization. The proposed method is evaluated on a 288 cores and around 1TB memory system, composed of six disaggregated commodity servers, through a combination of benchmarks and real applications such as memory intensive graph databases. Our evaluation demonstrates significant improvement over the vanilla resource mapping methods. Overall, the mapping algorithm is able to improve performance by significant magnitude compared the default Linux scheduler used in system.

academic

Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems

基本信息

论文ID: 2501.01356
标题: Optimising Virtual Resource Mapping in Multi-Level NUMA Disaggregated Systems
作者: Ewnetu Bayuh Lakew, Petter Svärd, Erik Elmroth, Johan Tordsson (Umeå University, Sweden)
分类: cs.DC (分布式、并行与集群计算)
发表时间: 2025年1月2日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2501.01356

摘要

本文研究了分解系统(disaggregated systems)中复杂NUMA拓扑对应用性能的影响，并提出了一种改进的NUMA感知映射方法。该方法基于虚拟核心绑定和内存迁移，综合考虑应用性能、资源竞争和利用率。在由6台商用服务器组成的288核心、约1TB内存的分解系统上进行评估，结果显示相比默认Linux调度器有显著性能提升。

研究背景与动机

问题定义

分解系统架构挑战：分解系统通过聚合多台物理服务器的资源来支持资源密集型应用(如社交网络、搜索、内存数据库)，但分布式拓扑导致非均匀访问延迟和性能问题
多层NUMA复杂性：系统同时存在单机内部NUMA特性和跨机远程资源访问延迟，形成复杂的多层NUMA拓扑
虚拟化环境优化：现有Linux调度器无法有效处理这种复杂的资源映射场景

研究重要性

现代应用对计算资源需求超出单机能力，分解系统成为重要发展方向
资源映射策略直接影响应用性能，不当映射可能导致严重性能下降
需要同时考虑资源竞争、局部性和干扰程度的综合优化

现有方法局限性

传统NUMA优化工作主要针对小规模系统或使用仿真评估
缺乏针对大规模分解系统的实际硬件测量研究
未充分考虑资源竞争、局部性和干扰程度的综合影响

核心贡献

首个深入的分解系统实测研究：在真实分解硬件上进行深入测量，考虑资源竞争、局部性和干扰程度
应用分类与性能指标体系：采用Animal Classes分类法对应用进行分类，使用IPC和MPI作为性能指标
NUMA感知映射算法：提出考虑应用分类、资源邻近性和运行时硬件性能计数器的在线映射算法
显著性能提升：在实际系统上实现平均50倍的性能提升

方法详解

任务定义

输入：虚拟机请求(包含CPU核心数、内存需求)、应用分类、系统资源状态输出：虚拟CPU到物理CPU的最优映射方案约束：避免资源超订、最小化NUMA距离、减少应用间干扰

应用分类体系

基于Animal Classes分类法将应用分为三类：

Sheep(温和型)：不易受缓存共享影响的应用
Rabbit(敏感型)：性能快速但易受缓存分配不足或共享影响而退化
Devil(破坏型)：频繁访问缓存且未命中率高，影响其他应用性能

同时根据远程内存敏感性进一步分类为敏感/不敏感两类。

映射算法架构

两阶段映射策略

阶段1：远程性处理(应用到达时)

if VMi is a new arrival then
    if Free slot is suitable for VMi given ci, ai then
        Map VMi directly
    else
        Reshuffle existing VMs to create suitable slot
        Map VMi to new slot

阶段2：干扰最小化(运行时优化)

for each VMi do
    if (expected_perf - measured_perf)/expected_perf ≥ Threshold then
        Add VMi to affected list
        
for each affected VM do
    Build potential neighbor list based on class compatibility
    Compute new configuration with minimal reshuffle
    Remap if beneficial

应用兼容性矩阵

应用类型	Sheep	Rabbit	Devil
Sheep	✓	✓	✓
Rabbit	✓	✗	✗
Devil	✓	✗	✓

收益评估矩阵

应用类型	Socket级	NUMA节点级	服务器级
Sheep	1	5	8
Rabbit	4	7	9
Devil	1	6	9

性能监控机制

IPC (Instructions Per Cycle)：指示应用相对性能，数值越高性能越好
MPI (Misses Per Instruction)：测量缓存未命中率，数值越低性能越好
使用Linux Perf工具实时收集硬件性能计数器

实验设置

硬件平台

系统配置：6台IBM x3755 M3服务器
处理器：每台2×AMD 6380 (48核心)
内存：每台192GB RAM，总计1176GB
网络：NumaConnect N323适配器，2维环形拓扑
总资源：288核心，约1TB内存

NumaConnect技术特点

缓存一致性共享内存系统
统一编程模型，对应用透明
NUMA距离：本地10，邻居16/22，远程160/200

实验工作负载

应用	类型	分类	特点
Neo4j	图数据库	Sheep	CPU和内存密集
Sockshop	微服务	Sheep	云应用代表
Derby	基准测试	Sheep	数据库基准
SPECjvm2008	基准测试	Rabbit/Devil	Java运行时性能
Stream	内存带宽	-	内存带宽测试