2025-11-11T10:10:09.268407

Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry

Chen, Chien, Qian et al.

Modern machine learning (ML) has grown into a tightly coupled, full-stack ecosystem that combines hardware, software, network, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. Unfortunately, these platforms as a service use virtualization, which means operators have little insight into the users' workloads. This hinders resource optimizations by the operator, which is essential to ensure cost efficiency and minimize execution time. In this paper, we argue that workload knowledge is unnecessary for system-level optimization. We propose Reveal, which takes a hardware-centric approach, relying only on hardware signals - fully accessible by operators. Using low-level signals collected from the system, Reveal detects anomalies through an unsupervised learning pipeline. The pipeline is developed by analyzing over 30 popular ML models on various hardware platforms, ensuring adaptability to emerging workloads and unknown deployment patterns. Using Reveal, we successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.

academic

Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry

Basic Information

Paper ID: 2510.26008
Title: Detecting Anomalies in Systems for AI Using Hardware Telemetry
Authors: Ziji Chen, Steven W. D. Chien, Peng Qian, Noa Zilberman (University of Oxford)
Classification: cs.PF (Performance), cs.AR (Computer Architecture), cs.DC (Distributed Computing), cs.LG (Machine Learning)
Publication Date: October 31, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2510.26008v2

Abstract

Modern machine learning has evolved into a tightly-coupled full-stack ecosystem combining hardware, software, networking, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. However, these platform-as-a-service offerings employ virtualization, resulting in operators lacking visibility into user workloads. This impedes resource optimization by operators, which is critical for ensuring cost efficiency and minimizing execution time. This paper proposes that system-level optimization requires no workload knowledge. We present Reveal, adopting a hardware-centric approach that relies solely on hardware signals fully accessible to operators. By analyzing the performance of 30+ popular ML models across various hardware platforms, we developed an unsupervised learning pipeline to detect anomalies. Using Reveal, we successfully identified network and system configuration issues, accelerating the DeepSeek model by 5.97%.

Research Background and Motivation

Core Problems

Missing Observability: Virtualization in cloud platforms obscures underlying hardware, preventing operators from obtaining high-level workload information, making system-level optimization difficult
Performance Bottleneck Detection Challenges: ML workloads exhibit tight hardware-software coupling, where minor inefficiencies can cascade into system-level performance degradation
Limitations of Existing Tools: Require application-level integration, high runtime overhead (up to 90.2%), limited coverage

Problem Significance

Specialized accelerators like GPUs are expensive (tens of thousands of dollars per GPU)
Cloud AI resource demand projected to grow 30% annually through 2030
Even minor configuration errors can cause 1.5x performance degradation
Distributed training heavily depends on collective communication, vulnerable to network issues

Limitations of Existing Approaches

High-level Observability Dependency: Most tools require application-level information, unavailable in virtualized environments
High Overhead: Plumber adds 21% overhead, RL-Scope adds 90.2% GPU kernel launch time
Rule-driven Detection: Requires workload-specific threshold tuning, poor portability
Limited Coverage: Framework analyzers typically cover only applications and framework runtime

Core Contributions

Propose Reveal Framework: A hardware-centric analysis and anomaly detection framework with high portability, deployability, and analytical accuracy
Identify Key Performance Indicators: Determine a set of low-level performance metrics representing ML workload behavior on hardware, with all collected datasets open-sourced
Develop Unsupervised Detection Pipeline: Successfully detect performance issues in containerized ML workloads, identify system bottlenecks, and accelerate DeepSeek by 5.97%

Methodology Details

Task Definition

Input: Host-level hardware telemetry data (CPU, GPU, memory, network, storage metrics) Output: Anomalous window detection, subsystem attribution, root cause analysis reports Constraints: Use only hardware-level signals accessible to operators, without requiring high-level workload knowledge

Model Architecture

1. Telemetry Collector

Collects approximately 150 unique metric types using perf, procfs, nvidia-smi, and standard Linux tools
Scales to 700+ time series channels when replicated across CPU cores and GPUs
CPU overhead maintained below 1.5%

2. Metric Reanalysis and Feature Extraction

Metric Filtering: Correlation-driven pruning, retaining approximately 60% of metrics at |r|=0.5 threshold
Derived Metrics: Compute IPC (execution throughput), branch misprediction rate, cache miss rate, etc.
Sliding Window: 3-second window with 1-second stride, extracting statistical and temporal features

3. Anomaly Detection Engine

Employs three complementary unsupervised methods:

Z-score: Standardized deviation detection, flagging windows exceeding 99th percentile
Mahalanobis Distance in PCA Subspace: Accounts for metric correlations and scale differences
Isolation Forest: Tree-based ensemble method with 1% contamination rate

Technical Innovations

Hardware-centric Approach: Entirely based on hardware signals, avoiding dependence on high-level observability
Multi-detector Fusion: Reduces false positives and improves detection accuracy through inter-detector consistency
Subsystem Attribution: Maps anomalies to specific hardware subsystems (CPU, GPU, memory, network, storage)
Cross-layer Analysis: Single anomalous windows may involve multiple correlated signals, providing stronger anomaly evidence

Experimental Setup

Dataset

ML Applications: 30+ popular models including BERT, BART, ResNet, ViT, VGG, DeepSeek, LLaMA, Mistral
Task Types: Text classification, table question-answering, image classification, semantic segmentation
Datasets: GLUE/SST2, WikiSQL, PASCAL VOC, CIFAR, MNIST
Runs: 10 executions per workload to ensure statistical reliability

Experimental Environment

HPC Cluster:
- Dual-node with NVIDIA Tesla V100 GPUs (32GB), Intel Xeon Platinum 8628 CPUs
- Single-node with four NVIDIA H100 GPUs (96GB HBM3), Intel Sapphire Rapids CPUs
Local Cluster:
- 9 servers with AMD EPYC 7443P CPUs (24 cores), 256GB memory
- 99 containers in distributed training setup

Evaluation Metrics

Detection Accuracy: Accuracy of anomalous window identification
Subsystem Attribution: Ability to correctly map to hardware subsystems
Performance Improvement: End-to-end runtime improvement
Overhead Assessment: CPU usage, storage requirements, detector runtime

Experimental Results

Main Results

Performance Overhead

CPU Overhead: 1.2-1.4% at 100ms sampling interval, drops below 0.6% at 600ms
Storage Requirements: 42-43 KB/s/host before filtering, 14-22 KB/s after filtering
Detection Latency: Feature extraction 1.46±0.02s, end-to-end 2.26±0.17s

Anomaly Detection Performance

Metric Stability: 99.75% of workload-metric pairs show statistically significant similarity (p<0.05)
Cross-configuration Consistency: Median IoU of 0.50 between default and fine-grained settings, hit rate 0.92

Case Studies

Case 1: NUMA Anomaly (Memory Subsystem)

Detection: Windows 118-123 show IPC decline and increased L3 miss cycles
Analysis: Cross-socket memory and PCIe traffic cause increased latency
Fix: NUMA-aware binding, pinning processes to single NUMA node
Result: DeepSeek-7B fine-tuning improved from 1823.4±46.1s to 1714.6±70.0s (5.97% improvement)

Case 2: NCCL-QP Configuration Error (Network Subsystem)

Detection: Increased CPU Busy%, burst in ib0 TX/RX traffic, decreased GPU power
Analysis: Single QP configuration causes completion processing bottleneck
Fix: Increase from 1QP to 2QP configuration
Result: Runtime improved from 1825.4±46.1s to 1769.3±16.7s (3.1% improvement)

Case 3: IRQ Imbalance (CPU Subsystem)

Detection: CPU Busy% variance and IRQ counter anomalies
Fix: Enable irqbalance service for automatic interrupt load distribution
Result: TCP retransmission anomalies reduced from 6.07% to 3.51%

Case 4: HugePages Configuration Error (Memory Subsystem)

Detection: Cross-node memory usage anomalies
Analysis: Pre-allocated 1GiB HugePages reported as "used" memory
Fix: Configure to default 2MiB HugePages allocation

Case 5: Packet Loss Injection Test (Network Subsystem)

Detection Capability: Distinguish workload-inherent retransmissions from failure-induced ones
Analysis Depth: Provide cross-layer context from transport layer counters to CPU IRQ spikes and GPU stalls

Anomaly Pattern Analysis

HPC Cluster: CPU-side signals (Bzy_MHz, IRQ) dominate, contributing >50% of anomaly features
Local Cluster: Anomalies concentrated in memory and I/O subsystems, with writeback surges and dirty page accumulation
Cross-environment: TCP retransmissions appear in both environments, typically related to NCCL imbalance

Comparison of Existing Monitoring Methods

According to Table 1 in the paper, existing methods fall into three categories:

Application-level Analyzers: TensorFlow Profiler, PyTorch Profiler - require code instrumentation
System Tools: AWS SageMaker, Prometheus - rule-based detection
Low-level Tracing: BCC/eBPF tools, RL-Scope - high overhead or limited coverage

Reveal's Advantages

No Instrumentation Required: Entirely based on host-level telemetry
Full Subsystem Coverage: CPU, GPU, memory, network, storage
Automatic Anomaly Detection: Unsupervised ML methods
Hardware Attribution: Map anomalies to specific hardware components

Conclusions and Discussion

Main Conclusions

Hardware-centric Approach is Feasible: Effective ML workload anomaly detection using only hardware signals
Unsupervised Detection is Effective: Combination of three detectors accurately identifies multiple anomaly types
Practical Performance Gains: Successfully identifies and fixes configuration issues, achieving significant performance improvements
High Portability: 91% code reusable across platforms

Limitations

Static Configuration: Currently uses fixed sampling rates and window sizes, cannot adapt to workload dynamics
Passive Detection: Can only detect anomalies, cannot automatically resolve issues
Manual Remediation: Requires operator intervention for problem fixing

Future Directions

Adaptive Sampling: Adjust sampling frequency based on heuristics
Automatic Remediation: Explore lightweight runtime interventions, such as automatic IRQ rebalancing triggers
Extended Detectors: Explore additional unsupervised anomaly detection methods

In-depth Evaluation

Strengths

Strong Innovation: First to propose pure hardware signal-based ML anomaly detection, addressing cloud environment observability challenges
Comprehensive Experiments: Testing 30+ models across multiple hardware platforms with rich datasets
High Practical Value: Low overhead (<2% CPU), high portability (91% code reuse)
Convincing Results: 5.97% actual performance improvement validates method effectiveness
Open-source Contribution: Provides complete datasets and toolkits

Weaknesses

Detection Latency: 2.26-second end-to-end latency may be unsuitable for real-time applications
Feature Engineering: Metric selection and feature extraction processes are relatively complex, requiring domain expertise
Evaluation Scope: Primarily tested in academic environments; production environment complexity may present new challenges
Root Cause Analysis Depth: While capable of subsystem attribution, specific root cause analysis still requires manual intervention

Impact

Academic Contribution: Provides new research direction for ML system performance monitoring
Practical Value: Offers cloud service providers non-intrusive monitoring solutions without accessing user workloads
Reproducibility: Open-source code and datasets support research reproduction and extension

Applicable Scenarios

Cloud Service Providers: Need performance optimization without accessing user workloads
HPC Centers: Need to monitor and diagnose ML workload performance issues
Edge Computing: Lightweight monitoring in resource-constrained environments
Research Institutions: ML system performance analysis and optimization research

References

The paper cites 77 related references covering:

ML performance analysis tools: Hotline, RL-Scope, Plumber, etc.
Anomaly detection methods: Isolation Forest, PCA, Mahalanobis distance, etc.
System monitoring: Prometheus, AWS CloudWatch, etc.
ML frameworks: PyTorch, TensorFlow, etc.

Overall Assessment: This is a high-quality systems research paper proposing an innovative hardware-centric anomaly detection method addressing practical ML workload monitoring challenges in cloud environments. The experimental design is comprehensive, results are convincing, and the work holds significant value for both academia and industry.