2025-11-16T01:07:11.788386

Should I Run My Cloud Benchmark on Black Friday?

Henning, Vogel, Perez-Wohlfeil et al.

Benchmarks and performance experiments are frequently conducted in cloud environments. However, their results are often treated with caution, as the presumed high variability of performance in the cloud raises concerns about reproducibility and credibility. In a recent study, we empirically quantified the impact of this variability on benchmarking results by repeatedly executing a stream processing application benchmark at different times of the day over several months. Our analysis confirms that performance variability is indeed observable at the application level, although it is less pronounced than often assumed. The larger scale of our study compared to related work allowed us to identify subtle daily and weekly performance patterns. We now extend this investigation by examining whether a major global event, such as Black Friday, affects the outcomes of performance benchmarks.

academic

Should I Run My Cloud Benchmark on Black Friday?

Basic Information

Paper ID: 2510.12397
Title: Should I Run My Cloud Benchmark on Black Friday?
Authors: Sören Henning, Adriano Vogel, Esteban Perez-Wohlfeil, Otmar Ertl, Rick Rabiser
Institutions: Dynatrace Research, Linz, Austria; LIT CPS Lab, Johannes Kepler University Linz, Austria
Classification: cs.SE (Software Engineering), cs.DC (Distributed Computing), cs.PF (Performance Analysis)
Publication Date: October 14, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12397

Abstract

Benchmarking and performance experimentation in cloud environments are increasingly prevalent, yet their results are frequently questioned due to high variability in cloud performance, affecting reproducibility and credibility. This study empirically quantifies the impact of such variability on benchmarking results through repeated execution of stream processing application benchmarks at different times over several months. The analysis confirms that performance variability exists at the application level, but to a lesser extent than commonly assumed. Compared to related work, the larger scale of this study enables identification of subtle daily and periodic performance patterns. The research further extends to examine the impact of global major events, such as Black Friday, on performance benchmarking results.

Research Background and Motivation

Problem Definition

As organizations continue their digital transformation toward cloud deployment, benchmarking and performance experimentation in cloud environments have become common practices in research and engineering. However, performance measurement in cloud environments faces the following challenges:

Multi-tenant Resource Sharing: Cloud workloads share underlying infrastructure with other tenants
Hardware Abstraction: High levels of hardware abstraction introduce variability
Reproducibility Issues: Performance measurements may fluctuate, affecting meaningful comparisons across studies

Research Significance

The credibility of cloud benchmarking directly impacts the accuracy of performance assessment
Understanding performance variability patterns has practical implications for optimizing cloud resource configuration
Provides empirical evidence for best practices in cloud benchmarking

Limitations of Existing Approaches

Lack of large-scale, long-term empirical studies
Insufficient quantitative analysis of application-level performance variability
Insufficient consideration of global events' impact on cloud performance

Core Contributions

Large-Scale Longitudinal Study: Collected a dataset of over 1,000 benchmark executions through repeated experiments spanning several months
Performance Pattern Identification: Discovered subtle but statistically significant daily and periodic performance patterns in cloud environments
Global Event Impact Analysis: First quantitative analysis of the impact of major events such as Black Friday on cloud benchmark performance
Application-Level Variability Quantification: Provided precise measurement of performance variability for distributed stream processing applications in cloud environments

Methodology Details

Experimental Design

Test Subject

Application Type: Distributed stream processing applications (representing data-intensive, performance-critical distributed systems)
Benchmarking Tool: Open-source cloud-native stream processing benchmark ShuffleBench and its Kafka Streams implementation
Performance Metric: Throughput, measured using ShuffleBench's instantaneous measurement method

Execution Environment

Cloud Platform: Amazon Web Services (AWS)
Service: Elastic Kubernetes Service (EKS)
Cluster Configuration: 10 nodes using different sizes of m6i instances
Geographic Region: us-east-1 (primary), eu-central-1 (validation)

Automated Benchmark Execution

Automated through scheduled tasks in AWS Elastic Container Service (ECS):

Cluster Provisioning: Create new EKS clusters
Infrastructure Installation: Deploy Apache Kafka, monitoring tools, and Theodolite benchmarking framework
Benchmark Execution: Launch stream processing applications and load generators through Theodolite, running for 15 minutes
Repeated Testing: Each execution repeated 3 times
Data Collection: Store benchmark results, unload infrastructure, and deactivate clusters

Time Span Design

Primary Experiment Period: May to July 2024, one week in September 2024
Execution Frequency: Once every 6 hours (covering complete daily cycles)
High-Frequency Period: Every 3 hours over 3 weeks (capturing finer-grained daily patterns)
Black Friday Experiment: Additional experiments one week before and after Black Friday 2024

Experimental Setup

Performance Measurement Method

Warm-up Period: Discard measurements from the first 3 minutes
Measurement Window: Calculate average throughput for the remaining time
Output: Each benchmark execution produces one average throughput value

Evaluation Metrics

Primary Metric: Throughput (records/second)
Variability Measure: Coefficient of Variation (CV)
Statistical Analysis: Confidence intervals (obtained via bootstrap method), statistical significance testing

Data Processing

Temporal Grouping: Analysis grouped by hour, day of week, and week
Reference Pattern: Establish baseline daily and periodic patterns
Anomaly Detection: Identify performance deviations during Black Friday

Experimental Results

Overall Performance Variability

Data Scale: Over 1,000 benchmark executions
Distribution Characteristics: Throughput distribution shows clear central tendency, nearly symmetric within interquartile range, but non-normal due to slight skew toward lower throughput results
Coefficient of Variation: 3.69%, at the lower end of the range of micro and system-level benchmark variability reported in literature
Interquartile Range: 50% of measurements fall within -2.4% to +2.3% of the median

Daily Performance Patterns

Analysis by execution hour reveals:

Midday Trough: Benchmarks executed at noon show slightly lower performance
Nighttime Peak: Highest performance achieved during late night and early morning hours
Performance Difference: Average variation of 2.15%
Statistical Significance: Pattern is statistically significant

Periodic Performance Patterns

Analysis grouped by day of week shows:

Weekend Advantage: Benchmarks executed on weekends show slightly higher performance than weekdays
Wednesday Minimum: Wednesday exhibits the lowest performance
Maximum Variation: Average throughput difference from Saturday to Wednesday is 2.52%
Statistical Significance: Pattern is statistically significant

Long-Term Patterns

Week-to-Week Variation: Decomposition by execution week shows minor performance fluctuations
Trend Analysis: No obvious long-term patterns or trends observed
Seasonal Limitations: Due to experiments spanning only part of the year, other periods' differences cannot be ruled out

Black Friday Impact Analysis

Observed Phenomena

Performance Decline: Significant performance drop observed on Black Friday morning
Quick Recovery: Performance recovered on Saturday morning
Pre-Event Boost: Three days before Black Friday showed statistically significant throughput increase (2.3% to 3.3%)
Day-of Performance: Black Friday performance showed no significant difference from typical Friday performance

Possible Explanations

Seasonal Variation: Overall performance improvement in November 2024 compared to summer months, with temporary decline on Black Friday
Proactive Resource Provisioning: Cloud providers may have proactively provisioned additional computing resources in anticipation of Black Friday, improving performance in preceding days

Cloud Performance Variability Research

Foundational Research: Leitner and Cito (2016) on patterns of performance variability and predictability in public IaaS clouds
Experimental Methodology: Abedi and Brecht (2017) on methods for conducting repeatable experiments in highly variable cloud environments
Methodological Principles: Papadopoulos et al. (2021) on methodological principles for reproducible performance evaluation in cloud computing

Contribution Comparison

Scale Advantage: Larger scale than related work enables identification of more subtle performance patterns
Application Level: Focuses on application-level performance analysis rather than system or micro-level only
Time Span: Provides updated characterization over longer time periods

Conclusions and Discussion

Main Conclusions

Variability Confirmation: Application-level benchmark performance in cloud environments indeed exhibits notable variability
Moderate Magnitude: Variability magnitude is relatively small, becoming relevant only when target performance differences are less than 5%
Pattern Existence: Clear influences of time, day of week, and global events identified
Practical Impact: Black Friday introduces a small but noticeable source of cloud performance variability

Limitations

Geographic Scope: Primary experiments concentrated in us-east-1 region
Application Type: Focused on stream processing applications, may not apply to other application types
Time Constraints: Experiments span only part of the year, potentially missing seasonal variations
Statistical Power: Some effects did not reach statistical significance due to overlapping confidence intervals

Future Directions

Extended Application Types: Study performance variability of other types of cloud-native applications
Multi-Region Analysis: Conduct similar studies in more geographic regions
Long-Term Trends: Conduct long-term performance trend analysis spanning multiple years
Event Impact: Study the impact of other major global events on cloud performance

In-Depth Evaluation

Strengths

Rigorous Methodology: Employs large-scale, long-term empirical research methods with comprehensive data collection
Practical Significance: Research results provide direct guidance for cloud benchmarking practices
Technical Innovation: First quantitative analysis of global events' impact on cloud benchmarking
Statistical Rigor: Uses appropriate statistical methods, including bootstrap and confidence interval analysis
Reproducibility: Detailed description of experimental setup and automation processes

Weaknesses

Limited Application Scope: Focuses only on stream processing applications with limited generalizability
Causal Analysis: Lacks in-depth causal analysis of observed performance patterns
Cost Considerations: Does not discuss cost-benefit analysis of large-scale experiments
Practical Guidance: Lacks specific operational recommendations for practitioners

Impact

Academic Contribution: Provides important empirical data and methodological reference for cloud performance research
Engineering Practice: Provides scientific evidence for timing decisions in cloud benchmarking
Standard Setting: May influence development of cloud performance benchmarking standards and best practices

Applicable Scenarios

Performance Engineering: Cloud environment performance optimization and capacity planning
Benchmarking: Timing decisions for cloud-native application performance assessment
Resource Management: Cloud resource scheduling and load balancing strategy development
Academic Research: Cloud computing performance analysis and modeling research

References

This paper cites 8 important references covering key areas of cloud performance variability, experimental methodology, and benchmarking tools:

Leitner & Cito (2016) - Public IaaS cloud performance variability pattern research
Abedi & Brecht (2017) - Cloud environment repeatable experimentation methods
Papadopoulos et al. (2021) - Cloud computing performance evaluation methodology
Henning & Hasselbring (2022) - Cloud-native application scalability benchmarking methods
Horwitz (2022) - Black Friday traffic impact on observability strategies
Vogel et al. (2023) - Systematic mapping of distributed stream processing system performance
Henning et al. (2024) - ShuffleBench benchmarking tool
Henning et al. (2025) - Cloud performance variability research for stream processing applications

Summary: This is a high-quality empirical research paper that provides important insights for cloud benchmarking through large-scale experiments. The research methodology is rigorous, and the results have practical guidance value, making it an important contribution to the field of cloud performance engineering and benchmarking.