Benchmarks and performance experiments are frequently conducted in cloud environments. However, their results are often treated with caution, as the presumed high variability of performance in the cloud raises concerns about reproducibility and credibility. In a recent study, we empirically quantified the impact of this variability on benchmarking results by repeatedly executing a stream processing application benchmark at different times of the day over several months. Our analysis confirms that performance variability is indeed observable at the application level, although it is less pronounced than often assumed. The larger scale of our study compared to related work allowed us to identify subtle daily and weekly performance patterns. We now extend this investigation by examining whether a major global event, such as Black Friday, affects the outcomes of performance benchmarks.
- Paper ID: 2510.12397
- Title: Should I Run My Cloud Benchmark on Black Friday?
- Authors: Sören Henning, Adriano Vogel, Esteban Perez-Wohlfeil, Otmar Ertl, Rick Rabiser
- Institutions: Dynatrace Research, Linz, Austria; LIT CPS Lab, Johannes Kepler University Linz, Austria
- Classification: cs.SE (Software Engineering), cs.DC (Distributed Computing), cs.PF (Performance Analysis)
- Publication Date: October 14, 2024 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.12397
Benchmarking and performance experimentation in cloud environments are increasingly prevalent, yet their results are frequently questioned due to high variability in cloud performance, affecting reproducibility and credibility. This study empirically quantifies the impact of such variability on benchmarking results through repeated execution of stream processing application benchmarks at different times over several months. The analysis confirms that performance variability exists at the application level, but to a lesser extent than commonly assumed. Compared to related work, the larger scale of this study enables identification of subtle daily and periodic performance patterns. The research further extends to examine the impact of global major events, such as Black Friday, on performance benchmarking results.
As organizations continue their digital transformation toward cloud deployment, benchmarking and performance experimentation in cloud environments have become common practices in research and engineering. However, performance measurement in cloud environments faces the following challenges:
- Multi-tenant Resource Sharing: Cloud workloads share underlying infrastructure with other tenants
- Hardware Abstraction: High levels of hardware abstraction introduce variability
- Reproducibility Issues: Performance measurements may fluctuate, affecting meaningful comparisons across studies
- The credibility of cloud benchmarking directly impacts the accuracy of performance assessment
- Understanding performance variability patterns has practical implications for optimizing cloud resource configuration
- Provides empirical evidence for best practices in cloud benchmarking
- Lack of large-scale, long-term empirical studies
- Insufficient quantitative analysis of application-level performance variability
- Insufficient consideration of global events' impact on cloud performance
- Large-Scale Longitudinal Study: Collected a dataset of over 1,000 benchmark executions through repeated experiments spanning several months
- Performance Pattern Identification: Discovered subtle but statistically significant daily and periodic performance patterns in cloud environments
- Global Event Impact Analysis: First quantitative analysis of the impact of major events such as Black Friday on cloud benchmark performance
- Application-Level Variability Quantification: Provided precise measurement of performance variability for distributed stream processing applications in cloud environments
- Application Type: Distributed stream processing applications (representing data-intensive, performance-critical distributed systems)
- Benchmarking Tool: Open-source cloud-native stream processing benchmark ShuffleBench and its Kafka Streams implementation
- Performance Metric: Throughput, measured using ShuffleBench's instantaneous measurement method
- Cloud Platform: Amazon Web Services (AWS)
- Service: Elastic Kubernetes Service (EKS)
- Cluster Configuration: 10 nodes using different sizes of m6i instances
- Geographic Region: us-east-1 (primary), eu-central-1 (validation)
Automated through scheduled tasks in AWS Elastic Container Service (ECS):
- Cluster Provisioning: Create new EKS clusters
- Infrastructure Installation: Deploy Apache Kafka, monitoring tools, and Theodolite benchmarking framework
- Benchmark Execution: Launch stream processing applications and load generators through Theodolite, running for 15 minutes
- Repeated Testing: Each execution repeated 3 times
- Data Collection: Store benchmark results, unload infrastructure, and deactivate clusters
- Primary Experiment Period: May to July 2024, one week in September 2024
- Execution Frequency: Once every 6 hours (covering complete daily cycles)
- High-Frequency Period: Every 3 hours over 3 weeks (capturing finer-grained daily patterns)
- Black Friday Experiment: Additional experiments one week before and after Black Friday 2024
- Warm-up Period: Discard measurements from the first 3 minutes
- Measurement Window: Calculate average throughput for the remaining time
- Output: Each benchmark execution produces one average throughput value
- Primary Metric: Throughput (records/second)
- Variability Measure: Coefficient of Variation (CV)
- Statistical Analysis: Confidence intervals (obtained via bootstrap method), statistical significance testing
- Temporal Grouping: Analysis grouped by hour, day of week, and week
- Reference Pattern: Establish baseline daily and periodic patterns
- Anomaly Detection: Identify performance deviations during Black Friday
- Data Scale: Over 1,000 benchmark executions
- Distribution Characteristics: Throughput distribution shows clear central tendency, nearly symmetric within interquartile range, but non-normal due to slight skew toward lower throughput results
- Coefficient of Variation: 3.69%, at the lower end of the range of micro and system-level benchmark variability reported in literature
- Interquartile Range: 50% of measurements fall within -2.4% to +2.3% of the median
Analysis by execution hour reveals:
- Midday Trough: Benchmarks executed at noon show slightly lower performance
- Nighttime Peak: Highest performance achieved during late night and early morning hours
- Performance Difference: Average variation of 2.15%
- Statistical Significance: Pattern is statistically significant
Analysis grouped by day of week shows:
- Weekend Advantage: Benchmarks executed on weekends show slightly higher performance than weekdays
- Wednesday Minimum: Wednesday exhibits the lowest performance
- Maximum Variation: Average throughput difference from Saturday to Wednesday is 2.52%
- Statistical Significance: Pattern is statistically significant
- Week-to-Week Variation: Decomposition by execution week shows minor performance fluctuations
- Trend Analysis: No obvious long-term patterns or trends observed
- Seasonal Limitations: Due to experiments spanning only part of the year, other periods' differences cannot be ruled out
- Performance Decline: Significant performance drop observed on Black Friday morning
- Quick Recovery: Performance recovered on Saturday morning
- Pre-Event Boost: Three days before Black Friday showed statistically significant throughput increase (2.3% to 3.3%)
- Day-of Performance: Black Friday performance showed no significant difference from typical Friday performance
- Seasonal Variation: Overall performance improvement in November 2024 compared to summer months, with temporary decline on Black Friday
- Proactive Resource Provisioning: Cloud providers may have proactively provisioned additional computing resources in anticipation of Black Friday, improving performance in preceding days
- Foundational Research: Leitner and Cito (2016) on patterns of performance variability and predictability in public IaaS clouds
- Experimental Methodology: Abedi and Brecht (2017) on methods for conducting repeatable experiments in highly variable cloud environments
- Methodological Principles: Papadopoulos et al. (2021) on methodological principles for reproducible performance evaluation in cloud computing
- Scale Advantage: Larger scale than related work enables identification of more subtle performance patterns
- Application Level: Focuses on application-level performance analysis rather than system or micro-level only
- Time Span: Provides updated characterization over longer time periods
- Variability Confirmation: Application-level benchmark performance in cloud environments indeed exhibits notable variability
- Moderate Magnitude: Variability magnitude is relatively small, becoming relevant only when target performance differences are less than 5%
- Pattern Existence: Clear influences of time, day of week, and global events identified
- Practical Impact: Black Friday introduces a small but noticeable source of cloud performance variability
- Geographic Scope: Primary experiments concentrated in us-east-1 region
- Application Type: Focused on stream processing applications, may not apply to other application types
- Time Constraints: Experiments span only part of the year, potentially missing seasonal variations
- Statistical Power: Some effects did not reach statistical significance due to overlapping confidence intervals
- Extended Application Types: Study performance variability of other types of cloud-native applications
- Multi-Region Analysis: Conduct similar studies in more geographic regions
- Long-Term Trends: Conduct long-term performance trend analysis spanning multiple years
- Event Impact: Study the impact of other major global events on cloud performance
- Rigorous Methodology: Employs large-scale, long-term empirical research methods with comprehensive data collection
- Practical Significance: Research results provide direct guidance for cloud benchmarking practices
- Technical Innovation: First quantitative analysis of global events' impact on cloud benchmarking
- Statistical Rigor: Uses appropriate statistical methods, including bootstrap and confidence interval analysis
- Reproducibility: Detailed description of experimental setup and automation processes
- Limited Application Scope: Focuses only on stream processing applications with limited generalizability
- Causal Analysis: Lacks in-depth causal analysis of observed performance patterns
- Cost Considerations: Does not discuss cost-benefit analysis of large-scale experiments
- Practical Guidance: Lacks specific operational recommendations for practitioners
- Academic Contribution: Provides important empirical data and methodological reference for cloud performance research
- Engineering Practice: Provides scientific evidence for timing decisions in cloud benchmarking
- Standard Setting: May influence development of cloud performance benchmarking standards and best practices
- Performance Engineering: Cloud environment performance optimization and capacity planning
- Benchmarking: Timing decisions for cloud-native application performance assessment
- Resource Management: Cloud resource scheduling and load balancing strategy development
- Academic Research: Cloud computing performance analysis and modeling research
This paper cites 8 important references covering key areas of cloud performance variability, experimental methodology, and benchmarking tools:
- Leitner & Cito (2016) - Public IaaS cloud performance variability pattern research
- Abedi & Brecht (2017) - Cloud environment repeatable experimentation methods
- Papadopoulos et al. (2021) - Cloud computing performance evaluation methodology
- Henning & Hasselbring (2022) - Cloud-native application scalability benchmarking methods
- Horwitz (2022) - Black Friday traffic impact on observability strategies
- Vogel et al. (2023) - Systematic mapping of distributed stream processing system performance
- Henning et al. (2024) - ShuffleBench benchmarking tool
- Henning et al. (2025) - Cloud performance variability research for stream processing applications
Summary: This is a high-quality empirical research paper that provides important insights for cloud benchmarking through large-scale experiments. The research methodology is rigorous, and the results have practical guidance value, making it an important contribution to the field of cloud performance engineering and benchmarking.