Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.
How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis
- Paper ID: 2403.09445
- Title: How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis
- Authors: Bekir Turkkan (IBM Research), Elvis Rodrigues (University at Buffalo), Tevfik Kosar (University at Buffalo), Aleksey Charapko (University of New Hampshire), Ailidani Ailijiang (Microsoft), Murat Demirbas (MongoDB)
- Category: cs.DC (Distributed Computing)
- Publication Date: arXiv preprint, last updated October 27, 2025
- Paper Link: https://arxiv.org/abs/2403.09445
Distributed coordination services and protocols are critical components of distributed systems, essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standardized benchmarking and evaluation tools, developers and researchers of distributed coordination services either use NoSQL standard benchmarks while ignoring consistency, distribution, and fault tolerance evaluation, or create their own ad-hoc micro-benchmarks that cannot be compared with other services. This research analyzes and compares the evaluation mechanisms of well-known and widely-used consensus algorithms, distributed coordination services, and distributed applications built on these services. The authors identify the most important requirements for benchmarking distributed coordination services, such as metrics and parameters for performance, scalability, availability, and consistency evaluation. Finally, the paper discusses why existing benchmarks fail to meet the complex evaluation requirements of distributed coordination systems.
Distributed coordination systems (including consensus algorithms, coordination services, and distributed applications) lack standardized evaluation benchmarks, resulting in:
- Incomplete Evaluation: Developers either use NoSQL benchmarks (such as YCSB) while ignoring consistency, distribution, and fault tolerance, or create custom micro-benchmarks
- Poor Comparability: Each system uses customized micro-benchmarks with different metrics and techniques, making fair comparison impossible
- Fragmented Evaluation: No unified framework exists for comprehensive evaluation of performance, scalability, availability, and consistency
- Practical Needs: Cloud computing and big data applications (search engines, social networks, video streaming, IoT) all depend on distributed coordination
- Technical Evolution: From Paxos to Raft, and then to WAN-optimized variants like WPaxos and SwiftPaxos continue to emerge
- Wide Application: Critical systems like Google Spanner, Apache Kafka, and Twitter Manhattan all depend on coordination services
- Evaluation Complexity: Distributed coordination systems require simultaneous consideration of multiple dimensions including performance, consistency, fault tolerance, and geographic distribution
Inadequacies of Existing Benchmarking Tools:
- YCSB: Single client process, does not support data access overlap, access locality, and other critical parameters
- TPC-C: Primarily designed for transaction processing, unsuitable for coordination service-specific requirements
- Jepsen: Requires deep understanding of tool internals, not a black-box test, difficult to adopt
- Lack of WAN Support: Most tools do not support evaluation in geographically distributed scenarios
Through systematic investigation of evaluation practices across 30+ distributed coordination systems, this paper aims to:
- Identify commonalities and differences in current evaluation practices
- Extract core requirements for evaluating distributed coordination systems
- Analyze defects in existing benchmarking tools
- Provide guidance for future development of standardized benchmarking tools
- Systematic Investigation: Analyzed evaluation practices of 30+ distributed coordination systems (including 13 consensus algorithms, 10 coordination services, 7 distributed applications)
- Topology Classification: Identified and defined 6 experimental topology structures (flat, star, multi-star, hierarchical, grid, centralized log), providing a framework for understanding system architecture
- Metrics and Parameter Framework:
- Systematically organized 4 major evaluation metrics: performance, scalability, availability, consistency
- Identified critical workload parameters: read-write ratio, data access overlap, access locality, object count and size, etc.
- Benchmark Requirements: Proposed 7 core requirements for distributed coordination system benchmarks:
- Flexibility and complexity
- WAN system support
- Benchmark scalability
- Ease of adoption
- Black-box testing capability
- Consistency verification capability
- Fault injection capability
- Gap Analysis: Systematically analyzed capabilities and limitations of 10+ existing benchmarking tools (YCSB, TPC-C, Jepsen, Elle, etc.)
- Practical Guidance: Provided best practices and considerations for researchers and engineers evaluating distributed coordination systems
This paper does not propose new technical methods but conducts systematic investigation and analysis, with tasks including:
- Input: Papers and evaluation materials from 30+ distributed coordination systems
- Processing: Extract evaluation topologies, metrics, parameters, tools, and other information
- Output: Systematic summary of evaluation practices, requirements analysis, and tool capability comparison
The authors selected three categories of systems based on relevance, timeliness, and impact:
Category I: Consensus Algorithms (13)
- Paxos variants: Mencius, FPaxos, Multi-Paxos, Hybrid-Paxos, E-Paxos, M2 Paxos, WPaxos, SwiftPaxos, Omni-Paxos
- Other protocols: Raft, Bizur, ZAB, Hydra
Category II: Coordination Services (10)
- ZooKeeper, Tango, Calvin, WanKeeper, ZooNet, Boki, FlexLog, SplitFT, Fabric, Narwhal
Category III: Distributed Applications (7)
- Spanner, DistributedLog, PNUTS, COPS, CockroachDB, OceanBase, ScalarDB
The authors defined 6 topologies based on quorum creation method and request processing method:
| Topology Type | Characteristics | Representative Systems |
|---|
| Flat Topology | Multi-leader or leaderless, allows concurrent updates | Mencius, E-Paxos |
| Star Topology | Single-leader protocol | ZooKeeper, Raft, Hybrid-Paxos |
| Multi-Star Topology | Multiple quorums, each star-shaped, flat communication between leaders | ZooNet, M2 Paxos, Spanner |
| Hierarchical Topology | Multi-star with hierarchy among leaders | WanKeeper |
| Grid Topology | Uses grid quorum to optimize performance | FPaxos, WPaxos |
| Centralized Log Topology | Shared persistent log records execution order | Tango, Boki, Calvin |
From each system's paper, extracted:
- Experimental Setup: Number of regions, servers, clients, test platform, benchmarking tools
- Evaluation Metrics: Throughput, latency, scalability, availability, consistency
- Workload Parameters: Read-write ratio, object count/size, data access overlap, access locality
The authors analyzed experimental settings of 30 systems with major findings:
Geographic Distribution:
- Single-Region Deployment: Most systems (e.g., Raft, Multi-Paxos, ZooKeeper)
- Multi-Region Deployment: WAN-optimized systems (e.g., WPaxos 5 regions 15 servers, SwiftPaxos 13 regions)
- Real Cloud Environments: Amazon EC2, Google Compute Engine, Alibaba ECS
- Controlled Environments: Emulab, DETER (network latency controllable)
Cluster Scale:
- Small-scale: 3-13 servers (most consensus algorithms)
- Medium-scale: 15-100 servers (coordination services)
- Large-scale: OceanBase reaches 1,557 servers, 360,000 clients/servers
Client Configuration:
- Single client: Bizur, Omni-Paxos
- Multi-threaded clients: Multi-Paxos (1-20 threads)
- Distributed clients: E-Paxos (50 clients), PNUTS (300 clients)
According to Table 2 statistics:
| Metric Category | Systems Evaluated | Coverage Rate |
|---|
| Performance-Throughput | 28/30 | 93% |
| Performance-Latency | 27/30 | 90% |
| Scalability-Servers | 14/30 | 47% |
| Scalability-Clients | 8/30 | 27% |
| Availability-Failures | 14/30 | 47% |
| Availability-Partitions | 5/30 | 17% |
| Consistency | 8/30 | 27% |
Key Findings:
- Performance evaluation is nearly universal, but consistency evaluation is severely insufficient
- Network partition testing is far less common than node failure testing
- Scalability evaluation typically focuses only on server count, ignoring region scaling
Based on Table 3 analysis:
- 100% Write Operations: Multi-Paxos, E-Paxos, Hybrid-Paxos (focus on conflicting commands)
- 0-100% Variation: ZooKeeper, WanKeeper (demonstrate different scenarios)
- Fixed Ratio: COPS (50% write), PNUTS (10% write)
- Unspecified: Raft, FPaxos, and multiple other systems
Problem: Performance varies dramatically under different read-write ratios, but many systems test only single configurations
- 100% Overlap: Mencius, E-Paxos, Hybrid-Paxos (worst case)
- 0-100% Variation: WanKeeper, Boki, FlexLog
- Not Evaluated: Most single-leader systems (minimal performance impact)
Key Insight: Multi-leader system performance heavily depends on access overlap, but evaluation is often overlooked
- Evaluated Systems: M2 Paxos (0-100%), WPaxos (70-90%), COPS (0-100%)
- Not Evaluated: Most systems
- Importance: Massive impact on systems using ownership mechanisms
- Specified Systems: Mencius (16-1024), M2 Paxos (1-1000), Omni-Paxos (500-50K)
- Most Unspecified: Limits understanding of conflict rates
- Small Objects: 6B-1KB (CPU-intensive workloads)
- Large Objects: 1KB-8KB (network-intensive workloads)
- Variation Range: Mencius (6B-4KB), SplitFT (128B-8KB)
Workload Scalability:
- Hybrid-Paxos, E-Paxos: Increase concurrent client count
- WPaxos: Adjust client rate limiting
- Most systems: Test until saturation point
System Scalability:
- Horizontal Scaling: ZooKeeper (3-13 replicas), Calvin (4-100 replicas)
- Region Scaling: E-Paxos and Mencius (3-7 regions)
- Vertical Scaling: M2 Paxos (vary CPU performance)
Problem: Lack of unified scalability testing methodology makes comparison difficult
Current Practices:
- Testing Tools: Bizur uses Serialla, Multi-Paxos uses checksum verification
- Jepsen Testing: ZooKeeper, CockroachDB (linearizability verification)
- Elle Testing: ScalarDB (strict serializability verification)
- Staleness Measurement: ZooNet, PNUTS, BG (but cannot prove strong consistency)
Core Issues:
- Most systems claim "strong consistency" with vague definitions
- Lack of systematic consistency verification methods
- Staleness measurement insufficient to verify linearizability or serializability
According to Table 4:
Failure Types:
- Node Crashes: Most common (14/30 systems)
- Network Partitions: Less common (5/30 systems)
- Other Failures: Clock drift, memory corruption, etc., almost untested
Failure Count:
- Single node failure: Most systems
- Multiple node failures: ZooKeeper (2 followers), Omni-Paxos (1-2 nodes)
Testing Methods:
- Measure throughput degradation during failures
- Spanner: Crash entire region but Paxos group remains available
- Hybrid-Paxos: Increase replica count to test availability improvement
NoSQL Database Benchmarks:
- YCSB (2010): Most popular NoSQL benchmark, but lacks distributed client and WAN scenario support
- YCSB+T (2014): Adds transaction support, but still single-process
- YCSB++ (2011): Supports distributed clients, but relies on ZooKeeper synchronization, unsuitable for WAN
Application-Specific Benchmarks:
- BG (2013): Social network workload, but uses locks to avoid conflicts
- TPC-C (1992): Transaction processing standard, but not designed for coordination services
- HiBench (2010): Hadoop benchmark, unsuitable for coordination systems
Big Data Benchmarks:
- BigDataBench (2014): Covers multiple big data workloads
- But all unsuitable for evaluating coordination service-specific requirements (consistency, fault tolerance, geographic distribution)
Jepsen (2013-present):
- Powerful consistency testing framework
- Can detect linearizability violations
- Requires deep tool understanding, not black-box testing
Elle (2020):
- Based on Jepsen, more efficient isolation level detection
- Builds transaction dependency graphs to identify violation cycles
- Still requires customized workloads
Other Testing Tools:
- Serialla: Strict serializability testing used by Bizur
- UPB (2013): Availability benchmark, but based on YCSB
Cloud Service Evaluation:
- Elasticity evaluation, computing capability, cost-benefit analysis
- But not specific to coordination services
File Systems and Data Warehouses:
- Distributed file system benchmarking
- Data warehouse query performance evaluation
- Different from coordination system requirements
Coordination Service Surveys:
- Algorithm comparison (Paxos variants)
- Service characteristics analysis
- Unique Contribution of This Paper: First systematic analysis of evaluation practices and benchmarking requirements
- Fragmented Evaluation Practices: Among 30 systems, only 7 use standard benchmarks (YCSB, TPC-C, Jepsen); most use custom micro-benchmarks
- Uneven Metric Coverage:
- Performance evaluation universal (93% of systems)
- Consistency evaluation insufficient (27% of systems)
- Network partition testing rare (17% of systems)
- Inconsistent Parameter Usage:
- Critical parameters (access locality, data access overlap) often overlooked
- Lack of standardized parameter configurations
- Difficult to fairly compare different systems
- Existing Benchmarks Inadequate:
- YCSB: Does not support distributed clients, WAN scenarios, access locality
- TPC-C: Not designed for coordination services
- Jepsen: Non-black-box, difficult to adopt
- No tool satisfies all requirements
- 7 Major Benchmark Requirements:
- Flexibility and complexity (support multi-dimensional parameter tuning)
- WAN system support (geographic distribution, uneven latency)
- Scalability (distributed load generation)
- Ease of adoption (black-box testing, language-agnostic)
- Performance benchmarking (throughput, latency, tail latency)
- Consistency verification (linearizability, serializability)
- Fault injection (crashes, partitions, clock drift)
- Sample Coverage: While covering 30 systems, may miss some emerging systems or domain-specific coordination services
- Timeliness: Rapid evolution of distributed systems means new evaluation practices and tools constantly emerge
- Depth of Analysis: Analysis of each system's evaluation practices based on public papers may not capture all implementation details
- Benchmark Tool Implementation: Paper identifies requirements but does not implement a complete benchmarking tool
- Consistency Models: Different systems define "strong consistency" differently, making unified evaluation standards difficult
- Develop Standardized Benchmark Tools:
- Support distributed clients and WAN scenarios
- Provide flexible parameter configuration
- Integrate consistency verification capability
- Support multiple fault injection types
- Establish Evaluation Standards:
- Define minimum required evaluation metric sets
- Standardize workload parameter configurations
- Establish consistency verification protocols
- Expand Survey Scope:
- Include more emerging coordination protocols (e.g., DAG-based consensus)
- Analyze blockchain consensus algorithm evaluation practices
- Study coordination requirements in edge computing scenarios
- Empirical Research:
- Re-evaluate existing systems using standard benchmarks
- Quantify parameter impact on performance
- Verify claimed consistency guarantees
- Automated Testing:
- Develop automated consistency verification tools
- Integrate with continuous integration/continuous deployment (CI/CD)
- Support regression testing
- Breadth: Covers 30 systems spanning 20 years of research history (Paxos 1998 - latest systems 2024)
- Depth: Detailed analysis of experimental setup, topologies, metrics, parameters
- Clear Classification: Three-layer classification (algorithm-service-application) + six topology types
- Guidance: Provides evaluation best practices for developers
- Clear Requirements: 7 benchmark requirements are actionable
- Problem-Oriented: Clearly identifies specific shortcomings of existing tools
- 3 Comprehensive Tables: Table 1 (experimental setup), Table 2 (metric usage), Table 3 (workload parameters)
- Quantitative Analysis: Metric coverage rates, parameter usage frequencies
- Visualization: Clear diagrams of 6 topology types
- Does not favor specific systems or benchmarking tools
- Fair analysis of each tool's strengths and weaknesses
- Fact-based assessment rather than subjective judgment
- Cites 85 references
- Clear methodology (selection criteria, analysis framework)
- Conclusions well-supported by data
- No performance difference data for different evaluation methods
- No quantification of parameter choice impact on results
- Missing statistical analysis (correlation, significance testing)
- No actual development of benchmark tool meeting requirements
- No experimental verification of requirement feasibility
- Lack of prototype system evaluation
- Insufficient discussion of differences between consistency models
- No specific consistency verification methodology provided
- Missing complexity analysis of consistency testing
- While emphasizing WAN importance, specific analysis insufficient
- Insufficient discussion of different geographic distribution patterns
- Missing challenges of cross-cloud, cross-continent deployment
- Blockchain consensus algorithm evaluation practices not covered
- Edge computing scenario coordination requirements not discussed
- Machine learning system coordination evaluation not addressed
- No detailed experimental reproduction guidelines
- Lack of open-source datasets or evaluation scripts
- No discussion of ensuring evaluation reproducibility
- Fills Gap: First systematic investigation of distributed coordination system evaluation practices
- Theoretical Value: Establishes evaluation framework and requirements system
- Citation Potential: May become reference literature for evaluation methods in the field
- Engineering Guidance: Helps developers choose appropriate evaluation methods
- Benchmark Development: Provides requirements specification for new benchmark tools
- Standardization Promotion: May drive establishment of evaluation standards
- Implementation Missing: Does not provide directly usable tools
- Verification Insufficient: Requirement feasibility unverified through empirical study
- Update Needed: Rapidly evolving field requires continuous updates
- Direct Application: Researchers and engineers in distributed coordination systems
- Indirect Application: Developers of distributed databases, blockchain, cloud computing systems
- Educational Value: Can serve as reference material for distributed systems courses
- New Protocol Development: Reference requirements checklist when designing evaluation plans
- System Comparison: Select appropriate metrics and parameters for fair comparison
- Paper Writing: Cite standard evaluation practices to enhance credibility
- System Selection: Understand evaluation results and limitations of different systems
- Performance Tuning: Identify critical parameters affecting performance
- Failure Testing: Design comprehensive availability testing plans
- Course Teaching: Introduce distributed system evaluation methodology
- Project Practice: Guide students in designing experiments and evaluation plans
- Literature Review: Understand current research status in the field
- Benchmark Development: Serve as requirements specification document
- Industry Standards: Drive establishment of evaluation standards
- Compliance Testing: Design coordination service compliance tests
- Lamport, L. (1998). The part-time parliament. ACM TOCS - Original Paxos paper
- Ongaro, D. & Ousterhout, J. (2014). In search of an understandable consensus algorithm. USENIX ATC - Raft algorithm
- Hunt, P. et al. (2010). ZooKeeper: Wait-free coordination for internet-scale systems. USENIX ATC
- Balakrishnan, M. et al. (2013). Tango: Distributed data structures over a shared log. SOSP
- Cooper, B.F. et al. (2010). Benchmarking cloud serving systems with YCSB. SoCC
- Kingsbury, K. (2024). Jepsen tests - Consistency testing framework
- Kingsbury, K. & Alvaro, P. (2020). Elle: Inferring Isolation Anomalies from Experimental Observations
- Ailijiang, A. et al. (2017). Multileader WAN Paxos: Ruling the archipelago with fast consensus - WPaxos
- Mao, Y. et al. (2008). Mencius: Building efficient replicated state machines for WANs. OSDI
- Corbett, J.C. et al. (2013). Spanner: Google's globally distributed database. ACM TOCS
Summary: This paper is an important survey work in the field of distributed coordination system evaluation, systematically revealing the fragmentation problem in current evaluation practices and proposing requirements for standardized benchmarking. While lacking actual tool implementation, it provides clear direction for future research and engineering practice. For distributed systems researchers and engineers, this is essential reading for understanding evaluation methodology in the field.