2025-11-21T14:04:16.070008

How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis

Turkkan, Rodrigues, Kosar et al.
Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.
academic

How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis

Basic Information

  • Paper ID: 2403.09445
  • Title: How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis
  • Authors: Bekir Turkkan (IBM Research), Elvis Rodrigues (University at Buffalo), Tevfik Kosar (University at Buffalo), Aleksey Charapko (University of New Hampshire), Ailidani Ailijiang (Microsoft), Murat Demirbas (MongoDB)
  • Category: cs.DC (Distributed Computing)
  • Publication Date: arXiv preprint, last updated October 27, 2025
  • Paper Link: https://arxiv.org/abs/2403.09445

Abstract

Distributed coordination services and protocols are critical components of distributed systems, essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standardized benchmarking and evaluation tools, developers and researchers of distributed coordination services either use NoSQL standard benchmarks while ignoring consistency, distribution, and fault tolerance evaluation, or create their own ad-hoc micro-benchmarks that cannot be compared with other services. This research analyzes and compares the evaluation mechanisms of well-known and widely-used consensus algorithms, distributed coordination services, and distributed applications built on these services. The authors identify the most important requirements for benchmarking distributed coordination services, such as metrics and parameters for performance, scalability, availability, and consistency evaluation. Finally, the paper discusses why existing benchmarks fail to meet the complex evaluation requirements of distributed coordination systems.

Research Background and Motivation

1. Core Problem to Be Addressed

Distributed coordination systems (including consensus algorithms, coordination services, and distributed applications) lack standardized evaluation benchmarks, resulting in:

  • Incomplete Evaluation: Developers either use NoSQL benchmarks (such as YCSB) while ignoring consistency, distribution, and fault tolerance, or create custom micro-benchmarks
  • Poor Comparability: Each system uses customized micro-benchmarks with different metrics and techniques, making fair comparison impossible
  • Fragmented Evaluation: No unified framework exists for comprehensive evaluation of performance, scalability, availability, and consistency

2. Problem Significance

  • Practical Needs: Cloud computing and big data applications (search engines, social networks, video streaming, IoT) all depend on distributed coordination
  • Technical Evolution: From Paxos to Raft, and then to WAN-optimized variants like WPaxos and SwiftPaxos continue to emerge
  • Wide Application: Critical systems like Google Spanner, Apache Kafka, and Twitter Manhattan all depend on coordination services
  • Evaluation Complexity: Distributed coordination systems require simultaneous consideration of multiple dimensions including performance, consistency, fault tolerance, and geographic distribution

3. Limitations of Existing Approaches

Inadequacies of Existing Benchmarking Tools:

  • YCSB: Single client process, does not support data access overlap, access locality, and other critical parameters
  • TPC-C: Primarily designed for transaction processing, unsuitable for coordination service-specific requirements
  • Jepsen: Requires deep understanding of tool internals, not a black-box test, difficult to adopt
  • Lack of WAN Support: Most tools do not support evaluation in geographically distributed scenarios

4. Research Motivation

Through systematic investigation of evaluation practices across 30+ distributed coordination systems, this paper aims to:

  • Identify commonalities and differences in current evaluation practices
  • Extract core requirements for evaluating distributed coordination systems
  • Analyze defects in existing benchmarking tools
  • Provide guidance for future development of standardized benchmarking tools

Core Contributions

  1. Systematic Investigation: Analyzed evaluation practices of 30+ distributed coordination systems (including 13 consensus algorithms, 10 coordination services, 7 distributed applications)
  2. Topology Classification: Identified and defined 6 experimental topology structures (flat, star, multi-star, hierarchical, grid, centralized log), providing a framework for understanding system architecture
  3. Metrics and Parameter Framework:
    • Systematically organized 4 major evaluation metrics: performance, scalability, availability, consistency
    • Identified critical workload parameters: read-write ratio, data access overlap, access locality, object count and size, etc.
  4. Benchmark Requirements: Proposed 7 core requirements for distributed coordination system benchmarks:
    • Flexibility and complexity
    • WAN system support
    • Benchmark scalability
    • Ease of adoption
    • Black-box testing capability
    • Consistency verification capability
    • Fault injection capability
  5. Gap Analysis: Systematically analyzed capabilities and limitations of 10+ existing benchmarking tools (YCSB, TPC-C, Jepsen, Elle, etc.)
  6. Practical Guidance: Provided best practices and considerations for researchers and engineers evaluating distributed coordination systems

Methodology Details

Task Definition

This paper does not propose new technical methods but conducts systematic investigation and analysis, with tasks including:

  • Input: Papers and evaluation materials from 30+ distributed coordination systems
  • Processing: Extract evaluation topologies, metrics, parameters, tools, and other information
  • Output: Systematic summary of evaluation practices, requirements analysis, and tool capability comparison

Research Method

1. System Selection Criteria

The authors selected three categories of systems based on relevance, timeliness, and impact:

Category I: Consensus Algorithms (13)

  • Paxos variants: Mencius, FPaxos, Multi-Paxos, Hybrid-Paxos, E-Paxos, M2 Paxos, WPaxos, SwiftPaxos, Omni-Paxos
  • Other protocols: Raft, Bizur, ZAB, Hydra

Category II: Coordination Services (10)

  • ZooKeeper, Tango, Calvin, WanKeeper, ZooNet, Boki, FlexLog, SplitFT, Fabric, Narwhal

Category III: Distributed Applications (7)

  • Spanner, DistributedLog, PNUTS, COPS, CockroachDB, OceanBase, ScalarDB

2. Topology Classification Framework

The authors defined 6 topologies based on quorum creation method and request processing method:

Topology TypeCharacteristicsRepresentative Systems
Flat TopologyMulti-leader or leaderless, allows concurrent updatesMencius, E-Paxos
Star TopologySingle-leader protocolZooKeeper, Raft, Hybrid-Paxos
Multi-Star TopologyMultiple quorums, each star-shaped, flat communication between leadersZooNet, M2 Paxos, Spanner
Hierarchical TopologyMulti-star with hierarchy among leadersWanKeeper
Grid TopologyUses grid quorum to optimize performanceFPaxos, WPaxos
Centralized Log TopologyShared persistent log records execution orderTango, Boki, Calvin

3. Data Extraction and Analysis

From each system's paper, extracted:

  • Experimental Setup: Number of regions, servers, clients, test platform, benchmarking tools
  • Evaluation Metrics: Throughput, latency, scalability, availability, consistency
  • Workload Parameters: Read-write ratio, object count/size, data access overlap, access locality

Experimental Setup (Survey Findings)

Experimental Topology Distribution

The authors analyzed experimental settings of 30 systems with major findings:

Geographic Distribution:

  • Single-Region Deployment: Most systems (e.g., Raft, Multi-Paxos, ZooKeeper)
  • Multi-Region Deployment: WAN-optimized systems (e.g., WPaxos 5 regions 15 servers, SwiftPaxos 13 regions)
  • Real Cloud Environments: Amazon EC2, Google Compute Engine, Alibaba ECS
  • Controlled Environments: Emulab, DETER (network latency controllable)

Cluster Scale:

  • Small-scale: 3-13 servers (most consensus algorithms)
  • Medium-scale: 15-100 servers (coordination services)
  • Large-scale: OceanBase reaches 1,557 servers, 360,000 clients/servers

Client Configuration:

  • Single client: Bizur, Omni-Paxos
  • Multi-threaded clients: Multi-Paxos (1-20 threads)
  • Distributed clients: E-Paxos (50 clients), PNUTS (300 clients)

Evaluation Metrics Usage

According to Table 2 statistics:

Metric CategorySystems EvaluatedCoverage Rate
Performance-Throughput28/3093%
Performance-Latency27/3090%
Scalability-Servers14/3047%
Scalability-Clients8/3027%
Availability-Failures14/3047%
Availability-Partitions5/3017%
Consistency8/3027%

Key Findings:

  • Performance evaluation is nearly universal, but consistency evaluation is severely insufficient
  • Network partition testing is far less common than node failure testing
  • Scalability evaluation typically focuses only on server count, ignoring region scaling

Experimental Results (Survey Findings)

Major Finding 1: Inconsistent Workload Parameter Usage

Based on Table 3 analysis:

Read-Write Ratio

  • 100% Write Operations: Multi-Paxos, E-Paxos, Hybrid-Paxos (focus on conflicting commands)
  • 0-100% Variation: ZooKeeper, WanKeeper (demonstrate different scenarios)
  • Fixed Ratio: COPS (50% write), PNUTS (10% write)
  • Unspecified: Raft, FPaxos, and multiple other systems

Problem: Performance varies dramatically under different read-write ratios, but many systems test only single configurations

Data Access Overlap

  • 100% Overlap: Mencius, E-Paxos, Hybrid-Paxos (worst case)
  • 0-100% Variation: WanKeeper, Boki, FlexLog
  • Not Evaluated: Most single-leader systems (minimal performance impact)

Key Insight: Multi-leader system performance heavily depends on access overlap, but evaluation is often overlooked

Access Locality

  • Evaluated Systems: M2 Paxos (0-100%), WPaxos (70-90%), COPS (0-100%)
  • Not Evaluated: Most systems
  • Importance: Massive impact on systems using ownership mechanisms

Object Count

  • Specified Systems: Mencius (16-1024), M2 Paxos (1-1000), Omni-Paxos (500-50K)
  • Most Unspecified: Limits understanding of conflict rates

Object Size

  • Small Objects: 6B-1KB (CPU-intensive workloads)
  • Large Objects: 1KB-8KB (network-intensive workloads)
  • Variation Range: Mencius (6B-4KB), SplitFT (128B-8KB)

Major Finding 2: Diverse Scalability Evaluation Methods

Workload Scalability:

  • Hybrid-Paxos, E-Paxos: Increase concurrent client count
  • WPaxos: Adjust client rate limiting
  • Most systems: Test until saturation point

System Scalability:

  • Horizontal Scaling: ZooKeeper (3-13 replicas), Calvin (4-100 replicas)
  • Region Scaling: E-Paxos and Mencius (3-7 regions)
  • Vertical Scaling: M2 Paxos (vary CPU performance)

Problem: Lack of unified scalability testing methodology makes comparison difficult

Major Finding 3: Severely Insufficient Consistency Evaluation

Current Practices:

  • Testing Tools: Bizur uses Serialla, Multi-Paxos uses checksum verification
  • Jepsen Testing: ZooKeeper, CockroachDB (linearizability verification)
  • Elle Testing: ScalarDB (strict serializability verification)
  • Staleness Measurement: ZooNet, PNUTS, BG (but cannot prove strong consistency)

Core Issues:

  • Most systems claim "strong consistency" with vague definitions
  • Lack of systematic consistency verification methods
  • Staleness measurement insufficient to verify linearizability or serializability

Major Finding 4: Availability Evaluation Concentrated on Crash Failures

According to Table 4:

Failure Types:

  • Node Crashes: Most common (14/30 systems)
  • Network Partitions: Less common (5/30 systems)
  • Other Failures: Clock drift, memory corruption, etc., almost untested

Failure Count:

  • Single node failure: Most systems
  • Multiple node failures: ZooKeeper (2 followers), Omni-Paxos (1-2 nodes)

Testing Methods:

  • Measure throughput degradation during failures
  • Spanner: Crash entire region but Paxos group remains available
  • Hybrid-Paxos: Increase replica count to test availability improvement

Distributed System Benchmarking

NoSQL Database Benchmarks:

  • YCSB (2010): Most popular NoSQL benchmark, but lacks distributed client and WAN scenario support
  • YCSB+T (2014): Adds transaction support, but still single-process
  • YCSB++ (2011): Supports distributed clients, but relies on ZooKeeper synchronization, unsuitable for WAN

Application-Specific Benchmarks:

  • BG (2013): Social network workload, but uses locks to avoid conflicts
  • TPC-C (1992): Transaction processing standard, but not designed for coordination services
  • HiBench (2010): Hadoop benchmark, unsuitable for coordination systems

Big Data Benchmarks:

  • BigDataBench (2014): Covers multiple big data workloads
  • But all unsuitable for evaluating coordination service-specific requirements (consistency, fault tolerance, geographic distribution)

Consistency Testing Tools

Jepsen (2013-present):

  • Powerful consistency testing framework
  • Can detect linearizability violations
  • Requires deep tool understanding, not black-box testing

Elle (2020):

  • Based on Jepsen, more efficient isolation level detection
  • Builds transaction dependency graphs to identify violation cycles
  • Still requires customized workloads

Other Testing Tools:

  • Serialla: Strict serializability testing used by Bizur
  • UPB (2013): Availability benchmark, but based on YCSB

Distributed System Surveys

Cloud Service Evaluation:

  • Elasticity evaluation, computing capability, cost-benefit analysis
  • But not specific to coordination services

File Systems and Data Warehouses:

  • Distributed file system benchmarking
  • Data warehouse query performance evaluation
  • Different from coordination system requirements

Coordination Service Surveys:

  • Algorithm comparison (Paxos variants)
  • Service characteristics analysis
  • Unique Contribution of This Paper: First systematic analysis of evaluation practices and benchmarking requirements

Conclusions and Discussion

Main Conclusions

  1. Fragmented Evaluation Practices: Among 30 systems, only 7 use standard benchmarks (YCSB, TPC-C, Jepsen); most use custom micro-benchmarks
  2. Uneven Metric Coverage:
    • Performance evaluation universal (93% of systems)
    • Consistency evaluation insufficient (27% of systems)
    • Network partition testing rare (17% of systems)
  3. Inconsistent Parameter Usage:
    • Critical parameters (access locality, data access overlap) often overlooked
    • Lack of standardized parameter configurations
    • Difficult to fairly compare different systems
  4. Existing Benchmarks Inadequate:
    • YCSB: Does not support distributed clients, WAN scenarios, access locality
    • TPC-C: Not designed for coordination services
    • Jepsen: Non-black-box, difficult to adopt
    • No tool satisfies all requirements
  5. 7 Major Benchmark Requirements:
    • Flexibility and complexity (support multi-dimensional parameter tuning)
    • WAN system support (geographic distribution, uneven latency)
    • Scalability (distributed load generation)
    • Ease of adoption (black-box testing, language-agnostic)
    • Performance benchmarking (throughput, latency, tail latency)
    • Consistency verification (linearizability, serializability)
    • Fault injection (crashes, partitions, clock drift)

Limitations

  1. Sample Coverage: While covering 30 systems, may miss some emerging systems or domain-specific coordination services
  2. Timeliness: Rapid evolution of distributed systems means new evaluation practices and tools constantly emerge
  3. Depth of Analysis: Analysis of each system's evaluation practices based on public papers may not capture all implementation details
  4. Benchmark Tool Implementation: Paper identifies requirements but does not implement a complete benchmarking tool
  5. Consistency Models: Different systems define "strong consistency" differently, making unified evaluation standards difficult

Future Directions

  1. Develop Standardized Benchmark Tools:
    • Support distributed clients and WAN scenarios
    • Provide flexible parameter configuration
    • Integrate consistency verification capability
    • Support multiple fault injection types
  2. Establish Evaluation Standards:
    • Define minimum required evaluation metric sets
    • Standardize workload parameter configurations
    • Establish consistency verification protocols
  3. Expand Survey Scope:
    • Include more emerging coordination protocols (e.g., DAG-based consensus)
    • Analyze blockchain consensus algorithm evaluation practices
    • Study coordination requirements in edge computing scenarios
  4. Empirical Research:
    • Re-evaluate existing systems using standard benchmarks
    • Quantify parameter impact on performance
    • Verify claimed consistency guarantees
  5. Automated Testing:
    • Develop automated consistency verification tools
    • Integrate with continuous integration/continuous deployment (CI/CD)
    • Support regression testing

In-Depth Evaluation

Strengths

1. Systematicity and Comprehensiveness

  • Breadth: Covers 30 systems spanning 20 years of research history (Paxos 1998 - latest systems 2024)
  • Depth: Detailed analysis of experimental setup, topologies, metrics, parameters
  • Clear Classification: Three-layer classification (algorithm-service-application) + six topology types

2. High Practical Value

  • Guidance: Provides evaluation best practices for developers
  • Clear Requirements: 7 benchmark requirements are actionable
  • Problem-Oriented: Clearly identifies specific shortcomings of existing tools

3. Rich Data

  • 3 Comprehensive Tables: Table 1 (experimental setup), Table 2 (metric usage), Table 3 (workload parameters)
  • Quantitative Analysis: Metric coverage rates, parameter usage frequencies
  • Visualization: Clear diagrams of 6 topology types

4. Objective and Neutral

  • Does not favor specific systems or benchmarking tools
  • Fair analysis of each tool's strengths and weaknesses
  • Fact-based assessment rather than subjective judgment

5. Academic Rigor

  • Cites 85 references
  • Clear methodology (selection criteria, analysis framework)
  • Conclusions well-supported by data

Weaknesses

1. Lack of Quantitative Comparison

  • No performance difference data for different evaluation methods
  • No quantification of parameter choice impact on results
  • Missing statistical analysis (correlation, significance testing)

2. Insufficient Implementation Verification

  • No actual development of benchmark tool meeting requirements
  • No experimental verification of requirement feasibility
  • Lack of prototype system evaluation

3. Shallow Consistency Evaluation Analysis

  • Insufficient discussion of differences between consistency models
  • No specific consistency verification methodology provided
  • Missing complexity analysis of consistency testing

4. Limited WAN Scenario Analysis

  • While emphasizing WAN importance, specific analysis insufficient
  • Insufficient discussion of different geographic distribution patterns
  • Missing challenges of cross-cloud, cross-continent deployment
  • Blockchain consensus algorithm evaluation practices not covered
  • Edge computing scenario coordination requirements not discussed
  • Machine learning system coordination evaluation not addressed

6. Insufficient Reproducibility Guidance

  • No detailed experimental reproduction guidelines
  • Lack of open-source datasets or evaluation scripts
  • No discussion of ensuring evaluation reproducibility

Impact

1. Academic Contribution

  • Fills Gap: First systematic investigation of distributed coordination system evaluation practices
  • Theoretical Value: Establishes evaluation framework and requirements system
  • Citation Potential: May become reference literature for evaluation methods in the field

2. Practical Value

  • Engineering Guidance: Helps developers choose appropriate evaluation methods
  • Benchmark Development: Provides requirements specification for new benchmark tools
  • Standardization Promotion: May drive establishment of evaluation standards

3. Limitations

  • Implementation Missing: Does not provide directly usable tools
  • Verification Insufficient: Requirement feasibility unverified through empirical study
  • Update Needed: Rapidly evolving field requires continuous updates

4. Applicable Scope

  • Direct Application: Researchers and engineers in distributed coordination systems
  • Indirect Application: Developers of distributed databases, blockchain, cloud computing systems
  • Educational Value: Can serve as reference material for distributed systems courses

Applicable Scenarios

1. Research Scenarios

  • New Protocol Development: Reference requirements checklist when designing evaluation plans
  • System Comparison: Select appropriate metrics and parameters for fair comparison
  • Paper Writing: Cite standard evaluation practices to enhance credibility

2. Engineering Scenarios

  • System Selection: Understand evaluation results and limitations of different systems
  • Performance Tuning: Identify critical parameters affecting performance
  • Failure Testing: Design comprehensive availability testing plans

3. Educational Scenarios

  • Course Teaching: Introduce distributed system evaluation methodology
  • Project Practice: Guide students in designing experiments and evaluation plans
  • Literature Review: Understand current research status in the field

4. Standardization Scenarios

  • Benchmark Development: Serve as requirements specification document
  • Industry Standards: Drive establishment of evaluation standards
  • Compliance Testing: Design coordination service compliance tests

Selected References

Classic Consensus Algorithms

  1. Lamport, L. (1998). The part-time parliament. ACM TOCS - Original Paxos paper
  2. Ongaro, D. & Ousterhout, J. (2014). In search of an understandable consensus algorithm. USENIX ATC - Raft algorithm

Coordination Services

  1. Hunt, P. et al. (2010). ZooKeeper: Wait-free coordination for internet-scale systems. USENIX ATC
  2. Balakrishnan, M. et al. (2013). Tango: Distributed data structures over a shared log. SOSP

Benchmarking Tools

  1. Cooper, B.F. et al. (2010). Benchmarking cloud serving systems with YCSB. SoCC
  2. Kingsbury, K. (2024). Jepsen tests - Consistency testing framework
  3. Kingsbury, K. & Alvaro, P. (2020). Elle: Inferring Isolation Anomalies from Experimental Observations

WAN Optimization

  1. Ailijiang, A. et al. (2017). Multileader WAN Paxos: Ruling the archipelago with fast consensus - WPaxos
  2. Mao, Y. et al. (2008). Mencius: Building efficient replicated state machines for WANs. OSDI

Distributed Applications

  1. Corbett, J.C. et al. (2013). Spanner: Google's globally distributed database. ACM TOCS

Summary: This paper is an important survey work in the field of distributed coordination system evaluation, systematically revealing the fragmentation problem in current evaluation practices and proposing requirements for standardized benchmarking. While lacking actual tool implementation, it provides clear direction for future research and engineering practice. For distributed systems researchers and engineers, this is essential reading for understanding evaluation methodology in the field.