2025-11-21T14:04:16.070008

How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis

Turkkan, Rodrigues, Kosar et al.

Coordination services and protocols are critical components of distributed systems and are essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standard benchmarking and evaluation tools for distributed coordination services, coordination service developers/researchers either use a NoSQL standard benchmark and omit evaluating consistency, distribution, and fault tolerance; or create their own ad-hoc microbenchmarks and skip comparability with other services. In this study, we analyze and compare the evaluation mechanisms for known and widely used consensus algorithms, distributed coordination services, and distributed applications built on top of these services. We identify the most important requirements of distributed coordination service benchmarking, such as the metrics and parameters for the evaluation of the performance, scalability, availability, and consistency of these systems. Finally, we discuss why the existing benchmarks fail to address the complex requirements of distributed coordination system evaluation.

academic

How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis

Basic Information

Paper ID: 2403.09445
Title: How to Evaluate Distributed Coordination Systems? -- A Survey and Analysis
Authors: Bekir Turkkan (IBM Research), Elvis Rodrigues (University at Buffalo), Tevfik Kosar (University at Buffalo), Aleksey Charapko (University of New Hampshire), Ailidani Ailijiang (Microsoft), Murat Demirbas (MongoDB)
Category: cs.DC (Distributed Computing)
Publication Date: arXiv preprint, last updated October 27, 2025
Paper Link: https://arxiv.org/abs/2403.09445

Abstract

Distributed coordination services and protocols are critical components of distributed systems, essential for providing consistency, fault tolerance, and scalability. However, due to the lack of standardized benchmarking and evaluation tools, developers and researchers of distributed coordination services either use NoSQL standard benchmarks while ignoring consistency, distribution, and fault tolerance evaluation, or create their own ad-hoc micro-benchmarks that cannot be compared with other services. This research analyzes and compares the evaluation mechanisms of well-known and widely-used consensus algorithms, distributed coordination services, and distributed applications built on these services. The authors identify the most important requirements for benchmarking distributed coordination services, such as metrics and parameters for performance, scalability, availability, and consistency evaluation. Finally, the paper discusses why existing benchmarks fail to meet the complex evaluation requirements of distributed coordination systems.

Research Background and Motivation

1. Core Problem to Be Addressed

Distributed coordination systems (including consensus algorithms, coordination services, and distributed applications) lack standardized evaluation benchmarks, resulting in:

Incomplete Evaluation: Developers either use NoSQL benchmarks (such as YCSB) while ignoring consistency, distribution, and fault tolerance, or create custom micro-benchmarks
Poor Comparability: Each system uses customized micro-benchmarks with different metrics and techniques, making fair comparison impossible
Fragmented Evaluation: No unified framework exists for comprehensive evaluation of performance, scalability, availability, and consistency

2. Problem Significance

Practical Needs: Cloud computing and big data applications (search engines, social networks, video streaming, IoT) all depend on distributed coordination
Technical Evolution: From Paxos to Raft, and then to WAN-optimized variants like WPaxos and SwiftPaxos continue to emerge
Wide Application: Critical systems like Google Spanner, Apache Kafka, and Twitter Manhattan all depend on coordination services
Evaluation Complexity: Distributed coordination systems require simultaneous consideration of multiple dimensions including performance, consistency, fault tolerance, and geographic distribution

3. Limitations of Existing Approaches

Inadequacies of Existing Benchmarking Tools:

YCSB: Single client process, does not support data access overlap, access locality, and other critical parameters
TPC-C: Primarily designed for transaction processing, unsuitable for coordination service-specific requirements
Jepsen: Requires deep understanding of tool internals, not a black-box test, difficult to adopt
Lack of WAN Support: Most tools do not support evaluation in geographically distributed scenarios

4. Research Motivation

Through systematic investigation of evaluation practices across 30+ distributed coordination systems, this paper aims to:

Identify commonalities and differences in current evaluation practices
Extract core requirements for evaluating distributed coordination systems
Analyze defects in existing benchmarking tools
Provide guidance for future development of standardized benchmarking tools

Core Contributions

Systematic Investigation: Analyzed evaluation practices of 30+ distributed coordination systems (including 13 consensus algorithms, 10 coordination services, 7 distributed applications)
Topology Classification: Identified and defined 6 experimental topology structures (flat, star, multi-star, hierarchical, grid, centralized log), providing a framework for understanding system architecture
Metrics and Parameter Framework:
- Systematically organized 4 major evaluation metrics: performance, scalability, availability, consistency
- Identified critical workload parameters: read-write ratio, data access overlap, access locality, object count and size, etc.
Benchmark Requirements: Proposed 7 core requirements for distributed coordination system benchmarks:
- Flexibility and complexity
- WAN system support
- Benchmark scalability
- Ease of adoption
- Black-box testing capability
- Consistency verification capability
- Fault injection capability
Gap Analysis: Systematically analyzed capabilities and limitations of 10+ existing benchmarking tools (YCSB, TPC-C, Jepsen, Elle, etc.)
Practical Guidance: Provided best practices and considerations for researchers and engineers evaluating distributed coordination systems

Methodology Details

Task Definition

This paper does not propose new technical methods but conducts systematic investigation and analysis, with tasks including:

Input: Papers and evaluation materials from 30+ distributed coordination systems
Processing: Extract evaluation topologies, metrics, parameters, tools, and other information
Output: Systematic summary of evaluation practices, requirements analysis, and tool capability comparison

Research Method

1. System Selection Criteria

The authors selected three categories of systems based on relevance, timeliness, and impact:

Category I: Consensus Algorithms (13)

Paxos variants: Mencius, FPaxos, Multi-Paxos, Hybrid-Paxos, E-Paxos, M2 Paxos, WPaxos, SwiftPaxos, Omni-Paxos
Other protocols: Raft, Bizur, ZAB, Hydra

Category II: Coordination Services (10)

ZooKeeper, Tango, Calvin, WanKeeper, ZooNet, Boki, FlexLog, SplitFT, Fabric, Narwhal

Category III: Distributed Applications (7)

Spanner, DistributedLog, PNUTS, COPS, CockroachDB, OceanBase, ScalarDB

2. Topology Classification Framework

The authors defined 6 topologies based on quorum creation method and request processing method:

Topology Type	Characteristics	Representative Systems
Flat Topology	Multi-leader or leaderless, allows concurrent updates	Mencius, E-Paxos
Star Topology	Single-leader protocol	ZooKeeper, Raft, Hybrid-Paxos
Multi-Star Topology	Multiple quorums, each star-shaped, flat communication between leaders	ZooNet, M2 Paxos, Spanner
Hierarchical Topology	Multi-star with hierarchy among leaders	WanKeeper
Grid Topology	Uses grid quorum to optimize performance	FPaxos, WPaxos
Centralized Log Topology	Shared persistent log records execution order	Tango, Boki, Calvin

3. Data Extraction and Analysis

From each system's paper, extracted:

Experimental Setup: Number of regions, servers, clients, test platform, benchmarking tools
Evaluation Metrics: Throughput, latency, scalability, availability, consistency
Workload Parameters: Read-write ratio, object count/size, data access overlap, access locality

Experimental Setup (Survey Findings)

Experimental Topology Distribution

The authors analyzed experimental settings of 30 systems with major findings:

Geographic Distribution:

Single-Region Deployment: Most systems (e.g., Raft, Multi-Paxos, ZooKeeper)
Multi-Region Deployment: WAN-optimized systems (e.g., WPaxos 5 regions 15 servers, SwiftPaxos 13 regions)
Real Cloud Environments: Amazon EC2, Google Compute Engine, Alibaba ECS
Controlled Environments: Emulab, DETER (network latency controllable)

Cluster Scale:

Small-scale: 3-13 servers (most consensus algorithms)
Medium-scale: 15-100 servers (coordination services)
Large-scale: OceanBase reaches 1,557 servers, 360,000 clients/servers

Client Configuration:

Single client: Bizur, Omni-Paxos
Multi-threaded clients: Multi-Paxos (1-20 threads)
Distributed clients: E-Paxos (50 clients), PNUTS (300 clients)

Evaluation Metrics Usage

According to Table 2 statistics:

Metric Category	Systems Evaluated	Coverage Rate
Performance-Throughput	28/30	93%
Performance-Latency	27/30	90%
Scalability-Servers	14/30	47%
Scalability-Clients	8/30	27%
Availability-Failures	14/30	47%
Availability-Partitions	5/30	17%
Consistency	8/30	27%

Key Findings:

Performance evaluation is nearly universal, but consistency evaluation is severely insufficient
Network partition testing is far less common than node failure testing
Scalability evaluation typically focuses only on server count, ignoring region scaling

Experimental Results (Survey Findings)

Major Finding 1: Inconsistent Workload Parameter Usage

Based on Table 3 analysis:

Read-Write Ratio

100% Write Operations: Multi-Paxos, E-Paxos, Hybrid-Paxos (focus on conflicting commands)
0-100% Variation: ZooKeeper, WanKeeper (demonstrate different scenarios)
Fixed Ratio: COPS (50% write), PNUTS (10% write)
Unspecified: Raft, FPaxos, and multiple other systems

Problem: Performance varies dramatically under different read-write ratios, but many systems test only single configurations

Data Access Overlap

100% Overlap: Mencius, E-Paxos, Hybrid-Paxos (worst case)
0-100% Variation: WanKeeper, Boki, FlexLog
Not Evaluated: Most single-leader systems (minimal performance impact)

Key Insight: Multi-leader system performance heavily depends on access overlap, but evaluation is often overlooked

Access Locality

Evaluated Systems: M2 Paxos (0-100%), WPaxos (70-90%), COPS (0-100%)
Not Evaluated: Most systems
Importance: Massive impact on systems using ownership mechanisms

Object Count

Specified Systems: Mencius (16-1024), M2 Paxos (1-1000), Omni-Paxos (500-50K)
Most Unspecified: Limits understanding of conflict rates

Object Size

Small Objects: 6B-1KB (CPU-intensive workloads)
Large Objects: 1KB-8KB (network-intensive workloads)
Variation Range: Mencius (6B-4KB), SplitFT (128B-8KB)

Major Finding 2: Diverse Scalability Evaluation Methods

Workload Scalability:

Hybrid-Paxos, E-Paxos: Increase concurrent client count
WPaxos: Adjust client rate limiting
Most systems: Test until saturation point

System Scalability:

Horizontal Scaling: ZooKeeper (3-13 replicas), Calvin (4-100 replicas)
Region Scaling: E-Paxos and Mencius (3-7 regions)
Vertical Scaling: M2 Paxos (vary CPU performance)

Problem: Lack of unified scalability testing methodology makes comparison difficult

Major Finding 3: Severely Insufficient Consistency Evaluation

Current Practices:

Testing Tools: Bizur uses Serialla, Multi-Paxos uses checksum verification
Jepsen Testing: ZooKeeper, CockroachDB (linearizability verification)
Elle Testing: ScalarDB (strict serializability verification)
Staleness Measurement: ZooNet, PNUTS, BG (but cannot prove strong consistency)

Core Issues:

Most systems claim "strong consistency" with vague definitions
Lack of systematic consistency verification methods
Staleness measurement insufficient to verify linearizability or serializability

Major Finding 4: Availability Evaluation Concentrated on Crash Failures

According to Table 4:

Failure Types:

Node Crashes: Most common (14/30 systems)
Network Partitions: Less common (5/30 systems)
Other Failures: Clock drift, memory corruption, etc., almost untested

Failure Count:

Single node failure: Most systems
Multiple node failures: ZooKeeper (2 followers), Omni-Paxos (1-2 nodes)

Testing Methods:

Measure throughput degradation during failures
Spanner: Crash entire region but Paxos group remains available
Hybrid-Paxos: Increase replica count to test availability improvement

Distributed System Benchmarking

NoSQL Database Benchmarks:

YCSB (2010): Most popular NoSQL benchmark, but lacks distributed client and WAN scenario support
YCSB+T (2014): Adds transaction support, but still single-process
YCSB++ (2011): Supports distributed clients, but relies on ZooKeeper synchronization, unsuitable for WAN

Application-Specific Benchmarks:

BG (2013): Social network workload, but uses locks to avoid conflicts
TPC-C (1992): Transaction processing standard, but not designed for coordination services
HiBench (2010): Hadoop benchmark, unsuitable for coordination systems

Big Data Benchmarks:

BigDataBench (2014): Covers multiple big data workloads
But all unsuitable for evaluating coordination service-specific requirements (consistency, fault tolerance, geographic distribution)

Consistency Testing Tools

Jepsen (2013-present):

Powerful consistency testing framework
Can detect linearizability violations
Requires deep tool understanding, not black-box testing

Elle (2020):

Based on Jepsen, more efficient isolation level detection
Builds transaction dependency graphs to identify violation cycles
Still requires customized workloads

Other Testing Tools:

Serialla: Strict serializability testing used by Bizur
UPB (2013): Availability benchmark, but based on YCSB

Distributed System Surveys

Cloud Service Evaluation:

Elasticity evaluation, computing capability, cost-benefit analysis
But not specific to coordination services

File Systems and Data Warehouses:

Distributed file system benchmarking
Data warehouse query performance evaluation
Different from coordination system requirements

Coordination Service Surveys:

Algorithm comparison (Paxos variants)
Service characteristics analysis
Unique Contribution of This Paper: First systematic analysis of evaluation practices and benchmarking requirements

Conclusions and Discussion

Main Conclusions

Fragmented Evaluation Practices: Among 30 systems, only 7 use standard benchmarks (YCSB, TPC-C, Jepsen); most use custom micro-benchmarks
Uneven Metric Coverage:
- Performance evaluation universal (93% of systems)
- Consistency evaluation insufficient (27% of systems)
- Network partition testing rare (17% of systems)
Inconsistent Parameter Usage:
- Critical parameters (access locality, data access overlap) often overlooked
- Lack of standardized parameter configurations
- Difficult to fairly compare different systems
Existing Benchmarks Inadequate:
- YCSB: Does not support distributed clients, WAN scenarios, access locality
- TPC-C: Not designed for coordination services
- Jepsen: Non-black-box, difficult to adopt
- No tool satisfies all requirements
7 Major Benchmark Requirements:
- Flexibility and complexity (support multi-dimensional parameter tuning)
- WAN system support (geographic distribution, uneven latency)
- Scalability (distributed load generation)
- Ease of adoption (black-box testing, language-agnostic)
- Performance benchmarking (throughput, latency, tail latency)
- Consistency verification (linearizability, serializability)
- Fault injection (crashes, partitions, clock drift)

Limitations

Sample Coverage: While covering 30 systems, may miss some emerging systems or domain-specific coordination services
Timeliness: Rapid evolution of distributed systems means new evaluation practices and tools constantly emerge
Depth of Analysis: Analysis of each system's evaluation practices based on public papers may not capture all implementation details
Benchmark Tool Implementation: Paper identifies requirements but does not implement a complete benchmarking tool
Consistency Models: Different systems define "strong consistency" differently, making unified evaluation standards difficult

Future Directions

Develop Standardized Benchmark Tools:
- Support distributed clients and WAN scenarios
- Provide flexible parameter configuration
- Integrate consistency verification capability
- Support multiple fault injection types
Establish Evaluation Standards:
- Define minimum required evaluation metric sets
- Standardize workload parameter configurations
- Establish consistency verification protocols
Expand Survey Scope:
- Include more emerging coordination protocols (e.g., DAG-based consensus)
- Analyze blockchain consensus algorithm evaluation practices
- Study coordination requirements in edge computing scenarios
Empirical Research:
- Re-evaluate existing systems using standard benchmarks
- Quantify parameter impact on performance
- Verify claimed consistency guarantees
Automated Testing:
- Develop automated consistency verification tools
- Integrate with continuous integration/continuous deployment (CI/CD)
- Support regression testing

In-Depth Evaluation

Strengths

1. Systematicity and Comprehensiveness

Breadth: Covers 30 systems spanning 20 years of research history (Paxos 1998 - latest systems 2024)
Depth: Detailed analysis of experimental setup, topologies, metrics, parameters
Clear Classification: Three-layer classification (algorithm-service-application) + six topology types

2. High Practical Value

Guidance: Provides evaluation best practices for developers
Clear Requirements: 7 benchmark requirements are actionable
Problem-Oriented: Clearly identifies specific shortcomings of existing tools

3. Rich Data

3 Comprehensive Tables: Table 1 (experimental setup), Table 2 (metric usage), Table 3 (workload parameters)
Quantitative Analysis: Metric coverage rates, parameter usage frequencies
Visualization: Clear diagrams of 6 topology types

4. Objective and Neutral

Does not favor specific systems or benchmarking tools
Fair analysis of each tool's strengths and weaknesses
Fact-based assessment rather than subjective judgment

5. Academic Rigor

Cites 85 references
Clear methodology (selection criteria, analysis framework)
Conclusions well-supported by data

Weaknesses

1. Lack of Quantitative Comparison

No performance difference data for different evaluation methods
No quantification of parameter choice impact on results
Missing statistical analysis (correlation, significance testing)

2. Insufficient Implementation Verification

No actual development of benchmark tool meeting requirements
No experimental verification of requirement feasibility
Lack of prototype system evaluation

3. Shallow Consistency Evaluation Analysis

Insufficient discussion of differences between consistency models
No specific consistency verification methodology provided
Missing complexity analysis of consistency testing

4. Limited WAN Scenario Analysis

While emphasizing WAN importance, specific analysis insufficient
Insufficient discussion of different geographic distribution patterns
Missing challenges of cross-cloud, cross-continent deployment

5. Inadequate Coverage of Emerging Trends

Blockchain consensus algorithm evaluation practices not covered
Edge computing scenario coordination requirements not discussed
Machine learning system coordination evaluation not addressed

6. Insufficient Reproducibility Guidance

No detailed experimental reproduction guidelines
Lack of open-source datasets or evaluation scripts
No discussion of ensuring evaluation reproducibility

Impact

1. Academic Contribution

Fills Gap: First systematic investigation of distributed coordination system evaluation practices
Theoretical Value: Establishes evaluation framework and requirements system
Citation Potential: May become reference literature for evaluation methods in the field

2. Practical Value

Engineering Guidance: Helps developers choose appropriate evaluation methods
Benchmark Development: Provides requirements specification for new benchmark tools
Standardization Promotion: May drive establishment of evaluation standards

3. Limitations

Implementation Missing: Does not provide directly usable tools
Verification Insufficient: Requirement feasibility unverified through empirical study
Update Needed: Rapidly evolving field requires continuous updates

4. Applicable Scope

Direct Application: Researchers and engineers in distributed coordination systems
Indirect Application: Developers of distributed databases, blockchain, cloud computing systems
Educational Value: Can serve as reference material for distributed systems courses

Applicable Scenarios

1. Research Scenarios

New Protocol Development: Reference requirements checklist when designing evaluation plans
System Comparison: Select appropriate metrics and parameters for fair comparison
Paper Writing: Cite standard evaluation practices to enhance credibility

2. Engineering Scenarios

System Selection: Understand evaluation results and limitations of different systems
Performance Tuning: Identify critical parameters affecting performance
Failure Testing: Design comprehensive availability testing plans

3. Educational Scenarios

Course Teaching: Introduce distributed system evaluation methodology
Project Practice: Guide students in designing experiments and evaluation plans
Literature Review: Understand current research status in the field

4. Standardization Scenarios

Benchmark Development: Serve as requirements specification document
Industry Standards: Drive establishment of evaluation standards
Compliance Testing: Design coordination service compliance tests

Selected References

Classic Consensus Algorithms

Lamport, L. (1998). The part-time parliament. ACM TOCS - Original Paxos paper
Ongaro, D. & Ousterhout, J. (2014). In search of an understandable consensus algorithm. USENIX ATC - Raft algorithm

Coordination Services

Hunt, P. et al. (2010). ZooKeeper: Wait-free coordination for internet-scale systems. USENIX ATC
Balakrishnan, M. et al. (2013). Tango: Distributed data structures over a shared log. SOSP

Benchmarking Tools

Cooper, B.F. et al. (2010). Benchmarking cloud serving systems with YCSB. SoCC
Kingsbury, K. (2024). Jepsen tests - Consistency testing framework
Kingsbury, K. & Alvaro, P. (2020). Elle: Inferring Isolation Anomalies from Experimental Observations

WAN Optimization

Ailijiang, A. et al. (2017). Multileader WAN Paxos: Ruling the archipelago with fast consensus - WPaxos
Mao, Y. et al. (2008). Mencius: Building efficient replicated state machines for WANs. OSDI

Distributed Applications

Corbett, J.C. et al. (2013). Spanner: Google's globally distributed database. ACM TOCS

Summary: This paper is an important survey work in the field of distributed coordination system evaluation, systematically revealing the fragmentation problem in current evaluation practices and proposing requirements for standardized benchmarking. While lacking actual tool implementation, it provides clear direction for future research and engineering practice. For distributed systems researchers and engineers, this is essential reading for understanding evaluation methodology in the field.