2025-11-28T03:34:19.410649

Diagonal Scaling: A Multi-Dimensional Resource Model and Optimization Framework for Distributed Databases

Abdullah, Zaman
Modern cloud databases present scaling as a binary decision: scale-out by adding nodes or scale-up by increasing per-node resources. This one-dimensional view is limiting because database performance, cost, and coordination overhead emerge from the joint interaction of horizontal elasticity and per-node CPU, memory, network bandwidth, and storage IOPS. As a result, systems often overreact to load spikes, underreact to memory pressure, or oscillate between suboptimal states. We introduce the Scaling Plane, a two-dimensional model in which each distributed database configuration is represented as a point (H, V), with H denoting node count and V a vector of resources. Over this plane, we define smooth approximations of latency, throughput, coordination overhead, and monetary cost, providing a unified view of performance trade-offs. We show analytically and empirically that optimal scaling trajectories frequently lie along diagonal paths: sequences of joint horizontal and vertical adjustments that simultaneously exploit cluster parallelism and per-node improvements. To compute such actions, we propose DIAGONALSCALE, a discrete local-search algorithm that evaluates horizontal, vertical, and diagonal moves in the Scaling Plane and selects the configuration minimizing a multi-objective function subject to SLA constraints. Using synthetic surfaces, microbenchmarks, and experiments on distributed SQL and KV systems, we demonstrate that diagonal scaling reduces p95 latency by up to 40 percent, lowers cost-per-query by up to 37 percent, and reduces rebalancing by 2 to 5 times compared to horizontal-only and vertical-only autoscaling. Our results highlight the need for multi-dimensional scaling models and provide a foundation for next-generation autoscaling in cloud database systems.
academic

Diagonal Scaling: A Multi-Dimensional Resource Model and Optimization Framework for Distributed Databases

Basic Information

  • Paper ID: 2511.21612
  • Title: Diagonal Scaling: A Multi-Dimensional Resource Model and Optimization Framework for Distributed Databases
  • Authors: Shahir Abdullah, Syed Rohit Zaman
  • Category: cs.DC (Distributed Computing)
  • Publication Date: November 26, 2025 (arXiv v1)
  • Paper Link: https://arxiv.org/abs/2511.21612

Abstract

Modern cloud databases view scaling as a binary decision: horizontal scaling (scale-out) by adding nodes or vertical scaling (scale-up) by increasing single-node resources. This one-dimensional perspective has limitations because database performance, cost, and coordination overhead stem from joint interactions between horizontal elasticity and single-node CPU, memory, network bandwidth, and storage IOPS. Consequently, systems often overreact to load peaks, underreact to memory pressure, or oscillate between suboptimal states.

This paper introduces the Scaling Plane, a two-dimensional model where each distributed database configuration is represented as a point (H, V), with H denoting the number of nodes and V as a resource vector. On this plane, the authors define smooth approximations for latency, throughput, coordination overhead, and monetary cost, providing a unified view of performance trade-offs. The research demonstrates that optimal scaling trajectories typically follow diagonal paths: coordinated horizontal-vertical adjustment sequences that simultaneously leverage cluster parallelism and single-node improvements. To this end, the authors propose the DIAGONALSCALE algorithm, a discrete local search algorithm that evaluates horizontal, vertical, and diagonal movements in the scaling plane and selects configurations minimizing a multi-objective function under SLA constraints.

Experiments show that diagonal scaling reduces p95 latency by up to 40% compared to pure horizontal or pure vertical autoscaling, reduces cost-per-query by up to 37%, and decreases rebalancing by 2-5×.

Research Background and Motivation

1. Core Problem to Address

The scaling decision dilemma faced by modern distributed databases:

  • Limitations of binary choice: Traditional approaches treat horizontal scaling (adding nodes) and vertical scaling (adding resources) as independent decisions
  • System behavior issues: Improper reactions to load changes, leading to over-provisioning, SLA violations, or frequent destructive rebalancing
  • Lack of unified view: No comprehensive model to understand multi-dimensional interactions between performance, cost, and coordination overhead

2. Problem Significance

  • Economic impact: Cloud databases are critical infrastructure (finance, e-commerce, logistics, social networks); improper scaling decisions cause massive cost waste
  • Performance criticality: Scaling decisions directly impact latency, throughput, and availability
  • Operational complexity: Incorrect scaling strategies lead to frequent data rebalancing, leadership changes, and system instability

3. Limitations of Existing Approaches

Problems with scale-out (horizontal scaling):

  • Increases consensus overhead (Paxos/Raft message counts)
  • Expands replica group size
  • Increases replication fanout
  • Triggers more leadership changes
  • Causes expensive data rebalancing

Problems with scale-up (vertical scaling):

  • Memory upgrades cannot resolve cross-partition data skew
  • CPU upgrades cannot resolve metadata bottlenecks
  • Eventually hits hardware limits
  • Shows diminishing returns

Shortcomings of existing autoscaling:

  • Kubernetes HPA/VPA tools handle two dimensions separately
  • Reactive policies based on simple thresholds (e.g., CPU > 70%)
  • Ignore non-linear interactions between dimensions
  • Cannot compute diagonal trajectories

4. Research Motivation

The authors observe that many workloads benefit from coordinated rather than independent horizontal and vertical resource adjustments. This motivates them to construct a unified multi-dimensional scaling model and develop algorithms capable of optimization in this space.

Core Contributions

  1. Scaling Plane Model: Proposes a novel two-dimensional abstraction for elastic database configurations, representing configurations as (H, V) points, where H is the number of nodes and V is a resource vector
  2. Analytical Performance Surfaces: Derives closed-form models for latency, throughput, cost, and coordination overhead, revealing the geometric structure of these metrics on the H-V plane
  3. DIAGONALSCALE Algorithm: Designs a discrete optimization algorithm that evaluates local neighborhoods in the H-V plane, supporting horizontal, vertical, and diagonal movements
  4. Theoretical Guarantees: Provides proof sketches for monotonic improvement, convergence to local optimality, and stability
  5. Comprehensive Evaluation: Demonstrates diagonal scaling advantages through synthetic surfaces, microbenchmarks, and distributed SQL/KV system experiments:
    • p95 latency reduction up to 40%
    • Cost-per-query reduction up to 37%
    • Rebalancing reduction 2-5×

Methodology Details

Task Definition

Input:

  • Current configuration: (H, V), where H is the number of nodes, V = (c, r, b, s) represents single-node CPU, RAM, bandwidth, and storage IOPS
  • Workload characteristics: request rate λ, read-write ratio, access distribution
  • SLA constraints: maximum latency Lmax, minimum throughput Tmin

Output:

  • Next optimal configuration: (Hnext, Vnext)

Objectives:

  • Minimize multi-objective function F(H,V) = αL(H,V) + βC(H,V) + γK(H,V)
  • Satisfy SLA constraints: L(H,V) ≤ Lmax and T(H,V) ≥ Tmin

Model Architecture

1. Resource Space Definition

The configuration space is defined as:

S = {(H,V) : H ≥ 1, c, r, b, s > 0}

where H is a discrete integer (number of nodes) and V is selected from a finite set of instance types.

2. Performance Surface Modeling

(a) Node-Intrinsic Latency

Uses a weighted harmonic form:

Lnode(V) = α/c + β/r + γ/b + δ/s

This captures:

  • CPU's strong influence on compute-intensive operations
  • RAM's impact on working set and cache behavior
  • Network bandwidth's role in replication and RPC
  • Storage IOPS' effect on LSM tree compaction and log flushing

(b) Coordination Latency

Coordination cost grows with cluster size due to consensus protocols, global timestamps, and metadata synchronization:

Lcoord(H) = η log H + μH^θ

where 0 < θ < 1 creates a superlogarithmic but sublinear growth curve.

(c) Total Latency

L(H,V) = Lnode(V) + Lcoord(H)

Key properties:

  • ∂L/∂H > 0 (latency increases with more nodes)
  • ∂L/∂||V|| < 0 (latency decreases with more resources)

(d) Throughput Surface

Single-node throughput:

Tnode(V) = κ · min(c, r, b, s)

Cluster throughput accounting for diminishing returns:

T(H,V) = H · Tnode(V) · φ(H)

where:

φ(H) = 1 / (1 + ω log H)

reflects increased coordination overhead and replication costs.

(e) Coordination Overhead Surface

For write-intensive workloads with write arrival rate λw:

K(H,V) = ρ · Lcoord(H) · λw / T(H,V)

Intuition:

  • Coordination overhead increases with write load
  • Decreases as throughput increases
  • Rises with larger cluster size

(f) Monetary Cost Surface

C(H,V) = H · Cnode(V)

where Cnode(V) is the cloud cost of an instance with resources V.

3. Multi-Objective Optimization

Define the objective function:

F(H,V) = αL(H,V) + βC(H,V) + γK(H,V)

Constraints:

L(H,V) ≤ Lmax
T(H,V) ≥ Tmin

This creates a two-dimensional non-convex optimization problem.

4. Surface Geometry Insights

Key finding: The minimum of F rarely occurs on axis-aligned edges (pure scale-up or pure scale-out). Instead, the minimum lies in the interior, along a diagonal trajectory.

This is because:

  • L decreases along V but increases along H
  • T increases with both H and V but saturates
  • C grows linearly with H, superlinearly with V
  • K grows with H but decreases with V

Technical Innovations

1. Diagonal Scaling Theory

Trajectory definition:

τ(t) = (H(t), V(t))

where both H and V increase with t. Let slope m = dH/d||V||.

Gradient alignment condition:

The gradient of the objective function:

∇F = (∂F/∂H, ∂F/∂||V||)

Local optimality along trajectory direction (1, m) satisfies:

∇F(H*, V*) · (1, m*) = 0

Therefore the optimal diagonal direction (1, m*) aligns with -∇F.

Lemma 1 (Axis-aligned scaling rarely optimal):

If ∂F/∂H ≠ 0 and ∂F/∂||V|| ≠ 0, then the optimal direction is neither horizontal nor vertical.

Proof sketch: If optimal scaling is horizontal, the direction vector is (1, 0). But:

∇F · (1, 0) = ∂F/∂H ≠ 0

Contradiction. Vertical scaling follows similarly. Therefore diagonal scaling is necessary. □

Proposition (Existence of interior minimum):

If L decreases in V and increases in H, C increases in both, and K increases in H but decreases in V, then F has at least one interior stationary point (H*, V*).

2. DIAGONALSCALE Algorithm

Design principles:

  1. Local search: Explore neighbors around (H, V)
  2. SLA-aware: Consider only feasible configurations
  3. Direction diversity: Check horizontal, vertical, and diagonal movements
  4. Stability: Penalize disruptive movements based on expected rebalancing
  5. Monotonicity: Accept movements only if F improvement exceeds margin ε

Neighborhood definition:

N(H,V) = {(H±ΔH, V), (H, V±1), (H±ΔH, V±1)}

ΔH typically 1-2 nodes; vertical movements correspond to adjacent instance types.

Algorithm Flow (Algorithm 1):

Input: Current configuration (H,V), SLA (Lmax, Tmin)
Output: Next configuration (Hnext, Vnext)

1. Compute neighborhood N(H,V)
2. For each (H', V') in N:
   a. Estimate L(H', V'), T(H', V'), K(H', V'), C(H', V')
   b. If SLA violated, mark as infeasible and continue
   c. Compute objective F(H', V')
   d. Compute rebalancing penalty Prebalance(H,V; H', V')
   e. Set F'(H', V') = F(H', V') + δPrebalance
3. Select feasible neighbor (H*, V*) minimizing F'
4. If F'(H*, V*) < F(H,V) - ε:
   Return (H*, V*)
   Else:
   Return (H,V)

Rebalancing penalty:

Prebalance = λ1|H' - H| + λ2||V' - V||1 + λ3·ShardMovement(H,V → H', V')

Shard movement estimation can be obtained using partition metadata.

Complexity analysis:

Neighborhood size |N| = 8. Each evaluation computes closed-form expressions in O(1) time.

Therefore, time complexity per scaling decision: O(|N|) = O(1)

Convergence theorem:

If objective evaluation is exact and the space is finite (finite H and finite instance types), DIAGONALSCALE converges to a local minimum.

Proof sketch: Monotonic descent + discrete finite state space → guaranteed termination.

Stability proposition:

If δ is sufficiently large, DIAGONALSCALE avoids configuration oscillation under fluctuating workloads.

Experimental Setup

Datasets and Systems

Test systems:

  1. CockroachDB (distributed SQL): Uses Raft consensus, range-based partitioning, and dynamic rebalancing
  2. Redis Cluster (distributed KV): Uses hash slot sharding and asynchronous replication
  3. Synthetic model: Parameterized analytical scaling plane surfaces

Configuration Space

Horizontal scale:

H ∈ {1, 2, 4, 8, 12}

Vertical instance types:

V ∈ {Small, Medium, Large, XLarge}

Each type maps to (c, r, b, s) of cloud instance families.

Total 20+ configurations forming a discrete subset of the scaling plane.

Workloads

  1. Read-intensive: 90% GET, 10% PUT (YCSB Workload B)
  2. Write-intensive: 30% GET, 70% PUT (YCSB Workload A)
  3. Mixed: Balanced GET/PUT ratio (Workload D)
  4. Skewed: Zipfian distribution with skew parameter θ = 0.8
  5. Dynamic: Time-varying request rates with sinusoidal, step, and burst patterns

Evaluation Metrics

  • Latency: p50, p95, p99 latency
  • Throughput: ops/s
  • Cost: cost per unit time and cost per operation
  • Stability: number of autoscaling operations, rebalancing and leadership changes
  • SLA violation rate

Comparison Methods

  1. Horizontal-only (H-only): Add/remove nodes based on CPU/memory only
  2. Vertical-only (V-only): Change instance types based on resource saturation only
  3. DiagonalScale (this work): Local search in H-V space with stability penalty

Implementation Details

  • Platform: Kubernetes cluster with HPA+VPA disabled
  • Controller: Custom autoscaling controller implementing DIAGONALSCALE
  • Monitoring: Prometheus + Grafana
  • Load generation: Locust/YCSB
  • Repetitions: All experiments repeated 5 times; error bars reflect standard deviation

Experimental Results

Main Results

1. Surface Structure Verification (Figures 2-3)

Synthetic latency surface L(H,V) (Figure 2) shows:

  • Horizontal lines at fixed V encounter increasing Lcoord
  • Vertical lines at fixed H face diminishing returns
  • Diagonal path reaches interior valley minimizing F

Cost-per-query heatmap (Figure 3) reveals:

  • Interior minimum reachable via diagonal scaling
  • Pure axis-aligned strategies miss optimal region

2. Autoscaling Trajectory Comparison (Figure 4)

Observations:

  • H-only: Oscillates, frequent node cycling and expensive rebalancing
  • V-only: Underreacts to load peaks, violates SLA constraints
  • DiagonalScale: Stabilizes quickly, uses fewer disruptive operations

3. Latency Under Dynamic Load (Figure 5)

Results:

  • H-only: Latency spikes during rebalancing
  • V-only: CPU and memory saturation
  • DiagonalScale: Avoids both issues, maintains lower and more stable tail latency

Specific numbers:

  • p95 latency reduction up to 40%
  • Significantly reduced latency variance

4. Cost-Benefit (Figure 6)

DiagonalScale reduces costs through:

  • Avoiding unnecessary node additions
  • Making small vertical adjustments
  • Minimizing expensive rebalancing

Cost-per-query reduction: up to 37%

5. Stability Metrics (Figure 7)

Rebalancing events and scaling operations:

  • DiagonalScale reduces disruptive changes by 2-5×
  • Fewer leadership changes
  • Smoother resource adjustments

6. SLA Violations

DiagonalScale reduces SLA violations through:

  • Smooth resource adjustments
  • Preventing CPU saturation
  • Avoiding network hotspots

7. Algorithm Efficiency

Each autoscaling decision takes < 5ms (due to closed-form evaluation).

Suitable for real-time control loops (1-5 second iterations).

Ablation Studies

While the paper does not explicitly list traditional ablation studies, implicit ablation is performed through comparison of three strategies (H-only, V-only, Diagonal):

  1. Without diagonal movement (H-only + V-only): Significant performance degradation
  2. Without stability penalty: Leads to more frequent oscillation (controlled by δ parameter)
  3. Different neighborhood sizes: 8-neighbor configuration balances exploration and computational cost

Case Study

Scenario: Burst traffic pattern

  • H-only response: Immediately add 4 nodes → trigger large-scale rebalancing → latency spike → over-provisioning after traffic drops
  • V-only response: Upgrade to XLarge instance → CPU improves but network still saturated → partial SLA violations
  • DiagonalScale response: Add 1 node + upgrade to Large → balanced improvement → no rebalancing spike → more cost-effective

Experimental Findings

  1. Diagonal paths universally optimal: In 80%+ of workload configurations, optimal solution lies in plane interior
  2. Small vertical adjustments have large impact: Even single instance type upgrade significantly reduces required horizontal scaling
  3. Stability-performance trade-off: Appropriate δ value (rebalancing penalty) crucial for avoiding oscillation
  4. Workload-specific: Write-intensive workloads benefit more from diagonal scaling (due to coordination overhead)

1. Horizontal Scaling in Distributed Databases

Representative systems:

  • Google Spanner: Paxos + TrueTime coordination
  • Bigtable: Range-based partitioning
  • Cassandra: Eventually consistent replication
  • CockroachDB: Raft consensus
  • DynamoDB: Hash partitioning

Limitations: Horizontal scaling increases coordination costs, sometimes superlinearly, causing p99 latency degradation.

2. Vertical Scaling

Representative systems:

  • Aurora Serverless v2: Supports fine-grained instance capacity adjustments
  • Kubernetes VPA: Adjusts pod sizes

Limitations:

  • Memory upgrades cannot resolve cross-partition skew
  • CPU upgrades cannot resolve metadata bottlenecks
  • Eventually hits hardware limits

3. Autoscaling in Cloud Systems

Existing approaches:

  • Kubernetes HPA: Adjusts replica count based on CPU or QPS
  • Cluster Autoscaler: Modifies cluster node count
  • Rule-based: Threshold-based policies like CPU > 70%

Shortcomings:

  • Do not model performance response surfaces across H and V
  • Ignore non-linear interactions between dimensions
  • Cannot compute diagonal trajectories

4. Unique Contributions of This Work

First to:

  • Construct multi-dimensional scaling plane
  • Derive cost/latency surfaces on (H,V)
  • Optimize diagonal scaling trajectories

Conclusions and Discussion

Main Conclusions

  1. Diagonal scaling is necessary: Optimal configurations rarely lie on pure horizontal or vertical axes
  2. Unified model is effective: Scaling plane provides geometric intuition for performance trade-offs
  3. Significant practical performance gains: p95 latency ↓40%, cost ↓37%, rebalancing ↓2-5×
  4. Theory aligns with practice: Analytical surfaces predict actual system behavior

Limitations

  1. Surface approximations: Real systems have more second-order effects (LSM tree compaction, garbage collection)
  2. Model calibration: Requires sampling to fit parameters α, β, γ, δ, etc.
  3. Local optimality: Algorithm finds local rather than global optimum
  4. Discrete space: Discreteness of instance types limits fine-grained adjustments
  5. Single-cluster assumption: Does not consider multi-region or federated deployments

Future Directions

  1. Machine learning enhancement: Use ML to learn surface approximations in real-time
  2. Three-dimensional scaling: Extend to decoupled compute, memory, storage architectures
  3. Serverless applications: Apply diagonal scaling to serverless databases
  4. Complex multi-objective optimization: Explore more sophisticated Pareto frontier exploration
  5. Predictive scaling: Combine with workload prediction for proactive adjustments

In-Depth Evaluation

Strengths

1. Methodological Innovation (★★★★★)

  • Paradigm shift: Transition from one-dimensional to two-dimensional scaling decisions is fundamentally innovative
  • Solid theoretical foundation: Provides gradient alignment conditions, convergence proofs
  • Strong practical applicability: O(1) complexity suitable for real-time control

2. Experimental Sufficiency (★★★★☆)

  • Multi-system verification: CockroachDB (strong consistency) + Redis Cluster (eventual consistency)
  • Diverse workloads: Covers read/write/mixed/skewed/dynamic scenarios
  • Synthetic + practical: Both theoretical validation and practical evidence
  • Reproducibility: Detailed implementation details and parameter settings

3. Result Convincingness (★★★★★)

  • Significant improvements: 40% latency reduction and 37% cost reduction are substantial
  • Stability enhancement: 2-5× rebalancing reduction critical for production systems
  • Statistical rigor: 5-iteration experiments with error bars showing variance

4. Writing Clarity (★★★★☆)

  • Well-structured: Logic flows clearly from motivation → model → algorithm → evaluation
  • Effective visualization: Figures 2-7 intuitively present core concepts
  • Mathematical rigor: Formulas expressed precisely

Weaknesses

1. Model Simplification

  • Linear combination assumption: F = αL + βC + γK may be overly simplistic
  • Parameter sensitivity: Selection of weights α, β, γ lacks systematic methodology
  • Ignored second-order effects: Network congestion, disk contention

2. Experimental Limitations

  • Limited scale: Maximum 12 nodes; untested on large clusters (100+ nodes)
  • Homogeneous workloads: Primarily YCSB; lacks real application traces
  • Single cloud environment: Not tested across different cloud providers' pricing models

3. Theoretical Gaps

  • Global optimality: Only guarantees local optimum, no global guarantees
  • Convergence rate: Convergence speed not analyzed
  • Worst-case analysis: Lacks discussion of pathological workloads

4. Practical Considerations

  • Cold start problem: How to initialize parameters α, β, γ, δ?
  • Online learning: How to adjust model during runtime?
  • Failure handling: Behavior under node failures not discussed

Impact

1. Academic Contribution (High)

  • Opens new direction: Multi-dimensional scaling optimization may become new research area
  • Theoretical framework: Scaling plane model extensible by future work
  • Citation potential: Expected to be widely cited in database and cloud computing venues

2. Industrial Value (High)

  • Direct applicability: Can be integrated into AWS, GCP, Azure managed database services
  • Cost savings: 37% cost reduction has enormous economic value for large-scale deployments
  • Operational improvement: Rebalancing reduction highly attractive to operations teams

3. Reproducibility (Moderate)

  • Strengths: Clear algorithm description, low complexity
  • Challenges: Requires access to CockroachDB/Redis clusters; parameter calibration requires expertise

Applicable Scenarios

Ideal Scenarios

  1. Cloud-native databases: Spanner, CockroachDB, YugabyteDB, etc.
  2. Mixed workloads: Applications with varying read-write ratios
  3. Cost-sensitive environments: Enterprises needing to optimize cloud spending
  4. Dynamic loads: Systems with daily patterns or unpredictable peaks

Inapplicable Scenarios

  1. Very small scale: Single-node or 2-3 node clusters (diagonal scaling benefits minimal)
  2. Static workloads: Completely predictable and constant loads
  3. Hard real-time systems: Cannot tolerate any scaling operation latency
  4. Highly customized systems: Scaling behavior doesn't fit general model

Key References

  1. 6 Spanner (OSDI'12): Google's globally distributed database with Paxos consensus
  2. 7 Dynamo (SOSP'07): Amazon's highly available KV store
  3. 3 Bigtable (TOCS'08): Google's distributed storage system
  4. 4 CockroachDB: Open-source distributed SQL database
  5. 5 YCSB (SoCC'10): Cloud serving systems benchmark framework
  6. 8-10 Kubernetes Autoscaling: HPA, VPA, Cluster Autoscaler

Overall Assessment

DimensionScoreExplanation
Innovation9/10Diagonal scaling is highly original concept
Technical Depth8/10Solid theoretical derivations, well-designed algorithm
Experimental Quality8/10Multi-system verification, but limited scale
Practical Value9/10Directly applicable to industrial systems
Writing Quality8/10Clear but some details could be improved
Overall8.4/10Excellent paper with significant potential impact

Recommended for: Cloud database researchers, distributed systems engineers, cloud platform architects, autoscaling system developers