The demand for computer in our daily lives has led to the proliferation of Datacenters that power indispensable many services. On the other hand, computing has become essential for some research for various scientific fields, that require Supercomputers with vast computing capabilities to produce results in reasonable time. The scale and complexity of these systems, compared to our day-to-day devices, are like comparing a cell to a living organism. To make them work properly, we need state-of-the-art technology and engineering, not just raw resources. Interconnecting the different computer nodes that make up a whole is a delicate task, as it can become the bottleneck for the whole infrastructure. In this work, we explore two aspects of the network: how to prevent degradation under heavy use with congestion control, and how to save energy when idle with power management; and how the two may interact.
Combined power management and congestion control in High-Speed Ethernet-based Networks for Supercomputers and Data Centers
- Paper ID: 2511.10159
- Title: Combined power management and congestion control in High-Speed Ethernet-based Networks for Supercomputers and Data Centers
- Authors: Miguel Sánchez de la Rosa, Francisco J. Andújar, Jesus Escudero-Sahuquillo, José L. Sánchez, Francisco J. Alfaro-Cortés
- Institutions: Universidad de Castilla-La Mancha (Spain), Universidad de Valladolid (Spain)
- Classification: cs.AR (Computer Architecture)
- Publication Date: November 13, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2511.10159
As data centers and supercomputers continue to scale, interconnection networks have become a potential bottleneck for entire systems. This paper investigates the synergistic optimization of two critical aspects in high-speed Ethernet networks: (1) preventing performance degradation under heavy loads through congestion control; (2) saving energy during idle periods through power management; and (3) exploring the interactions between the two. The study demonstrates that appropriate static queue schemes (SQS) combined with dynamic power management techniques (such as PerfBound) can minimize performance loss while reducing energy consumption.
This paper addresses high-speed Ethernet interconnection networks for supercomputers and data centers, exploring how to simultaneously achieve:
- Energy efficiency optimization: Reducing power consumption during network idle periods
- Congestion control: Maintaining network performance under high loads
- Synergistic integration: Understanding the interactive effects between power management and congestion control
- Increasing energy proportion: As CPUs become more energy-proportional, the proportion of interconnection network power consumption in total system power increases
- Performance bottleneck: Network congestion causes Head-of-Line (HoL) blocking, severely degrading overall performance
- Application characteristics: HPC applications typically alternate between computation and communication phases, providing energy-saving opportunities
- Practical requirements: Link power consumption is independent of network activity, consuming the same energy even when idle
Power Management Aspects:
- Fixed PDT (Power-down Threshold): All links use the same threshold, unable to adapt to different link usage patterns
- Requires pre-tuning: Requires pre-execution of workloads for parameter adjustment, consuming additional energy
- Performance degradation: Latency overhead exists when entering and exiting Low Power Idle (LPI) states
Congestion Control Aspects:
- HoL blocking: Hot flows monopolize shared resources, causing severe performance degradation for cold flows
- Congestion backpropagation: Congestion propagates to sources through stop-start or credit-based mechanisms
- Lack of co-design: Power management and congestion control are typically studied independently
- Co-optimization opportunities: Power management and congestion control may interact with each other, requiring joint design
- Practical requirements: Both InfiniBand and Ethernet have standard energy-saving mechanisms (IBTA and EEE), but are often disabled to pursue maximum performance
- Filling research gaps: Lack of systematic research on the combined effects of these two technologies
- First systematic study of the synergistic effects between static queue schemes (SQS) and power management techniques, revealing the dual impact of congestion control on energy consumption and performance
- Experimental evaluation of combinations of 4 SQS schemes (1Q, BBQ, DBBM, Flow2SL) with multiple power management strategies (fixed PDT and dynamic PerfBound)
- Discovery of DBBM advantages: Destination-Based Buffer Management demonstrates significant advantages when combined with power management, minimizing latency and execution time overhead while reducing energy consumption
- Practical guidance: Provides experimental evidence and configuration recommendations for joint optimization of power management and congestion control in supercomputer and data center network design
Input:
- Network topology and traffic patterns
- Static queue scheme (SQS) configuration
- Power management parameters (PDT values or PerfBound configuration)
Output:
- Network energy consumption
- Network latency
- Application execution time
Constraints:
- Performance degradation must be controlled within acceptable ranges
- Maintain lossless network characteristics
Used to mitigate HoL blocking by distributing congestion across different virtual channels (VCs):
- 1Q (Single Queue): Baseline scheme where all flows share a single queue
- BBQ (Bubble-Based Queuing): Bubble-based queuing mechanism that reduces HoL blocking in Dragonfly topology
- DBBM (Destination-Based Buffer Management): Allocates buffers based on destination, isolating congestion of different flows
- Flow2SL (Flow to Service Level): Maps flows to different service levels, enabling finer-grained queue management
Fixed PDT Approach
- Principle: After packet transmission ceases, links remain active for a fixed duration (PDT)
- Parameters: Tested PDT values from 1e-0s to 1e-9s, and PDT=0s (immediate sleep)
- Advantages: Simple implementation
- Disadvantages: Cannot adapt to different link usage patterns
PerfBound Dynamic Approach
- Principle: Dynamically computes PDT values for each port to satisfy preset performance degradation limits
- Mechanism: Based on histogram management data structures
- Three strategies:
- Regular Histogram: Standard histogram approach
- Circular Histogram: Circular histogram approach
- Self-clearing Histogram: Self-clearing histogram approach
- Advantages: Adaptive adjustment without pre-tuning
- Co-evaluation framework: First to evaluate SQS and power management as coupled systems rather than independent optimizations
- Multi-dimensional analysis: Simultaneously examines three critical metrics: energy consumption, network latency, and execution time
- Discovery of DBBM synergistic effects: Reveals special advantages of DBBM in power management scenarios:
- Better buffer management reduces unnecessary link wake-ups
- Destination-based traffic isolation makes more links eligible for sleep states
- Practical orientation: Based on standardized technologies (EEE), research results can be directly applied to real systems
- Testing platform: High-speed Ethernet network simulator based on BXIv3
- Network type: Lossless network
- Flow control mechanism: Supports virtual channels (VCs) and priority-based flow control (PFC)
- Energy Consumption:
- Percentage relative to baseline without power-saving
- Lower is better
- Network Latency:
- Average percentage increase in application-layer network latency
- Measured relative to baseline without power-saving
- Execution Time Increase:
- Percentage increase in total application execution time
- Reflects overall performance impact
SQS Schemes:
- 1Q (baseline)
- BBQ
- DBBM
- Flow2SL
Power Management Schemes:
- No power-saving
- Fixed PDT (8 different values: 1e-0s to 1e-9s, and 0s)
- PerfBound (3 histogram management strategies)
- PDT range: From 1 second to 1 nanosecond, covering multiple orders of magnitude
- PerfBound configuration: Sets performance degradation limits and dynamically adjusts PDT
- Test scenarios: Simulates typical HPC workloads with alternating computation and communication phases
Fixed PDT Effects (Figure 1a):
- Dominant factor: PDT value is the primary determinant of energy consumption
- Minimal SQS impact: Energy consumption differences between different SQS schemes are insignificant (under fixed PDT)
- Energy consumption range: Decreases from 100% (no power-saving) to approximately 16% (at PDT=1e-9s)
- Trend: Smaller PDT values lead to lower energy consumption, but increase performance degradation risk
PerfBound Effects (Figure 1b):
- Further energy reduction: Energy consumption ranges from 76%-100% compared to fixed PDT
- DBBM advantages emerge: DBBM combined with PerfBound achieves the lowest energy consumption
- Histogram strategy impact: Three histogram management strategies show minimal differences (approximately 80%-96%)
- Synergistic effects: DBBM's buffer management characteristics produce synergistic effects with dynamic PDT adjustment
Fixed PDT Impact (Figure 2a):
- Latency increase range: From 1.1% to 102.1%
- PDT critical threshold: Clear performance inflection point exists
- Very small PDT (e.g., 1e-9s): Significant latency increase (>80%)
- Moderate PDT (e.g., 1e-5s to 1e-6s): Controllable latency increase (<20%)
- SQS differentiation:
- DBBM performs best: Minimal latency increase across all PDT values
- 1Q performs worst: Most significant latency increase
- BBQ and Flow2SL intermediate: Performance between the two extremes
PerfBound Impact (Figure 2b):
- More pronounced SQS differences: Performance differences between different SQS schemes are amplified with PerfBound
- DBBM advantages prominent: Latency increase approximately 5-10%
- 1Q disadvantages evident: Latency increase can reach 40-45%
- Minimal histogram strategy impact: Differences between three strategies within 5%
Fixed PDT Impact (Figure 3a):
- Overall trend: Execution time overhead increases as PDT decreases
- DBBM significant advantages:
- Execution time increase only 1-3%
- Notably lower than other SQS schemes (3-8%)
- 1Q worst case: Overhead can reach 8% under strict PDT
PerfBound Impact (Figure 3b):
- More pronounced SQS effects:
- DBBM: 1-3% increase
- BBQ and Flow2SL: 3-5% increase
- 1Q: 5-8% increase
- Histogram strategy: Minimal impact on execution time
- Performance-energy tradeoff: DBBM achieves optimal performance-energy balance
- DBBM's superior performance:
- Consistently performs best across all power management configurations
- Successfully reduces energy consumption while controlling performance degradation to minimum levels
- Execution time overhead only 1-3%, while energy consumption can be reduced by 20-24% (using PerfBound)
- Confirmed synergistic effects:
- Power management and congestion control are not independent
- Good SQS can enhance power management effectiveness
- DBBM's destination-based buffer management enables more links to enter sleep states
- Effectiveness of PerfBound:
- Compared to fixed PDT, PerfBound adapts dynamically
- Maximizes energy savings while guaranteeing performance constraints
- Best results achieved when combined with DBBM
- Limited histogram strategy impact:
- Minimal differences between three histogram management strategies
- Indicates that PerfBound's core mechanism is key, with implementation details having minor impact
- EEE standards and improvements:
- IEEE 802.3az (EEE): Ethernet energy efficiency standard defining Low Power Idle (LPI) states
- Fixed PDT 12: Saravanan et al. proposed maintaining link activity for fixed duration after transmission
- PerfBound 13: Dynamically computes PDT values to satisfy preset performance degradation limits
- Paper's improvement 4: Enhanced PerfBound version proposed by the authors
- Energy-proportional networks:
- Abts et al. 1: Pioneering work on energy-proportional data center networks
- InfiniBand power-saving 5: Software-managed power reduction techniques in IBTA standard
- Static Queue Schemes (SQS):
- BBQ 14: Direct queuing scheme for Dragonfly topology
- DBBM 9: Destination-based buffer management reducing HoL blocking
- Flow2SL 15: Efficient queuing scheme for minimal path routing
- Dynamic Virtual Lanes (DVL):
- DVL 6, 10: Dynamically allocates VCs to congested flows, isolating congestion effects
- End-to-end flow control:
- PFC 16: Priority-based flow control operating on individual VCs
- SFC 7, 8: Source flow control, completely stopping injection
- DCQCN 16: Data Center Quantized Congestion Notification, throttling congested flows
- DCTCP 2: Data Center TCP, ECN-based congestion control
Distinctions:
- First systematic study of synergistic effects between SQS and power management
- Provides comprehensive multi-dimensional evaluation (energy, latency, execution time)
- Reveals special advantages of DBBM in energy-saving scenarios
Advantages:
- More comprehensive experimental design (4 SQS × multiple power management strategies)
- High practical value based on standardized technologies
- Provides clear guidance for real system configuration
- Necessity of co-optimization: Power management and congestion control must be considered jointly, with significant interactions between them
- Recommended use of DBBM: In scenarios requiring simultaneous consideration of energy efficiency and performance, DBBM is the optimal choice:
- Energy consumption reduction of 20-24% (compared to no power-saving)
- Performance degradation only 1-3%
- Minimal network latency increase
- Applicability of PerfBound: Dynamic PDT adjustment outperforms fixed PDT, enabling adaptive optimization across different workloads
- Practical value: Research results can be directly applied to EEE-based high-speed Ethernet systems
- Limited experimental scope:
- Only tested 4 SQS schemes
- Does not cover all possible network topologies
- Workload characteristics not detailed
- Lack of theoretical analysis:
- Primarily based on experimental observations
- Lacks theoretical explanation for DBBM advantages
- No mathematical model for performance-energy relationship
- Insufficient implementation details:
- Specific PerfBound parameter configuration not detailed
- Implementation details of histogram management strategies unclear
- Lacks verification on actual hardware
- Insufficient consideration of dynamic scenarios:
- Does not study adaptability to workload changes
- Lacks analysis of burst traffic
- Does not consider abnormal situations such as network faults
While not explicitly stated in the paper, the following research directions can be inferred:
- Extended experiments:
- Test more SQS schemes and network topologies
- Evaluation using real HPC applications
- Verification on actual hardware
- Theoretical modeling:
- Establish analytical models for performance-energy relationships
- Theoretically explain sources of DBBM advantages
- Provide theoretical guidance for optimal configuration
- Dynamic optimization:
- Develop online adaptive algorithms
- Incorporate workload prediction
- Combine machine learning for parameter optimization
- Hardware co-design:
- Explore hardware-level optimization opportunities
- Design dedicated power management circuits
- Optimize state transition latency
- Important and practical research problem:
- Addresses actual needs of supercomputers and data centers
- Energy consumption increasingly critical, with real-world significance
- Based on standardized technologies, easy to deploy
- Systematic and comprehensive research methodology:
- Combined evaluation of multiple SQS and power management strategies
- Comprehensive analysis of three key metrics
- Reasonable experimental design with sufficient comparisons
- Findings with practical value:
- DBBM advantages clear and significant
- Provides clear guidance for system configuration
- Quantifies performance-energy tradeoffs
- Clear and concise writing:
- Reasonable structure with clear logic
- Intuitive figures with easily understandable results
- Sufficient background introduction
- Insufficient experimental depth:
- Lacks detailed workload descriptions
- Network scale and topology details not specified
- Missing statistical significance analysis
- Only average values, lacking variance or confidence intervals
- Limited theoretical contributions:
- Primarily experimental work
- Lacks theoretical explanation of phenomena
- No design principles or methodological guidance provided
- Insufficient depth of analysis:
- Does not analyze fundamental reasons for DBBM advantages
- Lacks discussion of different traffic patterns
- Does not explore generalizability of results
- Brief related work discussion:
- Simple enumeration in Section 2
- Lacks in-depth comparison with existing work
- Lacks clear positioning of this paper
- Lacks actual verification:
- Based only on simulation experiments
- Not verified on real systems
- Implementation costs and deployment difficulties not discussed
Contribution to the field:
- Medium-high: Fills research gap in co-optimization
- Provides practical guidance for HPC and data center network design
- Promotes application of energy-saving technologies in high-performance networks
Practical value:
- High: Based on standardized technologies, easy to implement
- DBBM + PerfBound combination can be directly applied
- 20%+ energy savings have economic value
Reproducibility:
- Medium:
- Clear method description but insufficient details
- Lacks public code and datasets
- Requires specialized simulator or hardware platform
Citation potential:
- Expected to be cited in HPC networking and green computing fields
- Provides baseline for subsequent co-optimization research
- Limited theoretical contributions may affect long-term impact
Most suitable scenarios:
- Supercomputer interconnection networks:
- Clear separation between computation and communication phases
- Energy-sensitive but high performance requirements
- Using high-speed Ethernet or InfiniBand
- Data center networks:
- Large load fluctuations with energy-saving opportunities
- Need to guarantee low latency
- Adopting lossless Ethernet
- Cloud computing infrastructure:
- Multi-tenant environments requiring congestion isolation
- Energy costs are important consideration
- Diverse workloads
Less suitable scenarios:
- Real-time systems: Extremely sensitive to latency jitter
- Small-scale networks: Energy-saving benefits not significant
- Continuously high-load systems: Lack of energy-saving opportunities
1 Abts et al., 2010 - Pioneering work on energy-proportional data center networks
3 Christensen et al., 2010 - IEEE 802.3az EEE standard
9 Nachiondo et al., 2010 - DBBM buffer management scheme
13 Saravanan & Carpenter, 2018 - PerfBound dynamic PDT method
15 Yébenes et al., 2015 - Flow2SL queuing scheme
16 Zhu et al., 2015 - DCQCN congestion control
This is a practical, experiment-oriented research paper addressing energy consumption optimization in supercomputer and data center networks, systematically evaluating the synergistic effects between congestion control and power management. The paper's main value lies in:
- Filling research gaps: First systematic study of interactions between two technologies
- High practical value: DBBM + PerfBound combination can be directly applied, achieving 20%+ energy savings with <3% performance degradation
- Comprehensive experiments: Full comparison of multiple scheme combinations
Main limitations include limited theoretical depth, lack of deep explanation of phenomena, and absence of actual system verification. However, as an application-oriented paper, its experimental results and practical guidance have significant value and are expected to positively impact green transformation of HPC and data center networks.
Recommendation Score: ⭐⭐⭐⭐ (4/5) - Highly valuable reference for researchers and engineers working on HPC networking and green computing.