LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning
Kang, Song, Kim
Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.
academic
LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning
Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capabilities of small language models by decomposing complex problems into sequential sub-stages. However, this approach comes at the cost of increased latency. The authors observe that existing adaptive acceleration techniques (such as layer skipping) struggle to balance efficiency and accuracy in this setting, facing two key challenges: (1) differential skip sensitivity across stages, and (2) generation of redundant output tokens. To address these issues, this paper proposes LiteStage, a latency-aware layer skipping framework tailored for multi-stage reasoning. LiteStage combines stage-wise offline search for optimal layer budget allocation with confidence-based online generation early-exit mechanisms to suppress unnecessary decoding. Experiments on three benchmarks—OBQA, CSQA, and StrategyQA—demonstrate that LiteStage achieves up to 1.70× speedup with less than 4.0% accuracy loss, outperforming previous training-free layer skipping methods.
Multi-stage reasoning enhances the reasoning capabilities of small language models by decomposing complex problems into multiple sequential sub-problems. For example, TinyThinker employs three-stage reasoning: Recall, Analysis, and Summary. While this approach effectively improves reasoning quality, it inevitably increases inference latency.
Through in-depth analysis, the authors identify two critical issues:
Differential Skip Sensitivity Across Stages: Different reasoning stages exhibit significantly varying sensitivity to layer skipping. Experiments demonstrate that Stage 3 (Summary stage) is most sensitive to layer skipping, while Stage 1 (Recall stage) is relatively robust.
Redundant Token Generation: Although layer skipping reduces per-token computational cost, it often results in generating more tokens, which paradoxically increases end-to-end latency.
Existing layer skipping methods (such as SkipDecode, UnifiedSkip, AdaSkip) typically employ uniform skipping strategies that cannot adapt to the characteristics of different stages in multi-stage reasoning, leading to:
Excessive compression in sensitive stages causing sharp accuracy degradation
Overlooking the issue of increased generation length caused by layer skipping
Proposes LiteStage Framework: The first latency-aware layer skipping framework specifically designed for multi-stage reasoning, effectively addressing differential stage sensitivity and redundant token generation.
Stage-wise Layer Budget Allocation Strategy: Designs a greedy search algorithm proceeding from the slowest to fastest stage, allocating optimal layer skipping budgets for each reasoning stage.
Confidence-driven Generation Early-exit Mechanism: Introduces online confidence monitoring to dynamically terminate low-confidence redundant generation, further enhancing inference efficiency.
Significant Performance Improvements: Achieves 1.16-1.70× speedup on three benchmark datasets with only 0.4-4.0% accuracy loss, substantially surpassing existing training-free methods.
Non-uniform Layer Budget Allocation: Adaptively allocates different layer skipping budgets according to stage-specific sensitivity differences, avoiding excessive compression in sensitive stages.
Latency-aware Optimization: Considers not only accuracy but also actual inference latency, automatically excluding configurations that skip more layers but incur higher latency.
Dynamic Generation Control: Actively controls generation length through confidence monitoring, mitigating redundant token issues caused by layer skipping.
Differential Stage Sensitivity: Single-stage skipping experiments confirm that Stage 3 is most sensitive to layer skipping, with its accuracy curve nearly determining the overall performance upper bound.
Latency Paradox: More layer skipping does not always yield faster inference; due to increased generation length, certain configurations paradoxically result in higher latency.
Confidence Patterns: Layer-skipped models exhibit monotonically decreasing token confidence, while full-layer models may recover confidence in later stages.
Concrete examples from CSQA demonstrate that generation early-exit effectively truncates low-confidence redundant text while preserving core reasoning logic, maintaining consistent final answers.
Non-uniform Sensitivity in Multi-stage Reasoning: Different reasoning stages exhibit significantly varying sensitivity to layer compression, requiring differentiated optimization strategies.
Necessity of Latency-aware Optimization: Pure layer skipping may degrade latency due to increased generation length, necessitating joint consideration of accuracy and latency.
Effectiveness of Generation Control: Confidence-based generation early-exit effectively mitigates redundant generation caused by layer skipping.
Offline Search Overhead: Compared to other training-free methods, LiteStage's offline configuration requires more computational resources (approximately 1-7.6 hours).
Model Architecture Dependency: Primarily validated on Llama-series models, with limited effectiveness on other architectures like Qwen.
Scope Limitations: Specifically designed for multi-stage reasoning scenarios; applicability to single-stage reasoning remains insufficiently validated.
Accurate Problem Identification: Precisely identifies key bottlenecks in multi-stage reasoning, including differential stage sensitivity and redundant generation.
Reasonable Method Design: The offline-online combined framework design is elegant, ensuring optimization effectiveness while controlling runtime overhead.
Comprehensive Experimental Design: Thoroughly validates method effectiveness through detailed motivation experiments, ablation studies, and case analyses.
High Practical Value: As a training-free method, demonstrates strong potential for real-world applications.
Insufficient Theoretical Analysis: Lacks theoretical explanation for differential stage sensitivity, relying primarily on empirical observations.
Heuristic Parameter Settings: Key parameters such as confidence threshold and cache size are primarily heuristically determined, lacking systematic analysis.
Limited Generalizability: Performance varies significantly across different model architectures, with room for improvement in generalization.
Academic Contribution: First systematic study of layer skipping optimization in multi-stage reasoning, providing new perspectives for related research.
Practical Value: Provides practical solutions for efficient inference in small language models, facilitating edge deployment.
Reproducibility: Provides complete code implementation, facilitating subsequent research and applications.
The paper cites multiple important related works, including:
TinyThinker (Piao and Park, 2024): Representative work in multi-stage reasoning
AdaSkip (He et al., 2025): Latest methods in sub-layer skipping
Mixture-of-Depths (Raposo et al., 2024): Pioneering work in dynamic computation allocation
Overall Assessment: This paper proposes an innovative solution to layer skipping optimization in multi-stage reasoning, with significant contributions in both theoretical insights and practical effectiveness. Despite certain limitations, it opens new research directions for efficient inference in small language models, possessing important academic value and practical significance.