2025-11-17T14:34:12.785982

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Kang, Song, Kim
Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.
academic

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Basic Information

Abstract

Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capabilities of small language models by decomposing complex problems into sequential sub-stages. However, this approach comes at the cost of increased latency. The authors observe that existing adaptive acceleration techniques (such as layer skipping) struggle to balance efficiency and accuracy in this setting, facing two key challenges: (1) differential skip sensitivity across stages, and (2) generation of redundant output tokens. To address these issues, this paper proposes LiteStage, a latency-aware layer skipping framework tailored for multi-stage reasoning. LiteStage combines stage-wise offline search for optimal layer budget allocation with confidence-based online generation early-exit mechanisms to suppress unnecessary decoding. Experiments on three benchmarks—OBQA, CSQA, and StrategyQA—demonstrate that LiteStage achieves up to 1.70× speedup with less than 4.0% accuracy loss, outperforming previous training-free layer skipping methods.

Research Background and Motivation

Problem Definition

Multi-stage reasoning enhances the reasoning capabilities of small language models by decomposing complex problems into multiple sequential sub-problems. For example, TinyThinker employs three-stage reasoning: Recall, Analysis, and Summary. While this approach effectively improves reasoning quality, it inevitably increases inference latency.

Core Challenges

Through in-depth analysis, the authors identify two critical issues:

  1. Differential Skip Sensitivity Across Stages: Different reasoning stages exhibit significantly varying sensitivity to layer skipping. Experiments demonstrate that Stage 3 (Summary stage) is most sensitive to layer skipping, while Stage 1 (Recall stage) is relatively robust.
  2. Redundant Token Generation: Although layer skipping reduces per-token computational cost, it often results in generating more tokens, which paradoxically increases end-to-end latency.

Limitations of Existing Methods

Existing layer skipping methods (such as SkipDecode, UnifiedSkip, AdaSkip) typically employ uniform skipping strategies that cannot adapt to the characteristics of different stages in multi-stage reasoning, leading to:

  • Excessive compression in sensitive stages causing sharp accuracy degradation
  • Overlooking the issue of increased generation length caused by layer skipping
  • Lack of latency-aware optimization mechanisms

Core Contributions

  1. Proposes LiteStage Framework: The first latency-aware layer skipping framework specifically designed for multi-stage reasoning, effectively addressing differential stage sensitivity and redundant token generation.
  2. Stage-wise Layer Budget Allocation Strategy: Designs a greedy search algorithm proceeding from the slowest to fastest stage, allocating optimal layer skipping budgets for each reasoning stage.
  3. Confidence-driven Generation Early-exit Mechanism: Introduces online confidence monitoring to dynamically terminate low-confidence redundant generation, further enhancing inference efficiency.
  4. Significant Performance Improvements: Achieves 1.16-1.70× speedup on three benchmark datasets with only 0.4-4.0% accuracy loss, substantially surpassing existing training-free methods.

Methodology Details

Task Definition

Given a test dataset D, the objective is to find stage-wise layer budgets L that minimize inference latency within a given accuracy threshold ε:

argmin_L (1/|D|) ∑_{d∈D} T(M_L(d))
subject to: A(M_L(d)) ≤ A(M(d)) - ε

where T and A denote inference latency and accuracy respectively, and M_L and M represent models with and without layer skipping.

Model Architecture

LiteStage comprises two complementary components:

1. Offline Configuration

Step 1: Layer Importance Estimation

  • Employs sub-layer-level cosine similarity as importance proxy
  • Separately computes importance for multi-head self-attention (MHSA) and feed-forward networks (FFN):
I^(j)_MHSA = (1/N) ∑_{n=0}^{N-1} cos(MHSA^(j)(x) + x, x)
I^(j)_FFN = (1/N) ∑_{n=0}^{N-1} cos(FFN^(j)(x) + x, x)

Step 2: Layer Budget Search

  • Initiates greedy search from the slowest reasoning stage
  • Constructs accuracy-latency curves, selecting optimal latency configurations under accuracy constraints
  • Performs stage-by-stage optimization to accurately reflect inter-stage interactions

2. Online Adjustment

Step 3: Generation Early-exit

  • Maintains confidence cache for the most recent n tokens
  • Computes average confidence μ_Conf, terminating generation early when below threshold
  • Confidence is defined as the maximum logit value for each token

Technical Innovations

  1. Non-uniform Layer Budget Allocation: Adaptively allocates different layer skipping budgets according to stage-specific sensitivity differences, avoiding excessive compression in sensitive stages.
  2. Latency-aware Optimization: Considers not only accuracy but also actual inference latency, automatically excluding configurations that skip more layers but incur higher latency.
  3. Dynamic Generation Control: Actively controls generation length through confidence monitoring, mitigating redundant token issues caused by layer skipping.

Experimental Setup

Datasets

Employs TinyThinker's three-stage reasoning pipeline, evaluated on three question-answering benchmarks:

  • OpenBookQA (OBQA): Open-domain question answering
  • CommonSenseQA (CSQA): Commonsense reasoning QA
  • StrategyQA: Strategic reasoning QA

Evaluation Metrics

  • Accuracy: Question answering correctness rate
  • Speedup: Inference speed improvement relative to full-layer models
  • Latency: End-to-end inference time

Baseline Methods

  • SkipDecode: Progressive deep layer skipping
  • UnifiedSkip: Periodic layer skipping
  • AdaSkip: Sub-layer importance estimation based on cosine similarity

Implementation Details

  • Primarily uses TinyLlama-1.1B-Chat-v1.0 model
  • Training for 10 epochs with batch sizes 16 (OBQA/CSQA) or 24 (StrategyQA)
  • Learning rate: 5×10^-5
  • Self-consistency protocol with 10-iteration evaluation
  • Confidence threshold: 0.5, cache size: n=5

Experimental Results

Main Results

LiteStage significantly outperforms baseline methods on all three benchmark datasets:

DatasetBaseline AccuracyLiteStage AccuracySpeedup
OBQA64.0%60.0%1.32×
CSQA54.8%53.2%1.16×
StrategyQA62.4%62.0%1.70×

Key Findings

  1. Differential Stage Sensitivity: Single-stage skipping experiments confirm that Stage 3 is most sensitive to layer skipping, with its accuracy curve nearly determining the overall performance upper bound.
  2. Latency Paradox: More layer skipping does not always yield faster inference; due to increased generation length, certain configurations paradoxically result in higher latency.
  3. Confidence Patterns: Layer-skipped models exhibit monotonically decreasing token confidence, while full-layer models may recover confidence in later stages.

Ablation Studies

Effect of Non-uniform Layer Budgets:

  • At equivalent layer skipping counts, LiteStage achieves significantly higher accuracy than uniform skipping strategies
  • Performance gaps further widen as skipping increases

Contribution of Generation Early-exit:

  • Under light layer skipping, early-exit has minimal impact (-0.5% decoding steps)
  • Under heavy layer skipping, can reduce decoding steps by up to 82.5%
  • Accuracy remains stable, occasionally even improving

Case Analysis

Concrete examples from CSQA demonstrate that generation early-exit effectively truncates low-confidence redundant text while preserving core reasoning logic, maintaining consistent final answers.

Multi-stage Generation

  • TinyThinker: Proposes three-stage reasoning cycle of Recall-Analysis-Summary
  • DeAR: Employs Decompose-Analyze-Reconsider process
  • CasCoD: Cascade-style distillation of decomposed reasoning chains
  • Self-Discover: Dynamically organizes reasoning structures

Layer Skipping Techniques

Training-based Methods:

  • LayerSkip, DeeBERT, EE-LLM: Intermediate layer early-exit
  • Mixture-of-Depths: Requires training models and routers

Training-free Methods:

  • SkipDecode: Progressive deep layer skipping
  • Unified Skipping: Periodic skipping
  • ShortGPT: Cosine similarity-based
  • AdaSkip: Sub-layer importance estimation

Generation Early-exit

Existing methods primarily target verbose reasoning models, lacking focus on generation lengthening caused by model compression.

Conclusions and Discussion

Main Conclusions

  1. Non-uniform Sensitivity in Multi-stage Reasoning: Different reasoning stages exhibit significantly varying sensitivity to layer compression, requiring differentiated optimization strategies.
  2. Necessity of Latency-aware Optimization: Pure layer skipping may degrade latency due to increased generation length, necessitating joint consideration of accuracy and latency.
  3. Effectiveness of Generation Control: Confidence-based generation early-exit effectively mitigates redundant generation caused by layer skipping.

Limitations

  1. Offline Search Overhead: Compared to other training-free methods, LiteStage's offline configuration requires more computational resources (approximately 1-7.6 hours).
  2. Model Architecture Dependency: Primarily validated on Llama-series models, with limited effectiveness on other architectures like Qwen.
  3. Scope Limitations: Specifically designed for multi-stage reasoning scenarios; applicability to single-stage reasoning remains insufficiently validated.

Future Directions

  1. Extension to More Model Architectures: Investigate skip sensitivity characteristics across different architectures
  2. Dynamic Budget Allocation: Develop runtime-adaptive layer budget adjustment mechanisms
  3. Multimodal Reasoning Optimization: Extend framework to vision-language and other multimodal reasoning tasks

In-depth Evaluation

Strengths

  1. Accurate Problem Identification: Precisely identifies key bottlenecks in multi-stage reasoning, including differential stage sensitivity and redundant generation.
  2. Reasonable Method Design: The offline-online combined framework design is elegant, ensuring optimization effectiveness while controlling runtime overhead.
  3. Comprehensive Experimental Design: Thoroughly validates method effectiveness through detailed motivation experiments, ablation studies, and case analyses.
  4. High Practical Value: As a training-free method, demonstrates strong potential for real-world applications.

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for differential stage sensitivity, relying primarily on empirical observations.
  2. Heuristic Parameter Settings: Key parameters such as confidence threshold and cache size are primarily heuristically determined, lacking systematic analysis.
  3. Limited Generalizability: Performance varies significantly across different model architectures, with room for improvement in generalization.

Impact

  1. Academic Contribution: First systematic study of layer skipping optimization in multi-stage reasoning, providing new perspectives for related research.
  2. Practical Value: Provides practical solutions for efficient inference in small language models, facilitating edge deployment.
  3. Reproducibility: Provides complete code implementation, facilitating subsequent research and applications.

Applicable Scenarios

LiteStage is particularly suitable for:

  • Resource-constrained edge device deployment
  • Complex tasks requiring multi-stage reasoning
  • Latency-sensitive real-time applications
  • Inference acceleration for small language models

References

The paper cites multiple important related works, including:

  • TinyThinker (Piao and Park, 2024): Representative work in multi-stage reasoning
  • AdaSkip (He et al., 2025): Latest methods in sub-layer skipping
  • Mixture-of-Depths (Raposo et al., 2024): Pioneering work in dynamic computation allocation

Overall Assessment: This paper proposes an innovative solution to layer skipping optimization in multi-stage reasoning, with significant contributions in both theoretical insights and practical effectiveness. Despite certain limitations, it opens new research directions for efficient inference in small language models, possessing important academic value and practical significance.