2025-11-17T14:34:12.785982

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Kang, Song, Kim

Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.

academic

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Basic Information

Paper ID: 2510.14211
Title: LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning
Authors: Beomseok Kang, Jiwon Song, Jae-Joon Kim (Seoul National University)
Classification: cs.CL, cs.AI
Publication Date: October 16, 2025
Paper Link: https://arxiv.org/abs/2510.14211
Code Link: https://github.com/beomseokg/LiteStage

Abstract

Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capabilities of small language models by decomposing complex problems into sequential sub-stages. However, this approach comes at the cost of increased latency. The authors observe that existing adaptive acceleration techniques (such as layer skipping) struggle to balance efficiency and accuracy in this setting, facing two key challenges: (1) differential skip sensitivity across stages, and (2) generation of redundant output tokens. To address these issues, this paper proposes LiteStage, a latency-aware layer skipping framework tailored for multi-stage reasoning. LiteStage combines stage-wise offline search for optimal layer budget allocation with confidence-based online generation early-exit mechanisms to suppress unnecessary decoding. Experiments on three benchmarks—OBQA, CSQA, and StrategyQA—demonstrate that LiteStage achieves up to 1.70× speedup with less than 4.0% accuracy loss, outperforming previous training-free layer skipping methods.

Research Background and Motivation

Problem Definition

Multi-stage reasoning enhances the reasoning capabilities of small language models by decomposing complex problems into multiple sequential sub-problems. For example, TinyThinker employs three-stage reasoning: Recall, Analysis, and Summary. While this approach effectively improves reasoning quality, it inevitably increases inference latency.

Core Challenges

Through in-depth analysis, the authors identify two critical issues:

Differential Skip Sensitivity Across Stages: Different reasoning stages exhibit significantly varying sensitivity to layer skipping. Experiments demonstrate that Stage 3 (Summary stage) is most sensitive to layer skipping, while Stage 1 (Recall stage) is relatively robust.
Redundant Token Generation: Although layer skipping reduces per-token computational cost, it often results in generating more tokens, which paradoxically increases end-to-end latency.

Limitations of Existing Methods

Existing layer skipping methods (such as SkipDecode, UnifiedSkip, AdaSkip) typically employ uniform skipping strategies that cannot adapt to the characteristics of different stages in multi-stage reasoning, leading to:

Excessive compression in sensitive stages causing sharp accuracy degradation
Overlooking the issue of increased generation length caused by layer skipping
Lack of latency-aware optimization mechanisms

Core Contributions

Proposes LiteStage Framework: The first latency-aware layer skipping framework specifically designed for multi-stage reasoning, effectively addressing differential stage sensitivity and redundant token generation.
Stage-wise Layer Budget Allocation Strategy: Designs a greedy search algorithm proceeding from the slowest to fastest stage, allocating optimal layer skipping budgets for each reasoning stage.
Confidence-driven Generation Early-exit Mechanism: Introduces online confidence monitoring to dynamically terminate low-confidence redundant generation, further enhancing inference efficiency.
Significant Performance Improvements: Achieves 1.16-1.70× speedup on three benchmark datasets with only 0.4-4.0% accuracy loss, substantially surpassing existing training-free methods.

Methodology Details

Task Definition

Given a test dataset D, the objective is to find stage-wise layer budgets L that minimize inference latency within a given accuracy threshold ε:

argmin_L (1/|D|) ∑_{d∈D} T(M_L(d))
subject to: A(M_L(d)) ≤ A(M(d)) - ε

where T and A denote inference latency and accuracy respectively, and M_L and M represent models with and without layer skipping.

Model Architecture

LiteStage comprises two complementary components:

1. Offline Configuration

Step 1: Layer Importance Estimation

Employs sub-layer-level cosine similarity as importance proxy
Separately computes importance for multi-head self-attention (MHSA) and feed-forward networks (FFN):

I^(j)_MHSA = (1/N) ∑_{n=0}^{N-1} cos(MHSA^(j)(x) + x, x)
I^(j)_FFN = (1/N) ∑_{n=0}^{N-1} cos(FFN^(j)(x) + x, x)

Step 2: Layer Budget Search

Initiates greedy search from the slowest reasoning stage
Constructs accuracy-latency curves, selecting optimal latency configurations under accuracy constraints
Performs stage-by-stage optimization to accurately reflect inter-stage interactions

2. Online Adjustment

Step 3: Generation Early-exit

Maintains confidence cache for the most recent n tokens
Computes average confidence μ_Conf, terminating generation early when below threshold
Confidence is defined as the maximum logit value for each token

Technical Innovations

Non-uniform Layer Budget Allocation: Adaptively allocates different layer skipping budgets according to stage-specific sensitivity differences, avoiding excessive compression in sensitive stages.
Latency-aware Optimization: Considers not only accuracy but also actual inference latency, automatically excluding configurations that skip more layers but incur higher latency.
Dynamic Generation Control: Actively controls generation length through confidence monitoring, mitigating redundant token issues caused by layer skipping.

Experimental Setup

Datasets

Employs TinyThinker's three-stage reasoning pipeline, evaluated on three question-answering benchmarks:

OpenBookQA (OBQA): Open-domain question answering
CommonSenseQA (CSQA): Commonsense reasoning QA
StrategyQA: Strategic reasoning QA

Evaluation Metrics

Accuracy: Question answering correctness rate
Speedup: Inference speed improvement relative to full-layer models
Latency: End-to-end inference time

Baseline Methods

SkipDecode: Progressive deep layer skipping
UnifiedSkip: Periodic layer skipping
AdaSkip: Sub-layer importance estimation based on cosine similarity

Implementation Details

Primarily uses TinyLlama-1.1B-Chat-v1.0 model
Training for 10 epochs with batch sizes 16 (OBQA/CSQA) or 24 (StrategyQA)
Learning rate: 5×10^-5
Self-consistency protocol with 10-iteration evaluation
Confidence threshold: 0.5, cache size: n=5

Experimental Results

Main Results

LiteStage significantly outperforms baseline methods on all three benchmark datasets:

Dataset	Baseline Accuracy	LiteStage Accuracy	Speedup
OBQA	64.0%	60.0%	1.32×
CSQA	54.8%	53.2%	1.16×
StrategyQA	62.4%	62.0%	1.70×

Key Findings

Differential Stage Sensitivity: Single-stage skipping experiments confirm that Stage 3 is most sensitive to layer skipping, with its accuracy curve nearly determining the overall performance upper bound.
Latency Paradox: More layer skipping does not always yield faster inference; due to increased generation length, certain configurations paradoxically result in higher latency.
Confidence Patterns: Layer-skipped models exhibit monotonically decreasing token confidence, while full-layer models may recover confidence in later stages.

Ablation Studies

Effect of Non-uniform Layer Budgets:

At equivalent layer skipping counts, LiteStage achieves significantly higher accuracy than uniform skipping strategies
Performance gaps further widen as skipping increases

Contribution of Generation Early-exit:

Under light layer skipping, early-exit has minimal impact (-0.5% decoding steps)
Under heavy layer skipping, can reduce decoding steps by up to 82.5%
Accuracy remains stable, occasionally even improving

Case Analysis

Concrete examples from CSQA demonstrate that generation early-exit effectively truncates low-confidence redundant text while preserving core reasoning logic, maintaining consistent final answers.

Multi-stage Generation

TinyThinker: Proposes three-stage reasoning cycle of Recall-Analysis-Summary
DeAR: Employs Decompose-Analyze-Reconsider process
CasCoD: Cascade-style distillation of decomposed reasoning chains
Self-Discover: Dynamically organizes reasoning structures

Layer Skipping Techniques

Training-based Methods:

LayerSkip, DeeBERT, EE-LLM: Intermediate layer early-exit
Mixture-of-Depths: Requires training models and routers

Training-free Methods:

SkipDecode: Progressive deep layer skipping
Unified Skipping: Periodic skipping
ShortGPT: Cosine similarity-based
AdaSkip: Sub-layer importance estimation

Generation Early-exit

Existing methods primarily target verbose reasoning models, lacking focus on generation lengthening caused by model compression.

Conclusions and Discussion

Main Conclusions

Non-uniform Sensitivity in Multi-stage Reasoning: Different reasoning stages exhibit significantly varying sensitivity to layer compression, requiring differentiated optimization strategies.
Necessity of Latency-aware Optimization: Pure layer skipping may degrade latency due to increased generation length, necessitating joint consideration of accuracy and latency.
Effectiveness of Generation Control: Confidence-based generation early-exit effectively mitigates redundant generation caused by layer skipping.

Limitations

Offline Search Overhead: Compared to other training-free methods, LiteStage's offline configuration requires more computational resources (approximately 1-7.6 hours).
Model Architecture Dependency: Primarily validated on Llama-series models, with limited effectiveness on other architectures like Qwen.
Scope Limitations: Specifically designed for multi-stage reasoning scenarios; applicability to single-stage reasoning remains insufficiently validated.

Future Directions

Extension to More Model Architectures: Investigate skip sensitivity characteristics across different architectures
Dynamic Budget Allocation: Develop runtime-adaptive layer budget adjustment mechanisms
Multimodal Reasoning Optimization: Extend framework to vision-language and other multimodal reasoning tasks

In-depth Evaluation

Strengths

Accurate Problem Identification: Precisely identifies key bottlenecks in multi-stage reasoning, including differential stage sensitivity and redundant generation.
Reasonable Method Design: The offline-online combined framework design is elegant, ensuring optimization effectiveness while controlling runtime overhead.
Comprehensive Experimental Design: Thoroughly validates method effectiveness through detailed motivation experiments, ablation studies, and case analyses.
High Practical Value: As a training-free method, demonstrates strong potential for real-world applications.

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for differential stage sensitivity, relying primarily on empirical observations.
Heuristic Parameter Settings: Key parameters such as confidence threshold and cache size are primarily heuristically determined, lacking systematic analysis.
Limited Generalizability: Performance varies significantly across different model architectures, with room for improvement in generalization.

Impact

Academic Contribution: First systematic study of layer skipping optimization in multi-stage reasoning, providing new perspectives for related research.
Practical Value: Provides practical solutions for efficient inference in small language models, facilitating edge deployment.
Reproducibility: Provides complete code implementation, facilitating subsequent research and applications.

Applicable Scenarios

LiteStage is particularly suitable for:

Resource-constrained edge device deployment
Complex tasks requiring multi-stage reasoning
Latency-sensitive real-time applications
Inference acceleration for small language models

References

The paper cites multiple important related works, including:

TinyThinker (Piao and Park, 2024): Representative work in multi-stage reasoning
AdaSkip (He et al., 2025): Latest methods in sub-layer skipping
Mixture-of-Depths (Raposo et al., 2024): Pioneering work in dynamic computation allocation

Overall Assessment: This paper proposes an innovative solution to layer skipping optimization in multi-stage reasoning, with significant contributions in both theoretical insights and practical effectiveness. Despite certain limitations, it opens new research directions for efficient inference in small language models, possessing important academic value and practical significance.