2025-11-16T23:13:13.427433

Order Matters: Rethinking Prompt Construction in In-Context Learning

Li, Wang, Wang et al.

In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.

academic

Order Matters: Rethinking Prompt Construction in In-Context Learning

Basic Information

Paper ID: 2511.09700
Title: Order Matters: Rethinking Prompt Construction in In-Context Learning
Authors: Warren Li, Yiqian Wang, Zihan Wang, Jingbo Shang (UC San Diego & Cushing Academy)
Classification: cs.CL (Computational Linguistics)
Publication Date: November 12, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2511.09700

Abstract

This paper challenges a fundamental assumption in the in-context learning (ICL) field: that example selection is more important than example ordering. Through systematic experiments on classification and generation tasks, the authors find that performance fluctuations caused by example ordering are comparable to the impact of completely replacing the example set. The research covers multiple open-source model families ranging from 0.5B to 27B parameters and GPT-5. Furthermore, the study demonstrates that strong orderings approaching oracle performance can be identified using only a development set. These findings call for a re-examination of prompt construction strategies in ICL, emphasizing that example selection and ordering are equally important.

Research Background and Motivation

1. Problem to Be Addressed

In in-context learning, large language models perform new tasks by conditioning on a small number of examples without gradient updates or task-specific fine-tuning. While ICL performance is known to be sensitive to examples, existing research universally assumes that example selection is more important than example ordering, leading research focus to concentrate on example selection.

2. Importance of the Problem

Practical Significance: If ordering is as important as selection, the current research paradigm focusing solely on example selection may miss an important dimension for performance improvement
Theoretical Significance: Understanding ordering sensitivity helps reveal the context processing mechanisms of LLMs
Application Value: Optimizing ordering may improve model performance at zero cost

3. Limitations of Existing Methods

Research Bias: Most work implicitly assumes ordering is a secondary factor, lacking systematic quantitative comparisons
Methodological Flaws: Previous research often conflates the effects of ordering and selection when comparing them
Insufficient Practical Guidance: Lack of effective methods for identifying optimal orderings in real applications

4. Research Motivation

The authors use controlled experimental design to independently vary selection and ordering, systematically quantifying the relative impact of both factors and challenging conventional wisdom in the field.

Core Contributions

Quantitative Proof: Through controlled experiments, the authors prove that the performance impact of example ordering is comparable to example selection, with ordering sensitivity average standard deviation of 0.01970 versus selection sensitivity of 0.02251 (only 14% higher)
Practical Method: Proposes a development set-based ordering identification method that requires evaluating only 64-128 candidate permutations to recover near-oracle performance (99% for classification tasks, 95% for generation tasks)
Systematic Analysis: Comprehensive evaluation across 8 datasets, 14 models (0.5B-27B parameters), and two task types (classification/generation)
Important Findings:
- Ordering effects do not vary monotonically with model scale
- Generation tasks are more sensitive to selection (r=1.46), while classification tasks show nearly equal sensitivity to both (r=1.09)
- Optimal ordering is highly dataset-dependent with poor cross-dataset transferability

Methodology Details

Task Definition

The research focuses on few-shot in-context learning with tasks including:

Classification Tasks: Given k annotated examples and a test input, predict the class label
Generation Tasks: Given k examples and a query, generate a free-form answer

Core Research Question: Quantify the relative impact of example ordering and example selection on ICL performance

Experimental Design Framework

1. Default Ordering Definition

To isolate the effects of ordering and selection, a consistent default ordering is defined:

Classification Tasks: Group by label alphabetically, sort within groups alphabetically
Generation Tasks: Sort all examples alphabetically

2. Controlled Variable Experiments

Construct M=10 different example sets S₁,...,Sₘ, evaluating P=10 random permutations π₁,...,πₚ for each set:

Accuracy Matrix A = [aᵢ,ⱼ]
where aᵢ,ⱼ = Acc(Sᵢ, πⱼ | Dₜₑₛₜ)

Sensitivity Metrics

Order Sensitivity

Calculate the standard deviation of different permutations for each example set, then average:

$\sigma^{(M)} = \frac{1}{M}\sum_{i=1}^{M} \text{std}(a_{i,1}, ..., a_{i,P})$

This measures the impact of changing ordering while fixing the example set.

Selection Sensitivity

Calculate the standard deviation of different example sets for each permutation, then average:

$\sigma^{(P)} = \frac{1}{P}\sum_{j=1}^{P} \text{std}(a_{1,j}, ..., a_{M,j})$

This measures the impact of changing the example set while fixing the ordering.

Relative Importance Ratio

$r = \frac{\sigma^{(P)}}{\sigma^{(M)}}$

r ≈ 1: Both factors have comparable impact
r > 1: Selection is more important
r < 1: Ordering is more important

Method for Finding Optimal Ordering

Algorithm Flow (Algorithm 1)

Input: Example set Sᵢ, development set Ddev, test set Dtest, number of permutations P=128
For each example set Sᵢ (repeat M=10 times):
    1. Generate P random permutations {πⱼ}
    2. Evaluate each permutation on development set: aⱼ = Acc(Sᵢ, πⱼ | Ddev)
    3. Select optimal permutation: π* = argmax aⱼ
    4. Evaluate on test set: a* = Acc(Sᵢ, π* | Dtest)
    5. Record oracle performance: amax = max Acc(Sᵢ, πⱼ | Dtest)
Return: {a*, amax}

Key Parameter Studies

Number of Permutations P: Study impact from 16 to 128
Development Set Size |Ddev|: Study impact from 50 to 1000 samples

Technical Innovations

Experimental Design Innovation: Through default ordering definition, first achieves complete decoupling of selection and ordering effects
Measurement Method: Proposes grouped standard deviation as a unified sensitivity metric, enabling fair comparison of both factors
Practical Balance: Method requires no oracle access to test labels, only a small-scale development set (250 samples suffice)
Systematic Evaluation: First comprehensive comparison of ordering vs. selection across multiple models, tasks, and scales

Experimental Setup

Datasets

Classification Tasks (5 datasets)

Dataset	Number of Classes	Examples k
AG News	4	8
NYT-Topics	9	18
NYT-Locations	10	20
DBPedia	14	28
MMLU	4	8

Generation Tasks (3 datasets)

GSM8K: Math word problems (k=8)
MMLU-Pro: Multi-task understanding (k=8)
MATH: Math problem solving (k=8)

Data Split:

Development Set Ddev: 1000 samples (for ordering selection)
Test Set Dtest: 500 samples (for final evaluation)
Classification tasks use oversampling to ensure class balance

Evaluation Metrics

Classification Tasks: Accuracy
Generation Tasks: Exact Match or numerical tolerance matching

Comparison Methods

Average: Average performance across all random permutations (baseline)
Highest-Dev: Optimal permutation selected on development set evaluated on test set (proposed method)
Max: Optimal permutation across all permutations on test set (oracle upper bound)

Implementation Details

Model Coverage (14 models)

Qwen2.5 Series: 0.5B, 1.5B, 3B, 7B
Gemma-2 Series: 2B, 9B
Gemma Series: 2B, 7B
Llama 3 Series: 1B, 3B, 8B
DeepSeek-R1-Distill: 1.5B, 7B
Gemma-3: 27B
GPT-5-Nano

Experimental Parameters

Sensitivity Experiments: M=10 example sets, P=10 permutations
Ordering Search Experiments: M=10 example sets, P=128 permutations
Development Set Size Study: 50-1000 samples

Experimental Results

Main Results: Ordering vs. Selection Sensitivity

Overall Findings

Order Sensitivity: σ^(M) = 0.01970
Selection Sensitivity: σ^(P) = 0.02251
Relative Difference: Selection is only 14% higher than ordering

This result overturns conventional wisdom, proving that the importance of ordering has been severely underestimated.

Analysis by Model Scale (Table 2 Key Findings)

Model	Scale	Ordering	Selection	r Value
Qwen2.5	0.5B	0.0223	0.0245	1.10
Qwen2.5	7B	0.0119	0.0155	1.30
Gemma-3	27B	0.0157	0.0262	1.67
GPT-5-Nano	-	0.0234	0.0198	0.85

Key Insights:

Smaller Models More Sensitive: 0.5B model sensitivity is approximately 2x that of 7B models
No Monotonic Trend: r values do not vary monotonically with model scale
Enterprise Model Anomaly: GPT-5-nano is more sensitive to ordering (r<1), possibly reflecting different training strategies

Analysis by Task Type (Table 3)

Task Type	Ordering	Selection	r Value
Classification (Average)	0.0226	0.0246	1.09
Generation (Average)	0.0154	0.0222	1.46

Important Findings:

Classification Tasks: Ordering and selection are nearly equally important (r≈1)
Generation Tasks: Selection is relatively more important (r=1.46), but ordering still accounts for 68% of the dominant impact

Dataset-Level Differences

Cases Where Ordering is More Important:

NYT-Topics: r=0.97 (ordering slightly better)
AG News: r=1.01 (completely equal)

Cases Where Selection is More Important:

GSM8K: r=1.58
MATH: r=1.33

This indicates that task characteristics influence the relative importance of both factors.

Effectiveness of Finding Optimal Ordering

Classification Task Results (Figures 3a, 3c)

Impact of Number of Permutations P:
- P=16: Recovers 98% of oracle performance
- P=128: Recovers 99% of oracle performance
- Average performance consistently lags optimal performance by 5-6 percentage points
Impact of Development Set Size:
- 50 samples: Already shows clear effect
- 250 samples: Performance stabilizes
- 1000 samples: Diminishing marginal returns

Generation Task Results (Figures 3b, 3d)

Impact of Number of Permutations P:
- P=64-100: Recovers 95% of oracle performance
- Requires more permutations to match classification task effectiveness
Development Set Size: Similarly stabilizes after 250 samples

Specific Dataset Performance (Tables 5, 6)

Classification Task Example (DBPedia, Qwen2.5-7B):

Average: 0.774
Highest-Dev: 0.795
Max: 0.800
Improvement: +2.1 percentage points (2.7% relative improvement)

Generation Task Example (GSM8K, Llama-3.1-8B):

Average: 0.658
Highest-Dev: 0.669
Max: 0.696
Improvement: +1.1 percentage points, but still gap from oracle

Ordering Transferability Experiments (Table 7)

Cross-Dataset Transfer (GSM8K ↔ MATH)

Model	GSM8K Optimal	MATH Optimal	GSM8K→MATH	MATH→GSM8K	Transfer Rate
Qwen2.5-7B	0.616	0.244	0.207	0.593	0.905
Average	0.439	0.188	0.145	0.400	0.798

Key Findings:

Post-transfer performance approaches the random average performance of the target dataset
Average transfer rate is only 79.8%, indicating that optimal ordering is highly dataset-dependent
Even for related tasks (two math datasets), ordering is difficult to transfer

Ablation Experiments: Key Factor Analysis

While the paper does not explicitly label ablation experiments, parameter variation experiments reveal:

Marginal Effect of Permutation Count P:
- 16→32: Significant improvement
- 32→64: Moderate improvement
- 64→128: Diminishing marginal returns
Threshold Effect of Development Set Size:
- <250 samples: Rapid performance improvement
- 250 samples: Performance plateaus
- Recommendation: Use 250-500 sample development set in practice

Case Analysis

The paper does not provide specific qualitative analysis examples, but numerical results suggest:

Maximum Fluctuation Case (Table 4):

Llama-3.1-8B on DBPedia:
- Order Sensitivity: 0.08791
- Selection Sensitivity: 0.13226
- This means changing ordering alone can cause ±17.6% accuracy fluctuation

Most Stable Case:

Gemma-3-27B on most tasks:
- Order Sensitivity: 0.00545-0.00802
- Larger models demonstrate better robustness

1. Research on Prompt Ordering Sensitivity

Zhao et al. (2021): First systematically proved that GPT-3 is highly sensitive to example ordering, with accuracy fluctuating by tens of percentage points, attributed to model's over-reliance on early context
Lu et al. (2022): Proved that optimal ordering can achieve near-SOTA performance, while poor ordering drops accuracy to random levels

This Paper's Contribution: First quantitatively compares relative impact of ordering and selection, rather than merely observing ordering's existence

2. Example Selection vs. Ordering Effects

Min et al. (2022): Emphasized importance of example selection
Rubin et al. (2022): Proposed retrieval-based example selection methods
Zhang et al. (2022), Guo et al. (2024): Recent research begins recognizing that ordering may be equally important as selection

This Paper's Contribution: Through controlled experimental design, first provides quantitative comparison of both factors' impact (r value)

3. Strategies for Mitigating Ordering Sensitivity

Heuristic Methods: Sampling permutations on development set (Zhao et al., 2021; Zhang et al., 2022)
Adaptive Methods: Dynamic reordering based on test queries (Guo et al., 2024)
Reinforcement Learning: RL-based search (Bhope et al., 2023)

This Paper's Contribution: Proposes simple yet effective development set selection method, proving that near-optimal ordering can be achieved without complex algorithms

4. Relationship to This Work

This paper extends existing work in the following aspects:

Broader Scope: 14 models, 8 datasets, classification + generation tasks
More Rigorous Methods: Achieves complete decoupling of ordering and selection comparison through default ordering
More Systematic Findings: Quantifies relative impact, studies transferability, analyzes model scale effects

Conclusions and Discussion

Main Conclusions

Core Finding: The performance impact of example ordering is comparable to example selection, with ordering sensitivity averaging 88% of selection sensitivity (r=1.14)
Practical Method: Evaluating 64-128 permutations and 250 development samples suffices to find near-optimal ordering
Universality: This finding holds across models from 0.5B to 27B parameters, classification and generation tasks
Specificity: Optimal ordering is highly dataset-dependent with poor cross-dataset transferability (transfer rate 79.8%)
Model Scale Effects: Smaller models are more sensitive, but relative importance of ordering vs. selection does not vary monotonically with scale

Limitations

Acknowledged by Authors

Model Coverage: Does not include full GPT-5 and Claude due to budget and API constraints
Language Limitation: Only evaluates English tasks, does not consider multilingual scenarios
Task Types: Does not cover code generation, retrieval-augmented generation, dialogue, etc.
Evaluation Metrics: Only uses accuracy, does not consider other dimensions (calibration, robustness)

Other Potential Limitations

Example Quantity: k values fixed at 2|C| or 8, does not systematically study impact of different shot numbers
Default Ordering Definition: While alphabetical ordering is reasonable, it may introduce minor biases
Computational Cost: Evaluating 128 permutations × 10 example sets still requires substantial computation, may require trade-offs in practical applications
Insufficient Theoretical Explanation: Lacks deep mechanistic analysis of why ordering is so important

Future Directions

Directions Proposed by Paper

Test larger-scale models (full GPT-5)
Extend to other languages
Explore different shot regimes (few-shot, many-shot)
Evaluate code generation and RAG tasks

Other Directions Worth Exploring

Mechanistic Research: Use attention visualization and other methods to understand underlying causes of ordering sensitivity
Automated Methods: Develop adaptive ordering optimization algorithms without requiring development sets
Cross-Task Transfer: Research whether task-agnostic ordering strategies can be learned
Interaction with Other Factors: Study joint optimization of ordering with prompt templates and instructions

In-Depth Evaluation

Strengths

1. Methodological Rigor ⭐⭐⭐⭐⭐

Controlled Experimental Design: Achieves complete decoupling of selection and ordering through default ordering, avoiding confounding factors
Systematic Evaluation: 14 models × 8 datasets × 2 task types provides broad coverage
Reasonable Metrics: Grouped standard deviation as unified metric enables direct comparison of both factors

2. Importance of Findings ⭐⭐⭐⭐⭐

Challenges Conventional Wisdom: Proves ordering and selection are equally important, overturning field assumptions
High Practical Value: Ordering optimization provides zero-cost 2-3 percentage point performance improvement
Theoretical Significance: Reveals LLM sensitivity to context structure, providing new perspective for understanding model behavior

3. Strong Practicality ⭐⭐⭐⭐

Simple Method: No complex algorithms needed, only requires evaluating candidate permutations on development set
Reasonable Resource Requirements: 250-sample development set + 64 permutations achieves good results
Easy to Reproduce: Paper provides detailed experimental setup and pseudocode

4. Clear Writing ⭐⭐⭐⭐⭐

Logical Structure: Clear progression from motivation to methods to experiments
Effective Visualization: Figure 1's matrix diagram intuitively shows experimental design
Comprehensive Data: Appendix provides complete model-dataset level results

Weaknesses

1. Insufficient Theoretical Explanation ⭐⭐

Lacks Mechanistic Analysis: Does not deeply explore why ordering is so important
No Attention Analysis: Does not verify hypotheses through attention weights
Lacks Interpretability: Does not analyze what makes an ordering "good"

2. Experimental Design Limitations ⭐⭐⭐

Permutation Sampling Strategy: Random sampling may miss certain special effective ordering patterns
Default Ordering Impact: Alphabetical ordering itself may not be truly "neutral" baseline
Example Set Construction: M=10 may be insufficient to fully represent selection diversity

3. Insufficient Transferability Research ⭐⭐

Limited Dataset Pairs: Only tests GSM8K and MATH, both math tasks, lacks cross-domain testing
Lacks Failure Analysis: Does not deeply investigate why transfer fails
Missing Positive Transfer Cases: Are there scenarios where ordering can transfer?

4. Limited Practical Application Guidance ⭐⭐⭐

No Ordering Design Principles: Does not summarize practical heuristic rules for ordering construction
Insufficient Cost Analysis: Does not quantify actual time and API costs of evaluating 128 permutations
Multi-Example-Set Scenarios: How to simultaneously optimize example sets and ordering in practice?

Impact Assessment

1. Contribution to Field ⭐⭐⭐⭐⭐

Paradigm Shift: May trigger shift in ICL research from "selection-focused" to "selection + ordering equally important"
Inspires Follow-up Research: Expected to catalyze numerous works on ordering optimization and mechanistic understanding
Practical Impact: May change industry best practices in prompt engineering

2. Practical Value ⭐⭐⭐⭐

Immediately Applicable: Simple method can be applied to existing systems
High Cost-Benefit: Small cost yields significant improvement (2-3 percentage points)
Broad Applicability: Effective across models and tasks

3. Reproducibility ⭐⭐⭐⭐

Strengths:
- Uses public models and datasets
- Provides detailed hyperparameter settings
- Appendix includes complete results
Weaknesses:
- Code not open-sourced (as of paper publication)
- Some experiments require substantial computational resources

4. Potential Citation Value

This paper is expected to become an important reference in ICL literature because:

Provides benchmark comparison data for ordering vs. selection
Simple method facilitates reproduction and extension in follow-up work
Challenges fundamental field assumptions with milestone significance

Applicable Scenarios

Highly Applicable ✅

Few-shot Classification Tasks: Paper proves most significant effect on classification (r≈1)
Resource-Constrained Scenarios: When unable to expand example sets, ordering optimization is low-cost improvement
Fixed Example Set Scenarios: Some applications have fixed example sets, making ordering optimization the only option
Sufficient Development Set Scenarios: When 250+ annotated samples available for ordering selection

Moderately Applicable ⚠️

Generation Tasks: Slightly weaker than classification (r=1.46), but still worth attempting
Cross-Task Applications: Requires re-searching ordering for each new task
Large Model Applications: While more stable, large models still exhibit ordering sensitivity

Not Applicable ❌

Zero-Shot Scenarios: Method depends on multi-example ICL
Extremely Small Development Sets: <50 samples shows unstable performance
Real-Time Interactive Systems: Cannot pre-evaluate 128 permutations
Cross-Domain Transfer: Ordering learned from one dataset transfers poorly to others

Implications for Future Research

Re-examine ICL Assumptions: Are other factors assumed secondary (e.g., example format, label word choice) also underestimated?
Joint Optimization Framework: Future work should develop methods simultaneously optimizing selection and ordering, rather than handling independently
Mechanistic Research: Urgent need for theoretical work explaining roots of ordering sensitivity (position bias? attention mechanisms?)
Adaptive Methods: Develop online ordering optimization algorithms without requiring development sets
Robustness Research: How to train models insensitive to ordering?

Key References

Brown et al. (2020) - Language Models are Few-Shot Learners (GPT-3 paper, establishes ICL paradigm)
Zhao et al. (2021) - Fantastically Ordered Prompts and Where to Find Them (First systematic study of ordering sensitivity)
Lu et al. (2022) - Order Matters: Re-evaluating Few-Shot Prompting for Text Classification Tasks
Min et al. (2022) - Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (Emphasizes example selection)
Guo et al. (2024) - DEmO: Dynamic Example Ordering for In-Context Learning (Dynamic ordering optimization)

Summary Evaluation

This is a high-quality, high-impact research work whose core value lies in:

Challenging Field Assumptions: Rigorously proves ordering and selection are equally important
Providing Practical Solutions: Simple yet effective development set selection method
Strong Systematicity: Comprehensive evaluation across models, tasks, and scales
High Inspirational Value: Points multiple important directions for future research

Main weaknesses are insufficient theoretical explanation and limited transferability research, but these do not diminish its status as an important contribution to ICL literature.

Recommended For: All researchers and engineers working on ICL, prompt engineering, and LLM applications.

Rating: ⭐⭐⭐⭐½ (4.5/5)