2025-11-16T23:13:13.427433

Order Matters: Rethinking Prompt Construction in In-Context Learning

Li, Wang, Wang et al.
In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.
academic

Order Matters: Rethinking Prompt Construction in In-Context Learning

Basic Information

  • Paper ID: 2511.09700
  • Title: Order Matters: Rethinking Prompt Construction in In-Context Learning
  • Authors: Warren Li, Yiqian Wang, Zihan Wang, Jingbo Shang (UC San Diego & Cushing Academy)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: November 12, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2511.09700

Abstract

This paper challenges a fundamental assumption in the in-context learning (ICL) field: that example selection is more important than example ordering. Through systematic experiments on classification and generation tasks, the authors find that performance fluctuations caused by example ordering are comparable to the impact of completely replacing the example set. The research covers multiple open-source model families ranging from 0.5B to 27B parameters and GPT-5. Furthermore, the study demonstrates that strong orderings approaching oracle performance can be identified using only a development set. These findings call for a re-examination of prompt construction strategies in ICL, emphasizing that example selection and ordering are equally important.

Research Background and Motivation

1. Problem to Be Addressed

In in-context learning, large language models perform new tasks by conditioning on a small number of examples without gradient updates or task-specific fine-tuning. While ICL performance is known to be sensitive to examples, existing research universally assumes that example selection is more important than example ordering, leading research focus to concentrate on example selection.

2. Importance of the Problem

  • Practical Significance: If ordering is as important as selection, the current research paradigm focusing solely on example selection may miss an important dimension for performance improvement
  • Theoretical Significance: Understanding ordering sensitivity helps reveal the context processing mechanisms of LLMs
  • Application Value: Optimizing ordering may improve model performance at zero cost

3. Limitations of Existing Methods

  • Research Bias: Most work implicitly assumes ordering is a secondary factor, lacking systematic quantitative comparisons
  • Methodological Flaws: Previous research often conflates the effects of ordering and selection when comparing them
  • Insufficient Practical Guidance: Lack of effective methods for identifying optimal orderings in real applications

4. Research Motivation

The authors use controlled experimental design to independently vary selection and ordering, systematically quantifying the relative impact of both factors and challenging conventional wisdom in the field.

Core Contributions

  1. Quantitative Proof: Through controlled experiments, the authors prove that the performance impact of example ordering is comparable to example selection, with ordering sensitivity average standard deviation of 0.01970 versus selection sensitivity of 0.02251 (only 14% higher)
  2. Practical Method: Proposes a development set-based ordering identification method that requires evaluating only 64-128 candidate permutations to recover near-oracle performance (99% for classification tasks, 95% for generation tasks)
  3. Systematic Analysis: Comprehensive evaluation across 8 datasets, 14 models (0.5B-27B parameters), and two task types (classification/generation)
  4. Important Findings:
    • Ordering effects do not vary monotonically with model scale
    • Generation tasks are more sensitive to selection (r=1.46), while classification tasks show nearly equal sensitivity to both (r=1.09)
    • Optimal ordering is highly dataset-dependent with poor cross-dataset transferability

Methodology Details

Task Definition

The research focuses on few-shot in-context learning with tasks including:

  • Classification Tasks: Given k annotated examples and a test input, predict the class label
  • Generation Tasks: Given k examples and a query, generate a free-form answer

Core Research Question: Quantify the relative impact of example ordering and example selection on ICL performance

Experimental Design Framework

1. Default Ordering Definition

To isolate the effects of ordering and selection, a consistent default ordering is defined:

  • Classification Tasks: Group by label alphabetically, sort within groups alphabetically
  • Generation Tasks: Sort all examples alphabetically

2. Controlled Variable Experiments

Construct M=10 different example sets S₁,...,Sₘ, evaluating P=10 random permutations π₁,...,πₚ for each set:

Accuracy Matrix A = [aᵢ,ⱼ]
where aᵢ,ⱼ = Acc(Sᵢ, πⱼ | Dₜₑₛₜ)

Sensitivity Metrics

Order Sensitivity

Calculate the standard deviation of different permutations for each example set, then average:

σ(M)=1Mi=1Mstd(ai,1,...,ai,P)\sigma^{(M)} = \frac{1}{M}\sum_{i=1}^{M} \text{std}(a_{i,1}, ..., a_{i,P})

This measures the impact of changing ordering while fixing the example set.

Selection Sensitivity

Calculate the standard deviation of different example sets for each permutation, then average:

σ(P)=1Pj=1Pstd(a1,j,...,aM,j)\sigma^{(P)} = \frac{1}{P}\sum_{j=1}^{P} \text{std}(a_{1,j}, ..., a_{M,j})

This measures the impact of changing the example set while fixing the ordering.

Relative Importance Ratio

r=σ(P)σ(M)r = \frac{\sigma^{(P)}}{\sigma^{(M)}}

  • r ≈ 1: Both factors have comparable impact
  • r > 1: Selection is more important
  • r < 1: Ordering is more important

Method for Finding Optimal Ordering

Algorithm Flow (Algorithm 1)

Input: Example set Sᵢ, development set Ddev, test set Dtest, number of permutations P=128
For each example set Sᵢ (repeat M=10 times):
    1. Generate P random permutations {πⱼ}
    2. Evaluate each permutation on development set: aⱼ = Acc(Sᵢ, πⱼ | Ddev)
    3. Select optimal permutation: π* = argmax aⱼ
    4. Evaluate on test set: a* = Acc(Sᵢ, π* | Dtest)
    5. Record oracle performance: amax = max Acc(Sᵢ, πⱼ | Dtest)
Return: {a*, amax}

Key Parameter Studies

  • Number of Permutations P: Study impact from 16 to 128
  • Development Set Size |Ddev|: Study impact from 50 to 1000 samples

Technical Innovations

  1. Experimental Design Innovation: Through default ordering definition, first achieves complete decoupling of selection and ordering effects
  2. Measurement Method: Proposes grouped standard deviation as a unified sensitivity metric, enabling fair comparison of both factors
  3. Practical Balance: Method requires no oracle access to test labels, only a small-scale development set (250 samples suffice)
  4. Systematic Evaluation: First comprehensive comparison of ordering vs. selection across multiple models, tasks, and scales

Experimental Setup

Datasets

Classification Tasks (5 datasets)

DatasetNumber of ClassesExamples k
AG News48
NYT-Topics918
NYT-Locations1020
DBPedia1428
MMLU48

Generation Tasks (3 datasets)

  • GSM8K: Math word problems (k=8)
  • MMLU-Pro: Multi-task understanding (k=8)
  • MATH: Math problem solving (k=8)

Data Split:

  • Development Set Ddev: 1000 samples (for ordering selection)
  • Test Set Dtest: 500 samples (for final evaluation)
  • Classification tasks use oversampling to ensure class balance

Evaluation Metrics

  • Classification Tasks: Accuracy
  • Generation Tasks: Exact Match or numerical tolerance matching

Comparison Methods

  • Average: Average performance across all random permutations (baseline)
  • Highest-Dev: Optimal permutation selected on development set evaluated on test set (proposed method)
  • Max: Optimal permutation across all permutations on test set (oracle upper bound)

Implementation Details

Model Coverage (14 models)

  • Qwen2.5 Series: 0.5B, 1.5B, 3B, 7B
  • Gemma-2 Series: 2B, 9B
  • Gemma Series: 2B, 7B
  • Llama 3 Series: 1B, 3B, 8B
  • DeepSeek-R1-Distill: 1.5B, 7B
  • Gemma-3: 27B
  • GPT-5-Nano

Experimental Parameters

  • Sensitivity Experiments: M=10 example sets, P=10 permutations
  • Ordering Search Experiments: M=10 example sets, P=128 permutations
  • Development Set Size Study: 50-1000 samples

Experimental Results

Main Results: Ordering vs. Selection Sensitivity

Overall Findings

  • Order Sensitivity: σ^(M) = 0.01970
  • Selection Sensitivity: σ^(P) = 0.02251
  • Relative Difference: Selection is only 14% higher than ordering

This result overturns conventional wisdom, proving that the importance of ordering has been severely underestimated.

Analysis by Model Scale (Table 2 Key Findings)

ModelScaleOrderingSelectionr Value
Qwen2.50.5B0.02230.02451.10
Qwen2.57B0.01190.01551.30
Gemma-327B0.01570.02621.67
GPT-5-Nano-0.02340.01980.85

Key Insights:

  1. Smaller Models More Sensitive: 0.5B model sensitivity is approximately 2x that of 7B models
  2. No Monotonic Trend: r values do not vary monotonically with model scale
  3. Enterprise Model Anomaly: GPT-5-nano is more sensitive to ordering (r<1), possibly reflecting different training strategies

Analysis by Task Type (Table 3)

Task TypeOrderingSelectionr Value
Classification (Average)0.02260.02461.09
Generation (Average)0.01540.02221.46

Important Findings:

  • Classification Tasks: Ordering and selection are nearly equally important (r≈1)
  • Generation Tasks: Selection is relatively more important (r=1.46), but ordering still accounts for 68% of the dominant impact

Dataset-Level Differences

Cases Where Ordering is More Important:

  • NYT-Topics: r=0.97 (ordering slightly better)
  • AG News: r=1.01 (completely equal)

Cases Where Selection is More Important:

  • GSM8K: r=1.58
  • MATH: r=1.33

This indicates that task characteristics influence the relative importance of both factors.

Effectiveness of Finding Optimal Ordering

Classification Task Results (Figures 3a, 3c)

  • Impact of Number of Permutations P:
    • P=16: Recovers 98% of oracle performance
    • P=128: Recovers 99% of oracle performance
    • Average performance consistently lags optimal performance by 5-6 percentage points
  • Impact of Development Set Size:
    • 50 samples: Already shows clear effect
    • 250 samples: Performance stabilizes
    • 1000 samples: Diminishing marginal returns

Generation Task Results (Figures 3b, 3d)

  • Impact of Number of Permutations P:
    • P=64-100: Recovers 95% of oracle performance
    • Requires more permutations to match classification task effectiveness
  • Development Set Size: Similarly stabilizes after 250 samples

Specific Dataset Performance (Tables 5, 6)

Classification Task Example (DBPedia, Qwen2.5-7B):

  • Average: 0.774
  • Highest-Dev: 0.795
  • Max: 0.800
  • Improvement: +2.1 percentage points (2.7% relative improvement)

Generation Task Example (GSM8K, Llama-3.1-8B):

  • Average: 0.658
  • Highest-Dev: 0.669
  • Max: 0.696
  • Improvement: +1.1 percentage points, but still gap from oracle

Ordering Transferability Experiments (Table 7)

Cross-Dataset Transfer (GSM8K ↔ MATH)

ModelGSM8K OptimalMATH OptimalGSM8K→MATHMATH→GSM8KTransfer Rate
Qwen2.5-7B0.6160.2440.2070.5930.905
Average0.4390.1880.1450.4000.798

Key Findings:

  • Post-transfer performance approaches the random average performance of the target dataset
  • Average transfer rate is only 79.8%, indicating that optimal ordering is highly dataset-dependent
  • Even for related tasks (two math datasets), ordering is difficult to transfer

Ablation Experiments: Key Factor Analysis

While the paper does not explicitly label ablation experiments, parameter variation experiments reveal:

  1. Marginal Effect of Permutation Count P:
    • 16→32: Significant improvement
    • 32→64: Moderate improvement
    • 64→128: Diminishing marginal returns
  2. Threshold Effect of Development Set Size:
    • <250 samples: Rapid performance improvement
    • 250 samples: Performance plateaus

    • Recommendation: Use 250-500 sample development set in practice

Case Analysis

The paper does not provide specific qualitative analysis examples, but numerical results suggest:

Maximum Fluctuation Case (Table 4):

  • Llama-3.1-8B on DBPedia:
    • Order Sensitivity: 0.08791
    • Selection Sensitivity: 0.13226
    • This means changing ordering alone can cause ±17.6% accuracy fluctuation

Most Stable Case:

  • Gemma-3-27B on most tasks:
    • Order Sensitivity: 0.00545-0.00802
    • Larger models demonstrate better robustness

1. Research on Prompt Ordering Sensitivity

  • Zhao et al. (2021): First systematically proved that GPT-3 is highly sensitive to example ordering, with accuracy fluctuating by tens of percentage points, attributed to model's over-reliance on early context
  • Lu et al. (2022): Proved that optimal ordering can achieve near-SOTA performance, while poor ordering drops accuracy to random levels

This Paper's Contribution: First quantitatively compares relative impact of ordering and selection, rather than merely observing ordering's existence

2. Example Selection vs. Ordering Effects

  • Min et al. (2022): Emphasized importance of example selection
  • Rubin et al. (2022): Proposed retrieval-based example selection methods
  • Zhang et al. (2022), Guo et al. (2024): Recent research begins recognizing that ordering may be equally important as selection

This Paper's Contribution: Through controlled experimental design, first provides quantitative comparison of both factors' impact (r value)

3. Strategies for Mitigating Ordering Sensitivity

  • Heuristic Methods: Sampling permutations on development set (Zhao et al., 2021; Zhang et al., 2022)
  • Adaptive Methods: Dynamic reordering based on test queries (Guo et al., 2024)
  • Reinforcement Learning: RL-based search (Bhope et al., 2023)

This Paper's Contribution: Proposes simple yet effective development set selection method, proving that near-optimal ordering can be achieved without complex algorithms

4. Relationship to This Work

This paper extends existing work in the following aspects:

  • Broader Scope: 14 models, 8 datasets, classification + generation tasks
  • More Rigorous Methods: Achieves complete decoupling of ordering and selection comparison through default ordering
  • More Systematic Findings: Quantifies relative impact, studies transferability, analyzes model scale effects

Conclusions and Discussion

Main Conclusions

  1. Core Finding: The performance impact of example ordering is comparable to example selection, with ordering sensitivity averaging 88% of selection sensitivity (r=1.14)
  2. Practical Method: Evaluating 64-128 permutations and 250 development samples suffices to find near-optimal ordering
  3. Universality: This finding holds across models from 0.5B to 27B parameters, classification and generation tasks
  4. Specificity: Optimal ordering is highly dataset-dependent with poor cross-dataset transferability (transfer rate 79.8%)
  5. Model Scale Effects: Smaller models are more sensitive, but relative importance of ordering vs. selection does not vary monotonically with scale

Limitations

Acknowledged by Authors

  1. Model Coverage: Does not include full GPT-5 and Claude due to budget and API constraints
  2. Language Limitation: Only evaluates English tasks, does not consider multilingual scenarios
  3. Task Types: Does not cover code generation, retrieval-augmented generation, dialogue, etc.
  4. Evaluation Metrics: Only uses accuracy, does not consider other dimensions (calibration, robustness)

Other Potential Limitations

  1. Example Quantity: k values fixed at 2|C| or 8, does not systematically study impact of different shot numbers
  2. Default Ordering Definition: While alphabetical ordering is reasonable, it may introduce minor biases
  3. Computational Cost: Evaluating 128 permutations × 10 example sets still requires substantial computation, may require trade-offs in practical applications
  4. Insufficient Theoretical Explanation: Lacks deep mechanistic analysis of why ordering is so important

Future Directions

Directions Proposed by Paper

  1. Test larger-scale models (full GPT-5)
  2. Extend to other languages
  3. Explore different shot regimes (few-shot, many-shot)
  4. Evaluate code generation and RAG tasks

Other Directions Worth Exploring

  1. Mechanistic Research: Use attention visualization and other methods to understand underlying causes of ordering sensitivity
  2. Automated Methods: Develop adaptive ordering optimization algorithms without requiring development sets
  3. Cross-Task Transfer: Research whether task-agnostic ordering strategies can be learned
  4. Interaction with Other Factors: Study joint optimization of ordering with prompt templates and instructions

In-Depth Evaluation

Strengths

1. Methodological Rigor ⭐⭐⭐⭐⭐

  • Controlled Experimental Design: Achieves complete decoupling of selection and ordering through default ordering, avoiding confounding factors
  • Systematic Evaluation: 14 models × 8 datasets × 2 task types provides broad coverage
  • Reasonable Metrics: Grouped standard deviation as unified metric enables direct comparison of both factors

2. Importance of Findings ⭐⭐⭐⭐⭐

  • Challenges Conventional Wisdom: Proves ordering and selection are equally important, overturning field assumptions
  • High Practical Value: Ordering optimization provides zero-cost 2-3 percentage point performance improvement
  • Theoretical Significance: Reveals LLM sensitivity to context structure, providing new perspective for understanding model behavior

3. Strong Practicality ⭐⭐⭐⭐

  • Simple Method: No complex algorithms needed, only requires evaluating candidate permutations on development set
  • Reasonable Resource Requirements: 250-sample development set + 64 permutations achieves good results
  • Easy to Reproduce: Paper provides detailed experimental setup and pseudocode

4. Clear Writing ⭐⭐⭐⭐⭐

  • Logical Structure: Clear progression from motivation to methods to experiments
  • Effective Visualization: Figure 1's matrix diagram intuitively shows experimental design
  • Comprehensive Data: Appendix provides complete model-dataset level results

Weaknesses

1. Insufficient Theoretical Explanation ⭐⭐

  • Lacks Mechanistic Analysis: Does not deeply explore why ordering is so important
  • No Attention Analysis: Does not verify hypotheses through attention weights
  • Lacks Interpretability: Does not analyze what makes an ordering "good"

2. Experimental Design Limitations ⭐⭐⭐

  • Permutation Sampling Strategy: Random sampling may miss certain special effective ordering patterns
  • Default Ordering Impact: Alphabetical ordering itself may not be truly "neutral" baseline
  • Example Set Construction: M=10 may be insufficient to fully represent selection diversity

3. Insufficient Transferability Research ⭐⭐

  • Limited Dataset Pairs: Only tests GSM8K and MATH, both math tasks, lacks cross-domain testing
  • Lacks Failure Analysis: Does not deeply investigate why transfer fails
  • Missing Positive Transfer Cases: Are there scenarios where ordering can transfer?

4. Limited Practical Application Guidance ⭐⭐⭐

  • No Ordering Design Principles: Does not summarize practical heuristic rules for ordering construction
  • Insufficient Cost Analysis: Does not quantify actual time and API costs of evaluating 128 permutations
  • Multi-Example-Set Scenarios: How to simultaneously optimize example sets and ordering in practice?

Impact Assessment

1. Contribution to Field ⭐⭐⭐⭐⭐

  • Paradigm Shift: May trigger shift in ICL research from "selection-focused" to "selection + ordering equally important"
  • Inspires Follow-up Research: Expected to catalyze numerous works on ordering optimization and mechanistic understanding
  • Practical Impact: May change industry best practices in prompt engineering

2. Practical Value ⭐⭐⭐⭐

  • Immediately Applicable: Simple method can be applied to existing systems
  • High Cost-Benefit: Small cost yields significant improvement (2-3 percentage points)
  • Broad Applicability: Effective across models and tasks

3. Reproducibility ⭐⭐⭐⭐

  • Strengths:
    • Uses public models and datasets
    • Provides detailed hyperparameter settings
    • Appendix includes complete results
  • Weaknesses:
    • Code not open-sourced (as of paper publication)
    • Some experiments require substantial computational resources

4. Potential Citation Value

This paper is expected to become an important reference in ICL literature because:

  • Provides benchmark comparison data for ordering vs. selection
  • Simple method facilitates reproduction and extension in follow-up work
  • Challenges fundamental field assumptions with milestone significance

Applicable Scenarios

Highly Applicable ✅

  1. Few-shot Classification Tasks: Paper proves most significant effect on classification (r≈1)
  2. Resource-Constrained Scenarios: When unable to expand example sets, ordering optimization is low-cost improvement
  3. Fixed Example Set Scenarios: Some applications have fixed example sets, making ordering optimization the only option
  4. Sufficient Development Set Scenarios: When 250+ annotated samples available for ordering selection

Moderately Applicable ⚠️

  1. Generation Tasks: Slightly weaker than classification (r=1.46), but still worth attempting
  2. Cross-Task Applications: Requires re-searching ordering for each new task
  3. Large Model Applications: While more stable, large models still exhibit ordering sensitivity

Not Applicable ❌

  1. Zero-Shot Scenarios: Method depends on multi-example ICL
  2. Extremely Small Development Sets: <50 samples shows unstable performance
  3. Real-Time Interactive Systems: Cannot pre-evaluate 128 permutations
  4. Cross-Domain Transfer: Ordering learned from one dataset transfers poorly to others

Implications for Future Research

  1. Re-examine ICL Assumptions: Are other factors assumed secondary (e.g., example format, label word choice) also underestimated?
  2. Joint Optimization Framework: Future work should develop methods simultaneously optimizing selection and ordering, rather than handling independently
  3. Mechanistic Research: Urgent need for theoretical work explaining roots of ordering sensitivity (position bias? attention mechanisms?)
  4. Adaptive Methods: Develop online ordering optimization algorithms without requiring development sets
  5. Robustness Research: How to train models insensitive to ordering?

Key References

  1. Brown et al. (2020) - Language Models are Few-Shot Learners (GPT-3 paper, establishes ICL paradigm)
  2. Zhao et al. (2021) - Fantastically Ordered Prompts and Where to Find Them (First systematic study of ordering sensitivity)
  3. Lu et al. (2022) - Order Matters: Re-evaluating Few-Shot Prompting for Text Classification Tasks
  4. Min et al. (2022) - Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? (Emphasizes example selection)
  5. Guo et al. (2024) - DEmO: Dynamic Example Ordering for In-Context Learning (Dynamic ordering optimization)

Summary Evaluation

This is a high-quality, high-impact research work whose core value lies in:

  1. Challenging Field Assumptions: Rigorously proves ordering and selection are equally important
  2. Providing Practical Solutions: Simple yet effective development set selection method
  3. Strong Systematicity: Comprehensive evaluation across models, tasks, and scales
  4. High Inspirational Value: Points multiple important directions for future research

Main weaknesses are insufficient theoretical explanation and limited transferability research, but these do not diminish its status as an important contribution to ICL literature.

Recommended For: All researchers and engineers working on ICL, prompt engineering, and LLM applications.

Rating: ⭐⭐⭐⭐½ (4.5/5)