Order Matters: Rethinking Prompt Construction in In-Context Learning
Li, Wang, Wang et al.
In-context learning (ICL) enables large language models to perform new tasks by conditioning on a sequence of examples. Most prior work reasonably and intuitively assumes that which examples are chosen has a far greater effect on performance than how those examples are ordered, leading to a focus on example selection. We revisit this assumption and conduct a systematic comparison between the effect of selection and ordering. Through controlled experiments on both classification and generation tasks, using multiple open-source model families (0.5B to 27B parameters) and GPT-5, we find that the variance in performance due to different example orderings is comparable to that from using entirely different example sets. Furthermore, we show that strong orderings can be identified using only a development set, achieving performance close to an oracle that selects the best ordering based on test labels. Our findings highlight the equal and intertwined importance of example selection and ordering in prompt design, calling for a reexamination of the assumptions held in ICL.
academic
Order Matters: Rethinking Prompt Construction in In-Context Learning
This paper challenges a fundamental assumption in the in-context learning (ICL) field: that example selection is more important than example ordering. Through systematic experiments on classification and generation tasks, the authors find that performance fluctuations caused by example ordering are comparable to the impact of completely replacing the example set. The research covers multiple open-source model families ranging from 0.5B to 27B parameters and GPT-5. Furthermore, the study demonstrates that strong orderings approaching oracle performance can be identified using only a development set. These findings call for a re-examination of prompt construction strategies in ICL, emphasizing that example selection and ordering are equally important.
In in-context learning, large language models perform new tasks by conditioning on a small number of examples without gradient updates or task-specific fine-tuning. While ICL performance is known to be sensitive to examples, existing research universally assumes that example selection is more important than example ordering, leading research focus to concentrate on example selection.
Practical Significance: If ordering is as important as selection, the current research paradigm focusing solely on example selection may miss an important dimension for performance improvement
Theoretical Significance: Understanding ordering sensitivity helps reveal the context processing mechanisms of LLMs
Application Value: Optimizing ordering may improve model performance at zero cost
The authors use controlled experimental design to independently vary selection and ordering, systematically quantifying the relative impact of both factors and challenging conventional wisdom in the field.
Quantitative Proof: Through controlled experiments, the authors prove that the performance impact of example ordering is comparable to example selection, with ordering sensitivity average standard deviation of 0.01970 versus selection sensitivity of 0.02251 (only 14% higher)
Practical Method: Proposes a development set-based ordering identification method that requires evaluating only 64-128 candidate permutations to recover near-oracle performance (99% for classification tasks, 95% for generation tasks)
Systematic Analysis: Comprehensive evaluation across 8 datasets, 14 models (0.5B-27B parameters), and two task types (classification/generation)
Important Findings:
Ordering effects do not vary monotonically with model scale
Generation tasks are more sensitive to selection (r=1.46), while classification tasks show nearly equal sensitivity to both (r=1.09)
Optimal ordering is highly dataset-dependent with poor cross-dataset transferability
Input: Example set Sᵢ, development set Ddev, test set Dtest, number of permutations P=128
For each example set Sᵢ (repeat M=10 times):
1. Generate P random permutations {πⱼ}
2. Evaluate each permutation on development set: aⱼ = Acc(Sᵢ, πⱼ | Ddev)
3. Select optimal permutation: π* = argmax aⱼ
4. Evaluate on test set: a* = Acc(Sᵢ, π* | Dtest)
5. Record oracle performance: amax = max Acc(Sᵢ, πⱼ | Dtest)
Return: {a*, amax}
Zhao et al. (2021): First systematically proved that GPT-3 is highly sensitive to example ordering, with accuracy fluctuating by tens of percentage points, attributed to model's over-reliance on early context
Lu et al. (2022): Proved that optimal ordering can achieve near-SOTA performance, while poor ordering drops accuracy to random levels
This Paper's Contribution: First quantitatively compares relative impact of ordering and selection, rather than merely observing ordering's existence
Heuristic Methods: Sampling permutations on development set (Zhao et al., 2021; Zhang et al., 2022)
Adaptive Methods: Dynamic reordering based on test queries (Guo et al., 2024)
Reinforcement Learning: RL-based search (Bhope et al., 2023)
This Paper's Contribution: Proposes simple yet effective development set selection method, proving that near-optimal ordering can be achieved without complex algorithms
Core Finding: The performance impact of example ordering is comparable to example selection, with ordering sensitivity averaging 88% of selection sensitivity (r=1.14)
Practical Method: Evaluating 64-128 permutations and 250 development samples suffices to find near-optimal ordering
Universality: This finding holds across models from 0.5B to 27B parameters, classification and generation tasks
Specificity: Optimal ordering is highly dataset-dependent with poor cross-dataset transferability (transfer rate 79.8%)
Model Scale Effects: Smaller models are more sensitive, but relative importance of ordering vs. selection does not vary monotonically with scale
Example Quantity: k values fixed at 2|C| or 8, does not systematically study impact of different shot numbers
Default Ordering Definition: While alphabetical ordering is reasonable, it may introduce minor biases
Computational Cost: Evaluating 128 permutations × 10 example sets still requires substantial computation, may require trade-offs in practical applications
Insufficient Theoretical Explanation: Lacks deep mechanistic analysis of why ordering is so important
This is a high-quality, high-impact research work whose core value lies in:
Challenging Field Assumptions: Rigorously proves ordering and selection are equally important
Providing Practical Solutions: Simple yet effective development set selection method
Strong Systematicity: Comprehensive evaluation across models, tasks, and scales
High Inspirational Value: Points multiple important directions for future research
Main weaknesses are insufficient theoretical explanation and limited transferability research, but these do not diminish its status as an important contribution to ICL literature.
Recommended For: All researchers and engineers working on ICL, prompt engineering, and LLM applications.