Predicting Task Performance with Context-aware Scaling Laws
Montgomery, Park, Tu et al.
Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at https://github.com/wang-research-lab/context-scaling.
academic
Predicting Task Performance with Context-aware Scaling Laws
Traditional neural network scaling laws have revolutionized our understanding of large language models by linking upstream metrics (such as cross-entropy loss) to design factors (such as model size, training data, and compute). However, these conventional laws fail to capture downstream task performance, where context plays a critical role. This paper proposes an intuitive and interpretable framework that models downstream performance as a joint function of training compute and provided context. The authors validate this framework empirically by fitting it on extended context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, commonsense reasoning, and machine translation. Results demonstrate that the framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude of training compute, and reliably extrapolates performance as context quantity increases.
Traditional neural network scaling laws primarily focus on upstream metrics (such as cross-entropy loss), but in practical applications, downstream task performance often diverges from these upstream trends. Existing work on predicting downstream performance typically relies on overly complex methods with poor interpretability.
Practical Necessity: Accurate downstream performance estimation can guide model development and identify emergent or saturation phenomena on certain tasks with fewer expensive experiments
Theoretical Gap: Existing scaling laws overlook context length, a critical factor in downstream task performance
Design Guidance: Understanding the interaction between compute and context utilization is essential for designing efficient long-context LLMs
Proposed Context-aware Scaling Laws Framework: Extends traditional neural scaling laws to downstream tasks by incorporating context length and context constraints, providing more accurate LLM performance modeling
Large-scale Empirical Validation: Fitted across 3 tasks on extended context windows of Llama-2 models, demonstrating the generalizability of scaling laws across 3 orders of magnitude of training compute, 4 orders of magnitude of context length, and different context extension techniques
Interpretable Theoretical Tool: Provides an interpretable framework for understanding the interaction between compute, context, and downstream performance, offering guidance for future long-context LLM design
Multiplicative Form: Compute and context are complementary rather than additive; significant deficiency in one dimension limits gains from the other
Saturating Power Laws: Exponential formulation ensures predicted performance remains below theoretical maximum of 1.0
Penalty Mechanism: When context exceeds model limits, generated tokens fall outside the range the model can reliably predict, causing sharp performance degradation
Kaplan, J., et al. (2020). Scaling laws for neural language models. arXiv:2001.08361.
Chen, Y., et al. (2024). Scaling laws for predicting downstream performance in llms. arXiv:2410.08527.
Peng, B., et al. (2024). YaRN: Efficient context window extension of large language models. ICLR.
Wei, J., et al. (2022). Emergent abilities of large language models. TMLR.
Touvron, H., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
This paper makes important contributions to scaling law research by systematically incorporating context length into downstream task performance prediction for the first time, providing valuable theoretical tools and practical guidance for designing and optimizing long-context LLMs.