2025-11-24T16:43:16.687108

In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

Wakayama, Suzuki

This paper develops a finite-sample statistical theory for in-context learning (ICL), analyzed within a meta-learning framework that accommodates mixtures of diverse task types. We introduce a principled risk decomposition that separates the total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well the trained model approximates the Bayes-optimal in-context predictor. For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts and their context length. The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty. Our key finding is that this term is determined solely by the difficulty of the true underlying task, while the uncertainty arising from the task mixture vanishes exponentially fast with only a few in-context examples. Together, these results provide a unified view of ICL: the Transformer selects the optimal meta-algorithm during pretraining and rapidly converges to the optimal algorithm for the true task at test time.

academic

In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

基本信息

论文ID: 2510.10981
标题: In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
作者: Tomoya Wakayama (RIKEN AIP), Taiji Suzuki (The University of Tokyo, RIKEN AIP)
分类: stat.ML cs.LG
发表时间: 2025年10月13日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.10981v1

摘要

本文为上下文学习(ICL)建立了有限样本统计理论，在容纳多种任务类型混合的元学习框架内进行分析。论文引入了一个原则性的风险分解，将总ICL风险分解为两个正交组件：贝叶斯间隙(Bayes Gap)和后验方差(Posterior Variance)。贝叶斯间隙量化了训练模型对贝叶斯最优上下文预测器的近似程度。对于均匀注意力Transformer，论文推导了该间隙的非渐近上界，明确阐明了对预训练提示数量和上下文长度的依赖关系。后验方差是表示内在任务不确定性的模型无关风险。关键发现是该项仅由真实潜在任务的难度决定，而来自任务混合的不确定性仅用少量上下文样例就能指数级快速消失。

研究背景与动机

问题背景

自GPT-3以来，大型语言模型展现出了令人瞩目的上下文学习能力，即仅从少量输入-输出示例就能适应新任务，无需参数更新。这种现象在各种数据集和任务格式中普遍存在，是现代LLM工作流程的核心。

研究动机

理论缺失：尽管ICL被广泛认为是一种隐式贝叶斯推理形式，但现有理论未能充分利用ICL与贝叶斯推理的理论关系
实际需求：现代LLM部署面临共同约束——推理时提示较短，上游预训练覆盖异构任务类型，需要有限样本的预测误差具体分析
理论空白：现有理论缺乏能够(i)联合耦合预训练规模N和提示长度p，(ii)容纳异构任务类型混合的统计理论

现有方法局限性

早期理论主要关注特定架构和设置下的信息论分析或非参数率
未能完全捕获p和N的联合效应
缺乏对混合任务设置下ICL行为的理论解释

核心贡献

原则性风险分解：提出ICL风险的正交分解：ICL risk = Bayes Gap + Posterior Variance
非渐近上界：为均匀注意力Transformer提供贝叶斯间隙的非渐近上界，明确了预训练提示数N和上下文长度p的耦合依赖关系： $E[R_{BG}(M_{\hat{\theta}})] \lesssim m^{-2\alpha/d_{eff}} + \frac{m}{pN} + \frac{1}{N}$
任务识别理论：证明在任务混合中，后验分布在任务索引上指数级快速集中到真实任务，ICL快速收敛到真实任务的最优算法
分布偏移稳定性：刻画输入分布偏移下的稳定性，证明贝叶斯间隙按分布间Wasserstein距离比例增加

方法详解

任务定义

论文考虑一个元学习框架，容纳T个不同任务类型的有限混合：

提示生成过程：

采样任务类型： $I \sim \text{Categorical}(\alpha)$
给定 $I=i$ ，采样任务函数： $f \sim P_{F_i}$
对 $k=1,\ldots,p+1$ $k = 1, \dots, p + 1$ ：
- 采样输入： $x_k \overset{i.i.d.}{\sim} P_X$
- 生成输出： $y_k = f(x_k) + \varepsilon_k$
形成长度为p的提示： $P = (x_1,y_1,\ldots,x_p,y_p,x_{p+1})$

模型架构

均匀注意力Transformer： $M_\theta(P^k) := \rho_\theta\left(\frac{1}{k}\sum_{i=1}^k \phi_\theta(x_i,y_i), x_{k+1}\right)$

其中：

特征编码器 $\phi_\theta: U \to \Delta_{m-1}$ ：深度为 $D_\phi$ 的前馈ReLU网络，后接重归一化层
解码器 $\rho_\theta: \Delta_{m-1} \times C \to \mathbb{R}$ ：深度为 $D_\rho$ 的前馈ReLU网络

贝叶斯最优预测器

ICL风险最小化等价于贝叶斯风险最小化，最优预测器为后验均值： $M_{\text{Bayes}}(P^k) := E_{I\sim P_{I|D^k}} E_{f\sim P_{F_I|D^k}}[f(x_{k+1})]$

技术创新点

置换不变性理论基础：证明贝叶斯预测器的置换不变性，为均匀注意力架构提供理论支撑
序列学习理论应用：利用序列学习理论处理提示内的p个上下文样例，结合传统学习理论处理N个元训练提示
最优传输近似理论：构建基于软直方图的分割单位来编码提示，通过离散1-Wasserstein度量上的McShane扩展近似贝叶斯预测器

实验设置

理论分析框架

论文主要提供理论分析，采用以下设置：

假设条件：

假设1：有界任务函数 $|f(x)| \leq B_f$
假设2：有界输入和条件独立性 $\|x\|_2 \leq B_X$

网络规模：

特征编码器： $S(\phi_\theta) \leq C_\phi m^{1/d_{eff}}$
解码器： $S(\rho_\theta) \leq C_\rho m^{1/2}$

评价指标

ICL风险定义为： $R(M) = \frac{1}{p}\sum_{k=1}^p E_{I,f,D^k,x_{k+1}}\left[(f(x_{k+1}) - M(P^k))^2\right]$

实验结果

主要理论结果

定理1（风险分解）： $R(M) = R_{BG}(M) + R_{PV}$ 其中：

贝叶斯间隙： $R_{BG}(M) := \frac{1}{p}\sum_{k=1}^p E[(M(P^k) - M_{\text{Bayes}}(P^k))^2]$
后验方差： $R_{PV} := \frac{1}{p}\sum_{k=1}^p E[\text{Var}_{f\sim P(f|D^k)}(f(x_{k+1}))]$

定理2（贝叶斯间隙上界）：在Hölder条件下，对均匀注意力Transformer： $E[R_{BG}(M_{\hat{\theta}})] \lesssim m^{-2\alpha/d_{eff}} + \frac{m}{pN}\text{polylog}(pN) + \frac{1}{N}\text{polylog}(pN)$

选择 $m^* \asymp (pN)^{d_{eff}/(d_{eff}+2\alpha)}$ 得到： $E[R_{BG}(M_{\hat{\theta}})] \lesssim (pN)^{-2\alpha/(d_{eff}+2\alpha)} + N^{-1}$

定理3（后验方差分析）：在对数似然比条件下： $E_{D^k,x|I=i^*}[\text{Var}_{f|D^k}\{f(x)\}] \leq \inf_M \sup_{f\in F_{i^*}} E[(f(x_{k+1}) - M(P^k))^2|f] + 5B_f^2\left(\frac{1-\alpha_{i^*}}{\alpha_{i^*}}e^{-D_{\min}k/2} + (T-1)e^{-Ck}\right)$