2025-11-24T23:22:17.314102

Pathwise guessing in categorical time series with unbounded alphabets

Chazottes, Gallo, Takahashi

The following learning problem arises naturally in various applications: Given a finite sample from a categorical or count time series, can we learn a function of the sample that (nearly) maximizes the probability of correctly guessing the values of a given portion of the data using the values from the remaining parts? Unlike classical approaches in statistical inference, our approach avoids explicitly estimating the conditional probabilities. We propose a non-parametric guessing function with a learning rate independent of the alphabet size. Our analysis focuses on a broad class of time series models that encompasses finite-order Markov chains, some hidden Markov chains, Poisson regression for count processes, and one-dimensional Gibbs measures. We provide a margin condition that controls the rate of convergence for the risk. Additionally, we establish a minimax lower bound for the convergence rate of the risk associated with our guessing problem. This lower bound matches the upper bound achieved by our estimator up to a logarithmic factor, demonstrating its near-optimality.

academic

Pathwise guessing in categorical time series with unbounded alphabets

基本信息

论文ID: 2501.06547
标题: Pathwise guessing in categorical time series with unbounded alphabets
作者: J.-R. Chazottes, S. Gallo, D. Y. Takahashi
分类: math.ST math.PR stat.TH
发表时间: October 16, 2025
论文链接: https://arxiv.org/abs/2501.06547

摘要

该论文研究了一个在多种应用中自然产生的学习问题：给定分类或计数时间序列的有限样本，能否学习一个样本函数，使其（近似）最大化使用剩余部分数据正确猜测给定部分数据值的概率？与经典统计推断方法不同，本文方法避免显式估计条件概率。作者提出了一个学习率与字母表大小无关的非参数猜测函数，分析涵盖了包括有限阶马尔可夫链、某些隐马尔可夫链、计数过程的泊松回归和一维吉布斯测度在内的广泛时间序列模型类别。

研究背景与动机

问题的重要性

实际应用驱动：预测和插值是科学中的基础问题，在分类时间序列中有广泛应用，特别是在大语言模型兴起的背景下，这些模型可以视为具有大字母表的分类时间序列模型。
传统方法的局限性：
- 经典方法依赖于估计所有转移概率的逐点估计
- 当字母表大小很大或转移概率很小时，猜测变得困难
- 准确估计稀有事件需要大量数据，这在实践中不可行
现有挑战：
- 字母表大小和依赖阶数通常都很高
- 需要处理无界依赖和字母表大小的模型
- 传统方法在大字母表情况下可能难以估计所有可能转移的概率

研究动机

作者提出了一种更实用的方法：专注于最可能发生的事件，即预测最可能的结果，而对稀有、不太可能的事件给予较少权重。这种方法特别适用于处理具有大型或无限符号集的序列。

核心贡献

提出了非参数猜测函数：学习率与字母表大小无关，适用于广泛的分类时间序列类别
建立了理论框架：适用于任意字母表大小，放宽了对记忆或阶数的约束
提供了边际条件：控制风险的收敛率
建立了minimax下界：证明了所提估计器的近似最优性，下界与上界在对数因子内匹配
首次考虑无限字母表情况：当字母表大小没有先验上界或可随样本大小增长时具有重要意义

方法详解

任务定义

给定两个独立的同分布过程副本 $(X_j)_{j \in \mathbb{Z}}$ 和 $(Y_j)_{j \in \mathbb{Z}}$ ，目标是使用数据集 $D$ 的信息预测猜测集 $G$ 上的值。

估计器定义： $f̂^n_{D,G} : A^n \times A^D \to A^G$

超额风险： $R(f̂^n_{D,G}) := \sup_{b \in A^D} \left( \tilde{P}(f̂^n_{D,G}(Y_D) \neq Y_G | Y_D = b) - \inf_{a \in A^G} \tilde{P}(a \neq Y_G | Y_D = b) \right) \tilde{P}(Y_D = b)$

模型架构

核心估计器： $f̂^n_{D,G}[X^n_1](b) := \arg\max_{a \in A^G} \frac{N^n_{D,G}[X^n_1](b,a)}{N^n_{D,G}[X^n_1](b)}$

其中计数函数定义为： $N^n_{D,G}[X^n_1](b,a) := \sum_{i=0}^{n-1} \mathbf{1}\{X_{\theta^i D} = b, X_{\theta^i G} = a\}$

主要假设

假设A：设 $(X_i)_{i \in \mathbb{Z}}$ 是具有测度 $P$ 的平稳过程，如果满足： $\Gamma(P) := \prod_{j=0}^{\infty} (1 - \text{Var}_j(p)) > 0$

其中变分定义为： $\text{Var}_n(p) := \sup\left\{\frac{1}{2}\sum_{a \in A}|p(a|x) - p(a|y)| : x,y \in A^{\mathbb{Z}_-}, x_i = y_i, i \geq -n\right\}$

边际条件

对于每个 $b \in A^D$ ，定义： $\delta_{D,G}(b) = \inf\{P(X_G \neq c, X_D = b) - \inf_{a \in A^G} P(X_G \neq a, X_D = b) > 0 : c \in A^G\}$

边际为： $\delta_{D,G} := \inf_{b \in A^D} \delta_{D,G}(b)$

主要理论结果

上界结果（定理3.1）

如果样本大小 $n$ 满足特定条件，则： $R(f̂^n_{D,G}) \leq \varepsilon \land \beta_{D,G}$

收敛率（推论3.1）

当边际条件较弱时：如果 $\delta_n\sqrt{\frac{n}{\log n}} \to 0$ ，则： $R(f̂^n_{D,G}) \leq \frac{1}{2}\sqrt{\frac{\log n}{n}} \land \beta_{D,G}$
当边际条件较强时：如果 $\delta_n\sqrt{\frac{n}{\log n}} \to \infty$ ，则： $R(f̂^n_{D,G}) \leq \exp\left(-\frac{\Gamma^2 n \delta_n^2}{8(|G|+|D|)^2}\right) \land \beta_{D,G}$

Minimax下界（定理3.2）

建立了两种情况下的minimax下界：

边际较小情况： $\inf_{\psi_n \in \Psi_n} \sup_{P \in \mathcal{P}_n} R(\psi_n; P) \geq \frac{e^{-1}}{\sqrt{n}}\left(\frac{1}{4}\right)^{|G|+|D|}$
边际较大情况： $\inf_{\psi_n \in \Psi_n} \sup_{P \in \mathcal{Q}_n} R(\psi_n; P) \geq \delta_n e^{-n\delta_n^2}\left(\frac{1}{4}\right)^{|D|+|G|}$