2025-11-24T23:22:17.314102

Pathwise guessing in categorical time series with unbounded alphabets

Chazottes, Gallo, Takahashi

The following learning problem arises naturally in various applications: Given a finite sample from a categorical or count time series, can we learn a function of the sample that (nearly) maximizes the probability of correctly guessing the values of a given portion of the data using the values from the remaining parts? Unlike classical approaches in statistical inference, our approach avoids explicitly estimating the conditional probabilities. We propose a non-parametric guessing function with a learning rate independent of the alphabet size. Our analysis focuses on a broad class of time series models that encompasses finite-order Markov chains, some hidden Markov chains, Poisson regression for count processes, and one-dimensional Gibbs measures. We provide a margin condition that controls the rate of convergence for the risk. Additionally, we establish a minimax lower bound for the convergence rate of the risk associated with our guessing problem. This lower bound matches the upper bound achieved by our estimator up to a logarithmic factor, demonstrating its near-optimality.

academic

Pathwise guessing in categorical time series with unbounded alphabets

Basic Information

Paper ID: 2501.06547
Title: Pathwise guessing in categorical time series with unbounded alphabets
Authors: J.-R. Chazottes, S. Gallo, D. Y. Takahashi
Classification: math.ST math.PR stat.TH
Publication Date: October 16, 2025
Paper Link: https://arxiv.org/abs/2501.06547

Abstract

This paper investigates a learning problem that naturally arises in various applications: given a finite sample of a categorical or count time series, can one learn a sample function that (approximately) maximizes the probability of correctly guessing the value of a given portion of data using the remaining data? Unlike classical statistical inference methods, the proposed approach avoids explicit estimation of conditional probabilities. The authors present a nonparametric guessing function with learning rates independent of alphabet size, with analysis covering a broad class of time series models including finite-order Markov chains, certain hidden Markov chains, Poisson regression for counting processes, and one-dimensional Gibbs measures.

Research Background and Motivation

Importance of the Problem

Practical Applications: Prediction and interpolation are fundamental problems in science with widespread applications in categorical time series, particularly in the context of large language models, which can be viewed as categorical time series models with large alphabets.
Limitations of Traditional Methods:
- Classical methods rely on pointwise estimation of all transition probabilities
- When alphabet size is large or transition probabilities are small, guessing becomes difficult
- Accurate estimation of rare events requires large amounts of data, which is impractical
- Traditional approaches may struggle to estimate probabilities of all possible transitions in large alphabet cases
Existing Challenges:
- Both alphabet size and dependency order are typically high
- Need to handle models with unbounded dependencies and alphabet sizes
- Conventional methods may be inefficient for large alphabet scenarios

Research Motivation

The authors propose a more practical approach: focusing on the most likely events, i.e., predicting the most probable outcomes while assigning less weight to rare, unlikely events. This approach is particularly suitable for handling sequences with large or infinite symbol sets.

Core Contributions

Proposes a nonparametric guessing function: Learning rate independent of alphabet size, applicable to a broad class of categorical time series
Establishes a theoretical framework: Applicable to arbitrary alphabet sizes, relaxing constraints on memory or order
Provides margin conditions: Controlling convergence rates of risk
Establishes minimax lower bounds: Proving approximate optimality of the proposed estimator, with lower and upper bounds matching up to logarithmic factors
First consideration of infinite alphabet case: Important when alphabet size has no prior upper bound or grows with sample size

Methodology Details

Task Definition

Given two independent and identically distributed process copies $(X_j)_{j \in \mathbb{Z}}$ and $(Y_j)_{j \in \mathbb{Z}}$ , the goal is to predict values on the guessing set $G$ using information from dataset $D$ .

Estimator Definition: $\hat{f}^n_{D,G} : A^n \times A^D \to A^G$

Excess Risk: $R(\hat{f}^n_{D,G}) := \sup_{b \in A^D} \left( \tilde{P}(\hat{f}^n_{D,G}(Y_D) \neq Y_G | Y_D = b) - \inf_{a \in A^G} \tilde{P}(a \neq Y_G | Y_D = b) \right) \tilde{P}(Y_D = b)$

Model Architecture

Core Estimator: $\hat{f}^n_{D,G}[X^n_1](b) := \arg\max_{a \in A^G} \frac{N^n_{D,G}[X^n_1](b,a)}{N^n_{D,G}[X^n_1](b)}$

where the counting function is defined as: $N^n_{D,G}[X^n_1](b,a) := \sum_{i=0}^{n-1} \mathbf{1}\{X_{\theta^i D} = b, X_{\theta^i G} = a\}$

Main Assumptions

Assumption A: Let $(X_i)_{i \in \mathbb{Z}}$ be a stationary process with measure $P$ . It satisfies Assumption A if: $\Gamma(P) := \prod_{j=0}^{\infty} (1 - \text{Var}_j(p)) > 0$

where variation is defined as: $\text{Var}_n(p) := \sup\left\{\frac{1}{2}\sum_{a \in A}|p(a|x) - p(a|y)| : x,y \in A^{\mathbb{Z}_-}, x_i = y_i, i \geq -n\right\}$

Margin Conditions

For each $b \in A^D$ , define: $\delta_{D,G}(b) = \inf\{P(X_G \neq c, X_D = b) - \inf_{a \in A^G} P(X_G \neq a, X_D = b) > 0 : c \in A^G\}$

Margin: $\delta_{D,G} := \inf_{b \in A^D} \delta_{D,G}(b)$

Main Theoretical Results

Upper Bound Results (Theorem 3.1)

If sample size $n$ satisfies certain conditions, then: $R(\hat{f}^n_{D,G}) \leq \varepsilon \land \beta_{D,G}$

Convergence Rates (Corollary 3.1)

When margin condition is weak: If $\delta_n\sqrt{\frac{n}{\log n}} \to 0$ , then: $R(\hat{f}^n_{D,G}) \leq \frac{1}{2}\sqrt{\frac{\log n}{n}} \land \beta_{D,G}$
When margin condition is strong: If $\delta_n\sqrt{\frac{n}{\log n}} \to \infty$ , then: $R(\hat{f}^n_{D,G}) \leq \exp\left(-\frac{\Gamma^2 n \delta_n^2}{8(|G|+|D|)^2}\right) \land \beta_{D,G}$

Minimax Lower Bounds (Theorem 3.2)

Establishes minimax lower bounds in two scenarios:

Small margin case: $\inf_{\psi_n \in \Psi_n} \sup_{P \in \mathcal{P}_n} R(\psi_n; P) \geq \frac{e^{-1}}{\sqrt{n}}\left(\frac{1}{4}\right)^{|G|+|D|}$
Large margin case: $\inf_{\psi_n \in \Psi_n} \sup_{P \in \mathcal{Q}_n} R(\psi_n; P) \geq \delta_n e^{-n\delta_n^2}\left(\frac{1}{4}\right)^{|D|+|G|}$

Application Examples

The paper demonstrates that Assumption A applies to various important models:

Markov Chains

For Markov chains with state space $A$ and transition matrix $Q$ , the condition simplifies to the Dobrushin ergodic coefficient: $d(Q) := \sup_{a,b \in A} \|Q(a,\cdot) - Q(b,\cdot)\|_{TV} < 1$

Autoregressive Models

Binary autoregressive process with transition probabilities: $p(a|x) = \Upsilon\left(a\sum_{j=1}^{\infty}\xi_j x_{-j} + a\xi_0\right)$

Poisson Regression

Poisson regression model for count time series: $p(a|x) = \frac{e^{-v(x)}v(x)^a}{a!}$ where $v(x) = \exp\left(\sum_{j=1}^{\infty}\xi_j \min\{x_{-j}, c\}\right)$

Gibbs Measures

One-dimensional Gibbs measures satisfying: $P(X_\Lambda = x_\Lambda | X_{\Lambda^c} = y_{\Lambda^c}) = \frac{\exp(-\beta H^\Phi_\Lambda(x_\Lambda y_{\Lambda^c}))}{Z^\Phi_\Lambda(y)}$

Technical Innovations

Avoids explicit probability estimation: No need to estimate all conditional probabilities; focuses only on the most likely outcomes
Alphabet-size-independent learning rate: Key advantage for handling large or infinite alphabets
Dvoretzky-Kiefer-Wolfowitz-type inequalities: Establishes new concentration inequalities for random chains
Unified framework: Covers a broad class of time series models

Experimental and Proof Techniques

Main Proof Techniques

Concentration inequalities: Uses modified Dvoretzky-Kiefer-Wolfowitz inequalities
Coupling methods: Controls probability differences under different conditions
Le Cam-type arguments: Establishes minimax lower bounds
Variational analysis: Controls oscillation of variations through potential functions

Key Lemmas

Proposition 3.1: Establishes relationship between $\beta_{D,G}$ and set sizes
Proposition 4.1: Provides concrete variational bounds for Gibbs measures
Theorem A.1: Extension of Dvoretzky-Kiefer-Wolfowitz-type inequalities

Traditional Methods

Classical prediction: Based on pointwise estimation of transition probabilities
PAC learning framework: Studies optimal rates for learning conditional probabilities
Parametric regression models: Flexible but with restrictive assumptions

Advantages of This Work

Handles large alphabets: Learning rate independent of alphabet size
Nonparametric approach: Avoids restrictive assumptions of parametric models
Theoretical guarantees: Provides approximately optimal convergence rates

Conclusions and Discussion

Main Conclusions

Proposes a nonparametric guessing method applicable to unbounded alphabets
Establishes learning rates independent of alphabet size
Proves approximate optimality of the method (up to logarithmic factors)
Provides a unified framework for a broad class of time series models

Limitations

Verification of Assumption A: Verifying Assumption A in practical applications may be challenging
Finite sample performance: Theoretical results are asymptotic; finite sample behavior may differ
Computational complexity: The paper does not discuss algorithmic computational complexity in detail

Future Directions

Algorithm implementation: Develop efficient algorithmic implementations
Practical applications: Validate the method in practical applications such as large language models
Extension to other loss functions: Consider different risk measures

In-Depth Evaluation

Strengths

Significant theoretical contribution: First to address infinite alphabet case with complete theoretical framework
Strong methodological innovation: Avoiding explicit probability estimation has practical value
Rigorous analysis: Provides upper bounds and matching lower bounds, proving approximate optimality
Broad applicability: Unified framework covers multiple important time series models

Weaknesses

Lack of experimental validation: Paper is purely theoretical without numerical experiments or practical applications
Insufficient algorithmic details: Limited discussion of practical implementation and computational complexity
Difficult assumption verification: Methods for verifying Assumption A in practice are unclear

Impact

High theoretical value: Provides new theoretical tools for handling large alphabet time series
Significant practical potential: Important for modern applications like large language models
Method generality: Framework potentially applicable to related problems

Applicable Scenarios

Large language models: Text generation tasks with large vocabularies
Bioinformatics: DNA/protein sequence analysis
Network traffic analysis: Network behavior prediction with large state spaces
Financial time series: High-frequency trading data analysis

References

The paper cites 26 relevant references spanning multiple fields including Markov chain theory, statistical learning theory, dynamical systems, and probability theory, providing solid theoretical foundations for this work.