2025-11-24T23:22:17.314102

Pathwise guessing in categorical time series with unbounded alphabets

Chazottes, Gallo, Takahashi
The following learning problem arises naturally in various applications: Given a finite sample from a categorical or count time series, can we learn a function of the sample that (nearly) maximizes the probability of correctly guessing the values of a given portion of the data using the values from the remaining parts? Unlike classical approaches in statistical inference, our approach avoids explicitly estimating the conditional probabilities. We propose a non-parametric guessing function with a learning rate independent of the alphabet size. Our analysis focuses on a broad class of time series models that encompasses finite-order Markov chains, some hidden Markov chains, Poisson regression for count processes, and one-dimensional Gibbs measures. We provide a margin condition that controls the rate of convergence for the risk. Additionally, we establish a minimax lower bound for the convergence rate of the risk associated with our guessing problem. This lower bound matches the upper bound achieved by our estimator up to a logarithmic factor, demonstrating its near-optimality.
academic

Pathwise guessing in categorical time series with unbounded alphabets

Basic Information

  • Paper ID: 2501.06547
  • Title: Pathwise guessing in categorical time series with unbounded alphabets
  • Authors: J.-R. Chazottes, S. Gallo, D. Y. Takahashi
  • Classification: math.ST math.PR stat.TH
  • Publication Date: October 16, 2025
  • Paper Link: https://arxiv.org/abs/2501.06547

Abstract

This paper investigates a learning problem that naturally arises in various applications: given a finite sample of a categorical or count time series, can one learn a sample function that (approximately) maximizes the probability of correctly guessing the value of a given portion of data using the remaining data? Unlike classical statistical inference methods, the proposed approach avoids explicit estimation of conditional probabilities. The authors present a nonparametric guessing function with learning rates independent of alphabet size, with analysis covering a broad class of time series models including finite-order Markov chains, certain hidden Markov chains, Poisson regression for counting processes, and one-dimensional Gibbs measures.

Research Background and Motivation

Importance of the Problem

  1. Practical Applications: Prediction and interpolation are fundamental problems in science with widespread applications in categorical time series, particularly in the context of large language models, which can be viewed as categorical time series models with large alphabets.
  2. Limitations of Traditional Methods:
    • Classical methods rely on pointwise estimation of all transition probabilities
    • When alphabet size is large or transition probabilities are small, guessing becomes difficult
    • Accurate estimation of rare events requires large amounts of data, which is impractical
    • Traditional approaches may struggle to estimate probabilities of all possible transitions in large alphabet cases
  3. Existing Challenges:
    • Both alphabet size and dependency order are typically high
    • Need to handle models with unbounded dependencies and alphabet sizes
    • Conventional methods may be inefficient for large alphabet scenarios

Research Motivation

The authors propose a more practical approach: focusing on the most likely events, i.e., predicting the most probable outcomes while assigning less weight to rare, unlikely events. This approach is particularly suitable for handling sequences with large or infinite symbol sets.

Core Contributions

  1. Proposes a nonparametric guessing function: Learning rate independent of alphabet size, applicable to a broad class of categorical time series
  2. Establishes a theoretical framework: Applicable to arbitrary alphabet sizes, relaxing constraints on memory or order
  3. Provides margin conditions: Controlling convergence rates of risk
  4. Establishes minimax lower bounds: Proving approximate optimality of the proposed estimator, with lower and upper bounds matching up to logarithmic factors
  5. First consideration of infinite alphabet case: Important when alphabet size has no prior upper bound or grows with sample size

Methodology Details

Task Definition

Given two independent and identically distributed process copies (Xj)jZ(X_j)_{j \in \mathbb{Z}} and (Yj)jZ(Y_j)_{j \in \mathbb{Z}}, the goal is to predict values on the guessing set GG using information from dataset DD.

Estimator Definition: f^D,Gn:An×ADAG\hat{f}^n_{D,G} : A^n \times A^D \to A^G

Excess Risk: R(f^D,Gn):=supbAD(P~(f^D,Gn(YD)YGYD=b)infaAGP~(aYGYD=b))P~(YD=b)R(\hat{f}^n_{D,G}) := \sup_{b \in A^D} \left( \tilde{P}(\hat{f}^n_{D,G}(Y_D) \neq Y_G | Y_D = b) - \inf_{a \in A^G} \tilde{P}(a \neq Y_G | Y_D = b) \right) \tilde{P}(Y_D = b)

Model Architecture

Core Estimator: f^D,Gn[X1n](b):=argmaxaAGND,Gn[X1n](b,a)ND,Gn[X1n](b)\hat{f}^n_{D,G}[X^n_1](b) := \arg\max_{a \in A^G} \frac{N^n_{D,G}[X^n_1](b,a)}{N^n_{D,G}[X^n_1](b)}

where the counting function is defined as: ND,Gn[X1n](b,a):=i=0n11{XθiD=b,XθiG=a}N^n_{D,G}[X^n_1](b,a) := \sum_{i=0}^{n-1} \mathbf{1}\{X_{\theta^i D} = b, X_{\theta^i G} = a\}

Main Assumptions

Assumption A: Let (Xi)iZ(X_i)_{i \in \mathbb{Z}} be a stationary process with measure PP. It satisfies Assumption A if: Γ(P):=j=0(1Varj(p))>0\Gamma(P) := \prod_{j=0}^{\infty} (1 - \text{Var}_j(p)) > 0

where variation is defined as: Varn(p):=sup{12aAp(ax)p(ay):x,yAZ,xi=yi,in}\text{Var}_n(p) := \sup\left\{\frac{1}{2}\sum_{a \in A}|p(a|x) - p(a|y)| : x,y \in A^{\mathbb{Z}_-}, x_i = y_i, i \geq -n\right\}

Margin Conditions

For each bADb \in A^D, define: δD,G(b)=inf{P(XGc,XD=b)infaAGP(XGa,XD=b)>0:cAG}\delta_{D,G}(b) = \inf\{P(X_G \neq c, X_D = b) - \inf_{a \in A^G} P(X_G \neq a, X_D = b) > 0 : c \in A^G\}

Margin: δD,G:=infbADδD,G(b)\delta_{D,G} := \inf_{b \in A^D} \delta_{D,G}(b)

Main Theoretical Results

Upper Bound Results (Theorem 3.1)

If sample size nn satisfies certain conditions, then: R(f^D,Gn)εβD,GR(\hat{f}^n_{D,G}) \leq \varepsilon \land \beta_{D,G}

Convergence Rates (Corollary 3.1)

  1. When margin condition is weak: If δnnlogn0\delta_n\sqrt{\frac{n}{\log n}} \to 0, then: R(f^D,Gn)12lognnβD,GR(\hat{f}^n_{D,G}) \leq \frac{1}{2}\sqrt{\frac{\log n}{n}} \land \beta_{D,G}
  2. When margin condition is strong: If δnnlogn\delta_n\sqrt{\frac{n}{\log n}} \to \infty, then: R(f^D,Gn)exp(Γ2nδn28(G+D)2)βD,GR(\hat{f}^n_{D,G}) \leq \exp\left(-\frac{\Gamma^2 n \delta_n^2}{8(|G|+|D|)^2}\right) \land \beta_{D,G}

Minimax Lower Bounds (Theorem 3.2)

Establishes minimax lower bounds in two scenarios:

  1. Small margin case: infψnΨnsupPPnR(ψn;P)e1n(14)G+D\inf_{\psi_n \in \Psi_n} \sup_{P \in \mathcal{P}_n} R(\psi_n; P) \geq \frac{e^{-1}}{\sqrt{n}}\left(\frac{1}{4}\right)^{|G|+|D|}
  2. Large margin case: infψnΨnsupPQnR(ψn;P)δnenδn2(14)D+G\inf_{\psi_n \in \Psi_n} \sup_{P \in \mathcal{Q}_n} R(\psi_n; P) \geq \delta_n e^{-n\delta_n^2}\left(\frac{1}{4}\right)^{|D|+|G|}

Application Examples

The paper demonstrates that Assumption A applies to various important models:

Markov Chains

For Markov chains with state space AA and transition matrix QQ, the condition simplifies to the Dobrushin ergodic coefficient: d(Q):=supa,bAQ(a,)Q(b,)TV<1d(Q) := \sup_{a,b \in A} \|Q(a,\cdot) - Q(b,\cdot)\|_{TV} < 1

Autoregressive Models

Binary autoregressive process with transition probabilities: p(ax)=Υ(aj=1ξjxj+aξ0)p(a|x) = \Upsilon\left(a\sum_{j=1}^{\infty}\xi_j x_{-j} + a\xi_0\right)

Poisson Regression

Poisson regression model for count time series: p(ax)=ev(x)v(x)aa!p(a|x) = \frac{e^{-v(x)}v(x)^a}{a!} where v(x)=exp(j=1ξjmin{xj,c})v(x) = \exp\left(\sum_{j=1}^{\infty}\xi_j \min\{x_{-j}, c\}\right)

Gibbs Measures

One-dimensional Gibbs measures satisfying: P(XΛ=xΛXΛc=yΛc)=exp(βHΛΦ(xΛyΛc))ZΛΦ(y)P(X_\Lambda = x_\Lambda | X_{\Lambda^c} = y_{\Lambda^c}) = \frac{\exp(-\beta H^\Phi_\Lambda(x_\Lambda y_{\Lambda^c}))}{Z^\Phi_\Lambda(y)}

Technical Innovations

  1. Avoids explicit probability estimation: No need to estimate all conditional probabilities; focuses only on the most likely outcomes
  2. Alphabet-size-independent learning rate: Key advantage for handling large or infinite alphabets
  3. Dvoretzky-Kiefer-Wolfowitz-type inequalities: Establishes new concentration inequalities for random chains
  4. Unified framework: Covers a broad class of time series models

Experimental and Proof Techniques

Main Proof Techniques

  1. Concentration inequalities: Uses modified Dvoretzky-Kiefer-Wolfowitz inequalities
  2. Coupling methods: Controls probability differences under different conditions
  3. Le Cam-type arguments: Establishes minimax lower bounds
  4. Variational analysis: Controls oscillation of variations through potential functions

Key Lemmas

  • Proposition 3.1: Establishes relationship between βD,G\beta_{D,G} and set sizes
  • Proposition 4.1: Provides concrete variational bounds for Gibbs measures
  • Theorem A.1: Extension of Dvoretzky-Kiefer-Wolfowitz-type inequalities

Traditional Methods

  1. Classical prediction: Based on pointwise estimation of transition probabilities
  2. PAC learning framework: Studies optimal rates for learning conditional probabilities
  3. Parametric regression models: Flexible but with restrictive assumptions

Advantages of This Work

  1. Handles large alphabets: Learning rate independent of alphabet size
  2. Nonparametric approach: Avoids restrictive assumptions of parametric models
  3. Theoretical guarantees: Provides approximately optimal convergence rates

Conclusions and Discussion

Main Conclusions

  1. Proposes a nonparametric guessing method applicable to unbounded alphabets
  2. Establishes learning rates independent of alphabet size
  3. Proves approximate optimality of the method (up to logarithmic factors)
  4. Provides a unified framework for a broad class of time series models

Limitations

  1. Verification of Assumption A: Verifying Assumption A in practical applications may be challenging
  2. Finite sample performance: Theoretical results are asymptotic; finite sample behavior may differ
  3. Computational complexity: The paper does not discuss algorithmic computational complexity in detail

Future Directions

  1. Algorithm implementation: Develop efficient algorithmic implementations
  2. Practical applications: Validate the method in practical applications such as large language models
  3. Extension to other loss functions: Consider different risk measures

In-Depth Evaluation

Strengths

  1. Significant theoretical contribution: First to address infinite alphabet case with complete theoretical framework
  2. Strong methodological innovation: Avoiding explicit probability estimation has practical value
  3. Rigorous analysis: Provides upper bounds and matching lower bounds, proving approximate optimality
  4. Broad applicability: Unified framework covers multiple important time series models

Weaknesses

  1. Lack of experimental validation: Paper is purely theoretical without numerical experiments or practical applications
  2. Insufficient algorithmic details: Limited discussion of practical implementation and computational complexity
  3. Difficult assumption verification: Methods for verifying Assumption A in practice are unclear

Impact

  1. High theoretical value: Provides new theoretical tools for handling large alphabet time series
  2. Significant practical potential: Important for modern applications like large language models
  3. Method generality: Framework potentially applicable to related problems

Applicable Scenarios

  1. Large language models: Text generation tasks with large vocabularies
  2. Bioinformatics: DNA/protein sequence analysis
  3. Network traffic analysis: Network behavior prediction with large state spaces
  4. Financial time series: High-frequency trading data analysis

References

The paper cites 26 relevant references spanning multiple fields including Markov chain theory, statistical learning theory, dynamical systems, and probability theory, providing solid theoretical foundations for this work.