Pathwise guessing in categorical time series with unbounded alphabets
Chazottes, Gallo, Takahashi
The following learning problem arises naturally in various applications: Given a finite sample from a categorical or count time series, can we learn a function of the sample that (nearly) maximizes the probability of correctly guessing the values of a given portion of the data using the values from the remaining parts? Unlike classical approaches in statistical inference, our approach avoids explicitly estimating the conditional probabilities.
We propose a non-parametric guessing function with a learning rate independent of the alphabet size. Our analysis focuses on a broad class of time series models that encompasses finite-order Markov chains, some hidden Markov chains, Poisson regression for count processes, and one-dimensional Gibbs measures.
We provide a margin condition that controls the rate of convergence for the risk. Additionally, we establish a minimax lower bound for the convergence rate of the risk associated with our guessing problem. This lower bound matches the upper bound achieved by our estimator up to a logarithmic factor, demonstrating its near-optimality.
academic
Pathwise guessing in categorical time series with unbounded alphabets
This paper investigates a learning problem that naturally arises in various applications: given a finite sample of a categorical or count time series, can one learn a sample function that (approximately) maximizes the probability of correctly guessing the value of a given portion of data using the remaining data? Unlike classical statistical inference methods, the proposed approach avoids explicit estimation of conditional probabilities. The authors present a nonparametric guessing function with learning rates independent of alphabet size, with analysis covering a broad class of time series models including finite-order Markov chains, certain hidden Markov chains, Poisson regression for counting processes, and one-dimensional Gibbs measures.
Practical Applications: Prediction and interpolation are fundamental problems in science with widespread applications in categorical time series, particularly in the context of large language models, which can be viewed as categorical time series models with large alphabets.
Limitations of Traditional Methods:
Classical methods rely on pointwise estimation of all transition probabilities
When alphabet size is large or transition probabilities are small, guessing becomes difficult
Accurate estimation of rare events requires large amounts of data, which is impractical
Traditional approaches may struggle to estimate probabilities of all possible transitions in large alphabet cases
Existing Challenges:
Both alphabet size and dependency order are typically high
Need to handle models with unbounded dependencies and alphabet sizes
Conventional methods may be inefficient for large alphabet scenarios
The authors propose a more practical approach: focusing on the most likely events, i.e., predicting the most probable outcomes while assigning less weight to rare, unlikely events. This approach is particularly suitable for handling sequences with large or infinite symbol sets.
Proposes a nonparametric guessing function: Learning rate independent of alphabet size, applicable to a broad class of categorical time series
Establishes a theoretical framework: Applicable to arbitrary alphabet sizes, relaxing constraints on memory or order
Provides margin conditions: Controlling convergence rates of risk
Establishes minimax lower bounds: Proving approximate optimality of the proposed estimator, with lower and upper bounds matching up to logarithmic factors
First consideration of infinite alphabet case: Important when alphabet size has no prior upper bound or grows with sample size
Given two independent and identically distributed process copies (Xj)j∈Z and (Yj)j∈Z, the goal is to predict values on the guessing set G using information from dataset D.
For Markov chains with state space A and transition matrix Q, the condition simplifies to the Dobrushin ergodic coefficient:
d(Q):=supa,b∈A∥Q(a,⋅)−Q(b,⋅)∥TV<1
The paper cites 26 relevant references spanning multiple fields including Markov chain theory, statistical learning theory, dynamical systems, and probability theory, providing solid theoretical foundations for this work.