2025-11-22T22:28:16.439435

The Pitfalls of Continuous Heavy-Tailed Distributions in High-Frequency Data Analysis

Holý
We address the challenges of modeling high-frequency integer price changes in financial markets using continuous distributions, particularly the Student's t-distribution. We demonstrate that traditional GARCH models, which rely on continuous distributions, are ill-suited for high-frequency data due to the discreteness of price changes. We propose a modification to the maximum likelihood estimation procedure that accounts for the discrete nature of observations while still using continuous distributions. Our approach involves modeling the log-likelihood in terms of intervals corresponding to the rounding of continuous price changes to the nearest integer. The findings highlight the importance of adjusting for discreteness in volatility analysis and provide a framework for incroporating any continuous distribution for modeling high-frequency prices.
academic

The Pitfalls of Continuous Heavy-Tailed Distributions in High-Frequency Data Analysis

Basic Information

  • Paper ID: 2510.09785
  • Title: The Pitfalls of Continuous Heavy-Tailed Distributions in High-Frequency Data Analysis
  • Author: Vladimír Holý (Prague University of Economics and Business)
  • Classification: q-fin.ST (Statistical Finance)
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09785

Abstract

This paper investigates the challenges of modeling high-frequency integer price changes in financial markets using continuous distributions, particularly Student's t distribution. The author demonstrates that traditional GARCH models are unsuitable for high-frequency data analysis due to the discrete nature of price changes. The paper proposes a corrected maximum likelihood estimation method that accounts for the discrete characteristics of observations while employing continuous distributions. The approach models the log-likelihood function by rounding continuous price changes to intervals corresponding to the nearest integer. The findings emphasize the importance of adjusting for discreteness in volatility analysis and provide a framework for applying any continuous distribution to high-frequency price modeling.

Research Background and Motivation

Problem Definition

  1. Core Issue: Traditional GARCH models using continuous distributions (e.g., Student's t distribution) have fundamental defects when modeling high-frequency financial data
  2. Specific Manifestation: When price changes are integers with frequent zero values, Student's t distribution degenerates into an inverted-U shape with density concentrated at point zero and extremely heavy tails
  3. Practical Impact: This degeneracy causes likelihood function explosion, parameter estimation failure, and meaningless or misleading model results

Research Significance

  1. Practical Importance: Increasing high-frequency trading intensity makes price discreteness issues more pronounced
  2. Risk Management: Incorrect volatility models affect risk management, portfolio optimization, and derivative pricing
  3. Academic Value: Fills theoretical gaps in continuous distribution modeling of discrete data

Limitations of Existing Methods

  1. Traditional GARCH Models: Assume continuous price changes, ignoring the discrete nature of high-frequency data
  2. Existing Discrete Models: Primarily based on Skellam distribution, limiting flexibility in distribution selection
  3. Software Package Issues: Multiple R packages impose artificial lower bounds on degrees of freedom parameters, masking true optimization problems

Core Contributions

  1. Warning Function: Clearly identifies the unsuitability of standard GARCH models with heavy-tailed continuous distributions for high-frequency data
  2. Theoretical Innovation: Proposes interval maximum likelihood estimation method, treating integer observations as rounded continuous values
  3. Methodological Framework: Establishes a framework applicable to any continuous distribution for high-frequency price modeling
  4. Empirical Verification: Validates the method's effectiveness through empirical analysis of multiple stocks

Methodology Details

Task Definition

  • Input: High-frequency stock price change sequences (integer values with abundant zeros)
  • Output: Estimates of time-varying volatility parameters and distribution parameters
  • Constraint: Maintain continuous distribution usage while handling data discreteness

Problems with Traditional Methods

GARCH Model

Standard GARCH model:

y_t = μ + e_t, e_t ~ t(0, σ²_t, ν)
σ²_t = ω + αe²_{t-1} + φσ²_{t-1}

Score-Driven Model

y_t ~ t(μ, σ²_t, ν)
ln σ²_t = ω + α∇_{ln σ²}(y_{t-1}; μ, σ²_{t-1}, ν) + φσ²_{t-1}

Issues

When ν → 0, Student's t distribution degenerates:

  • σ² → 0 (numerical lower bound 2^{-1074})
  • Density explodes at point zero, forming an inverted-U shape
  • Log-likelihood function reaches extreme values (e.g., 72 per observation vs. normal -2)

Interval Maximum Likelihood Estimation Method

Core Concept

Treat integer observation y as a continuous value rounded to the nearest integer, i.e., y corresponds to interval (y-0.5, y+0.5].

Mathematical Formulation

Interval log-likelihood function:

ℓ(p|y) = Σ_{t=1}^n ln[F((y_t - μ_t + 0.5)/σ_t | ν) - F((y_t - μ_t - 0.5)/σ_t | ν)]

where F(·|ν) is the cumulative distribution function of Student's t distribution.

Modified Score Function

∇_{ln σ²}(y; μ, σ², ν) = [(y-μ-0.5)f((y-μ-0.5)/σ|ν) - (y-μ+0.5)f((y-μ+0.5)/σ|ν)] / [2σF((y-μ+0.5)/σ|ν) - 2σF((y-μ-0.5)/σ|ν)]

Complete Model Specification

Location Parameter Dynamics

μ_t = θ(y_{t-1} - μ_{t-1})

Captures market microstructure noise.

Scale Parameter Dynamics

ln σ²_t = ω + ln ŝ_t + e_t
e_t = α∇_{ln σ²}(y_{t-1}; μ_{t-1}, σ²_{t-1}, ν) + φe_{t-1}

where ŝ_t estimates intraday volatility patterns through smoothing splines.

Experimental Setup

Dataset

  1. Primary Data: IBM stock (NYSE, full year 2024)
  2. Supplementary Data: MCD (NYSE), CSCO and MSFT (NASDAQ)
  3. Data Scale: Over 15 million tick-by-tick transaction observations
  4. Frequency Settings: 0.1 seconds, 1 second, 10 seconds, 60 seconds, 300 seconds

Data Preprocessing

  1. Standard Cleaning: Remove after-hours data, records without prices, outliers
  2. Outlier Definition: Exceeding 10 times the mean absolute deviation within a 201-observation rolling window
  3. Aggregation Method: Last-trade-price method

Evaluation Metrics

  1. Log-Likelihood (ℓ): Model goodness-of-fit
  2. ARCH-LM Statistic: Residual autocorrelation testing
  3. Out-of-Sample Performance: Next-day prediction capability

Comparison Methods

  1. Continuous Distributions: Normal distribution (interval estimation), Student's t distribution (interval estimation)
  2. Discrete Distributions: Skellam distribution, zero-inflated Skellam distribution
  3. Software Packages: rugarch, fGarch, GAS, gasmodel

Experimental Results

Main Findings

Failure of Traditional Methods

Table 1 Results Show:

  • At 1-second frequency, gasmodel package estimates ν=0.220 (median), other packages constrained by artificial lower bounds
  • Massive log-likelihood differences: gasmodel at 72/observation vs. others at approximately -2/observation
  • At 1-minute frequency, results from all packages relatively consistent

Performance of Interval Method

Table 2 Results Show:

  • 1-second frequency: Zero-inflated Skellam optimal (ℓ=-1.700), Student's t second (ℓ=-1.841)
  • 1-minute frequency: Student's t optimal (ℓ=-3.550), slightly better than other methods
  • Very low residual ARCH effects, indicating effective capture of time-varying volatility

Out-of-Sample Performance

  • Student's t, Skellam, and zero-inflated Skellam models show stable performance
  • Normal distribution exhibits zero likelihood on 56% of days at 1-second frequency, unsuitable for prediction

Distribution Fit Analysis

Figure 3 Shows:

  • 1-second frequency: Student's t distribution overestimates probabilities of -1 and 1, underestimates others
  • 1-minute frequency: No systematic bias, but slight underestimation of zero probability

Multi-Stock Verification

Appendix Results:

  • MCD stock: Similar degeneracy behavior to IBM
  • CSCO stock: Higher proportion of zeros, problem more severe
  • MSFT stock: More dispersed distribution, traditional methods relatively stable but still problematic

High-Frequency Data Modeling Development

  1. Early Research: Ghysels and Jasiak (1998), Engle (2000), Meddahi et al. (2006)
  2. Discrete Models: Koopman et al. (2017-2018), Catania et al. (2022), Holý (2024)
  3. Score-Driven Models: Creal et al. (2013) theoretical foundation

Paper Positioning

  1. Distinction from Discrete Methods: Maintains flexibility of continuous distribution usage
  2. Supplements Existing Theory: Holý (2024) observations noted but not thoroughly investigated
  3. Practical Value: Provides warnings for existing software package users

Conclusions and Discussion

Main Conclusions

  1. Theoretical Conclusion: Student's t distribution is unsuitable for modeling integer price changes with frequent zero values
  2. Methodological Conclusion: Interval maximum likelihood estimation effectively resolves continuous distribution discrete data modeling problems
  3. Practical Conclusion: Method performs excellently on relatively low-frequency (1-minute) data; higher-frequency data requires more complex distributions

Limitations

  1. Scope of Application: Student's t distribution still lacks flexibility for ultra-high-frequency data
  2. Computational Complexity: Interval estimation increases computational burden
  3. Parameter Constraints: May require lower bounds on score coefficients in certain cases

Future Directions

  1. Distribution Extension: Apply method to other continuous distributions
  2. Theoretical Refinement: Investigate asymptotic properties of interval estimation
  3. Practical Applications: Applications in risk management and derivative pricing

In-Depth Evaluation

Strengths

  1. Accurate Problem Identification: Clearly identifies an overlooked but important practical issue
  2. Concise Solution: Interval estimation method is simple, effective, and easy to implement
  3. Comprehensive Empirical Analysis: Full verification across multiple software packages, stocks, and frequencies
  4. High Practical Value: Provides clear warnings and solutions for practitioners

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical property analysis of interval estimation method
  2. Computational Efficiency: Does not discuss computational complexity and optimization strategies
  3. Limited Model Comparison: Primarily compares with basic discrete distributions, lacking more advanced benchmarks
  4. Parameter Selection: Interval choice (0.5) lacks theoretical justification

Impact

  1. Academic Contribution: Fills gap in continuous distribution modeling of discrete data
  2. Practical Value: Direct application value for high-frequency trading and risk management
  3. Method Generalizability: Framework extensible to other continuous distributions and application domains

Applicable Scenarios

  1. High-Frequency Financial Data: Particularly markets where price changes are quoted in minimum units
  2. Discrete Observations of Continuous Processes: Other time series with rounding errors
  3. Volatility Modeling: Risk management applications requiring continuous distribution flexibility

References

This paper cites important literature in financial econometrics, high-frequency data analysis, and time series modeling, including:

  • Engle (1982, 2000, 2002) - GARCH models and high-frequency data analysis foundations
  • Creal et al. (2013) - Score-Driven model theory
  • Koopman et al. (2017, 2018) - Dynamic modeling of discrete price changes
  • Holý (2024) - Related discrete GARCH model research

Overall Assessment: This paper provides a concise and effective solution to an important but overlooked practical problem with strong practical value. Although somewhat limited in theoretical depth, its empirical research is comprehensive, conclusions credible, and it makes significant contributions to high-frequency financial data analysis.