2025-11-23T05:40:16.518964

Joint modeling and inference of multiple-subject high-dimensional sparse vector autoregressive models

Kim, Fisher, Pipiras
The multiple-subject vector autoregression (multi-VAR) model captures heterogeneous network Granger causality across subjects by decomposing individual sparse VAR transition matrices into commonly shared and subject-unique paths. The model has been applied to characterize hidden shared and unique paths among subjects and has demonstrated performance compared to methods commonly used in psychology and neuroscience. Despite this innovation, the model suffers from using a weighted median for identifying the common effects, leading to statistical inefficiency as the convergence rates of the common and unique paths are determined by the least sparse subject and the smallest sample size across all subjects. We propose a new identifiability condition for the multi-VAR model based on a communication-efficient data integration framework. We show that this approach achieves convergence rates tailored to each subject's sparsity level and sample size. Furthermore, we develop hypothesis tests to assess the nullity and homogeneity of individual paths, using Wald-type test statistics constructed from individual debiased estimators. A test for the significance of the common paths can also be derived through the framework. Simulation studies under various heterogeneity scenarios and a real data application demonstrate the performance of the proposed method compared to existing benchmark across standard evaluation metrics.
academic

Joint modeling and inference of multiple-subject high-dimensional sparse vector autoregressive models

Basic Information

  • Paper ID: 2510.14044
  • Title: Joint modeling and inference of multiple-subject high-dimensional sparse vector autoregressive models
  • Authors: Younghoon Kim (Cornell University), Zachary F. Fisher (University of North Carolina at Chapel Hill), Vladas Pipiras (University of North Carolina at Chapel Hill)
  • Classification: stat.ME (Statistics - Methodology)
  • Publication Date: October 17, 2025
  • Paper Link: https://arxiv.org/abs/2510.14044

Abstract

The multi-subject vector autoregressive (multi-VAR) model captures heterogeneous network Granger causality across subjects by decomposing individual sparse VAR transition matrices into common shared pathways and subject-specific pathways. Although this model has been applied to characterize hidden shared and unique pathways across subjects and has demonstrated superior performance compared to commonly used methods in psychology and neuroscience, its use of weighted medians to identify common effects suffers from statistical efficiency issues, as convergence rates for common and unique pathways are determined by the least sparse subject and the minimum sample size across all subjects. This paper proposes new identifiability conditions for the multi-VAR model based on a communication-efficient data integration framework, enabling customized convergence rates tailored to each subject's sparsity level and sample size. Additionally, a hypothesis testing framework is developed to assess the nullity and homogeneity of individual pathways using Wald-type test statistics constructed from subject-specific debiased estimators, from which significance tests for common pathways can be derived.

Research Background and Motivation

Problem Definition

The core problems addressed in this research concern statistical efficiency and inference in multi-subject high-dimensional sparse vector autoregressive modeling, specifically:

  1. Statistical Efficiency Issue: The existing multi-VAR model uses weighted medians to identify common effects, resulting in convergence rates limited by the least sparse subject and minimum sample size, failing to fully leverage the heterogeneous characteristics of each subject.
  2. Missing Inference Framework: Lack of formal hypothesis testing framework for multi-subject VAR models, preventing assessment of individual pathway significance, nullity, and homogeneity.

Research Significance

This problem is important in the following domains:

  • Neuroscience: Analyzing brain network connectivity patterns across multiple subjects, identifying common and subject-specific neural connections
  • Psychology: Understanding individual differences and common psychological processes
  • Genomics: Analyzing common and subject-specific patterns in gene regulatory networks
  • Finance: Modeling systematic and idiosyncratic risks in financial time series

Limitations of Existing Methods

The original multi-VAR approach has the following limitations:

  1. Suboptimal Convergence Rate: ∥α̂^(k) - α^(k)∥₂ ≤ O_P(√(max_k(∥α^(k)∥₀) log d²p)/N_k), constrained by the least sparse subject
  2. Low Computational Efficiency: Requires stacking all subject equations to solve large-scale optimization problems
  3. Lack of Inference Tools: Unable to perform statistical testing and uncertainty quantification

Core Contributions

  1. Proposes New Identifiability Conditions: Based on a communication-efficient data integration framework, avoiding statistical efficiency issues of weighted median methods
  2. Achieves Subject-Specific Convergence Rates: Convergence rates now depend on each subject's own sparsity level and sample size, rather than global worst-case scenarios
  3. Constructs Complete Inference Framework: Develops three classes of hypothesis tests: nullity tests, homogeneity tests, and significance tests
  4. Provides Theoretical Guarantees: Establishes convergence rates for estimators and asymptotic distribution theory for test statistics
  5. Improves Computational Efficiency: Employs a separate estimation and aggregation strategy, significantly reducing computational complexity

Methodology Details

Task Definition

Given K subjects' d-dimensional time series {X_t^(k)}, each subject with T_k time points, the objectives are:

  1. Estimate Common Pathways α^(0): VAR transition matrix parameters shared by all subjects
  2. Estimate Unique Pathways α^(k): Parameters specific to subject k
  3. Satisfy Decomposition Relationship: β^(k) = α^(0) + α^(k), where β^(k) is the complete parameter vector for subject k

Model Architecture

1. VAR Model Specification

Each subject follows a VAR(p) model:

X_t^(k) = Φ₁^(k)X_{t-1}^(k) + ... + Φ_p^(k)X_{t-p}^(k) + ε_t^(k)

where ε_t^(k) ~ N(0, Σ_ε^(k)), Σ_ε^(k) = diag(σ²_{k,1}, ..., σ²_{k,d})

2. Estimation Procedure

Step 1: Subject-Specific Estimation For each subject k and variable i, apply Lasso regression:

β̂_i^(k) = argmin_{β_i^(k)} {1/(2N_k)||Y_i^(k) - X^(k)β_i^(k)||²₂ + λ_i^(k)||β_i^(k)||₁}

Step 2: Debiased Estimation Compute debiased estimators:

β̃_i^(k) = β̂_i^(k) + (1/N_k)Θ̂^(k)X^(k)'(Y_i^(k) - X^(k)β̂_i^(k))

where Θ̂^(k) is an approximate inverse of the Hessian matrix, computed via nodewise regression.

Step 3: Robust Aggregation Identify common pathways using redescending loss function:

(α̃_i^(0))_j = argmin_{x∈ℝ} {∑_{k=1}^K min{((β̃_i^(k))_j - x)², η_j²}}

Step 4: Sparsification Apply hard or soft thresholding to recover sparsity:

α̂_i^(0) = HT_{δ₀}(α̃_i^(0))
α̂_i^(k) = HT_{δₖ}(β̃_i^(k) - α̃_i^(0))

Technical Innovations

  1. Robust M-Estimators: Treats common effect identification as a measurement contamination problem, using redescending loss functions to handle outliers
  2. Subject-Specific Thresholds: δₖ ~ √(log q/Nₖ), δ₀ ~ √(log q/(KN_)), fully leveraging sample information from each subject
  3. Communication-Efficient Framework: Avoids global optimization; each subject can compute independently followed by aggregation

Experimental Setup

Datasets

Simulated Data

  • Parameter Settings: K ∈ {10,15}, d ∈ {10,20}, average sample length T ∈ {50,200}
  • Heterogeneity Levels: (s₀,sₖ) ∈ {(0.02,0.04), (0.03,0.03), (0.04,0.02)}, corresponding to high, medium, and low heterogeneity
  • Overall Sparsity: Fixed at 6%
  • Repetitions: 50 repetitions for each setting

Real Data

  • Data Source: Human Connectome Project (HCP) emotion processing task fMRI data
  • Subjects: 12 female subjects, ages 22-30
  • Brain Parcellation: Schaefer2018 400-parcel atlas, mapped to 17 functional networks
  • Sample Length: Average Tₖ = 165 time points

Evaluation Metrics

Estimation Performance

  • RMSE: ∥α̂ - α∥₂/∥α∥₂
  • Sensitivity: Proportion of correctly identified non-zero parameters
  • Specificity: Proportion of correctly identified zero parameters

Inference Performance

  • FDR: False discovery rate
  • Power: Statistical power
  • Computation Time: Speedup ratio relative to baseline methods

Comparison Methods

  • multi-VAR: Original multi-subject VAR model
  • multi-VAR(A): multi-VAR with adaptive Lasso penalty

Experimental Results

Main Results

Estimation Performance

  1. Low-Dimensional Case (d=10): Proposed method outperforms existing methods in RMSE
  2. High-Dimensional Case (d=20): Performance gap narrows as sample size increases
  3. Sensitivity and Specificity: Comparable to adaptive multi-VAR, indicating subject-specific thresholds function similarly to adaptive weights

Computational Efficiency

Proposed method shows significant speedup over baseline methods:

  • d=10, T=50: Speedup ratio approximately 2-3×
  • d=20, T=200: Speedup ratio reaches 60-100×

Convergence Rate Improvement

Theoretical analysis demonstrates subject-specific convergence rates:

  • Common Pathways: ∥α̂^(0) - α^(0)∥₂ ≤ O_P(√(s₀,max log d²/(KN_)))
  • Unique Pathways: ∥α̂^(k) - α^(k)∥₂ ≤ O_P(√(sₖ,max log d²/Nₖ))

Inference Results

Hypothesis Testing Performance

  1. Nullity Tests: FDR between 0.0-0.6, power 0.5-1.0
  2. Homogeneity Tests: FDR between 0.0-0.6, power 0.4-1.0
  3. Significance Tests: FDR consistently 0, power 0.25-1.0

Testing performance improves with increasing sample size and is robust to dimensional changes.

Real Data Application

Brain Network Discovery

  1. Common Connections: Identifies emotion processing-related brain network connections shared across all subjects
  2. Individual Differences: Compared to baseline methods, proposed method identifies sparser yet more interpretable connection patterns
  3. Biological Significance: Discovered connections align with known neural mechanisms of emotion processing

Key Findings

  • Bidirectional connections between ventral attention network A and default mode network B
  • Connections from frontoparietal network A to limbic system B
  • Connections within limbic system from A to B

Multi-Subject Time Series Modeling

  1. Multi-Class VAR Models (Wilms et al., 2018): Uses fused Lasso to encourage similarity across subjects
  2. Non-Overlapping Support Models (Skripnikov & Michailidis, 2019): Distinguishes common and unique components through non-convex penalties
  3. Joint VAR Models (Manomaisaowapak & Songsiri, 2022): Uses group Lasso to identify common components

High-Dimensional Time Series

  • Sparse VAR Modeling: Application of Lasso-type methods in high-dimensional settings
  • Debiased Estimation: Statistical inference theory in high-dimensional regression
  • Robust Estimation: M-estimator methods for handling heterogeneous data

Advantages of This Work

Compared to existing methods, this paper is the first to provide:

  1. Theoretically guaranteed subject-specific convergence rates
  2. Complete statistical inference framework
  3. Communication-efficient computational strategy

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: New identifiability conditions significantly improve statistical efficiency of multi-VAR models
  2. Theoretical Contribution: Establishes subject-specific convergence rate theory, breaking through global limitations of existing methods
  3. Practical Value: Inference framework fills important gap in multi-subject high-dimensional time series modeling
  4. Application Prospects: Demonstrates good application potential in neuroscience and related fields

Limitations

  1. Distributional Assumptions: Currently limited to Gaussian innovations; extension to heavy-tailed distributions remains challenging
  2. Parameter Tuning: Lack of standardized criteria for parameter grid selection in cross-validation
  3. Higher-Order Lags: Design of structured penalties for VAR(p) models needs refinement

Future Directions

  1. Distributional Extensions: Handling more general innovation distributions such as sub-exponential distributions
  2. Clustering Extensions: Incorporating clustering decomposition with partially shared pathways
  3. Structured Modeling: Overlapping group sparse methods for higher-order lags

In-Depth Evaluation

Strengths

  1. Theoretical Rigor: Provides complete convergence rate analysis and asymptotic distribution theory
  2. Methodological Innovation: Cleverly combines robust estimation and communication-efficient framework
  3. Comprehensive Experiments: Covers multiple heterogeneity scenarios and real data validation
  4. High Practical Value: Addresses important theoretical and practical problems in the field

Weaknesses

  1. Computational Complexity: Three-layer cross-validation for parameter selection has high computational cost
  2. Assumption Conditions: Technical conditions in Assumption 2.2 are relatively stringent
  3. Extensibility: Extension of method to more complex model structures needs verification

Impact

  1. Academic Contribution: Provides new theoretical framework for multi-subject high-dimensional time series analysis
  2. Application Value: Has broad application prospects in neuroscience, psychology, and related fields
  3. Reproducibility: Provides complete R package implementation facilitating research reproduction

Applicable Scenarios

  • Multi-subject brain network analysis
  • Individual difference research
  • Heterogeneous time series modeling
  • High-dimensional VAR applications requiring statistical inference

References

The paper cites abundant relevant literature covering multiple domains including high-dimensional statistics, time series analysis, and robust estimation, providing solid theoretical foundation for the research.