2025-11-23T05:40:16.518964

Joint modeling and inference of multiple-subject high-dimensional sparse vector autoregressive models

Kim, Fisher, Pipiras

The multiple-subject vector autoregression (multi-VAR) model captures heterogeneous network Granger causality across subjects by decomposing individual sparse VAR transition matrices into commonly shared and subject-unique paths. The model has been applied to characterize hidden shared and unique paths among subjects and has demonstrated performance compared to methods commonly used in psychology and neuroscience. Despite this innovation, the model suffers from using a weighted median for identifying the common effects, leading to statistical inefficiency as the convergence rates of the common and unique paths are determined by the least sparse subject and the smallest sample size across all subjects. We propose a new identifiability condition for the multi-VAR model based on a communication-efficient data integration framework. We show that this approach achieves convergence rates tailored to each subject's sparsity level and sample size. Furthermore, we develop hypothesis tests to assess the nullity and homogeneity of individual paths, using Wald-type test statistics constructed from individual debiased estimators. A test for the significance of the common paths can also be derived through the framework. Simulation studies under various heterogeneity scenarios and a real data application demonstrate the performance of the proposed method compared to existing benchmark across standard evaluation metrics.

academic

Joint modeling and inference of multiple-subject high-dimensional sparse vector autoregressive models

Basic Information

Paper ID: 2510.14044
Title: Joint modeling and inference of multiple-subject high-dimensional sparse vector autoregressive models
Authors: Younghoon Kim (Cornell University), Zachary F. Fisher (University of North Carolina at Chapel Hill), Vladas Pipiras (University of North Carolina at Chapel Hill)
Classification: stat.ME (Statistics - Methodology)
Publication Date: October 17, 2025
Paper Link: https://arxiv.org/abs/2510.14044

Abstract

The multi-subject vector autoregressive (multi-VAR) model captures heterogeneous network Granger causality across subjects by decomposing individual sparse VAR transition matrices into common shared pathways and subject-specific pathways. Although this model has been applied to characterize hidden shared and unique pathways across subjects and has demonstrated superior performance compared to commonly used methods in psychology and neuroscience, its use of weighted medians to identify common effects suffers from statistical efficiency issues, as convergence rates for common and unique pathways are determined by the least sparse subject and the minimum sample size across all subjects. This paper proposes new identifiability conditions for the multi-VAR model based on a communication-efficient data integration framework, enabling customized convergence rates tailored to each subject's sparsity level and sample size. Additionally, a hypothesis testing framework is developed to assess the nullity and homogeneity of individual pathways using Wald-type test statistics constructed from subject-specific debiased estimators, from which significance tests for common pathways can be derived.

Research Background and Motivation

Problem Definition

The core problems addressed in this research concern statistical efficiency and inference in multi-subject high-dimensional sparse vector autoregressive modeling, specifically:

Statistical Efficiency Issue: The existing multi-VAR model uses weighted medians to identify common effects, resulting in convergence rates limited by the least sparse subject and minimum sample size, failing to fully leverage the heterogeneous characteristics of each subject.
Missing Inference Framework: Lack of formal hypothesis testing framework for multi-subject VAR models, preventing assessment of individual pathway significance, nullity, and homogeneity.

Research Significance

This problem is important in the following domains:

Neuroscience: Analyzing brain network connectivity patterns across multiple subjects, identifying common and subject-specific neural connections
Psychology: Understanding individual differences and common psychological processes
Genomics: Analyzing common and subject-specific patterns in gene regulatory networks
Finance: Modeling systematic and idiosyncratic risks in financial time series

Limitations of Existing Methods

The original multi-VAR approach has the following limitations:

Suboptimal Convergence Rate: ∥α̂^(k) - α^(k)∥₂ ≤ O_P(√(max_k(∥α^(k)∥₀) log d²p)/N_k), constrained by the least sparse subject
Low Computational Efficiency: Requires stacking all subject equations to solve large-scale optimization problems
Lack of Inference Tools: Unable to perform statistical testing and uncertainty quantification

Core Contributions

Proposes New Identifiability Conditions: Based on a communication-efficient data integration framework, avoiding statistical efficiency issues of weighted median methods
Achieves Subject-Specific Convergence Rates: Convergence rates now depend on each subject's own sparsity level and sample size, rather than global worst-case scenarios
Constructs Complete Inference Framework: Develops three classes of hypothesis tests: nullity tests, homogeneity tests, and significance tests
Provides Theoretical Guarantees: Establishes convergence rates for estimators and asymptotic distribution theory for test statistics
Improves Computational Efficiency: Employs a separate estimation and aggregation strategy, significantly reducing computational complexity

Methodology Details

Task Definition

Given K subjects' d-dimensional time series {X_t^(k)}, each subject with T_k time points, the objectives are:

Estimate Common Pathways α^(0): VAR transition matrix parameters shared by all subjects
Estimate Unique Pathways α^(k): Parameters specific to subject k
Satisfy Decomposition Relationship: β^(k) = α^(0) + α^(k), where β^(k) is the complete parameter vector for subject k

Model Architecture

1. VAR Model Specification

Each subject follows a VAR(p) model:

X_t^(k) = Φ₁^(k)X_{t-1}^(k) + ... + Φ_p^(k)X_{t-p}^(k) + ε_t^(k)

where ε_t^(k) ~ N(0, Σ_ε^(k)), Σ_ε^(k) = diag(σ²_{k,1}, ..., σ²_{k,d})

2. Estimation Procedure

Step 1: Subject-Specific Estimation For each subject k and variable i, apply Lasso regression:

β̂_i^(k) = argmin_{β_i^(k)} {1/(2N_k)||Y_i^(k) - X^(k)β_i^(k)||²₂ + λ_i^(k)||β_i^(k)||₁}

Step 2: Debiased Estimation Compute debiased estimators:

β̃_i^(k) = β̂_i^(k) + (1/N_k)Θ̂^(k)X^(k)'(Y_i^(k) - X^(k)β̂_i^(k))

where Θ̂^(k) is an approximate inverse of the Hessian matrix, computed via nodewise regression.

Step 3: Robust Aggregation Identify common pathways using redescending loss function:

(α̃_i^(0))_j = argmin_{x∈ℝ} {∑_{k=1}^K min{((β̃_i^(k))_j - x)², η_j²}}

Step 4: Sparsification Apply hard or soft thresholding to recover sparsity:

α̂_i^(0) = HT_{δ₀}(α̃_i^(0))
α̂_i^(k) = HT_{δₖ}(β̃_i^(k) - α̃_i^(0))

Technical Innovations

Robust M-Estimators: Treats common effect identification as a measurement contamination problem, using redescending loss functions to handle outliers
Subject-Specific Thresholds: δₖ ~ √(log q/Nₖ), δ₀ ~ √(log q/(KN_)), fully leveraging sample information from each subject
Communication-Efficient Framework: Avoids global optimization; each subject can compute independently followed by aggregation

Experimental Setup

Datasets

Simulated Data

Parameter Settings: K ∈ {10,15}, d ∈ {10,20}, average sample length T ∈ {50,200}
Heterogeneity Levels: (s₀,sₖ) ∈ {(0.02,0.04), (0.03,0.03), (0.04,0.02)}, corresponding to high, medium, and low heterogeneity
Overall Sparsity: Fixed at 6%
Repetitions: 50 repetitions for each setting

Real Data

Data Source: Human Connectome Project (HCP) emotion processing task fMRI data
Subjects: 12 female subjects, ages 22-30
Brain Parcellation: Schaefer2018 400-parcel atlas, mapped to 17 functional networks
Sample Length: Average Tₖ = 165 time points

Evaluation Metrics

Estimation Performance

RMSE: ∥α̂ - α∥₂/∥α∥₂
Sensitivity: Proportion of correctly identified non-zero parameters
Specificity: Proportion of correctly identified zero parameters

Inference Performance

FDR: False discovery rate
Power: Statistical power
Computation Time: Speedup ratio relative to baseline methods

Comparison Methods

multi-VAR: Original multi-subject VAR model
multi-VAR(A): multi-VAR with adaptive Lasso penalty

Experimental Results

Main Results

Estimation Performance

Low-Dimensional Case (d=10): Proposed method outperforms existing methods in RMSE
High-Dimensional Case (d=20): Performance gap narrows as sample size increases
Sensitivity and Specificity: Comparable to adaptive multi-VAR, indicating subject-specific thresholds function similarly to adaptive weights

Computational Efficiency

Proposed method shows significant speedup over baseline methods:

d=10, T=50: Speedup ratio approximately 2-3×
d=20, T=200: Speedup ratio reaches 60-100×

Convergence Rate Improvement

Theoretical analysis demonstrates subject-specific convergence rates:

Common Pathways: ∥α̂^(0) - α^(0)∥₂ ≤ O_P(√(s₀,max log d²/(KN_)))
Unique Pathways: ∥α̂^(k) - α^(k)∥₂ ≤ O_P(√(sₖ,max log d²/Nₖ))

Inference Results

Hypothesis Testing Performance

Nullity Tests: FDR between 0.0-0.6, power 0.5-1.0
Homogeneity Tests: FDR between 0.0-0.6, power 0.4-1.0
Significance Tests: FDR consistently 0, power 0.25-1.0

Testing performance improves with increasing sample size and is robust to dimensional changes.

Real Data Application

Brain Network Discovery

Common Connections: Identifies emotion processing-related brain network connections shared across all subjects
Individual Differences: Compared to baseline methods, proposed method identifies sparser yet more interpretable connection patterns
Biological Significance: Discovered connections align with known neural mechanisms of emotion processing

Key Findings

Bidirectional connections between ventral attention network A and default mode network B
Connections from frontoparietal network A to limbic system B
Connections within limbic system from A to B

Multi-Subject Time Series Modeling

Multi-Class VAR Models (Wilms et al., 2018): Uses fused Lasso to encourage similarity across subjects
Non-Overlapping Support Models (Skripnikov & Michailidis, 2019): Distinguishes common and unique components through non-convex penalties
Joint VAR Models (Manomaisaowapak & Songsiri, 2022): Uses group Lasso to identify common components

High-Dimensional Time Series

Sparse VAR Modeling: Application of Lasso-type methods in high-dimensional settings
Debiased Estimation: Statistical inference theory in high-dimensional regression
Robust Estimation: M-estimator methods for handling heterogeneous data

Advantages of This Work

Compared to existing methods, this paper is the first to provide:

Theoretically guaranteed subject-specific convergence rates
Complete statistical inference framework
Communication-efficient computational strategy

Conclusions and Discussion

Main Conclusions

Method Effectiveness: New identifiability conditions significantly improve statistical efficiency of multi-VAR models
Theoretical Contribution: Establishes subject-specific convergence rate theory, breaking through global limitations of existing methods
Practical Value: Inference framework fills important gap in multi-subject high-dimensional time series modeling
Application Prospects: Demonstrates good application potential in neuroscience and related fields

Limitations

Distributional Assumptions: Currently limited to Gaussian innovations; extension to heavy-tailed distributions remains challenging
Parameter Tuning: Lack of standardized criteria for parameter grid selection in cross-validation
Higher-Order Lags: Design of structured penalties for VAR(p) models needs refinement

Future Directions

Distributional Extensions: Handling more general innovation distributions such as sub-exponential distributions
Clustering Extensions: Incorporating clustering decomposition with partially shared pathways
Structured Modeling: Overlapping group sparse methods for higher-order lags

In-Depth Evaluation

Strengths

Theoretical Rigor: Provides complete convergence rate analysis and asymptotic distribution theory
Methodological Innovation: Cleverly combines robust estimation and communication-efficient framework
Comprehensive Experiments: Covers multiple heterogeneity scenarios and real data validation
High Practical Value: Addresses important theoretical and practical problems in the field

Weaknesses

Computational Complexity: Three-layer cross-validation for parameter selection has high computational cost
Assumption Conditions: Technical conditions in Assumption 2.2 are relatively stringent
Extensibility: Extension of method to more complex model structures needs verification

Impact

Academic Contribution: Provides new theoretical framework for multi-subject high-dimensional time series analysis
Application Value: Has broad application prospects in neuroscience, psychology, and related fields
Reproducibility: Provides complete R package implementation facilitating research reproduction

Applicable Scenarios

Multi-subject brain network analysis
Individual difference research
Heterogeneous time series modeling
High-dimensional VAR applications requiring statistical inference

References

The paper cites abundant relevant literature covering multiple domains including high-dimensional statistics, time series analysis, and robust estimation, providing solid theoretical foundation for the research.