2025-11-15T03:10:19.291336

Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage

Nie, Kumar, Chen et al.
Recent advances in machine learning such as Long Short-Term Memory (LSTM) models and Transformers have been widely adopted in hydrological applications, demonstrating impressive performance amongst deep learning models and outperforming physical models in various tasks. However, their superiority in predicting land surface states such as terrestrial water storage (TWS) that are dominated by many factors such as natural variability and human driven modifications remains unclear. Here, using the open-access, globally representative HydroGlobe dataset - comprising a baseline version derived solely from a land surface model simulation and an advanced version incorporating multi-source remote sensing data assimilation - we show that linear regression is a robust benchmark, outperforming the more complex LSTM and Temporal Fusion Transformer for TWS prediction. Our findings highlight the importance of including traditional statistical models as benchmarks when developing and evaluating deep learning models. Additionally, we emphasize the critical need to establish globally representative benchmark datasets that capture the combined impact of natural variability and human interventions.
academic

Rethinking Deep Learning: Linear Regression Remains a Key Benchmark in Predicting Terrestrial Water Storage

Basic Information

  • Paper ID: 2510.10799
  • Title: Rethinking deep learning: linear regression remains a key benchmark in predicting terrestrial water storage
  • Authors: Wanshu Nie, Sujay V. Kumar, Junyu Chen, Long Zhao, Olya Skulovich, Jinwoong Yoo, Justin Pflug, Shahryar Khalique Ahmad, Goutam Konapala
  • Classification: cs.LG physics.ao-ph physics.geo-ph
  • Institutions: NASA Goddard Space Flight Center, Johns Hopkins University, et al.
  • Paper Link: https://arxiv.org/abs/2510.10799

Abstract

In recent years, machine learning techniques such as Long Short-Term Memory networks (LSTM) and Transformers have been widely adopted in hydrological applications, demonstrating superior performance in deep learning models and surpassing physics-based models in various tasks. However, the superiority of these methods remains unclear when predicting terrestrial surface states (such as Terrestrial Water Storage, TWS) dominated by multiple factors including natural variability and anthropogenic-driven changes. Using the open-access, globally representative HydroGlobe dataset—including a baseline version based solely on land surface model simulations and an advanced version incorporating multi-source remote sensing data assimilation—this study demonstrates that linear regression is a robust benchmark that outperforms more complex LSTM and Temporal Fusion Transformer models in TWS prediction tasks. The results emphasize the importance of using traditional statistical models as benchmarks when developing and evaluating deep learning models, and highlight the critical need to establish globally representative benchmark datasets capable of capturing the combined effects of natural variability and anthropogenic interventions.

Research Background and Motivation

Problem Definition

Terrestrial Water Storage (TWS) is a key indicator of global freshwater availability, encompassing all forms of terrestrial water bodies including soil moisture, groundwater, surface water, and snow cover. Accurate TWS estimation is critical for ecosystem protection, agricultural support, and water and food security.

Research Motivation

  1. Popularity of Deep Learning in Hydrology: Deep learning models such as LSTM and Transformer have become increasingly popular in hydrological applications, particularly excelling in tasks such as rainfall-runoff modeling.
  2. Non-stationarity Challenges: TWS is influenced by complex interactions between climate variability and human activities (such as groundwater extraction, land use change, and reservoir operations), exhibiting strong non-stationarity.
  3. Benchmark Selection Issues: Existing research often compares only among deep learning models, lacking comparisons with simple statistical methods.
  4. Dataset Limitations: Lack of globally representative benchmark datasets that comprehensively reflect both natural and anthropogenic influences.

Limitations of Existing Methods

  1. LSTM Limitations: Computationally expensive on long input sequences; limited ability to capture long-term dependencies when trained on shorter sequences.
  2. Transformer Challenges: Self-attention mechanisms are inherently permutation-invariant, potentially leading to loss of temporal information.
  3. Evaluation Bias: Lack of systematic comparison with traditional statistical methods.

Core Contributions

  1. Systematic Benchmark Comparison: First systematic comparison of linear regression, LSTM, and Temporal Fusion Transformer (TFT) performance in global-scale TWS prediction tasks.
  2. HydroGlobe Dataset Application: Utilization of a global hydrological dataset with two versions capturing natural variability (OL) and anthropogenic impacts (DA).
  3. Proof of Linear Regression Superiority: Demonstration that simple linear regression models consistently outperform complex deep learning models in TWS prediction tasks.
  4. Non-stationarity Analysis: In-depth analysis of performance differences among models in handling non-stationary environments.
  5. Emphasis on Benchmark Importance: Highlighting the importance of including traditional statistical benchmarks in deep learning model evaluation.

Methodology Details

Task Definition

Input: Monthly features from the past 12 months (precipitation, temperature, Leaf Area Index LAI, surface soil moisture SSMC) and static features (elevation, slope, soil texture, land cover, etc.) Output: Terrestrial Water Storage (TWS) for the current month Constraint: Historical TWS values are not used as input features, simulating realistic prediction scenarios.

Model Architecture

1. Linear Regression Models

  • Linear_single (Baseline Model): Linear regression model trained separately for each basin.
  • Linear_glob: Global linear model trained using data from all basins.

Feature Composition:

  • Lagged time-varying features: 48 (historical values of precipitation, temperature, LAI, SSMC)
  • Monthly categorical variables: 11 (proxies for seasonal effects)
  • Trend features: 1 (time index)

2. Deep Learning Models

  • LSTM: Single-layer LSTM network processing time-varying and static inputs.
  • Temporal Fusion Transformer (TFT): Hybrid architecture combining LSTM units and multi-head attention mechanisms.

Technical Innovations

  1. Comparative Dataset Design: Evaluation of model performance under different degrees of non-stationarity through OL and DA versions.
  2. Comprehensive Evaluation Framework: Experiments including different sequence lengths, prediction horizons, and temporal resolutions.
  3. Interpretability Analysis: Model behavior analysis using SHAP values and attention weights.
  4. Fair Comparison Strategy: Use of identical loss functions (quantile loss) and evaluation metrics.

Experimental Setup

Dataset

HydroGlobe Dataset:

  • Spatiotemporal Coverage: 2003-2020, 10km spatial resolution, 515 global basins.
  • OL Version: Baseline simulation based solely on Noah-MP land surface model.
  • DA Version: Data assimilation product integrating GRACE TWS, ESA CCI soil moisture, and MODIS LAI.

Data Partition:

  • Training Period: 2003-2015 (linear models); 2003-2012 (deep learning models)
  • Validation Period: 2013-2015 (deep learning models only)
  • Test Period: 2016-2020

Evaluation Metrics

  • Bias: Systematic error
  • Root Mean Square Error (RMSE): Overall prediction accuracy
  • Correlation Coefficient: Strength of linear relationship
  • Nash-Sutcliffe Efficiency (NSE): Model's ability to explain variance
  • Kling-Gupta Efficiency (KGE): Comprehensive evaluation metric

NSE Formula: NSE=1t=1T(ypredyobs)2t=1T(yobsyobs)2NSE = 1 - \frac{\sum_{t=1}^{T}(y_{pred} - y_{obs})^2}{\sum_{t=1}^{T}(y_{obs} - \overline{y_{obs}})^2}

KGE Formula: KGE=1(r1)2+(σpredσobs1)2+(μpredμobs1)2KGE = 1 - \sqrt{(r-1)^2 + (\frac{\sigma_{pred}}{\sigma_{obs}}-1)^2 + (\frac{\mu_{pred}}{\mu_{obs}}-1)^2}

Comparison Methods

  • Traditional Methods: Random Forest, LightGBM
  • Deep Learning: LSTM, Temporal Fusion Transformer
  • Baselines: Basin-specific and global linear regression

Experimental Results

Main Results

OL Dataset Performance

Linear_single significantly outperforms the other three models on all evaluation metrics (except bias):

  • Best Performance Ranking: Linear_single > TFT > LSTM > Linear_glob
  • TFT shows best performance on bias metric, even surpassing Linear_single
  • Linear_glob performs worst, particularly on correlation and NSE metrics

DA Dataset Performance

Linear_single again outperforms other models, but overall performance declines:

  • All models perform worse on DA dataset compared to OL dataset
  • Strong non-stationarity (more negative TWS trends) challenges all models
  • LSTM performs worst in handling strong non-stationarity

Spatial Distribution Analysis

  • In basins with strong negative TWS trends, best-performing models are primarily Linear_single or TFT
  • LSTM struggles to predict trends in basins exhibiting strong non-stationarity

Ablation Studies

Sequence Length Impact

Testing different input sequence lengths from 6-18 months:

  • LSTM and TFT: Increased sequence length does not significantly improve performance
  • SHAP Analysis: LSTM primarily relies on recent time steps, utilizing less historical information
  • Attention Analysis: TFT's attention patterns are inconsistent across different sequence lengths

Prediction Task Performance

Experiments with 1-6 month predictions:

  • Short-term Prediction (≤3 months): Linear_single performs best
  • Long-term Prediction (>3 months): TFT shows more stable performance, surpassing Linear_single
  • LSTM: Performs worst across all prediction horizons

Temporal Resolution Impact

Training with daily data:

  • Training data increases from 55,620 to 375,435 points
  • No significant performance improvement across all models
  • Indicates that training data scale is not a limiting factor

Non-stationarity Handling Mechanisms

Discovery through removal of TFT's temporal index embedding:

  • Temporal embedding is the primary mechanism for TFT's non-stationarity handling
  • Performance significantly declines in basins with significant declining trends after removal
  • Self-attention mechanism alone is insufficient for handling non-stationarity

Tree Model Comparison

Random Forest and LightGBM compared with Linear_single:

  • Linear_single outperforms tree models on most metrics
  • Tree models perform worse in basins with severe distribution shift
  • Demonstrates that increased model complexity does not necessarily improve performance

Deep Learning Applications in Hydrology

  1. LSTM Advantages: Consistently outperforms physics-based models in rainfall-runoff modeling, with capabilities for sequence processing and cross-basin generalization.
  2. Transformer Development: Introduced to hydrology following success in natural language processing, but effectiveness in time series tasks remains controversial.
  3. Benchmark Issues: Existing research often compares only among deep learning models, lacking comparison with simple methods.

Time Series Prediction Controversy

Recent research questions the necessity of Transformers in time series tasks:

  • Permutation invariance of self-attention may lead to loss of temporal information
  • Simple models can achieve comparable performance in certain tasks
  • Emphasizes the importance of selecting appropriate benchmarks

Conclusions and Discussion

Main Conclusions

  1. Robustness of Linear Regression: Simple linear regression consistently outperforms complex deep learning models in TWS prediction tasks.
  2. Importance of Benchmarks: Traditional statistical methods should serve as important benchmarks in deep learning model evaluation.
  3. Criticality of Datasets: Need for globally representative datasets reflecting both natural and anthropogenic influences.
  4. Non-stationarity Challenges: All models face difficulties in handling non-stationarity caused by anthropogenic impacts.

Limitations

  1. Task Specificity: Conclusions may be specific to TWS prediction tasks and not necessarily applicable to other hydrological applications.
  2. Feature Limitations: Lack of explicit anthropogenic intervention features (such as irrigation withdrawal) may limit deep learning model advantages.
  3. Temporal Range: 18 years of data may be insufficient for fully evaluating long-term dependencies.
  4. Spatial Scale: Basin-scale aggregation may mask sub-grid scale complexity.

Future Directions

  1. Feature Engineering: Development of better proxy variables for anthropogenic activities.
  2. Architectural Innovation: Design of deep learning architectures specifically addressing non-stationarity.
  3. Pre-training Strategies: Exploration of foundation models in hydrology.
  4. Multi-scale Modeling: Integration of information across different spatiotemporal scales.

In-Depth Evaluation

Strengths

  1. Rigorous Research Design: Systematic comparative experiments with multi-dimensional analysis.
  2. High-Quality Dataset: HydroGlobe dataset with global representativeness, incorporating both natural and anthropogenic influences.
  3. In-depth Analysis: Detailed model behavior analysis through interpretability methods such as SHAP values and attention weights.
  4. High Practical Value: Provides important methodological guidance for deep learning applications in hydrology.
  5. Clear Writing: Logical structure, rich figures and tables, facilitating comprehension.

Shortcomings

  1. Limited Generalizability: Conclusions primarily based on TWS prediction tasks; applicability to other hydrological applications requires verification.
  2. Model Selection: While representative models are selected, not all latest deep learning architectures are covered.
  3. Hyperparameter Optimization: Use of identical hyperparameters across different experiments may not be entirely fair.
  4. Missing Physical Constraints: Does not consider the role of physical constraints in models.

Impact

  1. Academic Contribution: Challenges the notion that deep learning is inherently superior in hydrology.
  2. Methodological Value: Emphasizes the importance of benchmark selection and fair comparison.
  3. Practical Guidance: Provides important reference for practitioners in hydrology regarding model selection.
  4. Dataset Contribution: HydroGlobe dataset provides valuable resource for subsequent research.

Applicable Scenarios

  1. Water Resource Management: Provides guidance for water resource management departments in selecting TWS prediction tools.
  2. Climate Impact Assessment: Evaluates impacts of climate change and human activities on water cycles.
  3. Extreme Event Early Warning: Early warning for hydrological extreme events such as floods and droughts.
  4. Academic Research: Provides benchmarks and datasets for machine learning research in hydrology.

References

The paper includes abundant references covering important works in deep learning, hydrology, remote sensing, and other relevant fields, providing comprehensive literature foundation for related research.


Overall Assessment: This is a high-quality interdisciplinary research paper that, through rigorous experimental design and in-depth analysis, challenges common assumptions about deep learning applications in hydrology, emphasizing the value of traditional statistical methods and the importance of appropriate benchmark selection. The research results have important methodological significance for both the hydrology and machine learning communities.