Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models
Wolniewicz, Kelebek, Mestici et al.
Operational forecasting of the ionosphere remains a critical space weather challenge due to sparse observations, complex coupling across geospatial layers, and a growing need for timely, accurate predictions that support Global Navigation Satellite System (GNSS), communications, aviation safety, as well as satellite operations. As part of the 2025 NASA Heliolab, we present a curated, open-access dataset that integrates diverse ionospheric and heliospheric measurements into a coherent, machine learning-ready structure, designed specifically to support next-generation forecasting models and address gaps in current operational frameworks. Our workflow integrates a large selection of data sources comprising Solar Dynamic Observatory data, solar irradiance indices (F10.7), solar wind parameters (velocity and interplanetary magnetic field), geomagnetic activity indices (Kp, AE, SYM-H), and NASA JPL's Global Ionospheric Maps of Total Electron Content (GIM-TEC). We also implement geospatially sparse data such as the TEC derived from the World-Wide GNSS Receiver Network and crowdsourced Android smartphone measurements. This novel heterogeneous dataset is temporally and spatially aligned into a single, modular data structure that supports both physical and data-driven modeling. Leveraging this dataset, we train and benchmark several spatiotemporal machine learning architectures for forecasting vertical TEC under both quiet and geomagnetically active conditions. This work presents an extensive dataset and modeling pipeline that enables exploration of not only ionospheric dynamics but also broader Sun-Earth interactions, supporting both scientific inquiry and operational forecasting efforts.
academic
Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models
Title: Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models
Authors: Linnea M. Wolniewicz, Halil S. Kelebek, Simone Mestici, Michael D. Vergalla, Giacomo Acciarini, Bala Poduval, Olga Verkhoglyadova, Madhulika Guhathakurta, Thomas E. Berger, Atılım Güneş Baydin, Frank Soboczenski
Institutions: University of Hawai'i at Mānoa, University of Oxford, Università degli Studi di Roma Sapienza, Free Flight Research Lab, ESA, University of New Hampshire, NASA JPL, NASA Headquarters, University of Colorado Boulder, University of York & King's College London
Publication Venue: NeurIPS 2025 Workshop: Machine Learning for the Physical Sciences
Operational forecasting of the ionosphere represents a critical challenge in space weather, with primary difficulties arising from sparse observational data, complex coupling across geospace layers, and growing demand for timely and accurate predictions supporting Global Navigation Satellite Systems (GNSS), communications, aviation safety, and satellite operations. As part of the 2025 NASA Heliolab project, this paper presents a carefully curated open-access dataset that integrates diverse ionospheric and heliospheric measurements into a coherent, machine learning-ready structure. The dataset synthesizes multiple data sources including Solar Dynamics Observatory (SDO) data, solar irradiance indices (F10.7), solar wind parameters (velocity and interplanetary magnetic field), geomagnetic activity indices (Kp, AE, SYM-H), and NASA JPL's Global Ionospheric Maps of Total Electron Content (GIM-TEC). The research team trained and benchmarked multiple spatiotemporal machine learning architectures for predicting vertical TEC under both quiet and geomagnetically active conditions, providing support for both scientific research and operational forecasting.
Ionospheric forecasting faces three major challenges:
Data Sparsity: Observational data is unevenly distributed in time and space
Multi-scale Coupling: Complex interactions exist between solar activity, magnetosphere, and ionosphere-thermosphere systems
Urgent Operational Needs: Modern technological infrastructure (GNSS, satellite constellations, aviation networks, power grids) increasingly depends on accurate and timely space weather forecasts
Space weather events (such as solar flares and coronal mass ejections) can cause:
Satellite operation disruptions
Degraded GNSS accuracy
Damaged radio communications
Power grid failures (e.g., the February 2022 event that caused 38 Starlink satellites to reenter the atmosphere)
With the rapid expansion of LEO satellite constellations and deepening dependence on space infrastructure, accurate ionospheric forecasting has become critical.
Construct a machine learning-ready standardized dataset that integrates heterogeneous multi-source observational data, unifies spatiotemporal scales, and provides a foundation for developing, testing, and benchmarking advanced ML architectures, ultimately enabling a digital twin of the ionosphere.
Constructed the first comprehensive ML-ready ionospheric dataset: Integrates 8 major data sources spanning 14 years (2010-2024) of multimodal observational data
Achieved spatiotemporal alignment of heterogeneous data:
Handled temporal frequency differences across data sources (from 15 seconds to daily)
Unified missing value representation and handling strategies
Provided multiple temporal resolution options (up to 15 minutes)
Provided a geomagnetic storm event catalog (MESTICI scale):
Based on Kp index and NOAA G-level standards
Considers event duration
Prevents data leakage between training/validation sets
Open-sourced data and code:
Google Cloud public storage bucket
GitHub open-source processing code
PyTorch dataset interface
Benchmarked multiple ML models (IonCast series):
LSTM baseline model
Spherical Neural Operator Model (SFNO)
GraphCast-inspired model
Achieved 12-hour lead time forecasts outperforming persistence baseline
Temporal Reference: SDO-FM data range as baseline (2010-05-13 to 2024-08-01)
Missing Value Handling:
Standardize all missing values to NaN
Process non-standard sentinel values in OMNI dataset
Remove feature columns with large-scale missing data
Forward Fill Strategy:
- Define maximum rewind time
- For most data streams: rewind time = native frequency
- OMNI exception: rewind time = 50 minutes
- Gaps exceeding rewind time: skip timestamp
Resample to Unified Frequency: Use forward fill as simple interpolation strategy
Based on NOAA G-level standards combined with event duration:
Event ID
Kp Range
NOAA Level
Duration
G0Hℓ
Kp < 5
Quiet
ℓ hours
G1Hℓ
5 ≤ Kp < 6
Minor
ℓ hours
G2Hℓ
6 ≤ Kp < 7
Moderate
ℓ hours
G3Hℓ
7 ≤ Kp < 8
Strong
ℓ hours
G4Hℓ
8 ≤ Kp < 9
Severe
ℓ hours
G5Hℓ
Kp ≥ 9
Extreme
ℓ hours
Purpose: Ensure physical reasonableness of model validation and prevent data leakage where the same geomagnetic storm event is scattered across training and validation sets.
This paper represents an important infrastructure contribution to the space weather forecasting field. Rather than proposing new algorithms, it addresses a more fundamental problem: providing standardized, high-quality datasets for machine learning research. This type of contribution is often undervalued in the AI community but is actually key to advancing the field.
The paper's greatest value lies in:
Significantly lowering entry barriers, enabling more ML researchers to participate in space weather research