2025-11-22T21:13:17.025129

Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models

Wolniewicz, Kelebek, Mestici et al.
Operational forecasting of the ionosphere remains a critical space weather challenge due to sparse observations, complex coupling across geospatial layers, and a growing need for timely, accurate predictions that support Global Navigation Satellite System (GNSS), communications, aviation safety, as well as satellite operations. As part of the 2025 NASA Heliolab, we present a curated, open-access dataset that integrates diverse ionospheric and heliospheric measurements into a coherent, machine learning-ready structure, designed specifically to support next-generation forecasting models and address gaps in current operational frameworks. Our workflow integrates a large selection of data sources comprising Solar Dynamic Observatory data, solar irradiance indices (F10.7), solar wind parameters (velocity and interplanetary magnetic field), geomagnetic activity indices (Kp, AE, SYM-H), and NASA JPL's Global Ionospheric Maps of Total Electron Content (GIM-TEC). We also implement geospatially sparse data such as the TEC derived from the World-Wide GNSS Receiver Network and crowdsourced Android smartphone measurements. This novel heterogeneous dataset is temporally and spatially aligned into a single, modular data structure that supports both physical and data-driven modeling. Leveraging this dataset, we train and benchmark several spatiotemporal machine learning architectures for forecasting vertical TEC under both quiet and geomagnetically active conditions. This work presents an extensive dataset and modeling pipeline that enables exploration of not only ionospheric dynamics but also broader Sun-Earth interactions, supporting both scientific inquiry and operational forecasting efforts.
academic

Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models

Basic Information

  • Paper ID: 2511.15743
  • Title: Connecting the Dots: A Machine Learning Ready Dataset for Ionospheric Forecasting Models
  • Authors: Linnea M. Wolniewicz, Halil S. Kelebek, Simone Mestici, Michael D. Vergalla, Giacomo Acciarini, Bala Poduval, Olga Verkhoglyadova, Madhulika Guhathakurta, Thomas E. Berger, Atılım Güneş Baydin, Frank Soboczenski
  • Institutions: University of Hawai'i at Mānoa, University of Oxford, Università degli Studi di Roma Sapienza, Free Flight Research Lab, ESA, University of New Hampshire, NASA JPL, NASA Headquarters, University of Colorado Boulder, University of York & King's College London
  • Publication Venue: NeurIPS 2025 Workshop: Machine Learning for the Physical Sciences
  • Paper Link: https://arxiv.org/abs/2511.15743

Abstract

Operational forecasting of the ionosphere represents a critical challenge in space weather, with primary difficulties arising from sparse observational data, complex coupling across geospace layers, and growing demand for timely and accurate predictions supporting Global Navigation Satellite Systems (GNSS), communications, aviation safety, and satellite operations. As part of the 2025 NASA Heliolab project, this paper presents a carefully curated open-access dataset that integrates diverse ionospheric and heliospheric measurements into a coherent, machine learning-ready structure. The dataset synthesizes multiple data sources including Solar Dynamics Observatory (SDO) data, solar irradiance indices (F10.7), solar wind parameters (velocity and interplanetary magnetic field), geomagnetic activity indices (Kp, AE, SYM-H), and NASA JPL's Global Ionospheric Maps of Total Electron Content (GIM-TEC). The research team trained and benchmarked multiple spatiotemporal machine learning architectures for predicting vertical TEC under both quiet and geomagnetically active conditions, providing support for both scientific research and operational forecasting.

Research Background and Motivation

1. Core Problems to Address

Ionospheric forecasting faces three major challenges:

  • Data Sparsity: Observational data is unevenly distributed in time and space
  • Multi-scale Coupling: Complex interactions exist between solar activity, magnetosphere, and ionosphere-thermosphere systems
  • Urgent Operational Needs: Modern technological infrastructure (GNSS, satellite constellations, aviation networks, power grids) increasingly depends on accurate and timely space weather forecasts

2. Problem Significance

Space weather events (such as solar flares and coronal mass ejections) can cause:

  • Satellite operation disruptions
  • Degraded GNSS accuracy
  • Damaged radio communications
  • Power grid failures (e.g., the February 2022 event that caused 38 Starlink satellites to reenter the atmosphere)

With the rapid expansion of LEO satellite constellations and deepening dependence on space infrastructure, accurate ionospheric forecasting has become critical.

3. Limitations of Existing Approaches

  • Data Heterogeneity: Existing data sources vary dramatically in resolution, format, and temporal frequency
  • Lack of Standardization: Data products are not designed for machine learning workflows
  • Heavy Preprocessing Burden: Substantial manual processing is required before models can be trained
  • Difficult Systematic Comparison: The absence of standardized datasets hinders systematic model comparison

4. Research Motivation

Construct a machine learning-ready standardized dataset that integrates heterogeneous multi-source observational data, unifies spatiotemporal scales, and provides a foundation for developing, testing, and benchmarking advanced ML architectures, ultimately enabling a digital twin of the ionosphere.

Core Contributions

  1. Constructed the first comprehensive ML-ready ionospheric dataset: Integrates 8 major data sources spanning 14 years (2010-2024) of multimodal observational data
  2. Achieved spatiotemporal alignment of heterogeneous data:
    • Handled temporal frequency differences across data sources (from 15 seconds to daily)
    • Unified missing value representation and handling strategies
    • Provided multiple temporal resolution options (up to 15 minutes)
  3. Provided a geomagnetic storm event catalog (MESTICI scale):
    • Based on Kp index and NOAA G-level standards
    • Considers event duration
    • Prevents data leakage between training/validation sets
  4. Open-sourced data and code:
    • Google Cloud public storage bucket
    • GitHub open-source processing code
    • PyTorch dataset interface
  5. Benchmarked multiple ML models (IonCast series):
    • LSTM baseline model
    • Spherical Neural Operator Model (SFNO)
    • GraphCast-inspired model
    • Achieved 12-hour lead time forecasts outperforming persistence baseline

Methodology Details

Task Definition

Objective: Predict spatiotemporal evolution of global ionospheric total electron content (TEC)

Inputs:

  • Solar-driven data (SDO EUV radiation embeddings, F10.7 and other solar flux indices)
  • Geomagnetic-driven data (Kp, AE, SYM-H and other geomagnetic indices)
  • Solar wind parameters (velocity, interplanetary magnetic field components)
  • Orbital mechanics features (solar zenith angle, lunar position, etc.)
  • Quasi-dipole coordinate system transformations
  • Historical TEC maps (sparse and dense)

Outputs:

  • TEC forecast maps on global 1°×1° grid
  • Maximum lead time: 12 hours
  • Temporal resolution: 15 minutes

Constraints:

  • Must handle both geomagnetically quiet and active conditions
  • Must address missing data and irregular sampling

Dataset Architecture

Data Source Integration (see Table 1)

Data SourceKey FeaturesTemporal FrequencyTime Range
OMNI2AU/AL/AE, SYM-H, IMF, solar wind speed1 minute2010-05-13 to 2024-08-01
NOAA/GFZAp, Kp indices3 hours1997-01-01 to 2025-10-12
JPL-DDense TEC maps (1°×1°)15 minutes2010-05-13 to 2024-07-31
MadrigalSparse TEC maps (GNSS receivers)5 minutes2010-01-01 to 2024-08-01
SDO-FMEUV radiation embeddings15 seconds2010-05-13 to 2024-08-01
SETF10.7 and multi-wavelength fluxesDaily1997-01-01 to 2025-10-12
Orbital MechanicsSolar/lunar geometry parametersVariableComputed as needed
Quasi-dipoleMagnetic field coordinate transformationsAnnual2010-2024

Data Alignment Strategy

  1. Temporal Reference: SDO-FM data range as baseline (2010-05-13 to 2024-08-01)
  2. Missing Value Handling:
    • Standardize all missing values to NaN
    • Process non-standard sentinel values in OMNI dataset
    • Remove feature columns with large-scale missing data
  3. Forward Fill Strategy:
    - Define maximum rewind time
    - For most data streams: rewind time = native frequency
    - OMNI exception: rewind time = 50 minutes
    - Gaps exceeding rewind time: skip timestamp
    
  4. Resample to Unified Frequency: Use forward fill as simple interpolation strategy

Geomagnetic Storm Event Classification (MESTICI Scale)

Based on NOAA G-level standards combined with event duration:

Event IDKp RangeNOAA LevelDuration
G0HℓKp < 5Quietℓ hours
G1Hℓ5 ≤ Kp < 6Minorℓ hours
G2Hℓ6 ≤ Kp < 7Moderateℓ hours
G3Hℓ7 ≤ Kp < 8Strongℓ hours
G4Hℓ8 ≤ Kp < 9Severeℓ hours
G5HℓKp ≥ 9Extremeℓ hours

Purpose: Ensure physical reasonableness of model validation and prevent data leakage where the same geomagnetic storm event is scattered across training and validation sets.

Technical Innovations

  1. Multimodal Data Fusion:
    • First alignment of dense and sparse TEC maps with solar and geomagnetic drivers
    • Integrates multi-level data from satellite observations to crowdsourced smartphone measurements
  2. Unified Temporal Scales:
    • Handles 6 orders of magnitude in temporal frequency differences (15 seconds to daily)
    • Flexible resampling mechanism with user-customizable target frequencies
  3. Physics-Informed Integration:
    • Includes orbital mechanics features (solar zenith angle, etc.)
    • Provides quasi-dipole coordinate transformations for better magnetic field geometry representation
  4. Event-Aware Data Partitioning:
    • Avoids data leakage from traditional random partitioning
    • Maintains integrity of geomagnetic storm events

Experimental Setup

Dataset Scale

  • Time Span: 2010-05-13 to 2024-08-01 (approximately 14 years)
  • Spatial Resolution: 1°×1° global grid (180×360 = 64,800 grid points)
  • Temporal Resolution: 15 minutes (for training)
  • Total Samples: Approximately 500,000 time steps (based on 15-minute frequency)

Data Preprocessing

  1. Normalization: Specific normalization schemes for each data stream
  2. Missing Value Handling: Forward fill (configurable maximum rewind time)
  3. Event Classification: MESTICI labels based on Kp index
  4. Data Partitioning: Split by event boundaries to prevent leakage

IonCast Model Architectures

The paper trained three model architectures (detailed results in referenced 21):

  1. LSTM Baseline:
    • Classical time series model
    • Handles temporal dependencies
  2. Spherical Neural Operator Model (SFNO):
    • Neural operator based on spherical geometry
    • Suitable for global-scale physical field modeling
    • Inspired by FourCastNet
  3. GraphCast-Inspired Model:
    • Graph neural network architecture
    • References DeepMind's weather forecasting model
    • Handles irregular grids and multi-scale interactions

Evaluation Metrics

The paper mentions models "outperform persistence baseline," but specific metrics are not detailed. Common TEC prediction metrics include:

  • RMSE (Root Mean Square Error)
  • MAE (Mean Absolute Error)
  • Correlation coefficient
  • Skill Score

Implementation Details

Experimental Results

Main Results

The paper focuses primarily on dataset construction with relatively brief model performance descriptions:

  1. IonCast Model Performance:
    • Outperforms persistence forecast baseline
    • Produces accurate 12-hour lead time predictions
    • Effective under both geomagnetically quiet and active conditions
  2. Model Comparison:
    • Trained three architectures: LSTM, SFNO, and GraphCast
    • Detailed benchmarking results published in companion paper 21

Dataset Validation

Demonstrated through Figure 2 (MESTICI scale visualization):

  • Temporal distribution of geomagnetic events from 2010-2024
  • Event frequency across different intensity levels (G0-G5)
  • Distribution characteristics of event durations

Observations:

  • G0 (quiet) conditions dominate
  • G1-G2 (minor to moderate) events are common
  • G4-G5 (severe to extreme) events are rare but critical

Case Studies

The paper does not provide specific TEC forecast case examples but demonstrates data alignment visualization through Figure 1:

  • Shows temporal alignment of multiple data streams
  • Displays spatial distribution of sparse and dense TEC maps
  • Illustrates integration of orbital mechanics and quasi-dipole features

Experimental Findings

  1. Data Heterogeneity Challenges:
    • OMNI dataset contains multi-year large-scale gaps
    • Inconsistent missing value encoding across data sources
    • Requires careful fill strategy design balancing completeness and timeliness
  2. Importance of Event-Aware Partitioning:
    • Traditional random partitioning causes data leakage from the same storm event
    • Physics-based event boundary partitioning is more appropriate
  3. Potential of Multimodal Fusion:
    • Integrating solar, geomagnetic, and ionospheric data captures Sun-Earth interactions
    • Provides unified platform for physics-driven and data-driven modeling

Ionospheric Modeling Field

  1. Traditional Physics Models:
    • Numerical simulations based on physical equations
    • High computational cost, difficult for real-time operation
  2. Empirical Models:
    • Such as International Reference Ionosphere (IRI)
    • Depend on statistical relationships, limited predictive power for extreme events
  3. Data Assimilation Methods:
    • Combine observations and physical models
    • Require complex algorithms and computational resources

Machine Learning Applications in Space Weather

  1. Solar Activity Prediction:
    • SDO Foundation Model 16: Uses deep learning for solar observations
    • This paper integrates SDO-FM embeddings as input features
  2. Geomagnetic Index Prediction:
    • Uses LSTM and other time series models to predict Dst, Kp indices
    • This paper uses these indices as drivers rather than prediction targets
  3. TEC Prediction:
    • Existing work mostly uses single data sources
    • Lacks standardized datasets and benchmarks

ML Breakthroughs in Weather Forecasting

  1. GraphCast 25: DeepMind's global weather forecasting model
  2. FourCastNet 24: Fourier neural operator-based probabilistic weather forecasting
  3. This Paper's Inspiration: Transfers successful weather forecasting experience to ionospheric forecasting

Unique Contributions of This Work

  • First comprehensive ML-ready ionospheric dataset: Integrates broadest range of data sources
  • Open Access: Data and code fully public
  • Event-Aware Design: Considers physical characteristics of space weather
  • Modular Structure: Supports multiple modeling paradigms

Conclusions and Discussion

Main Conclusions

  1. Successfully constructed the first comprehensive ML-ready ionospheric dataset:
    • Integrates 8 major data sources
    • Aligned spatiotemporal data into unified structure
    • Covers 14 years of observational data
  2. Provided complete open-source ecosystem:
    • Google Cloud public data storage
    • GitHub open-source processing code
    • PyTorch data loading interface
  3. Validated dataset effectiveness:
    • IonCast models outperform persistence baseline
    • Support 12-hour lead time forecasts
    • Perform well under multiple geomagnetic conditions
  4. Provided standardized benchmark for community:
    • Unified data format
    • Consistent evaluation protocols
    • Reproducible experimental setup

Limitations

  1. Limited Temporal Coverage:
    • Constrained by SDO data, covers only 2010-2024
    • Lacks data before solar activity cycle 24
    • Does not completely cover solar cycle 25
  2. Simplified Missing Value Handling:
    • Uses simple forward fill
    • May not suit all application scenarios
    • Does not explore more complex interpolation methods (e.g., physics-constrained interpolation)
  3. Fixed Spatial Resolution:
    • 1°×1° grid may be insufficient for capturing small-scale structures
    • Does not provide multi-resolution options
  4. Insufficient Model Performance Details:
    • Paper focuses on dataset construction
    • Model benchmarking results relatively brief
    • Detailed evaluation requires reference to companion paper 21
  5. Computational Resource Requirements:
    • Large dataset size (Google Cloud storage)
    • Training global models requires significant computational resources
    • May limit access for some researchers

Future Directions

  1. Dataset Expansion:
    • Integrate additional data sources (ICON satellite, Swarm constellation)
    • Extend temporal coverage
    • Increase spatial resolution
  2. Advanced Preprocessing Methods:
    • Physics-constrained data interpolation
    • Smarter missing value imputation
    • Automated data quality control
  3. Model Improvements:
    • Develop Physics-Informed Neural Networks (PINNs)
    • Explore Transformer architectures
    • Quantify uncertainty
  4. Operational Deployment:
    • Real-time data stream integration
    • Low-latency prediction systems
    • Integration with existing operational systems
  5. Digital Twin Vision:
    • Build complete ionospheric digital twin
    • Support what-if scenario analysis
    • Multi-physics coupled modeling

In-Depth Evaluation

Strengths

  1. Fills Important Gap:
    • Addresses long-standing ML community need for standardized ionospheric dataset
    • Lowers entry barriers to the field
    • Enables systematic model comparison
  2. Comprehensive Data Integration:
    • 8 major data sources span complete Sun-to-ionosphere chain
    • Includes both dense and sparse observations supporting diverse modeling needs
    • 14-year span covers multiple solar activity phases
  3. Excellent Technical Implementation:
    • Carefully handles heterogeneous data alignment
    • Event-aware partitioning prevents leakage
    • Provides flexible configuration options
  4. Openness and Reproducibility:
    • Data fully public (Google Cloud)
    • Code open-source (GitHub)
    • Detailed documentation for easy use
  5. Cross-Disciplinary Value:
    • Supports both physics-based and data-driven modeling
    • Promotes cross-fertilization between space physics and machine learning
    • Facilitates both scientific discovery and operational applications
  6. Timeliness:
    • Aligns with new NASA and ESA missions (TRACERS, Vigil)
    • Responds to urgent space weather forecasting needs
    • Synchronizes with latest ML weather forecasting advances

Weaknesses

  1. Insufficient Model Evaluation:
    • Paper focuses on dataset, model section relatively brief
    • Lacks detailed performance numbers and comparison tables
    • Missing error analysis and failure cases
  2. Conservative Missing Value Handling:
    • Simple forward fill method
    • Does not explore more advanced interpolation techniques
    • Handling of large-scale OMNI gaps may be overly aggressive (direct column deletion)
  3. Limited Physics Validation:
    • Insufficient discussion of prediction result physical reasonableness
    • Lacks comparison with physics models
    • Does not analyze whether models learn physical laws
  4. Insufficient Extreme Event Coverage:
    • G4-G5 level events are rare
    • May result in poor model predictive power for extreme events
    • Does not discuss class imbalance problem
  5. Unquantified Computational Costs:
    • Does not report data processing and model training time
    • Does not discuss real-time forecasting feasibility
    • Lacks resource requirement guidance
  6. Insufficient Regional Characteristics Consideration:
    • Global 1°×1° grid may mask regional differences
    • Does not discuss prediction difficulty variations across latitudes
    • Lacks analysis of special regions (polar, equatorial, etc.)

Impact

  1. Contribution to Field:
    • High Impact: Solves critical community pain point
    • Expected to become standard dataset for ionospheric ML research
    • Promotes paradigm shift in space weather forecasting
  2. Practical Value:
    • Direct Applications: Supports GNSS, communications, aviation, and other industries
    • Policy Impact: Provides tools for NASA, ESA, and other agency decisions
    • Safety Value: Enhances early warning capability for space weather hazards
  3. Reproducibility:
    • Excellent: Data and code fully public
    • Clear documentation enables easy community use
    • Provides solid foundation for follow-up research
  4. Academic Impact:
    • Expected to be widely cited
    • Likely to spawn series of follow-up studies
    • Promotes cross-fusion of physical sciences and AI

Applicable Scenarios

  1. Scientific Research:
    • Explore ionospheric dynamics mechanisms
    • Study Sun-Earth interactions
    • Validate physics models
  2. Operational Forecasting:
    • GNSS accuracy correction
    • Satellite operation decision support
    • Aviation route planning
  3. Education and Training:
    • Space weather course teaching data
    • ML application examples in physical sciences
    • Student projects and competitions
  4. Model Development:
    • Benchmark testing for new architectures
    • Pre-training data for transfer learning
    • Base models for ensemble learning
  5. Inapplicable Scenarios:
    • Applications requiring ultra-high spatial resolution (<1°)
    • Systems requiring real-time (second-level) response
    • Historical research before 2010

Selected References

  1. Berger et al. (2020): Impact of space weather uncertainty on aviation
  2. Kataoka et al. (2022): Analysis of February 2022 Starlink satellite reentry event
  3. Walsh et al. (2024): SDO Foundation Model - foundation model for solar observations
  4. Lam et al. (2023): GraphCast - DeepMind's weather forecasting breakthrough
  5. Bonev et al. (2025): FourCastNet 3 - geometric approach to probabilistic weather forecasting
  6. Kelebek et al. (2025): IonCast - detailed modeling study based on this dataset

Summary

This paper represents an important infrastructure contribution to the space weather forecasting field. Rather than proposing new algorithms, it addresses a more fundamental problem: providing standardized, high-quality datasets for machine learning research. This type of contribution is often undervalued in the AI community but is actually key to advancing the field.

The paper's greatest value lies in:

  1. Significantly lowering entry barriers, enabling more ML researchers to participate in space weather research
  2. Providing unified benchmarks enabling systematic method comparison
  3. Integrating data spanning multiple orders of magnitude in spatiotemporal scales, demonstrating best practices in data engineering

Recommendations for future users:

  • Carefully study data processing code to understand design choices
  • Adjust missing value handling strategies according to specific applications
  • Perform feature engineering informed by physics knowledge
  • Pay attention to class imbalance in extreme events
  • Validate predictions' physical reasonableness through comparison with physics models

This work lays the foundation for the ionospheric forecasting field's "ImageNet moment," expected to catalyze a series of innovative research efforts.