2025-11-13T10:52:11.188844

What Do Temporal Graph Learning Models Learn?

Hayes, Schumacher, Strohmaier

Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.

academic

What Do Temporal Graph Learning Models Learn?

Basic Information

Paper ID: 2510.09416
Title: What Do Temporal Graph Learning Models Learn?
Authors: Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier
Classification: cs.LG cs.SI
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09416

Abstract

Temporal graph learning has become a central topic in graph representation learning, with numerous benchmarks demonstrating strong performance of state-of-the-art models. However, recent research has raised concerns about the reliability of benchmark results, identifying problems in commonly used evaluation protocols and the surprising competitiveness of simple heuristic methods. This contrast raises a question: which properties of the underlying graph do temporal graph learning models actually utilize to form predictions? This paper addresses this question by systematically evaluating the ability of seven models to capture eight fundamental properties related to temporal graph link structure. These properties include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using synthetic and real-world datasets, the paper analyzes how well models learn these properties. The findings present a mixed picture: models capture certain properties well but fail to reproduce others, exposing important limitations.

Research Background and Motivation

Problem Background

Reliability Issues in Benchmark Evaluation: Despite temporal graph learning models demonstrating excellent performance on various benchmarks, recent research has discovered flaws in evaluation protocols, including issues with test sets and evaluation metrics that lead to unrealistic results.
Competitiveness of Simple Heuristics: Surprisingly, simple heuristic methods that predict edges involving recently active and globally popular nodes achieve performance comparable to many state-of-the-art models.
Lack of Model Interpretability: Even when specific models perform well on given benchmark datasets, it remains unclear which factors contribute to this performance, and more specifically, which graph properties models utilize to form predictions.

Research Motivation

This study aims to take a step back and assess the ability of popular graph learning models to learn simple, interpretable properties of temporal graphs, providing practical insights for real-world applications of temporal graph learning models, and promoting more interpretability-focused evaluation approaches.

Core Contributions

Proposes a Novel Evaluation Framework: Systematically evaluates the ability of temporal graph learning models to capture intuitive temporal network properties
Identifies Limitations of Existing Models: Discovers that models have limitations in distinguishing edge directionality, detecting periodic patterns, or emphasizing recently observed graph dynamics
Provides Practical Guidance: Offers insights for real-world applications of deep graph learning models
Establishes Interpretability Benchmarks: Provides benchmarks for more interpretability-focused evaluation of temporal graph learning models, complementing existing performance-oriented benchmarks

Methodology Details

Task Definition

This paper evaluates the ability of seven state-of-the-art temporal graph learning models to learn eight fundamental graph properties:

General Graph Features: Temporal granularity, edge directionality, density
Temporal Patterns: Persistence, periodicity, recency
Edge Formation Mechanisms: Homophily, preferential attachment

Evaluation Framework

Model Selection

Evaluated seven representative models:

DyGFormer: Transformer-based dynamic graph model
GraphMixer: Temporal network model with simplified architecture
DyRep: Recurrent neural network-based representation learning
JODIE: Joint dynamic user and item embedding
TGN: Temporal graph network
TCL: Transformer-based dynamic graph modeling with contrastive learning
TGAT: Inductive temporal graph representation learning

Dataset Design

Real-world Datasets: Enron email network, UCI message network, Wikipedia editing network
Synthetic Datasets: Artificial graphs designed for specific properties, such as stochastic block models (SBM) for homophily testing and Barabási-Albert models for preferential attachment testing

Evaluation Methods

Specialized experiments designed for each property:

Combination of synthetic and real-world datasets
Variable control to isolate the effects of specific properties
Performance evaluation through probability scores, accuracy, and other metrics

Technical Innovations

Systematic Evaluation Methodology: First systematic evaluation of temporal graph models' learning ability for fundamental graph properties
Multi-dimensional Property Analysis: Covers properties across three dimensions: structural, temporal, and mechanistic
Synthetic Data Validation: Validates models' learning ability for specific properties through carefully designed synthetic datasets
Interpretability-Oriented Approach: Evaluates models from an interpretability perspective rather than pure performance

Experimental Setup

Dataset Details

Dataset	Nodes	Continuous Edges	Discrete Edges	Unique Edges	Discrete Time Steps
Enron	184	125,235	10,472	3,125	45 (months)
UCI	1,899	59,835	26,628	20,296	29 (weeks)
Wikipedia	9,277	157,474	65,085	18,257	745 (hours)

Evaluation Metrics

ROC-AUC: For link prediction performance evaluation
Balanced accuracy: For classification tasks
Probability score distribution: For analyzing model prediction behavior
Edge grouping statistics: For quantitative analysis of specific properties

Implementation Details

Learning rate: 1e-4
Batch size: 200
Loss function: BCELoss
Optimizer: Adam
Maximum training epochs: 300
Early stopping tolerance: 1e-6
Temporal feature dimension: 100

Experimental Results

Summary of Main Findings

Graph Property	DyGFormer	DyRep	JODIE	GraphMixer	TCL	TGAT	TGN
Temporal Granularity	∼	✓	✓	✓	∼	∼	✓
Directionality	✗	✗	✗	✗	✗	✗	✗
Density	✗	✗	✗	✗	✗	✗	✗
Persistence	✓	✗	✗	∼	∼	✓	✗
Periodicity	✗	✗	✗	✓	✓	∼	∼
Recency	✗	✗	✗	✗	✗	✗	✗
Homophily	✓	∼	✗	∼	✓	∼	∼
Preferential Attachment	✓	✓	✓	✓	✓	✓	✓

Detailed Results Analysis

1. Temporal Granularity

Flattening timestamps severely impairs performance, indicating that models do utilize temporal information
GraphMixer and DyRep show the largest performance drops when timestamps are discretized
TGAT performs better on discrete time steps

2. Edge Directionality

Key Finding: All models fail to effectively distinguish edge directionality
For approximately 50% of edges, the prediction probability difference between forward and reverse edges is less than 0.02
Even with bidirectional training, most models still produce approximately symmetric predictions

3. Density

Important Limitation: All models fail to learn graph density
Predicted density is typically several orders of magnitude lower than true density
Models tend to predict all edges as negative when observing large numbers of negative samples

4. Persistence

DyGFormer and TGAT can learn persistent graphs
JODIE and TGN perform poorly on this simple task

5. Periodicity

GraphMixer and TCL can effectively distinguish between odd and even time steps
DyGFormer cannot distinguish time steps, behaving similarly to the EdgeBank baseline

6. Recency

Surprising Result: All models fail to emphasize recently observed edges
The average probability score of edges does not vary with their last observation time
This contrasts with the success of heuristic methods based on recently active nodes

7. Homophily

DyGFormer and TCL can predict intra-group links in a balanced manner
JODIE is extremely biased toward group 0
Most models tend to predict links within group 1 more frequently

8. Preferential Attachment

Consistent Success: All models learn preferential attachment
Edges involving high-degree nodes receive higher average probability scores
Follow power-law degree distribution patterns

Dynamic Graph Learning Benchmarks

Temporal Graph Benchmark (TGB): Evaluates the quality of temporal graph neural networks
BenchTemp: Focuses on benchmarks for temporal graph data
Unified Framework: Connects discrete-time and continuous-time models

Limitations of Temporal Link Prediction Models

EdgeBank Baseline: Simple baseline achieves performance similar to state-of-the-art methods
Temporal Pattern Learning Limitations: Small impact of timestamp perturbation on performance
Success of Heuristic Methods: Heuristics based on popularity and recent activity outperform complex models

Conclusions and Discussion

Main Conclusions

Mixed Performance: Models perform well on certain properties (such as preferential attachment) but have serious limitations in other aspects (such as distinguishing directionality and predicting density)
Consistent Limitations: All models fail to distinguish edge directionality, do not emphasize recency, and cannot accurately predict density
Model Differences: Different models show significant variations in learning specific properties, providing guidance for model selection in practical applications

Limitations

Dataset Constraints: Due to the breadth of experiments, the number of datasets used is limited and may not represent all network-related graph datasets
Property Selection: The eight properties evaluated are not exhaustive; other important graph properties deserve consideration
Model Scope: Includes only continuous-time models; does not cover models for discrete-time settings

Future Directions

Model Improvements: Design new models to address discovered limitations (density, directionality, recency)
Framework Extensions:
- Add more graph property evaluations
- Include discrete-time models
- Consider heterogeneous networks
Application Guidance: Recommend suitable models for different application scenarios based on property learning capabilities

In-Depth Evaluation

Strengths

Strong Systematicity: First systematic evaluation of temporal graph learning models from an interpretability perspective, filling an important gap
Rigorous Methodology: Combination of synthetic and real-world datasets with controlled variable experimental design ensures result reliability
Important Findings: Reveals serious limitations of seemingly powerful models in learning fundamental properties, with significant practical value
Application-Oriented: Provides practical guidance for model selection and application, rather than focusing solely on benchmark performance

Weaknesses

Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why certain models fail on specific properties
Missing Improvement Solutions: Primarily identifies problems without providing specific improvement suggestions or methods
Limited Evaluation Metrics: Some experiments may benefit from more diverse evaluation metrics for comprehensive assessment of model capabilities

Impact

Academic Value: Introduces a new evaluation perspective to temporal graph learning, potentially influencing future model design and evaluation standards
Practical Value: Provides important reference for practitioners in selecting appropriate models, avoiding blind pursuit of benchmark performance
Research Inspiration: Exposed limitations provide clear improvement directions for future research

Applicable Scenarios

Model Selection: Guidance for model selection in specific applications requiring consideration of edge directionality, density prediction, and other properties
Benchmark Design: Reference for designing more comprehensive temporal graph learning benchmarks
Model Development: Provides improvement targets and evaluation standards for developing new temporal graph learning models

References

The paper cites extensive related work, including:

Temporal graph benchmark-related work (TGB, BenchTemp, etc.)
Research on limitations of temporal graph learning models
Critical research on graph learning evaluation methods
Classical graph models (stochastic block models, Barabási-Albert models, etc.)

Overall Assessment: This is a research work of significant value that reveals important limitations of temporal graph learning models through systematic interpretability evaluation. The research methodology is rigorous, the findings have practical significance, and it provides new perspectives and improvement directions for field development. While there is room for improvement in theoretical analysis and solution proposals, its contributions are sufficient to promote the field toward more interpretability-focused and practical development.