Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.
- Paper ID: 2510.09416
- Title: What Do Temporal Graph Learning Models Learn?
- Authors: Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier
- Classification: cs.LG cs.SI
- Publication Date: October 10, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.09416
Temporal graph learning has become a central topic in graph representation learning, with numerous benchmarks demonstrating strong performance of state-of-the-art models. However, recent research has raised concerns about the reliability of benchmark results, identifying problems in commonly used evaluation protocols and the surprising competitiveness of simple heuristic methods. This contrast raises a question: which properties of the underlying graph do temporal graph learning models actually utilize to form predictions? This paper addresses this question by systematically evaluating the ability of seven models to capture eight fundamental properties related to temporal graph link structure. These properties include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using synthetic and real-world datasets, the paper analyzes how well models learn these properties. The findings present a mixed picture: models capture certain properties well but fail to reproduce others, exposing important limitations.
- Reliability Issues in Benchmark Evaluation: Despite temporal graph learning models demonstrating excellent performance on various benchmarks, recent research has discovered flaws in evaluation protocols, including issues with test sets and evaluation metrics that lead to unrealistic results.
- Competitiveness of Simple Heuristics: Surprisingly, simple heuristic methods that predict edges involving recently active and globally popular nodes achieve performance comparable to many state-of-the-art models.
- Lack of Model Interpretability: Even when specific models perform well on given benchmark datasets, it remains unclear which factors contribute to this performance, and more specifically, which graph properties models utilize to form predictions.
This study aims to take a step back and assess the ability of popular graph learning models to learn simple, interpretable properties of temporal graphs, providing practical insights for real-world applications of temporal graph learning models, and promoting more interpretability-focused evaluation approaches.
- Proposes a Novel Evaluation Framework: Systematically evaluates the ability of temporal graph learning models to capture intuitive temporal network properties
- Identifies Limitations of Existing Models: Discovers that models have limitations in distinguishing edge directionality, detecting periodic patterns, or emphasizing recently observed graph dynamics
- Provides Practical Guidance: Offers insights for real-world applications of deep graph learning models
- Establishes Interpretability Benchmarks: Provides benchmarks for more interpretability-focused evaluation of temporal graph learning models, complementing existing performance-oriented benchmarks
This paper evaluates the ability of seven state-of-the-art temporal graph learning models to learn eight fundamental graph properties:
- General Graph Features: Temporal granularity, edge directionality, density
- Temporal Patterns: Persistence, periodicity, recency
- Edge Formation Mechanisms: Homophily, preferential attachment
Evaluated seven representative models:
- DyGFormer: Transformer-based dynamic graph model
- GraphMixer: Temporal network model with simplified architecture
- DyRep: Recurrent neural network-based representation learning
- JODIE: Joint dynamic user and item embedding
- TGN: Temporal graph network
- TCL: Transformer-based dynamic graph modeling with contrastive learning
- TGAT: Inductive temporal graph representation learning
- Real-world Datasets: Enron email network, UCI message network, Wikipedia editing network
- Synthetic Datasets: Artificial graphs designed for specific properties, such as stochastic block models (SBM) for homophily testing and Barabási-Albert models for preferential attachment testing
Specialized experiments designed for each property:
- Combination of synthetic and real-world datasets
- Variable control to isolate the effects of specific properties
- Performance evaluation through probability scores, accuracy, and other metrics
- Systematic Evaluation Methodology: First systematic evaluation of temporal graph models' learning ability for fundamental graph properties
- Multi-dimensional Property Analysis: Covers properties across three dimensions: structural, temporal, and mechanistic
- Synthetic Data Validation: Validates models' learning ability for specific properties through carefully designed synthetic datasets
- Interpretability-Oriented Approach: Evaluates models from an interpretability perspective rather than pure performance
| Dataset | Nodes | Continuous Edges | Discrete Edges | Unique Edges | Discrete Time Steps |
|---|
| Enron | 184 | 125,235 | 10,472 | 3,125 | 45 (months) |
| UCI | 1,899 | 59,835 | 26,628 | 20,296 | 29 (weeks) |
| Wikipedia | 9,277 | 157,474 | 65,085 | 18,257 | 745 (hours) |
- ROC-AUC: For link prediction performance evaluation
- Balanced accuracy: For classification tasks
- Probability score distribution: For analyzing model prediction behavior
- Edge grouping statistics: For quantitative analysis of specific properties
- Learning rate: 1e-4
- Batch size: 200
- Loss function: BCELoss
- Optimizer: Adam
- Maximum training epochs: 300
- Early stopping tolerance: 1e-6
- Temporal feature dimension: 100
| Graph Property | DyGFormer | DyRep | JODIE | GraphMixer | TCL | TGAT | TGN |
|---|
| Temporal Granularity | ∼ | ✓ | ✓ | ✓ | ∼ | ∼ | ✓ |
| Directionality | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Density | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Persistence | ✓ | ✗ | ✗ | ∼ | ∼ | ✓ | ✗ |
| Periodicity | ✗ | ✗ | ✗ | ✓ | ✓ | ∼ | ∼ |
| Recency | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| Homophily | ✓ | ∼ | ✗ | ∼ | ✓ | ∼ | ∼ |
| Preferential Attachment | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
- Flattening timestamps severely impairs performance, indicating that models do utilize temporal information
- GraphMixer and DyRep show the largest performance drops when timestamps are discretized
- TGAT performs better on discrete time steps
- Key Finding: All models fail to effectively distinguish edge directionality
- For approximately 50% of edges, the prediction probability difference between forward and reverse edges is less than 0.02
- Even with bidirectional training, most models still produce approximately symmetric predictions
- Important Limitation: All models fail to learn graph density
- Predicted density is typically several orders of magnitude lower than true density
- Models tend to predict all edges as negative when observing large numbers of negative samples
- DyGFormer and TGAT can learn persistent graphs
- JODIE and TGN perform poorly on this simple task
- GraphMixer and TCL can effectively distinguish between odd and even time steps
- DyGFormer cannot distinguish time steps, behaving similarly to the EdgeBank baseline
- Surprising Result: All models fail to emphasize recently observed edges
- The average probability score of edges does not vary with their last observation time
- This contrasts with the success of heuristic methods based on recently active nodes
- DyGFormer and TCL can predict intra-group links in a balanced manner
- JODIE is extremely biased toward group 0
- Most models tend to predict links within group 1 more frequently
- Consistent Success: All models learn preferential attachment
- Edges involving high-degree nodes receive higher average probability scores
- Follow power-law degree distribution patterns
- Temporal Graph Benchmark (TGB): Evaluates the quality of temporal graph neural networks
- BenchTemp: Focuses on benchmarks for temporal graph data
- Unified Framework: Connects discrete-time and continuous-time models
- EdgeBank Baseline: Simple baseline achieves performance similar to state-of-the-art methods
- Temporal Pattern Learning Limitations: Small impact of timestamp perturbation on performance
- Success of Heuristic Methods: Heuristics based on popularity and recent activity outperform complex models
- Mixed Performance: Models perform well on certain properties (such as preferential attachment) but have serious limitations in other aspects (such as distinguishing directionality and predicting density)
- Consistent Limitations: All models fail to distinguish edge directionality, do not emphasize recency, and cannot accurately predict density
- Model Differences: Different models show significant variations in learning specific properties, providing guidance for model selection in practical applications
- Dataset Constraints: Due to the breadth of experiments, the number of datasets used is limited and may not represent all network-related graph datasets
- Property Selection: The eight properties evaluated are not exhaustive; other important graph properties deserve consideration
- Model Scope: Includes only continuous-time models; does not cover models for discrete-time settings
- Model Improvements: Design new models to address discovered limitations (density, directionality, recency)
- Framework Extensions:
- Add more graph property evaluations
- Include discrete-time models
- Consider heterogeneous networks
- Application Guidance: Recommend suitable models for different application scenarios based on property learning capabilities
- Strong Systematicity: First systematic evaluation of temporal graph learning models from an interpretability perspective, filling an important gap
- Rigorous Methodology: Combination of synthetic and real-world datasets with controlled variable experimental design ensures result reliability
- Important Findings: Reveals serious limitations of seemingly powerful models in learning fundamental properties, with significant practical value
- Application-Oriented: Provides practical guidance for model selection and application, rather than focusing solely on benchmark performance
- Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why certain models fail on specific properties
- Missing Improvement Solutions: Primarily identifies problems without providing specific improvement suggestions or methods
- Limited Evaluation Metrics: Some experiments may benefit from more diverse evaluation metrics for comprehensive assessment of model capabilities
- Academic Value: Introduces a new evaluation perspective to temporal graph learning, potentially influencing future model design and evaluation standards
- Practical Value: Provides important reference for practitioners in selecting appropriate models, avoiding blind pursuit of benchmark performance
- Research Inspiration: Exposed limitations provide clear improvement directions for future research
- Model Selection: Guidance for model selection in specific applications requiring consideration of edge directionality, density prediction, and other properties
- Benchmark Design: Reference for designing more comprehensive temporal graph learning benchmarks
- Model Development: Provides improvement targets and evaluation standards for developing new temporal graph learning models
The paper cites extensive related work, including:
- Temporal graph benchmark-related work (TGB, BenchTemp, etc.)
- Research on limitations of temporal graph learning models
- Critical research on graph learning evaluation methods
- Classical graph models (stochastic block models, Barabási-Albert models, etc.)
Overall Assessment: This is a research work of significant value that reveals important limitations of temporal graph learning models through systematic interpretability evaluation. The research methodology is rigorous, the findings have practical significance, and it provides new perspectives and improvement directions for field development. While there is room for improvement in theoretical analysis and solution proposals, its contributions are sufficient to promote the field toward more interpretability-focused and practical development.