2025-11-13T10:52:11.188844

What Do Temporal Graph Learning Models Learn?

Hayes, Schumacher, Strohmaier

Learning on temporal graphs has become a central topic in graph representation learning, with numerous benchmarks indicating the strong performance of state-of-the-art models. However, recent work has raised concerns about the reliability of benchmark results, noting issues with commonly used evaluation protocols and the surprising competitiveness of simple heuristics. This contrast raises the question of which properties of the underlying graphs temporal graph learning models actually use to form their predictions. We address this by systematically evaluating seven models on their ability to capture eight fundamental attributes related to the link structure of temporal graphs. These include structural characteristics such as density, temporal patterns such as recency, and edge formation mechanisms such as homophily. Using both synthetic and real-world datasets, we analyze how well models learn these attributes. Our findings reveal a mixed picture: models capture some attributes well but fail to reproduce others. With this, we expose important limitations. Overall, we believe that our results provide practical insights for the application of temporal graph learning models, and motivate more interpretability-driven evaluations in temporal graph learning research.

academic

What Do Temporal Graph Learning Models Learn?

基本信息

论文ID: 2510.09416
标题: What Do Temporal Graph Learning Models Learn?
作者: Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier
分类: cs.LG cs.SI
发表时间: 2025年10月10日（arXiv预印本）
论文链接: https://arxiv.org/abs/2510.09416

摘要

时间图学习已成为图表示学习的核心主题，众多基准测试表明最先进模型具有强劲性能。然而，最近的研究对基准结果的可靠性提出了担忧，指出了常用评估协议的问题以及简单启发式方法令人惊讶的竞争力。这种对比引发了一个问题：时间图学习模型实际使用底层图的哪些属性来形成预测？本文通过系统评估七个模型捕获与时间图链接结构相关的八个基本属性的能力来解决这一问题。这些属性包括密度等结构特征、近期性等时间模式，以及同质性等边形成机制。使用合成和真实世界数据集，分析模型学习这些属性的效果。研究发现呈现混合图景：模型能很好地捕获某些属性，但无法再现其他属性，从而暴露了重要的局限性。

研究背景与动机

问题背景

基准评估的可靠性问题：尽管时间图学习模型在各种基准测试中表现出色，但最近研究发现评估协议存在缺陷，包括测试集和评估指标的问题导致不现实的结果。
简单启发式的竞争力：令人惊讶的是，预测涉及最近活跃和全局流行节点的边等简单启发式方法，其性能与许多最先进模型相当。
模型可解释性缺失：即使特定模型在给定基准数据集上表现良好，也不清楚哪些因素促成了这种性能，更具体地说，模型利用哪些图属性来形成预测。

研究动机

本研究旨在退一步评估流行图学习模型学习时间图简单、可解释属性的能力，为时间图学习模型的实际应用提供实用见解，并推动更注重可解释性的评估。

核心贡献

提出了新颖的评估框架：系统评估时间图学习模型捕获直观时间网络属性的能力
识别了现有模型的局限性：发现模型在区分边的方向、检测周期模式或强调最近观察到的图动态方面存在局限
提供了实践指导：为深度图学习模型的实际应用提供见解
建立了可解释性基准：为时间图学习模型更注重可解释性的评估提供基准，补充现有的面向性能的基准

方法详解

任务定义

本文评估七个最先进时间图学习模型学习八个基本图属性的能力：

一般图特征：时间粒度、边方向、密度
时间模式：持久性、周期性、近期性
边形成机制：同质性、优先连接

评估框架

模型选择

评估了七个代表性模型：

DyGFormer：基于Transformer的动态图模型
GraphMixer：简化架构的时间网络模型
DyRep：基于循环神经网络的表示学习
JODIE：联合动态用户和项目嵌入
TGN：时间图网络
TCL：基于对比学习的Transformer动态图建模
TGAT：归纳时间图表示学习

数据集设计

真实数据集：Enron邮件网络、UCI消息网络、Wikipedia编辑网络
合成数据集：针对特定属性设计的人工图，如随机块模型（SBM）用于同质性测试，Barabási-Albert模型用于优先连接测试

评估方法

对每个属性设计专门的实验：

使用合成和真实数据集的组合
控制变量以隔离特定属性的影响
通过概率分数、准确率等指标评估模型性能

技术创新点

系统性评估方法：首次系统性地评估时间图模型对基本图属性的学习能力
多维度属性分析：涵盖结构、时间和机制三个维度的属性
合成数据验证：通过精心设计的合成数据集验证模型对特定属性的学习能力
可解释性导向：从可解释性角度而非纯性能角度评估模型

实验设置

数据集详情

数据集	节点数	连续边数	离散边数	唯一边数	离散时间步
Enron	184	125,235	10,472	3,125	45（月）
UCI	1,899	59,835	26,628	20,296	29（周）
Wikipedia	9,277	157,474	65,085	18,257	745（时）

评价指标

ROC-AUC：用于链接预测性能评估
平衡准确率：用于分类任务
概率分数分布：用于分析模型预测行为
边分组统计：用于特定属性的定量分析

实现细节

学习率：1e-4
批大小：200
损失函数：BCELoss
优化器：Adam
最大训练轮数：300
早停容忍度：1e-6
时间特征维度：100

实验结果

主要发现总结

图属性	DyGFormer	DyRep	JODIE	GraphMixer	TCL	TGAT	TGN
时间粒度	∼	✓	✓	✓	∼	∼	✓
方向	✗	✗	✗	✗	✗	✗	✗
密度	✗	✗	✗	✗	✗	✗	✗
持久性	✓	✗	✗	∼	∼	✓	✗
周期性	✗	✗	✗	✓	✓	∼	∼
近期性	✗	✗	✗	✗	✗	✗	✗
同质性	✓	∼	✗	∼	✓	∼	∼
优先连接	✓	✓	✓	✓	✓	✓	✓