Do Transformers Really Perform Badfor Graph Representation?

microsoft/Graphormer: This is the official implementation for "Do Transformers Really Perform Bad for Graph Representation?". (github.com)

作者们发现关键问题在于如何补回Transformer模型的自注意力层丢失掉的图结构信息！不同于序列数据(NLP, Speech)或网格数据(CV)，图的结构信息是图数据特有的属性，且对图的性质预测起着重要的作用。

There are many attempts of leveraging Transformer into the graph domain, but the only effective way is replacing some key modules (e.g., feature aggregation) in classic GNN variants by the softmax attention[47,7,22,48,58,43,13]

[47] Graph attention networks. ICLR, 2018.
[7] Graph transformer for graph-to-sequence learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7464–7471, 2020.
[22] Heterogeneous graph transformer. In Proceedings of The Web Conference 2020, pages 2704–2710, 2020.
[48] Direct multi-hop attention based graph neural network.arXiv preprint arXiv:2009.14332, 2020.
[58] Graph-bert: Only attention is needed forlearning graph representations.arXiv preprint arXiv:2001.05140, 2020.
[43] Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information ProcessingSystems, 33, 2020.
[13] generalization of transformer networks to graphs. AAAI Workshop on Deep Learning on Graphs: Methods and Applications, 2021

对于每个节点，the self-attention 只计算节点和其他节点之间的语义相似性，而不考虑反映在节点上的图的结构信息和节点对之间的关系。

基于此，研究人员们在图预测任务上提出了Graphormer模型 —— 一个标准的Transformer模型，并且带有三种结构信息编码(中心性编码Centrality Encoding、空间编码Spatial Encoding以及边编码Edge Encoding)，帮助Graphormer模型编码图数据的结构信息。

Centrality Encoding: capture the node importance in the graph. In particular, we leverage the degree centrality for the centrality encoding, where a learnable vectoris assigned to each node according to its degree and added to the node features in the input layer.
Spatial Encoding: capture the structural relation between nodes.
Edge Encoding

通过使用上述编码，我们进一步从数学上证明了Graphormer具有很强的表达能力，因为许多流行的GNN变体只是它的特例。

2.1 Structural Encodings in Graphormer

2.1.1 a Centrality Encoding

In Graphormer, we use the degree centrality, which is one of the standard centrality measures inliterature, as an additional signal to the neural network. To be specific, we develop a Centrality Encoding which assigns each node two real-valued embedding vectors according to its indegree and outdegree.

2.1.2 a Centrality Encoding

An advantage of Transformer is its global receptive field.

Spatial Encoding:

In this paper, we choose φ(vi,vj) to be the distance of the shortest path (SPD) between vi and vj if the two nodes are connected. If not, we set the output ofφto be a special value, i.e., -1. We assign each (feasible) output value a learnable scalar which will serve as a bias term in the self-attention module. Denote Aij as the (i,j)-element of the Query-Key product matrix A, we have:

2.1.3 Edge Encoding in the Attention

In many graph tasks, edges also have structural features.

In the first method, the edge features areadded to the associated nodes’ features [21,29].

[21] Open graph benchmark: Datasets for machine learning on graphs.arXiv preprintarXiv:2005.00687, 2020.
[29] Deepergcn: All you need to train deepergcns.arXiv preprint arXiv:2006.07739, 2020

In the second method, for each node, its associated edges’ features will be used together with the node features in the aggregation [15,51,25].

[51] How powerful are graph neural networks?InInternational Conference on Learning Representations, 2019.
[25] Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016

However, such ways of using edge feature only propagate the edge information to its associated nodes, which may not be an effective way to leverage edge information in representation of the whole graph.

a new edge encoding method in Graphormer: