Memory-Associated Differential Learning论文及代码解读

论文来源：

论文PDF：

Memory-Associated Differential Learning论文

论文代码：

Memory-Associated Differential Learning代码

论文解读：

1.Abstract

Conventional Supervised Learning approaches focus on the mapping from input features to output labels. After training, the learnt models alone are adapted onto testing features to predict testing labels in isolation, with training data wasted and their associations ignored. To take full advantage of the vast number of training data and their associations, we propose a novel learning paradigm called Memory-Associated Differential (MAD) Learning. We first introduce an additional component called Memory to memorize all the training data. Then we learn the differences of labels as well as the associations of features in the combination of a differential equation and some sampling methods. Finally, in the evaluating phase, we predict unknown labels by inferencing from the memorized facts plus the learnt differences and associations in a geometrically meaningfull manner. We gently build this theory in unary situations and apply it on Image Recognition, then extend it into Link Prediction as a binary situation, in which our method outperforms strong state-of-the-art baselines on three citation networks and ogbl-ddi dataset.
传统的监督学习方法侧重于从输入特征到输出标签的映射。在训练之后，单独学习的模型被调整到测试特征上以单独预测测试标签，训练数据被浪费并且它们的关联被忽略。为了充分利用大量的训练数据及其关联，我们提出了一种新的学习范式，称为记忆关联差分学习。我们首先引入一个名为Memory的附加组件来记忆所有的训练数据。然后在微分方程和一些抽样方法的组合中，我们学习标签的差异以及特征的关联。最后，在评估阶段，我们通过从记忆的事实加上学习的差异和联系中推断出几何意义上的完全方式来预测未知标签。我们在一元情况下温和地构建这一理论，并将其应用于图像识别，然后将其扩展为二元情况下的链接预测，其中我们的方法在三个引用网络和ogbl-ddi数据集上优于强大的最先进的基线。

2.Introduction

Figure 1: The difference between Conventional Supervised Learning and MAD Learning. The former learns the mapping from features to labels in training data and apply this mapping on testing data, while the latter learns the differences and associations among data and inferences testing labels from memorized training data.
图1:常规监督学习和MAD学习的区别。前者学习训练数据中从特征到标签的映射，并将该映射应用于测试数据，而后者学习数据之间的差异和关联，并从记忆的训练数据中推断测试标签。

3.Related Works

Instead of treating External Memory as a way to add more learnable parameters to store uninterpretable hidden states, we try to memorize the facts as they are, and then learn the differences and associations between them.
我们不是把外部记忆当作一种添加更多可学习的参数来存储无法解释的隐藏状态的方式，而是试图记住事实的本来面目，然后学习它们之间的区别和联系。

Most of the experiments in this article are designed to solve Link Prediction problem that we predict whether a pair of nodes in a graph are likely to be connected, how much the weight their edge bares, or what attributes their edge should have.
本文中的大部分实验都是为了解决链接预测问题，即我们预测图中的一对节点是否可能连通，它们的边露出多少权重，或者它们的边应该具有什么属性。

Although our method is derived from a different perspective of view, we point out that Matrix Factorization can be seen as a simplification of MAD Learning with no memory and no sampling.
虽然我们的方法是从不同的角度推导出来的，但我们指出，矩阵分解可以被视为无记忆、无采样的MAD学习的简化。

4.Proposed Approach

4.1 Memory-Associated Differential Learning

By applying Mean Value Theorem for Definite Integrals [Comenetz, 2002], we can estimate the unknown y with known y0 if x0 is close enough to x:
应用定积分中值定理[Comenetz，2002]，如果x0与x足够接近，我们可以用已知y0来估计未知y:

In such way, we connect the current prediction tasks y to the past fact y0, which can be stored in external memory, and convert the learning of our target function y(x) to the learning of a differential function y0(x), which in general is more accessible than the former.
以这种方式，我们将当前预测任务y连接到可以存储在外部存储器中的过去事实y0，并将我们的目标函数y(x)的学习转换成微分函数y0(x)的学习，微分函数y0(x)通常比前者更容易访问。

4.2 Inferencing from Multiple References

To get a steady and accurate estimation of y, we can sample n references x1, x2, · · · , xn to get n estimations yˆ|y1, yˆ|y2, · · · , yˆ|yn and combine them with an aggregator such as mean:
为了获得对y的稳定而精确的估计，我们可以对n个参考x1、x2、、、xn进行采样，以获得n个估计yˇ| y1、yˇ| y2、，yˇ| yn，并将它们与均值结合。

Here we adopt a function Softmin derived from Softmax which rescales the inputted d-dimentional array v so that every element of v lies in the range [0,1] and all of them sum to 1:
这里，我们采用从Softmax导出的函数Softmin，该函数对输入的d维数组v进行重新缩放，使得v的每个元素都位于[0，1]的范围内，并且它们的总和为1：

By applying Softmin we get the aggregated estimation:
通过Softmin最小，我们得到了总的估计:

Figure 2: (a) Memory-Associated Differential Learning inferences labels from memorized ones following the first-order Taylor series approximation: y ≈ y0 +∆x · y0(x). (b) In binary MAD Learning, when v = v0 holds, ∂u∂r |(u,v) is simplified to be ∂u∂r |v since it is the change
图2: (a)记忆相关差分学习根据一阶泰勒级数近似从记忆的标签中推断标签:y≈y0+⇼x y0(x)。(b)在二进制MAD学习中，当v = v0成立时，∂u∂r |(u，v)简化为∂u∂r |v，因为它是变化在v固定的情况下，将u轻微移动到u0后的r。

4.3 Soft Sentinels and Uncertainty

we introduce a mechanism on top of Softmin named Soft Sentinel. A Soft Sentinel is a dummy element mixed into the array of estimations with no information (e.g. the logit is 0) but a set distance (e.g. 0).
我们在Softmin之上引入了一个名为Soft Sentinel的机制。软哨点是一个混合到无信息估计数组中的虚拟元素(例如： logit为0)但是设定的距离(例如.：0).

The estimation after k Soft Sentinels distant at 1 added is：
增加k个软哨兵距离为1后的估计值为：

When Soft Sentinels involved, only estimations given by close-enough references can have most of their impacts on the final result that unreliable estimations are supressed.
当涉及软哨兵时，只有由足够接近的参考文献给出的估计才能对最终结果产生最大影响，即抑制不可靠的估计。

4.4 Other Details
For the sake of flexibility and performance, we usually do not use inputted features x directly, but to first convert x into position f(x).
为了灵活性和性能，我们通常不直接使用输入的特征x，而是先将x转换为位置f(x)。

To adapt to this situation, we generally wrap the memory with an adaptor function m such as a one-layer MLP, getting yˆ|y0 = m(y0) + (f(x) − f(x0)) · g(x) where g(x) stands for gradient.
为了适应这种情况，我们通常用适配器函数m包装存储器，例如一层MLP，得到y\ | y0 = m(y0)+(f(x)-f(x0))g(x)，其中g(x)代表梯度。

When the encodings of nodes are dynamic and no features are provided, we usually adopt Random Mode in the training phase for efficiency and adopt Dynamic NN Mode in the evaluation phase for performance.
当节点的编码是动态的并且没有提供特征时，我们通常在训练阶段采用随机模式来提高效率，在评估阶段采用动态神经网络模式来提高性能。

4.5 Binary MAD Learning
We model the relationship between a pair of nodes in a graph by extending MAD Learning into binary situations.
我们通过将MAD学习扩展到二元情况来建模图中一对节点之间的关系。

Therefore, we may further assume ∂r∂u |(u,v) = g1(v) if v = v0 and ∂r∂v |(u,v) = g2(u) if u = u0, making

Here g1(·) is destination differential function and g2(·) is source differential function. If the edge is undirected, these two functions can be shared.
这里g1()是目的微分函数，g2()是源微分函数。如果边是无向的，这两个函数可以共享。

5.Experiments

In the training phase, we sample arbitrary pairs of nodes to construct negative samples [Grover and Leskovec, 2016] and compare the scores between connected pairs and negative samples with Cross-Entropy as the loss function:
在训练阶段，我们对任意节点对进行采样，构造负样本[Grover和Leskovec，2016]，并以交叉熵为损失函数，比较连通对和负样本之间的得分:

where y is the number of positive samples and n of negative samples, py(i) is the predicted probability of the i-th positive sample and pn(i) of the i-th negative sample. In the evaluating phase, we record the scores not only in Dynamic NN Mode but also in Random Mode.
其中y是正样本的数量，n是负样本的数量，py(i)是第I个正样本的预测概率，pn(i)是第I个负样本的预测概率。在评估阶段，我们不仅在动态神经网络模式下记录分数，还在随机模式下记录分数。

We have these three experimental settings to examine the contribution of Softmin and Soft Sentinels:

mean. Estimations are aggregated by mean function.
softmin. Estimations given by different references are summed up weighted by the results of Softmin applied to the distances.
sentinel. Estimations of softmin with 8 Soft Sentinels at distance 1 added.
As is shown in Figure 4(b), it is no much difference between mean and Softmin. But when mixed with Soft Sentinels, MAD Learning performs better and converges faster.
我们有这三个实验设置来检验软敏和软哨兵的贡献:
mean: 估计通过均值函数聚合。
softmin: 不同参考文献给出的估计值通过应用于距离的软最小结果进行加权求和。
sentinel:在距离1处增加了8个软哨兵时，软敏度的估计值。
如图4(b)所示，平均值和软最小值之间没有太大差异。但是当与软哨兵混合时，MAD学习表现更好，收敛更快。

we repeat that MAD Learning does not predict directly. From another point of view, this experiment implies that undirect references can also be beneficial on par with direct information.
我们重申，MAD学习不能直接预测。从另一个角度来看，这个实验意味着无向引用与直接信息一样有益。

6.Discussion

by extending it from a scalar to a vector, MAD Learning can be used for graphs with featured edges.
通过将它从标量扩展到向量，MAD学习可以用于具有特征边的图。

We also point out that MAD Learning can learn relations in heterogeneous graphs where nodes belong to different types (usually represented by encodings in different lengths). The only requirement is that positions of the source nodes should match with gradients of the destination nodes and vice versa.
我们还指出，MAD Learning可以学习节点属于不同类型(通常由不同长度的编码表示)的异构图中的关系。唯一的要求是源节点的位置应该与目标节点的梯度相匹配，反之亦然。

7. Conclusion

In this work, we explore a novel learning paradigm which is flexible, effective and interpretable. The outstanding results, especially on Link Prediction, open the door for several research directions:

The most important part of MAD Learning is memory.However, MAD Learning have to index the whole training data for random access. In Link Prediction, we implement memory as a dense adjacency matrix which results in huge occupation of space. The way to shrink memory and improve the utilization of space should be investigated in the future.
Based on memory as the ground-truth, MAD Learning appends some difference as the second part. We implement this difference simply as the product of distance and differential function, but we believe there exist different ways to model it.
The third part of MAD Learning is the similarity, which is used to assign weights to estimations given by different references. We reuse distance to compute the similarity, but decoupling it by some other embeddings and some other measurements such as inner product should also be worthy to explore.
In this work, we do deliberately not combine direct information to focus only on MAD Learning. Since MAD Learning takes another parallel route to predict, we believe integrating MAD Learning and Conventional Supervised Learning is also a promising direction.

在这项工作中，我们探索了一种灵活、有效和可解释的新型学习范式。突出的结果，尤其是在链接预测方面，为几个研究方向打开了大门:

MAD学习最重要的部分是记忆.然而，MAD Learning必须对整个训练数据进行索引，以便随机访问。在链路预测中，我们将内存实现为密集的邻接矩阵，这导致了巨大的空间占用。未来应该研究缩小内存和提高空间利用率的方法。
基于记忆作为基础事实，MAD学习附加了一些区别作为第二部分。我们将这种差异简单地实现为距离和微分函数的乘积，但我们认为存在不同的建模方法。
MAD学习的第三部分是相似度，它被用来给不同参考文献给出的估计赋值。我们使用距离来计算相似度，但是通过一些其他嵌入和一些其他度量(例如内积)来解耦它也应该是值得探索的。
在这项工作中，我们故意不结合直接信息，只专注于MAD学习。由于MAD学习采取了另一种平行的预测路线，我们认为将MAD学习和常规监督学习相结合也是一个有前途的方向。

代码解读：

关于MAD函数主要参考了学长的文章Memory-Associated Differential Learning论文Link Prediction源码解读我太菜了=。=
主要是分析citations.py：

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader
import dgl
import dgl.nn
from sklearn import metrics#超参数
g_data_name = 'pubmed'  # cora | citeseer | pubmed
g_toy = False    #文中的涉及toy部分应该都属于作者编写代码时的调试部分。
g_dim = 32  #对应论文中维度32维向量
n_samples = 8   #8个reference作为参考
total_epoch = 200
lr = 0.005
if g_toy:g_dim = g_dim // 2
elif g_data_name == 'pubmed':   #当对pubmed链接预测时，超参数的改变g_dim, n_samples, total_epoch, lr = 64, 64, 2000, 0.001def gpu(x):return x.cuda() if torch.cuda.is_available() else xdef cpu(x):return x.cpu() if torch.cuda.is_available() else xdef ip(x, y):return (x.unsqueeze(-2) @ y.unsqueeze(-1)).squeeze(-1).squeeze(-1)#unsqueeze(-2)在倒数第二个维度上增加一个维度，那么使用unsqueeze(-2)#squeeze(-1)在倒数第一维且该维维度为1时，去掉该维class MAD(nn.Module):def __init__(self, in_feats, n_nodes, node_feats,n_samples, mem, feats, gather2neighbor=False,):super(self.__class__, self).__init__()self.n_nodes = n_nodesself.node_feats = node_featsself.n_samples = n_samplesself.mem = memself.feats = featsself.gather2neighbor = gather2neighborself.f = gpu(nn.Linear(in_feats, node_feats))self.g = (None if gather2neighbor else gpu(nn.Linear(in_feats, node_feats)))# self.g相当于论文中的g(u)和g(v)中的g(x)函数，g(u)是r(u,v)在v_0处关于u的偏导数。# g(u)可以看作是当v=v_0时（把v当作是固定的），在r(u_0,v_0)的基础上，# 将节点u移动到u_0（移动的距离很小，和微积分一个意思）后得到的r(u,v)的细微变化self.adapt = gpu(nn.Linear(1, 1))self.nn = Nonedef nns(self, src, dst):    #调试代码if self.nn is None:n = self.n_samplesself.nn = gpu(torch.empty((self.n_nodes, n), dtype=int))for perm in DataLoader(range(self.n_nodes), 64, shuffle=False):self.nn[perm] = (self.feats[perm].unsqueeze(1) - self.feats.unsqueeze(0)).norm(dim=-1).topk(1 + n, largest=False).indices[..., 1:]return self.nn[src], self.nn[dst]def recall(self, src, dst):# 这里被调用是因为forward方法中调用了self.recall(mid0, dst.unsqueeze(1))# 所以，src=mid0，dst=dst.unsqueeze(1)# src的shape,是(1024,8)，dst的shape是(1024,1)# self.mem[src, dst]中mem是矩阵张量，src和dst也是矩阵张量，其实这里是触发了广播机制，# dst每一行的元素会和src每一行的元素两两组合成一个二维坐标来确定取mem中的哪一行哪一列的值，# 而mem因为是训练集的邻接矩阵，所以self.mem[src, dst]相当于是取1024×8个r_0# dst每一行的元素（1个）会和src每一行的8个元素进行组合，从而得到8个二维坐标。# 一对多，这就是python的广播机制if self.mem is None:return 0return self.adapt((0.0 + self.mem[src, dst]).unsqueeze(-1)).squeeze(-1)# self.mem原来是bool型矩阵张量，(0.0 + self.mem[src, dst])使其变为数值型矩阵张量。# 目前来说，(0.0 + self.mem[src, dst])已经是取得的1024×8个r_0的数值形式，# 至于为什么要用self.adapt（定义为gpu(nn.Linear(1, 1))）进行线性变化，论文没提，# 看来是没有直接使用r_0(u_0,v_0),而是给每个r_0加了个可训练的权重系数def forward(self, src, dst):# 该方法被mad(train_src[perm], train_dst[perm])这一句所调用，所以，src=train_src[perm]，dst=train_dst[perm]n = src.shape[0]#获得边的数目feats = self.feats#获得节点特征g = self.f if self.gather2neighbor else self.g#如果有gather2neighbor（临近值）则g=.f，否则g=.gmid0 = torch.randint(0, self.n_nodes, (n, self.n_samples))#生成的mid0 是形状为n×self.n_samples的二维张量，数值是(0, self.n_nodes)mid1 = torch.randint(0, self.n_nodes, (n, self.n_samples))# mid0, mid1 = self.nns(src, dst)srcdiff = self.f(feats[src]).unsqueeze(1) - self.f(feats[mid0])# feats[src]的形状是len(src)×节点特征维度，具体来说就是(1024,1433)，# self.f是__init__方法中定的线性变化，经过线性变化，# self.f(feats[src])的shape=(1024,32)# 由于mid0的shape=(1024,8)，所以feats[mid0]的shape=（1024,8,1433），# self.f(feats[mid0])的shape=（1024,8,32）。# feats[src]是该批次中所有边的起始节点的特征，每个起始节点都要和8个reference节点# （因为self.n_samples=8）进行差分，从而实现论文3.5节中的(u-u_0)，这里有1024×8个u_0.# 对self.f(feats[src])使用unsqueeze(1)是为了将其shape变为(1024,1,32)，这样# self.f(feats[src]).unsqueeze(1) - self.f(feats[mid0])就能自动触发广播机制，使得# 每条边的起始节点的特征都能和8个reference节点的特征进行相减（差分），# 得到的srcdiff 的shape=（1024,8,32）logits1 = (ip(srcdiff, g(feats[dst]).unsqueeze(1))+ self.recall(mid0, dst.unsqueeze(1)))# logits1 这一步完成的是论文中的g(v)·(u-u_0)+r_0,when v=v_0，# 不过实现起来是按批进行处理，每一批1024条边，每条边有8个g(v)·(u-u_0)+r_0# 其中u和v都是一维向量，分别表示一条边的起始节点和终点节点的特征# 具体分析如下：# 这里的ip()方法用于实现批量的g(v)·(u-u_0)操作（论文3.5节Link Prediction部分），# srcdiff是1024×8个(u-u_0)，# g(feats[dst]).unsqueeze(1)是该批次所有的g(v)，一共1024个，# g(feats[dst])的shape=(1024,32)，之所以要使用unsqueeze(1)使其shape变为(1024,1,32)# 是考虑到srcdiff 的shape=（1024,8,32），为了触发广播机制，# 使得每个g(v)都能和8个(u-u_0)进行操作（操作在ip方法中执行）# def ip(x, y):# return (x.unsqueeze(-2) @ y.unsqueeze(-1)).squeeze(-1).squeeze(-1)# 这里的x=srcdiff ，y=g(feats[dst]).unsqueeze(1)，x和y的形状传经过unsqueeze()变换以后# 分别变成了(1024,8,1,32)和(1024,1,32,1)，调用x@y进行矩阵乘法，# 实际是x和y的最里层的两个维度# 进行矩阵乘法，即（1,32）×(32,1)=(1,1)，# 其物理意义为g(v)和(u-u_0)这两个向量逐元素相乘并求和。# 最终得到的形状是(1024,8,1,1)，然后经过squeeze(-1).squeeze(-1)后变为(1024,8)。# 至于self.recall(mid0, dst.unsqueeze(1))，# mid0的shape,是(1024,8)，dst.unsqueeze(1)的shape是(1024,1)，详细分析见self.recall方法# recall方法返回的是论文中的1024×8个r_0(u_0,v_0)，1024是batch大小，# 8是每条边的reference数量# ip()方法用于实现批量的g(v)·(u-u_0)，recall方法用于返回批量的r_0(u_0,v_0)，# 由此就完成了论文中的g(v)·(u-u_0)+r_0,when v=v_0dstdiff = self.f(feats[dst]).unsqueeze(1) - self.f(feats[mid1])# feats[dst]是该批次中所有边的终点节点的特征，每个终点节点都要和8个reference节点# （因为self.n_samples=8）进行差分，从而实现论文3.5节中(v-v_0)，这里有8个v_0.logits2 = (ip(dstdiff, g(feats[src]).unsqueeze(1))+ self.recall(src.unsqueeze(1), mid1))# logits2 这一步完成的是论文中的8个g(u)·(v-v_0)+r_0,when u=u_0，logits = torch.cat((logits1, logits2), dim=1)dist = torch.cat((srcdiff, dstdiff), dim=1).norm(dim=2)# 这一步中，srcdiff和dstdiff的shape都是（1024,8,32），所以torch.cat((srcdiff, dstdiff), dim=1)# 得到的shape是（1024,16,32），使用norm(dim=2)是为了计算论文中的距离，（norm：求指定维度上的范数。）# 即在维度2上计算2范数，因为norm方法的参数p没有指定，默认是计算2范数# 因此得到的dist的shape为（1024,16)logits = torch.cat((logits, gpu(torch.zeros(n, self.n_samples))), dim=1)# 使用8个 Soft Sentinels distant at 1，每个Soft Sentinel的logit是0，所以是使用zerosdist = torch.cat((dist, gpu(torch.ones(n, self.n_samples))), dim=1)# 每个Soft Sentinel的distance是1，所以是使用onesreturn torch.sigmoid(ip(logits, torch.softmax(-dist, dim=1)))dataset = (dgl.data.CoraGraphDataset() if g_data_name == 'cora'else dgl.data.CiteseerGraphDataset() if g_data_name == 'citeseer'else dgl.data.PubmedGraphDataset())#使用DGL提供的数据集，DGL官网有提供数据集的使用教程，网址：https://docs.dgl.ai/tutorials/blitz/2_dglgraph.html
graph = dataset[0]
#直接取出第一个graph
src, dst = graph.edges()
#获取图的所有边，分别是起，终节点的编号序列
node_features = gpu(graph.ndata['feat'])
#获取节点特征
node_labels = gpu(graph.ndata['label'])
#获取节点标签
train_mask = graph.ndata['train_mask']
#获取决定节点是否属于训练集的一维掩码张量，长度为graph中节点的数量，值为True表示对应的节点属于训练集
valid_mask = graph.ndata['val_mask']
#获取决定节点是否属于验证集的一维掩码张量
test_mask = graph.ndata['test_mask']
#获取决定节点是否属于测试集的一维掩码张量
n_nodes = graph.num_nodes()
#获取节点的数量
n_features = node_features.shape[1]
#shape[1]：表示矩阵的列数
#获取节点的特征的维度
n_labels = int(node_labels.max().item() + 1)
#item()的作用是取出单元素张量的元素值并返回该值，保持该元素类型不变。
#获取节点标签的种类数量（通过最大标签值+1获得）adj = gpu(torch.zeros((n_nodes, n_nodes), dtype=bool))
adj[src, dst] = 1
adj[dst, src] = 1
#构建一个全0的邻接矩阵矩阵，将边输入进去并对称化
if g_toy:   #当toy模式采用全部的图，应该是作者用来测试的部分mem = Nonetrain_src = gpu(src)train_dst = gpu(dst)mlp = gpu(nn.Linear(g_dim, n_labels))params = list(mlp.parameters())print('mlp params:', sum(p.numel() for p in params))mlp_opt = optim.Adam(params, lr=lr)
else:n = src.shape[0]#shape[0]就是读取矩阵第一维度的长度，即读取有多少条边perm = torch.randperm(n)#随机打乱n的序列val_num = int(0.05 * n) #划分验证集数量test_num = int(0.1 * n) #划分测试集数量train_src = gpu(src[perm[val_num + test_num:]])train_dst = gpu(dst[perm[val_num + test_num:]])#训练集是去掉验证集和测试集剩下的部分val_src = gpu(src[perm[:val_num]])val_dst = gpu(dst[perm[:val_num]])#划分验证集test_src = gpu(src[perm[val_num:val_num + test_num]])test_dst = gpu(dst[perm[val_num:val_num + test_num]])#划分测试集train_src, train_dst = (torch.cat((train_src, train_dst)),torch.cat((train_dst, train_src)))#torch.cat是将两个张量（tensor）拼接在一起val_src, val_dst = (torch.cat((val_src, val_dst)),torch.cat((val_dst, val_src)))test_src, test_dst = (torch.cat((test_src, test_dst)),torch.cat((test_dst, test_src)))mem = gpu(torch.zeros((n_nodes, n_nodes), dtype=bool))mem[train_src, train_dst] = 1#获得训练集邻接矩阵并且对称化total_aucs = []
total_aps = []
for run in range(10):torch.manual_seed(run)  #设置随机种子mad = MAD(  #调用MAD函数in_feats=n_features,n_nodes=n_nodes,node_feats=g_dim,n_samples=n_samples,mem=mem,feats=node_features,gather2neighbor=g_toy,)params = list(mad.parameters())print('params:', sum(p.numel() for p in params))#list（）将元组转换为列表#构建好神经网络后，网络的参数都保存在parameters()函数当中，打印神经网络结构opt = optim.Adam(params, lr=0.01)   #选择Adam优化器best_aucs = [0, 0]best_aps = [0, 0]best_accs = [0, 0]for epoch in range(1, total_epoch + 1):mad.train() #将mad转换到可训练状态for perm in DataLoader(range(train_src.shape[0]), batch_size=1024, shuffle=True):#range() 函数返回的是一个可迭代对象（类型是对象）#随机打乱训练节点数值序列,每1024为一组opt.zero_grad() #清空前一个epoch 残留的梯度p_pos = mad(train_src[perm], train_dst[perm])#调用MAD中forward函数neg_src = gpu(torch.randint(0, n_nodes, (perm.shape[0], )))#random.randint()随机生一个整数int类型，可以指定这个整数的范围，同样有上限和下限值#随机生成作为负样本的边的起始节点和终点节点(创造负样本)neg_dst = gpu(torch.randint(0, n_nodes, (perm.shape[0], )))idx = ~(mem[neg_src, neg_dst])#~是取反操作符#随机生成的负样本边可能是正样本边，所以要提出这部分的负样本边，若原先是正样本的边也是负样本，变为0p_neg = mad(neg_src[idx], neg_dst[idx])loss = (-torch.log(1e-5 + 1 - p_neg).mean()- torch.log(1e-5 + p_pos).mean())loss.backward()opt.step()if epoch % 10:continueif g_toy:with torch.no_grad():embed = mad.f(node_features)for i in range(100):mlp.train()mlp_opt.zero_grad()logits = mlp(embed)loss = F.cross_entropy(logits[train_mask], node_labels[train_mask])loss.backward()mlp_opt.step()with torch.no_grad():logits = mlp(embed)_, indices = torch.max(logits[valid_mask], dim=1)labels = node_labels[valid_mask]v_acc = torch.sum(indices == labels).item() * 1.0 / len(labels)_, indices = torch.max(logits[test_mask], dim=1)labels = node_labels[test_mask]t_acc = torch.sum(indices == labels).item() * 1.0 / len(labels)if v_acc > best_accs[0]:best_accs = [v_acc, t_acc]print(epoch, 'acc:', v_acc, t_acc)continuewith torch.no_grad():mad.eval()aucs = []aps = []for src, dst in ((val_src, val_dst), (test_src, test_dst)):p_pos = mad(src, dst)n = src.shape[0]perm = torch.randperm(n * 2)neg_src = torch.cat((src, gpu(torch.randint(0, n_nodes, (n, )))))[perm]neg_dst = torch.cat((gpu(torch.randint(0, n_nodes, (n, ))), dst))[perm]idx = ~(adj[neg_src, neg_dst])neg_src = neg_src[idx][:n]neg_dst = neg_dst[idx][:n]p_neg = mad(neg_src, neg_dst)y_true = cpu(torch.cat((p_pos * 0 + 1, p_neg * 0)))y_score = cpu(torch.cat((p_pos, p_neg)))fpr, tpr, _ = metrics.roc_curve(y_true, y_score, pos_label=1)#roc曲线绘制：#fpr：数组，随阈值上涨的假阳性率#tpr：数组，随阈值上涨的真正例率auc = metrics.auc(fpr, tpr)ap = metrics.average_precision_score(y_true, y_score)aucs.append(auc)aps.append(ap)if aucs[0] > best_aucs[0]:best_aucs = aucsprint(epoch, 'auc:', aucs)if aps[0] > best_aps[0]:best_aps = apsprint(epoch, 'ap:', aps)print(run, 'best auc:', best_aucs)print(run, 'best ap:', best_aucs)print(run, 'best acc (toy):', best_accs)total_aucs.append(best_aucs[1])total_aps.append(best_aps[1])
total_aucs = torch.tensor(total_aucs)
total_aps = torch.tensor(total_aps)
print('auc mean:', total_aucs.mean().item(), 'std:', total_aucs.std().item())
print('ap mean:', total_aps.mean().item(), 'std:', total_aps.std().item())

最后补上学长对MAD函数最后一个语句的分析，我确实想不到啊~