GraphSAGE NIPS 2017 代码分析（Tensorflow版）

文章目录

数据集
- ppi数据集信息
- - toy-ppi-G.json 图的信息
  - toy-ppi-class_map.json
  - toy-ppi-id_map.json
  - toy-ppi-walks.txt
  - toy-ppi-feats.npy
实验要求
- 安装Docker与程序运行
- 运行
代码分析
- `__init__.py`
- `utils.py`
- `neigh_samplers.py`
- `models.py`
- `layers.py`
- `minibatch.py`
- `aggregators.py`
- `prediction.py`
- `supervised_train.py`
- `unsupervised_train.py`
- `inits.py`
其他
- `citation_eval.py`
- `ppi_eval.py`
- `reddit_eval.py`
参考

数据集

数据集	#图	#节点	#边	#特征	#标签(y)
Cora	1	2708	5429	1433	7
Citeseer	1	3327	4732	3703	6
Pubmed	1	19717	44338	500	3
PPI	24	56944	818716	50	121
Reddit	1	232965	11606919	602	41
Nell	1	65755	266144	61278	210

ppi数据集信息

toy-ppi-G.json 图的信息

{ directed: falsegraph : {{name: disjoint_union(,) }nodes:  [{　 test: falseid: 0features: [ ... ]val: falselable: [ ... ]}{...}...]links: [{　 test_removed: falsetrain_removed: falsetarget: 800 # 指向的节点id（默认从小节点指向大节点）source: 0   # 从0节点按顺序展示}{...}...]}
}

name: disjoint_union(,)表示图的名字
toy-ppi-G.json里只有一个图（可能是因为用于节点分类只需要一张图即可，做图分类任务需要多张图）
可以看出，这是个无向图，并且由nodes集和links集合构成，每个集合都是一个list，里面包含的每一个node或link都是词典形式存储的
从github下载的源码中，没有links部分的数据?其实是由于文件过大显示不完整，其实是存在的，比如节点只显示到1883，总共14754个

toy-ppi-class_map.json

格式为：{“0”: [1, 0, 0,…],…,“14754”: [1, 1, 0, 0,…]}

toy-ppi-id_map.json

节点编号与序号的一一对应；数据格式为：{“0”: 0, “1”: 1,…, “14754”: 14754}

toy-ppi-walks.txt

0    708
0   3163
0   276
0   1789
...
1   15
1   1455
1   1327
1   317
1   63
1   1420
...
9715    7369
9715    8983
9715    6983

从一点出发随机游走到邻居节点的情况，对于每个点取198次（即可能有重复情况）
例如：0 708 表示从0点走到708点。

toy-ppi-feats.npy

预训练好得到的features。

数据处理的时候主要通过两个函数
(1):np.save(“test.npy”，数据结构） ----存数据
(2):data =np.load('test.npy") ----取数据
例如，存列表

z = [[[1, 2, 3], ['w']], [[1, 2, 3], ['w']]]
np.save('test.npy', z)
x = np.load('test.npy')x:
->array([[list([1, 2, 3]), list(['w'])],[list([1, 2, 3]), list(['w'])]], dtype=object)

例如，存字典

x
-> {0: 'wpy', 1: 'scg'}
np.save('test.npy',x)
x = np.load('test.npy')
x
->array({0: 'wpy', 1: 'scg'}, dtype=object)

在存为字典格式读取后，需要先调用如下语句
data.item()
将数据numpy.ndarray对象转换为dict

实验要求

networkx版本必须小于等于1.11，本人的是2.3版本，因此报错

  File "/home/yyl/桌面/Papers/GraphSage/GraphSAGE-master/graphsage/utils.py", line 20, in load_dataG_data = json.load(open(prefix + "-G.json"))
FileNotFoundError: [Errno 2] No such file or directory: '-G.json'

换成1.11版本即可：pip install networkx==1.11

安装Docker与程序运行

参考：https://www.cnblogs.com/shiyublog/p/9858786.html

运行

在命令运行unsupervised_train.py

python -m graphsage.unsupervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --max_total_steps 1000 --validate_iter 10
等价于
python ./graphsage/unsupervised_train.py  --train_prefix ./example_data/toy-ppi --model graphsage_mean --max_total_steps 1000 --validate_iter 10

注意，上述数据集路径和官方给的不一样。如果是在Pycharm中运行，需要更改train_prefix，model等参数的值，需要注意在ide和命令行中参数的格式，在idea中：

flags.DEFINE_string('model', 'graphsage_mean', 'model names. See README for possible values.')
flags.DEFINE_string('train_prefix', '../example_data/toy-ppi', 'prefix identifying training data. must be specified.')

model参数可选值

graphsage_mean – GraphSage with mean-based aggregator
graphsage_seq – GraphSage with LSTM-based aggregator
graphsage_maxpool – GraphSage with max-pooling aggregator (as described in the NIPS 2017 paper)
graphsage_meanpool – GraphSage with mean-pooling aggregator (a variant of the pooling aggregator, where the element-wie mean replaces the element-wise max).
gcn – GraphSage with GCN-based aggregator
n2v – an implementation of DeepWalk (called n2v for short in the code.)

date_iter 10
Loading training data..
Removed 0 nodes that lacked proper annotations due to networkx versioning issues
Loaded data.. now preprocessing..
Done loading training data..
Unexpected missing: 0
9716 train nodes
5039 test nodes
2019-11-05 20:49:37.036227: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-05 20:49:37.075335: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-11-05 20:49:37.076045: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: yyl-Z370-HD3
2019-11-05 20:49:37.076058: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: yyl-Z370-HD3
2019-11-05 20:49:37.076154: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 430.34.0
2019-11-05 20:49:37.076205: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 430.34.0
2019-11-05 20:49:37.076214: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 430.34.0
Epoch: 0001
2019-11-05 20:49:41.997221: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 20:49:42.001801: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 20:49:42.018491: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 20:49:42.024759: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
Iter: 0000 train_loss= 18.78112 train_mrr= 0.25004 train_mrr_ema= 0.25004 val_loss= 19.16457 val_mrr= 0.21838 val_mrr_ema= 0.21838 time= 1.29920
2019-11-05 20:49:42.517739: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
Iter: 0050 train_loss= 18.49755 train_mrr= 0.18471 train_mrr_ema= 0.22649 val_loss= 18.71147 val_mrr= 0.24803 val_mrr_ema= 0.20979 time= 0.14266
Iter: 0100 train_loss= 18.56170 train_mrr= 0.16419 train_mrr_ema= 0.21142 val_loss= 18.31936 val_mrr= 0.18485 val_mrr_ema= 0.21664 time= 0.12880
Iter: 0150 train_loss= 17.91184 train_mrr= 0.19551 train_mrr_ema= 0.20265 val_loss= 18.08941 val_mrr= 0.18543 val_mrr_ema= 0.21273 time= 0.12436
Iter: 0200 train_loss= 17.14944 train_mrr= 0.22213 train_mrr_ema= 0.19652 val_loss= 17.85653 val_mrr= 0.18931 val_mrr_ema= 0.20871 time= 0.12251
Iter: 0250 train_loss= 17.14205 train_mrr= 0.18210 train_mrr_ema= 0.19319 val_loss= 17.21363 val_mrr= 0.20572 val_mrr_ema= 0.20632 time= 0.12138
Iter: 0300 train_loss= 16.96532 train_mrr= 0.18807 train_mrr_ema= 0.19088 val_loss= 16.77304 val_mrr= 0.17903 val_mrr_ema= 0.20116 time= 0.12090
Iter: 0350 train_loss= 16.62087 train_mrr= 0.18385 train_mrr_ema= 0.18813 val_loss= 16.53304 val_mrr= 0.21569 val_mrr_ema= 0.21001 time= 0.12049
Iter: 0400 train_loss= 16.15347 train_mrr= 0.19938 train_mrr_ema= 0.18743 val_loss= 16.20924 val_mrr= 0.20107 val_mrr_ema= 0.20434 time= 0.11976
Iter: 0450 train_loss= 15.92187 train_mrr= 0.18782 train_mrr_ema= 0.18764 val_loss= 16.09361 val_mrr= 0.21507 val_mrr_ema= 0.20851 time= 0.11877
Iter: 0500 train_loss= 15.61726 train_mrr= 0.20567 train_mrr_ema= 0.18762 val_loss= 15.82525 val_mrr= 0.20445 val_mrr_ema= 0.20857 time= 0.11862
Iter: 0550 train_loss= 15.49840 train_mrr= 0.18336 train_mrr_ema= 0.18751 val_loss= 15.63526 val_mrr= 0.21188 val_mrr_ema= 0.20637 time= 0.11833
Iter: 0600 train_loss= 15.29428 train_mrr= 0.18559 train_mrr_ema= 0.18814 val_loss= 15.43966 val_mrr= 0.18432 val_mrr_ema= 0.20902 time= 0.11812
Iter: 0650 train_loss= 15.22701 train_mrr= 0.18361 train_mrr_ema= 0.18770 val_loss= 15.30805 val_mrr= 0.18943 val_mrr_ema= 0.20266 time= 0.11810
Iter: 0700 train_loss= 15.12402 train_mrr= 0.17775 train_mrr_ema= 0.18618 val_loss= 15.19020 val_mrr= 0.19980 val_mrr_ema= 0.20185 time= 0.11768
Iter: 0750 train_loss= 14.99245 train_mrr= 0.18521 train_mrr_ema= 0.18568 val_loss= 15.03517 val_mrr= 0.21548 val_mrr_ema= 0.20307 time= 0.11793
Iter: 0800 train_loss= 14.90705 train_mrr= 0.19703 train_mrr_ema= 0.18656 val_loss= 14.99781 val_mrr= 0.22909 val_mrr_ema= 0.20756 time= 0.11749
Iter: 0850 train_loss= 14.89672 train_mrr= 0.17020 train_mrr_ema= 0.18719 val_loss= 14.98761 val_mrr= 0.21074 val_mrr_ema= 0.21088 time= 0.11691
Iter: 0900 train_loss= 14.79581 train_mrr= 0.20458 train_mrr_ema= 0.18726 val_loss= 14.86050 val_mrr= 0.20545 val_mrr_ema= 0.20940 time= 0.11659
Iter: 0950 train_loss= 14.78135 train_mrr= 0.17666 train_mrr_ema= 0.18843 val_loss= 14.79613 val_mrr= 0.20859 val_mrr_ema= 0.20693 time= 0.11645
Iter: 1000 train_loss= 14.75321 train_mrr= 0.19232 train_mrr_ema= 0.18724 val_loss= 14.77326 val_mrr= 0.21229 val_mrr_ema= 0.20599 time= 0.11612
Optimization Finished!

可以看出，unsupervised_train.py只运行了一个epoch，共1000次迭代，每10个迭代运行一次validation
batch_size：512
python -m graphsage.unsupervised_train 表示以模块运行，不用具体路径
python ./graphsage/unsupervised_train.py 表示以脚本文件直接运行

运行supervised_train.py，注意train_prefix参数的值也需要改:…/example_data/toy-ppi

python -m graphsage.supervised_train --train_prefix ./example_data/toy-ppi --model graphsage_mean --sigmoid
等价于
python ./graphsage/supervised_train.py --train_prefix ./example_data/toy-ppi --model graphsage_mean --sigmoid

若报错：

  File "/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/cluster/supervised.py", line 14, in <module>from scipy.misc import comb
ImportError: cannot import name 'comb'

则修改lib\site-packages\sklearn\metrics\cluster\supervised.py中的from scipy.misc import comb为from scipy.special import comb

Loading training data..
Removed 0 nodes that lacked proper annotations due to networkx versioning issues
Loaded data.. now preprocessing..
Done loading training data..
2019-11-05 21:37:11.763295: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-05 21:37:11.770739: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-11-05 21:37:11.770767: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: yyl-Z370-HD3
2019-11-05 21:37:11.770772: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: yyl-Z370-HD3
2019-11-05 21:37:11.770792: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 430.34.0
2019-11-05 21:37:11.770810: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 430.34.0
2019-11-05 21:37:11.770815: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 430.34.0
Epoch: 0001
2019-11-05 21:37:12.108357: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 21:37:12.112443: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.'precision', 'predicted', average, warn_for)
/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1076: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.'recall', 'true', average, warn_for)
Iter: 0000 train_loss= 0.69289 train_f1_mic= 0.34109 train_f1_mac= 0.29393 val_loss= 0.66505 val_f1_mic= 0.37310 val_f1_mac= 0.13088 time= 0.38797
2019-11-05 21:37:12.349277: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 21:37:12.353043: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
2019-11-05 21:37:12.411469: W tensorflow/core/framework/allocator.cc:122] Allocation of 25600000 exceeds 10% of system memory.
Iter: 0005 train_loss= 0.58413 train_f1_mic= 0.36509 train_f1_mac= 0.08875 val_loss= 0.66505 val_f1_mic= 0.37310 val_f1_mac= 0.13088 time= 0.11344
Iter: 0010 train_loss= 0.54682 train_f1_mic= 0.40012 train_f1_mac= 0.10253 val_loss= 0.66505 val_f1_mic= 0.37310 val_f1_mac= 0.13088 time= 0.08710
Iter: 0015 train_loss= 0.54642 train_f1_mic= 0.37319 train_f1_mac= 0.08884 val_loss= 0.66505 val_f1_mic= 0.37310 val_f1_mac= 0.13088 time= 0.07637
Epoch: 0002
..........
Epoch: 0010
Iter: 0004 train_loss= 0.48976 train_f1_mic= 0.49845 train_f1_mac= 0.27817 val_loss= 0.52274 val_f1_mic= 0.49617 val_f1_mac= 0.29091 time= 0.05695
Iter: 0009 train_loss= 0.49231 train_f1_mic= 0.48284 train_f1_mac= 0.26799 val_loss= 0.52274 val_f1_mic= 0.49617 val_f1_mac= 0.29091 time= 0.05683
Iter: 0014 train_loss= 0.48515 train_f1_mic= 0.50462 train_f1_mac= 0.28371 val_loss= 0.52274 val_f1_mic= 0.49617 val_f1_mac= 0.29091 time= 0.05664
Optimization Finished!
Full validation stats: loss= 0.51017 f1_micro= 0.52730 f1_macro= 0.32655 time= 0.18505
Writing test set stats to file (don't peak!)

注意

/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1076: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples.'recall', 'true', average, warn_for)

原因：存在一些样本 label 为 y_true，但是y_pred 并没有预测到，即在预测数据中存在实际类别没有的标签时报此warning，此时F1当作0。
比如
y_true = (0, 1, 2, 3, 4)
y_pred = (0, 1, 1, 3, 4)
label‘2’ 从来没有被预测到，所以F-score没有计算这项 label，因此这种情况下 F-score 就被当作为 0.0 了。
但是又因为，要计算所有分类结果的平均得分就必须将这项得分为 0 的情况考虑进去，所以，scikit-learn出来提醒你，warning警告一下，但不是错误。

/home/yyl/anaconda3/lib/python3.5/site-packages/sklearn/metrics/classification.py:1074: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples.'precision', 'predicted', average, warn_for)

代码分析

`init.py`

from __future__ import print_function
#即使在python2.X，使用print就得像python3.X那样加括号使用。from __future__ import division
# 导入python未来支持的语言特征division(精确除法)，
# 当我们没有在程序中导入该特征时，"/"操作符执行的是截断除法(Truncating Division)；
# 当我们导入精确除法之后，"/"执行的是精确除法, "//"执行截断除除法

`utils.py`

（1）type() 与 isinstance() 区别
isinstance() 函数来判断一个对象是否是一个已知的类型，类似 type()。
isinstance(object, classinfo)
参数
object – 实例对象。
classinfo – 可以是直接或间接类名、基本类型或者由它们组成的元组。

返回值
如果对象的类型与参数二的类型（classinfo）相同则返回 True，否则返回 False。

>>>a = 2
>>> isinstance (a,int)
True
>>> isinstance (a,str)
False
>>> isinstance (a,(str,int,list))    # 是元组中的一个返回 True
True

type() 不会认为子类是一种父类类型，不考虑继承关系。
isinstance() 会认为子类是一种父类类型，考虑继承关系。

class A:passclass B(A):passisinstance(A(), A)    # returns True
type(A()) == A        # returns True
isinstance(B(), A)    # returns True
type(B()) == A        # returns False

（2）G.nodes()
返回的是图中节点n与节点属性nodedata。

    G_data = json.load(open(prefix + "-G.json"))G = json_graph.node_link_graph(G_data) #Return graph from node-link data formatprint("G.nodes():",G.nodes)#G.nodes(): <bound method Graph.nodes of <networkx.classes.graph.Graph object at 0x7f9a6af9cf60>>print("list(G.nodes):",list(G))# list(G.nodes): [0, 1, 2, 3, 4,...,14754]# print("G.nodes.data():",G.nodes.data()) #AttributeError: 'function' object has no attribute 'data'print(list(G.nodes(data=True))) #带nodedata#类似这种格式：[(0, {'val': False}，‘label':[...],...), (1, {...}), (2, {...}),...]#判断G.nodes()[0] 是否为int型（即不带nodedata）if isinstance(G.nodes()[0], int): #若为int型，则将n转为int型conversion = lambda n : int(n)else:conversion = lambda n : n

（3）G.edges()
代码中edge对edges迭代，每次去list中的一个元组，而edge[0], edge[1]则分别表示两个顶点。
若两个顶点中至少有一个的val / test不为空，则将该边的’train_removed’设为True，否则为False。
该操作为保证’train_removed’不为空。

    ## Make sure the graph has edge train_removed annotations## (some datasets might already have this..)for edge in G.edges():if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] orG.node[edge[0]]['test'] or G.node[edge[1]]['test']):G[edge[0]][edge[1]]['train_removed'] = Trueelse:G[edge[0]][edge[1]]['train_removed'] = False

G.edges() 得到edge_list, [( , ), ( , ), … ( , )]。list中每一个元素是所表示边的两个节点信息。若设置data = True，则会显示边的权重等属性信息。

>>> G = nx.Graph()   # or DiGraph, MultiGraph, MultiDiGraph, etc
>>> G.add_path([0,1,2])
>>> G.add_edge(2,3,weight=5)
>>> G.edges()
[(0, 1), (1, 2), (2, 3)]
>>> G.edges(data=True) # default edge data is {} (empty dictionary)
[(0, 1, {}), (1, 2, {}), (2, 3, {'weight': 5})]
>>> list(G.edges_iter(data='weight', default=1))
[(0, 1, 1), (1, 2, 1), (2, 3, 5)]
>>> G.edges([0,3])
[(0, 1), (3, 2)]
>>> G.edges(0)
[(0, 1)]

（4）获取训练数据features并标准化
这里if not feats is None 等价于 if feats is not None。
将val,test均为None的node选为训练数据，通过id_map获取其在feature表中的索引值，添加到train_ids数组中。
根据索引train_ids，train_fests获取这些nodes的features。

if normalize and not feats is None:from sklearn.preprocessing import StandardScalertrain_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])train_feats = feats[train_ids]scaler = StandardScaler()scaler.fit(train_feats)feats = scaler.transform(feats)

（5）StandardScaler的用法
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
Methods:

fit(X[, y]) : Compute the mean and std to be used for later scaling. 预处理的数据，计算矩阵列均值和列标准差
transform(X[, y, copy]) : Perform standardization by centering and scaling：得到标准化的矩阵。用此方法，必须使用fit先进行预处理计算均值和标准差，然后用fit计算的均值和标准差，进行标准化处理 {x_i - u}/标准差
fit_transform(X[, y]) : Fit to data, then transform it. 相当于是fit和transform的组合

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler(copy=True, with_mean=True, with_std=True)
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.][-1. -1.][ 1.  1.][ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]# 计算得
# 均值[0.5, 0.5],
# 方差：1/4 * [(0 - 0.5)^2 * 2 + (1 - 0.5)^2 * 2] = 1/4 = 0.25
# 标准差：0.5
# 对于[2,2] transform 标准化之后: (2 - 0.5) / 0.5 = 3

（6）map() 的用法
map(function, iterable, …)
map() 会根据提供的函数对指定序列做映射。
第一个参数 function 以参数序列中的每一个元素调用 function 函数，返回包含每次 function 函数返回值的新列表。
例子：

>>>def square(x) :            # 计算平方数
...     return x ** 2
...
>>> map(square, [1,2,3,4,5])   # 计算列表各个元素的平方
[1, 4, 9, 16, 25]
>>> map(lambda x: x ** 2, [1, 2, 3, 4, 5])  # 使用 lambda 匿名函数
[1, 4, 9, 16, 25]# 提供了两个列表，对相同位置的列表数据进行相加
>>> map(lambda x, y: x + y, [1, 3, 5, 7, 9], [2, 4, 6, 8, 10])
[3, 7, 11, 15, 19]

utils.py完整代码

from __future__ import print_functionimport numpy as np
import random
import json
import sys
import osimport networkx as nx
from networkx.readwrite import json_graph
version_info = list(map(int, nx.__version__.split('.')))
major = version_info[0]
minor = version_info[1]
assert (major <= 1) and (minor <= 11), "networkx major version > 1.11"WALK_LEN=5
N_WALKS=50def load_data(prefix, normalize=True, load_walks=False):G_data = json.load(open(prefix + "-G.json"))G = json_graph.node_link_graph(G_data) #Return graph from node-link data formatprint("G.nodes():",G.nodes)#G.nodes(): <bound method Graph.nodes of <networkx.classes.graph.Graph object at 0x7f9a6af9cf60>># print("list(G.nodes):",list(G))# list(G.nodes): [0, 1, 2, 3, 4,...,14754]# print("G.nodes.data():",G.nodes.data()) #AttributeError: 'function' object has no attribute 'data'# print(list(G.nodes(data=True))) #带nodedata#类似这种格式：[(0, {'val': False}，‘label':[...],...), (1, {...}), (2, {...}),...]#定义conversion函数#判断G.nodes()[0] 是否为int型（即不带nodedata）if isinstance(G.nodes()[0], int): #若为int型，则将n转为int型conversion = lambda n : int(n)else:conversion = lambda n : n#节点特征存于.npy文件中，如果不存在则feats = Noneif os.path.exists(prefix + "-feats.npy"):feats = np.load(prefix + "-feats.npy")else:print("No features present.. Only identity features will be used.")feats = Noneid_map = json.load(open(prefix + "-id_map.json"))# print("id_map:",id_map)# id_map :{'143': 143, '10758': 10758, '13438': 13438,..id_map = {conversion(k):int(v) for k,v in id_map.items()}   #id_map的迭代中k为str类型，v为int型，将其全部转换成整型# print("id_map:",id_map)#id_map: {0: 0, 1: 1, 2: 2...walks = []class_map = json.load(open(prefix + "-class_map.json"))# print("class_map:",class_map)#{"0": [1, 0, 0,...],...,"14754": [1, 1, 0, 0,...]}if isinstance(list(class_map.values())[0], list):lab_conversion = lambda n : nelse:lab_conversion = lambda n : int(n)class_map = {conversion(k):lab_conversion(v) for k,v in class_map.items()} #同上，将class_map的keys和values都转换成成型，并返回新的class_map# print("class_map:",class_map)#{0: [1, 0, 0,...],...,14754: [1, 1, 0, 0,...]}#这里删除的节点是不具有'val'或'test'属性 的节点，而不是'val'，'test' 属性值为None的节点。#区分开 if not 'val' in G.node[node] 和 if not G.node[n]['val']的不同意义。## Remove all nodes that do not have val/test annotations## (necessary because of networkx weirdness with the Reddit data)broken_count = 0  #记录删去的没有val 或者 test的属性的节点的数目。for node in G.nodes():if not 'val' in G.node[node] or not 'test' in G.node[node]:G.remove_node(node)broken_count += 1print("Removed {:d} nodes that lacked proper annotations due to networkx versioning issues".format(broken_count))## Make sure the graph has edge train_removed annotations## (some datasets might already have this..)# 代码中edge对edges迭代，每次去list中的一个元组，而edge[0], edge[1]则分别表示两个顶点。# 若两个顶点中至少有一个的val / test不为空，则将该边的'train_removed'设为True，否则为False。# 该操作为保证'train_removed'不为空。print("Loaded data.. now preprocessing..")for edge in G.edges():if (G.node[edge[0]]['val'] or G.node[edge[1]]['val'] orG.node[edge[0]]['test'] or G.node[edge[1]]['test']):G[edge[0]][edge[1]]['train_removed'] = Trueelse:G[edge[0]][edge[1]]['train_removed'] = False#获取训练数据features并标准化#这里if not feats is None 等价于 if feats is not Noneif normalize and not feats is None:from sklearn.preprocessing import StandardScaler#将val,test均为None的node选为训练数据，通过id_map获取其在feature表中的索引值，添加到train_ids数组中。#根据索引train_ids，train_fests获取这些nodes的features.train_ids = np.array([id_map[n] for n in G.nodes() if not G.node[n]['val'] and not G.node[n]['test']])train_feats = feats[train_ids]## 标准化数据，保证每个维度的特征数据方差为1，均值为0，使得预测结果不会被某些维度过大的特征值而主导scaler = StandardScaler()scaler.fit(train_feats)feats = scaler.transform(feats)#如果load_walks为True则加载walksif load_walks:with open(prefix + "-walks.txt") as fp:for line in fp:#map将walk中的数据都转为整型#walks初始化为[]， 之后append的是游走的节点对的对象walks.append(map(conversion, line.split()))# print("walks:",walks)#[<map object at 0x7f5bc0d68da0>, <map object at 0x7f5bc0d68e48>, <map object at 0x7f5bc0d68f28>....]print("len(walks):",len(walks))#len(walks): 1895817return G, feats, id_map, walks, class_mapdef run_random_walks(G, nodes, num_walks=N_WALKS):pairs = []for count, node in enumerate(nodes):if G.degree(node) == 0:continuefor i in range(num_walks):curr_node = nodefor j in range(WALK_LEN):next_node = random.choice(G.neighbors(curr_node))# self co-occurrences are uselessif curr_node != node:pairs.append((node,curr_node))curr_node = next_nodeif count % 1000 == 0:print("Done walks for", count, "nodes")return pairsif __name__ == "__main__":""" Run random walks """graph_file = sys.argv[1]out_file = sys.argv[2]G_data = json.load(open(graph_file))G = json_graph.node_link_graph(G_data)nodes = [n for n in G.nodes() if not G.node[n]["val"] and not G.node[n]["test"]]G = G.subgraph(nodes)pairs = run_random_walks(G, nodes)with open(out_file, "w") as fp:fp.write("\n".join([str(p[0]) + "\t" + str(p[1]) for p in pairs]))

`neigh_samplers.py`

定义了从节点的邻居中采样的均匀采样器。
均匀：shuffle打乱0维的顺序，即打乱行顺序，以此使下面采样可以“均匀”。为了使用shuffle函数，需要在shuffle前后transpose一下。
采样：slice之后，相当于随机挑选了num_samples个样本，并保留了这些样本的全部属性特征。
最后的adj_lists即为均匀采样后的表示邻居信息的矩阵。

neigh_samplers.py完整代码

from __future__ import division
from __future__ import print_functionfrom graphsage.layers import Layerimport tensorflow as tf
flags = tf.app.flags
FLAGS = flags.FLAGS"""
Classes that are used to sample node neighborhoods
"""class UniformNeighborSampler(Layer):"""Uniformly samples neighbors.Assumes that adj lists are padded with random re-sampling"""def __init__(self, adj_info, **kwargs):super(UniformNeighborSampler, self).__init__(**kwargs)self.adj_info = adj_infodef _call(self, inputs):ids, num_samples = inputsadj_lists = tf.nn.embedding_lookup(self.adj_info, ids)  #用于根据ids在adj_info中找到各个对应位的向量。adj_lists = tf.transpose(tf.random_shuffle(tf.transpose(adj_lists)))adj_lists = tf.slice(adj_lists, [0,0], [-1, num_samples])return adj_lists

`models.py`

（1）namedtuple 命名元组，可以给tuple命名
例子：

import collectionsMyTupleClass = collections.namedtuple('MyTupleClass',['name', 'age', 'job'])
obj = MyTupleClass("Tomsom",12,'Cooker')
print(obj.name)# Tomsom
print(obj.age) # 12
print(obj.job) # CookerPerson=collections.namedtuple('Person','name age gender')
# 以空格分开，表示这个namedtuple有三个元素print( 'Type of Person:',type(Person)) # Type of Person: <class 'type'>
Bob=Person(name='Bob',age=30,gender='male')
print( 'Representation:',Bob) #Representation: Person(name='Bob', age=30, gender='male')
Jane=Person(name='Jane',age=29,gender='female')
print( 'Field by Name:',Jane.name) # Field by Name: Jane
for people in [Bob,Jane]:print ("%s is %d years old %s" % people)
# Bob is 30 years old male
# Jane is 29 years old female# 在使用namedtyuple的时候要注意其中的名称不能使用Python的关键字，如class def等
# 不能有重复的元素名称，比如：不能有两个’age age’。如果出现这些情况，程序会报错。
# 但是，在实际使用的时候可能无法避免这种情况，
# 比如:可能我们的元素名称是从数据库里读出来的记录，这样很难保证一定不会出现Python关键字。
# 这种情况下的解决办法是将namedtuple的重命名模式打开，
# 这样如果遇到Python关键字或者有重复元素名 时，自动进行重命名。with_class=collections.namedtuple('Person','name age class gender',rename=True)
print(with_class._fields)# ('name', 'age', '_2', 'gender')
two_ages=collections.namedtuple('Person','name age gender age',rename=True)
print(two_ages._fields)# ('name', 'age', 'gender', '_3')# 使用rename=True的方式打开重命名选项
# 可以看到第一个集合中的class被重命名为 ‘_2' ；
# 第二个集合中重复的age被重命名为 ‘_3'
# namedtuple在重命名的时候使用了下划线 _ 加元素所在索引数的方式进行重命名

（2）class SampleAndAggregate(GeneralizedModel)主要包含的函数

def __init__(self, placeholders, features, adj, degrees, layer_infos, concat=True, aggregator_type=“mean”, model_size=“small”, identity_dim=0, **kwargs)
def sample(self, inputs, layer_infos, batch_size=None)
def aggregate(self, samples, input_features, dims, num_samples, support_sizes, batch_size=None,aggregators=None, name=None, concat=False, model_size=“small”)
def _build(self)
def build(self)
def _loss(self)
def _accuracy(self)

（3）类及其继承关系

      Model /   \/     \MLP   GeneralizedModel/  \/    \
Node2VecModel  SampleAndAggregate

其中Model与 GeneralizedModel的区别在于，Model的build()函数中搭建了序列层模型，而在GeneralizedModel中被删去
self.ouput必须在GeneralizedModel的子类build()中被赋值
序列层实现的功能是，给输入，通过layer()返回输出，又将这个输出再次作为输入到下一个layer()中，最终，取最后一层layer的结果作为output

（4）class SampleAndAggregate(GeneralizedModel)

__init__()中self.features的由来

para: features    tf.get_variable()-> identity features|                   |
self.features     self.embeds   --> At least one is not None\                 /       --> Concat if both are not None \               /\             /self.features

__init__()中self.dims：
self.dims是一个list, 每一位记录各个神经网络层的维数。
self.dims[0]的值相当于self.features的列数 (0 if features is None else features.shape[1]) + identity_dim)，(注意：括号里features为传入的参数，而非self.features)。
之后各位为各层output_dim，也就是hidden units的个数。
sample(inputs, layer_infos, batch_size=None)
对于sample的算法描述，详见论文Appendix A, algorithm 2。

sampler = layer_infos[t].neigh_sampler

当函数被调用时，layer_infos会被赋值，在unsupervised_train.py中，其中neigh_sampler被赋为UniformNeighborSampler，其在neigh_samplers.py中定义：class UniformNeighborSampler(Layer)。

目的是对于输入的samples[k] (即为上一步sample得到的节点，如上图依次得到黄色区域表示的samples[0]，橙色区域表示的samples[1], 粉色区域表示的samples[2]。其中samples[k]是有由对samples[k - 1]中各节点的邻居采样而得)，选取num_samples个数的邻居节点的序号（对应上图N(u)）。（返回值是adj_lists, 即为被截断为num_samples列数的邻接矩阵。）

这里注意区别support_size与num_samples:

num_sample为当前深度每个节点u所选取的邻居节点的个数为num_samples;

support_size表示当前节点u的embedding受多少节点信息的影响。其既受当前层num_samples个直接邻居的影响，其邻居也受更先前深度num_samples个邻居的影响，以此类推。故support_size是到目前深度为止的各深度下num_samples的连乘积。则对于batch_size个输入节点，总的support个数为: support_size * batch_size。

最后将support_size存进support_sizes的数组中。

sample() 函数最终返回包含各深度下采样点的samples数组与各深度下各点受支持节点数目的support_sizes数组。

（5）tf.nn.fixed_unigram_candidate_sampler

按照用户提供的概率分布进行采样。
如果类别服从均匀分布，我们就用uniform_candidate_sampler;
如果词作类别，我们知道词服从 Zipfian, 我们就用 log_uniform_candidate_sampler;
如果能够通过统计或者其他渠道知道类别满足某些分布，用 nn.fixed_unigram_candidate_sampler;
如果实在不知道类别分布，我们还可以用 tf.nn.learned_unigram_candidate_sampler。(2) Paras:
a. num_sampled:sampling_candidates的元素是在没有替换（如果unique = True）的情况下绘制的，
或者是从基本分布中替换（如果unique = False）。unique = True 可以看作无放回抽样；unique = False 可以看作有放回抽样。b. distortion:distortion used the word2vec freq energy table formulation
f^(3/4) / total(f^(3/4))
in word2vec energy counted by freq;
in graphsage energy counted by degrees
so in unigrams = [] each ID recored each node's degreec. unigrams: 各个节点的度。(3) Returns:
a. sampled_candidates: A tensor of type int64 and shape [num_sampled]. The sampled classes.
b. true_expected_count: A tensor of type float. Same shape as true_classes. The expected counts under the sampling distribution of each of true_classes.
c. sampled_expected_count: A tensor of type float. Same shape as sampled_candidates. The expected counts under the sampling distribution of each of sampled_candidates.

models.py完整代码

from collections import namedtupleimport tensorflow as tf
import mathimport graphsage.layers as layers
import graphsage.metrics as metricsfrom .prediction import BipartiteEdgePredLayer
from .aggregators import MeanAggregator, MaxPoolingAggregator, MeanPoolingAggregator, SeqAggregator, GCNAggregatorflags = tf.app.flags
FLAGS = flags.FLAGS# DISCLAIMER:
# Boilerplate parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras packageclass Model(object):def __init__(self, **kwargs):allowed_kwargs = {'name', 'logging', 'model_size'}for kwarg in kwargs.keys():assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwargname = kwargs.get('name')if not name:name = self.__class__.__name__.lower()self.name = namelogging = kwargs.get('logging', False)self.logging = loggingself.vars = {}self.placeholders = {}self.layers = []self.activations = []self.inputs = Noneself.outputs = Noneself.loss = 0self.accuracy = 0self.optimizer = Noneself.opt_op = Nonedef _build(self):raise NotImplementedErrordef build(self):""" Wrapper for _build() """with tf.variable_scope(self.name):self._build()# Build sequential layer modelself.activations.append(self.inputs)for layer in self.layers:hidden = layer(self.activations[-1])self.activations.append(hidden)self.outputs = self.activations[-1]# 这部分sequential layer model模型在GeneralizedModel的build()中被删去# Store model variables for easy accessvariables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)self.vars = {var.name: var for var in variables}# Build metricsself._loss()self._accuracy()self.opt_op = self.optimizer.minimize(self.loss)def predict(self):passdef _loss(self):raise NotImplementedErrordef _accuracy(self):raise NotImplementedErrordef save(self, sess=None):if not sess:raise AttributeError("TensorFlow session not provided.")saver = tf.train.Saver(self.vars)save_path = saver.save(sess, "tmp/%s.ckpt" % self.name)print("Model saved in file: %s" % save_path)def load(self, sess=None):if not sess:raise AttributeError("TensorFlow session not provided.")saver = tf.train.Saver(self.vars)save_path = "tmp/%s.ckpt" % self.namesaver.restore(sess, save_path)print("Model restored from file: %s" % save_path)class MLP(Model):""" A standard multi-layer perceptron """def __init__(self, placeholders, dims, categorical=True, **kwargs):super(MLP, self).__init__(**kwargs)self.dims = dimsself.input_dim = dims[0]self.output_dim = dims[-1]self.placeholders = placeholdersself.categorical = categoricalself.inputs = placeholders['features']self.labels = placeholders['labels']self.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)self.build()def _loss(self):# Weight decay lossfor var in self.layers[0].vars.values():self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)# Cross entropy errorif self.categorical:self.loss += metrics.masked_softmax_cross_entropy(self.outputs, self.placeholders['labels'],self.placeholders['labels_mask'])# L2else:diff = self.labels - self.outputsself.loss += tf.reduce_sum(tf.sqrt(tf.reduce_sum(diff * diff, axis=1)))def _accuracy(self):if self.categorical:self.accuracy = metrics.masked_accuracy(self.outputs, self.placeholders['labels'],self.placeholders['labels_mask'])def _build(self):self.layers.append(layers.Dense(input_dim=self.input_dim,output_dim=self.dims[1],act=tf.nn.relu,dropout=self.placeholders['dropout'],sparse_inputs=False,logging=self.logging))self.layers.append(layers.Dense(input_dim=self.dims[1],output_dim=self.output_dim,act=lambda x: x,dropout=self.placeholders['dropout'],logging=self.logging))def predict(self):return tf.nn.softmax(self.outputs)class GeneralizedModel(Model):"""Base class for models that aren't constructed from traditional, sequential layers.Subclasses must set self.outputs in _build method(Removes the layers idiom from build method of the Model class)"""def __init__(self, **kwargs):super(GeneralizedModel, self).__init__(**kwargs)def build(self):""" Wrapper for _build() """with tf.variable_scope(self.name):self._build()# Store model variables for easy accessvariables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.name)self.vars = {var.name: var for var in variables}# Build metricsself._loss()self._accuracy()self.opt_op = self.optimizer.minimize(self.loss)# SAGEInfo is a namedtuple that specifies the parameters of the recursive GraphSAGE layers
SAGEInfo = namedtuple("SAGEInfo",['layer_name', # name of the layer (to get feature embedding etc.)'neigh_sampler', # callable neigh_sampler constructor'num_samples','output_dim' # the output (i.e., hidden) dimension])class SampleAndAggregate(GeneralizedModel):"""Base implementation of unsupervised GraphSAGE"""def __init__(self, placeholders, features, adj, degrees,layer_infos, concat=True, aggregator_type="mean", model_size="small", identity_dim=0,**kwargs):'''Args:- placeholders: Stanford TensorFlow placeholder object.- features: Numpy array with node features. NOTE: Pass a None object to train in featureless mode (identity features for nodes)!- adj: Numpy array with adjacency lists (padded with random re-samples)- degrees: Numpy array with node degrees. - layer_infos: List of SAGEInfo namedtuples that describe the parameters of all the recursive layers. See SAGEInfo definition above.- concat: whether to concatenate during recursive iterations- aggregator_type: how to aggregate neighbor information- model_size: one of "small" and "big"- identity_dim: Set to positive int to use identity features (slow and cannot generalize, but better accuracy)'''super(SampleAndAggregate, self).__init__(**kwargs)if aggregator_type == "mean":self.aggregator_cls = MeanAggregatorelif aggregator_type == "seq":self.aggregator_cls = SeqAggregatorelif aggregator_type == "maxpool":self.aggregator_cls = MaxPoolingAggregatorelif aggregator_type == "meanpool":self.aggregator_cls = MeanPoolingAggregatorelif aggregator_type == "gcn":self.aggregator_cls = GCNAggregatorelse:raise Exception("Unknown aggregator: ", self.aggregator_cls)# get info from placeholders...self.inputs1 = placeholders["batch1"]self.inputs2 = placeholders["batch2"]self.model_size = model_sizeself.adj_info = adjif identity_dim > 0:self.embeds = tf.get_variable("node_embeddings", [adj.get_shape().as_list()[0], identity_dim])else:self.embeds = Noneif features is None: if identity_dim == 0:raise Exception("Must have a positive value for identity feature dimension if no input features given.")self.features = self.embedselse:self.features = tf.Variable(tf.constant(features, dtype=tf.float32), trainable=False)if not self.embeds is None:self.features = tf.concat([self.embeds, self.features], axis=1)self.degrees = degreesself.concat = concatself.dims = [(0 if features is None else features.shape[1]) + identity_dim]self.dims.extend([layer_infos[i].output_dim for i in range(len(layer_infos))])self.batch_size = placeholders["batch_size"]self.placeholders = placeholdersself.layer_infos = layer_infosself.optimizer = tf.train.AdamOptimizer(learning_rate=FLAGS.learning_rate)self.build()def sample(self, inputs, layer_infos, batch_size=None):""" Sample neighbors to be the supportive fields for multi-layer convolutions.Args:inputs: batch inputsbatch_size: the number of inputs (different for batch inputs and negative samples)."""if batch_size is None:batch_size = self.batch_sizesamples = [inputs]# size of convolution support at each layer per nodesupport_size = 1support_sizes = [support_size]for k in range(len(layer_infos)):t = len(layer_infos) - k - 1support_size *= layer_infos[t].num_samplessampler = layer_infos[t].neigh_samplernode = sampler((samples[k], layer_infos[t].num_samples))samples.append(tf.reshape(node, [support_size * batch_size,]))support_sizes.append(support_size)return samples, support_sizesdef aggregate(self, samples, input_features, dims, num_samples, support_sizes, batch_size=None,aggregators=None, name=None, concat=False, model_size="small"):""" At each layer, aggregate hidden representations of neighbors to compute the hidden representations at next layer.Args:samples: a list of samples of variable hops away for convolving at each layer of thenetwork. Length is the number of layers + 1. Each is a vector of node indices.input_features: the input features for each sample of various hops away.dims: a list of dimensions of the hidden representations from the input layer to thefinal layer. Length is the number of layers + 1.num_samples: list of number of samples for each layer.support_sizes: the number of nodes to gather information from for each layer.batch_size: the number of inputs (different for batch inputs and negative samples).Returns:The hidden representation at the final layer for all nodes in batch"""if batch_size is None:batch_size = self.batch_size# length: number of layers + 1hidden = [tf.nn.embedding_lookup(input_features, node_samples) for node_samples in samples]new_agg = aggregators is Noneif new_agg:aggregators = []for layer in range(len(num_samples)):if new_agg:dim_mult = 2 if concat and (layer != 0) else 1# aggregator at current layerif layer == len(num_samples) - 1:aggregator = self.aggregator_cls(dim_mult*dims[layer], dims[layer+1], act=lambda x : x,dropout=self.placeholders['dropout'], name=name, concat=concat, model_size=model_size)else:aggregator = self.aggregator_cls(dim_mult*dims[layer], dims[layer+1],dropout=self.placeholders['dropout'], name=name, concat=concat, model_size=model_size)aggregators.append(aggregator)else:aggregator = aggregators[layer]# hidden representation at current layer for all support nodes that are various hops awaynext_hidden = []# as layer increases, the number of support nodes needed decreasesfor hop in range(len(num_samples) - layer):dim_mult = 2 if concat and (layer != 0) else 1neigh_dims = [batch_size * support_sizes[hop], num_samples[len(num_samples) - hop - 1], dim_mult*dims[layer]]h = aggregator((hidden[hop],tf.reshape(hidden[hop + 1], neigh_dims)))next_hidden.append(h)hidden = next_hiddenreturn hidden[0], aggregatorsdef _build(self):labels = tf.reshape(tf.cast(self.placeholders['batch2'], dtype=tf.int64),[self.batch_size, 1])self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(true_classes=labels,num_true=1,num_sampled=FLAGS.neg_sample_size,unique=False,range_max=len(self.degrees),distortion=0.75,unigrams=self.degrees.tolist()))# perform "convolution"samples1, support_sizes1 = self.sample(self.inputs1, self.layer_infos)samples2, support_sizes2 = self.sample(self.inputs2, self.layer_infos)num_samples = [layer_info.num_samples for layer_info in self.layer_infos]self.outputs1, self.aggregators = self.aggregate(samples1, [self.features], self.dims, num_samples,support_sizes1, concat=self.concat, model_size=self.model_size)self.outputs2, _ = self.aggregate(samples2, [self.features], self.dims, num_samples,support_sizes2, aggregators=self.aggregators, concat=self.concat,model_size=self.model_size)neg_samples, neg_support_sizes = self.sample(self.neg_samples, self.layer_infos,FLAGS.neg_sample_size)self.neg_outputs, _ = self.aggregate(neg_samples, [self.features], self.dims, num_samples,neg_support_sizes, batch_size=FLAGS.neg_sample_size, aggregators=self.aggregators,concat=self.concat, model_size=self.model_size)dim_mult = 2 if self.concat else 1self.link_pred_layer = BipartiteEdgePredLayer(dim_mult*self.dims[-1],dim_mult*self.dims[-1], self.placeholders, act=tf.nn.sigmoid, bilinear_weights=False,name='edge_predict')self.outputs1 = tf.nn.l2_normalize(self.outputs1, 1)self.outputs2 = tf.nn.l2_normalize(self.outputs2, 1)self.neg_outputs = tf.nn.l2_normalize(self.neg_outputs, 1)def build(self):self._build()# TF graph managementself._loss()self._accuracy()self.loss = self.loss / tf.cast(self.batch_size, tf.float32)grads_and_vars = self.optimizer.compute_gradients(self.loss)clipped_grads_and_vars = [(tf.clip_by_value(grad, -5.0, 5.0) if grad is not None else None, var) for grad, var in grads_and_vars]self.grad, _ = clipped_grads_and_vars[0]self.opt_op = self.optimizer.apply_gradients(clipped_grads_and_vars)def _loss(self):for aggregator in self.aggregators:for var in aggregator.vars.values():self.loss += FLAGS.weight_decay * tf.nn.l2_loss(var)self.loss += self.link_pred_layer.loss(self.outputs1, self.outputs2, self.neg_outputs) tf.summary.scalar('loss', self.loss)def _accuracy(self):# shape: [batch_size]aff = self.link_pred_layer.affinity(self.outputs1, self.outputs2)# shape : [batch_size x num_neg_samples]self.neg_aff = self.link_pred_layer.neg_cost(self.outputs1, self.neg_outputs)self.neg_aff = tf.reshape(self.neg_aff, [self.batch_size, FLAGS.neg_sample_size])_aff = tf.expand_dims(aff, axis=1)self.aff_all = tf.concat(axis=1, values=[self.neg_aff, _aff])size = tf.shape(self.aff_all)[1]_, indices_of_ranks = tf.nn.top_k(self.aff_all, k=size)_, self.ranks = tf.nn.top_k(-indices_of_ranks, k=size)self.mrr = tf.reduce_mean(tf.div(1.0, tf.cast(self.ranks[:, -1] + 1, tf.float32)))tf.summary.scalar('mrr', self.mrr)class Node2VecModel(GeneralizedModel):def __init__(self, placeholders, dict_size, degrees, name=None,nodevec_dim=50, lr=0.001, **kwargs):""" Simple version of Node2Vec/DeepWalk algorithm.Args:dict_size: the total number of nodes.degrees: numpy array of node degrees, ordered as in the data's id_mapnodevec_dim: dimension of the vector representation of node.lr: learning rate of optimizer."""super(Node2VecModel, self).__init__(**kwargs)self.placeholders = placeholdersself.degrees = degreesself.inputs1 = placeholders["batch1"]self.inputs2 = placeholders["batch2"]self.batch_size = placeholders['batch_size']self.hidden_dim = nodevec_dim# following the tensorflow word2vec tutorialself.target_embeds = tf.Variable(tf.random_uniform([dict_size, nodevec_dim], -1, 1),name="target_embeds")self.context_embeds = tf.Variable(tf.truncated_normal([dict_size, nodevec_dim],stddev=1.0 / math.sqrt(nodevec_dim)),name="context_embeds")self.context_bias = tf.Variable(tf.zeros([dict_size]),name="context_bias")self.optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)self.build()def _build(self):labels = tf.reshape(tf.cast(self.placeholders['batch2'], dtype=tf.int64),[self.batch_size, 1])self.neg_samples, _, _ = (tf.nn.fixed_unigram_candidate_sampler(true_classes=labels,num_true=1,num_sampled=FLAGS.neg_sample_size,unique=True,range_max=len(self.degrees),distortion=0.75,unigrams=self.degrees.tolist()))self.outputs1 = tf.nn.embedding_lookup(self.target_embeds, self.inputs1)self.outputs2 = tf.nn.embedding_lookup(self.context_embeds, self.inputs2)self.outputs2_bias = tf.nn.embedding_lookup(self.context_bias, self.inputs2)self.neg_outputs = tf.nn.embedding_lookup(self.context_embeds, self.neg_samples)self.neg_outputs_bias = tf.nn.embedding_lookup(self.context_bias, self.neg_samples)self.link_pred_layer = BipartiteEdgePredLayer(self.hidden_dim, self.hidden_dim,self.placeholders, bilinear_weights=False)def build(self):self._build()# TF graph managementself._loss()self._minimize()self._accuracy()def _minimize(self):self.opt_op = self.optimizer.minimize(self.loss)def _loss(self):aff = tf.reduce_sum(tf.multiply(self.outputs1, self.outputs2), 1) + self.outputs2_biasneg_aff = tf.matmul(self.outputs1, tf.transpose(self.neg_outputs)) + self.neg_outputs_biastrue_xent = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(aff), logits=aff)negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.zeros_like(neg_aff), logits=neg_aff)loss = tf.reduce_sum(true_xent) + tf.reduce_sum(negative_xent)self.loss = loss / tf.cast(self.batch_size, tf.float32)tf.summary.scalar('loss', self.loss)def _accuracy(self):# shape: [batch_size]aff = self.link_pred_layer.affinity(self.outputs1, self.outputs2)# shape : [batch_size x num_neg_samples]self.neg_aff = self.link_pred_layer.neg_cost(self.outputs1, self.neg_outputs)self.neg_aff = tf.reshape(self.neg_aff, [self.batch_size, FLAGS.neg_sample_size])_aff = tf.expand_dims(aff, axis=1)self.aff_all = tf.concat(axis=1, values=[self.neg_aff, _aff])size = tf.shape(self.aff_all)[1]_, indices_of_ranks = tf.nn.top_k(self.aff_all, k=size)_, self.ranks = tf.nn.top_k(-indices_of_ranks, k=size)self.mrr = tf.reduce_mean(tf.div(1.0, tf.cast(self.ranks[:, -1] + 1, tf.float32)))tf.summary.scalar('mrr', self.mrr)

`layers.py`

（1）_LAYER_UIDS = {}
_LAYER_UIDS = {} 是记录layer及其出现次数的字典
在class Layer中，当未赋variable scope的name时，通过实例化Layer的次数来标定不同的layer_id
例子：

class Layer():def __init__(self):layer = self.__class__.__name__name = layer + '_' + str(get_layer_uid(layer))print(name) layer1 = Layer()
layer2 = Layer()# Output:
# Layer_1
# Layer_2

（2）class Layer
class Layer主要定义基本的层的API。
方法：

__init__(): 获取传入的name, logging, model_size参数。初始化实例变量name, vars{}, logging, sparse_inputs
_call(inputs): 定义层的计算图：获取input, 返回output
__call__(inputs): 相当于_call()的装饰器，在实现列_call()基本功能后，丰富了其功能，这里主要通过tf.summary.histogram() 可以查看inputs与outputs分布情况的直方图
_log_vars(): 记录所有变量。实现时主要将vars中的各个变量以直方图形式显示

（2）class Dense
Dense layer主要用于实现全连接层的基本功能。即为了最终得到 Relu(Wx + b)。

__init__(): 用于获取初始化成员变量。其中num_features_nonzero和featureless的作用目前还不清楚
_call(): 用于实现并且返回Relu(Wx + b)

layers.py完整代码

from __future__ import division
from __future__ import print_functionimport tensorflow as tffrom graphsage.inits import zerosflags = tf.app.flags
FLAGS = flags.FLAGS# DISCLAIMER:
# Boilerplate parts of this code file were originally forked from
# https://github.com/tkipf/gcn
# which itself was very inspired by the keras package# global unique layer ID dictionary for layer name assignment
_LAYER_UIDS = {} #记录layer及其出现次数的字典#作用： 在class Layer中，当未赋variable scope的name时，通过实例化Layer的次数来标定不同的layer_id
def get_layer_uid(layer_name=''):"""Helper function, assigns unique layer IDs."""if layer_name not in _LAYER_UIDS:_LAYER_UIDS[layer_name] = 1  #若layer_name从未出现过，如今出现了，则将_LAYER_UIDS[layer_name]设为1；否则累加return 1else:_LAYER_UIDS[layer_name] += 1return _LAYER_UIDS[layer_name]class Layer(object):"""Base layer class. Defines basic API for all layer objects.Implementation inspired by keras (http://keras.io).# Propertiesname: String, defines the variable scope of the layer.logging: Boolean, switches Tensorflow histogram logging on/off# Methods_call(inputs): Defines computation graph of layer(i.e. takes input, returns output)__call__(inputs): Wrapper for _call()_log_vars(): Log all variables"""def __init__(self, **kwargs):allowed_kwargs = {'name', 'logging', 'model_size'}for kwarg in kwargs.keys():assert kwarg in allowed_kwargs, 'Invalid keyword argument: ' + kwargname = kwargs.get('name')if not name:layer = self.__class__.__name__.lower()name = layer + '_' + str(get_layer_uid(layer))self.name = nameself.vars = {}logging = kwargs.get('logging', False)self.logging = loggingself.sparse_inputs = Falsedef _call(self, inputs):return inputs#重写了__call__ python内置函数，可以把对象当函数调用,__call__输入为inputs，输出为outputsdef __call__(self, inputs):with tf.name_scope(self.name):if self.logging and not self.sparse_inputs:tf.summary.histogram(self.name + '/inputs', inputs)outputs = self._call(inputs) #在子类也可以把对象当作函数调用，即调用__call__时也会自动调用_call()函数if self.logging:tf.summary.histogram(self.name + '/outputs', outputs)return outputsdef _log_vars(self):for var in self.vars:tf.summary.histogram(self.name + '/vars/' + var, self.vars[var])class Dense(Layer):"""Dense layer."""def __init__(self, input_dim, output_dim, dropout=0., act=tf.nn.relu, placeholders=None, bias=True, featureless=False, sparse_inputs=False, **kwargs):super(Dense, self).__init__(**kwargs)self.dropout = dropoutself.act = actself.featureless = featurelessself.bias = biasself.input_dim = input_dimself.output_dim = output_dim# helper variable for sparse dropoutself.sparse_inputs = sparse_inputsif sparse_inputs:self.num_features_nonzero = placeholders['num_features_nonzero']with tf.variable_scope(self.name + '_vars'):self.vars['weights'] = tf.get_variable('weights', shape=(input_dim, output_dim),dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer(),regularizer=tf.contrib.layers.l2_regularizer(FLAGS.weight_decay))if self.bias:self.vars['bias'] = zeros([output_dim], name='bias')if self.logging:self._log_vars()def _call(self, inputs):x = inputsx = tf.nn.dropout(x, 1-self.dropout)# transformoutput = tf.matmul(x, self.vars['weights'])# biasif self.bias:output += self.vars['bias']return self.act(output)

`minibatch.py`

minibatch.py完整代码

from __future__ import division
from __future__ import print_functionimport numpy as npnp.random.seed(123)class EdgeMinibatchIterator(object):""" This minibatch iterator iterates over batches of sampled edges orrandom pairs of co-occuring edges.G -- networkx graphid2idx -- dict mapping node ids to index in feature tensorplaceholders -- tensorflow placeholders objectcontext_pairs -- if not none, then a list of co-occuring node pairs (from random walks)batch_size -- size of the minibatchesmax_degree -- maximum size of the downsampled adjacency listsn2v_retrain -- signals that the iterator is being used to add new embeddings to a n2v modelfixed_n2v -- signals that the iterator is being used to retrain n2v with only existing nodes as context"""def __init__(self, G, id2idx, placeholders, context_pairs=None, batch_size=100, max_degree=25,n2v_retrain=False, fixed_n2v=False,**kwargs):self.G = Gself.nodes = G.nodes()self.id2idx = id2idxself.placeholders = placeholdersself.batch_size = batch_sizeself.max_degree = max_degreeself.batch_num = 0## 函数shuffle与permutation都是对原来的数组进行重新洗牌,即随机打乱原来的元素顺序# permutation不直接在原来的数组上进行操作，而是返回一个新的打乱顺序的数组，并不改变原来的数组。self.nodes = np.random.permutation(G.nodes())self.adj, self.deg = self.construct_adj()self.test_adj = self.construct_test_adj()if context_pairs is None:edges = G.edges()else:edges = context_pairsself.train_edges = self.edges = np.random.permutation(edges)if not n2v_retrain:self.train_edges = self._remove_isolated(self.train_edges)self.val_edges = [e for e in G.edges() if G[e[0]][e[1]]['train_removed']]else:if fixed_n2v:self.train_edges = self.val_edges = self._n2v_prune(self.edges)else:self.train_edges = self.val_edges = self.edgesprint(len([n for n in G.nodes() if not G.node[n]['test'] and not G.node[n]['val']]), 'train nodes')print(len([n for n in G.nodes() if G.node[n]['test'] or G.node[n]['val']]), 'test nodes')self.val_set_size = len(self.val_edges)def _n2v_prune(self, edges):is_val = lambda n : self.G.node[n]["val"] or self.G.node[n]["test"]return [e for e in edges if not is_val(e[1])]def _remove_isolated(self, edge_list):new_edge_list = []missing = 0for n1, n2 in edge_list:if not n1 in self.G.node or not n2 in self.G.node:missing += 1continueif (self.deg[self.id2idx[n1]] == 0 or self.deg[self.id2idx[n2]] == 0) \and (not self.G.node[n1]['test'] or self.G.node[n1]['val']) \and (not self.G.node[n2]['test'] or self.G.node[n2]['val']):continueelse:new_edge_list.append((n1,n2))print("Unexpected missing:", missing)return new_edge_listdef construct_adj(self):adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))deg = np.zeros((len(self.id2idx),))for nodeid in self.G.nodes():if self.G.node[nodeid]['test'] or self.G.node[nodeid]['val']:continueneighbors = np.array([self.id2idx[neighbor] for neighbor in self.G.neighbors(nodeid)if (not self.G[nodeid][neighbor]['train_removed'])])deg[self.id2idx[nodeid]] = len(neighbors)if len(neighbors) == 0:continueif len(neighbors) > self.max_degree:neighbors = np.random.choice(neighbors, self.max_degree, replace=False)elif len(neighbors) < self.max_degree:neighbors = np.random.choice(neighbors, self.max_degree, replace=True)adj[self.id2idx[nodeid], :] = neighborsreturn adj, degdef construct_test_adj(self):adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))for nodeid in self.G.nodes():neighbors = np.array([self.id2idx[neighbor] for neighbor in self.G.neighbors(nodeid)])if len(neighbors) == 0:continueif len(neighbors) > self.max_degree:neighbors = np.random.choice(neighbors, self.max_degree, replace=False)elif len(neighbors) < self.max_degree:neighbors = np.random.choice(neighbors, self.max_degree, replace=True)adj[self.id2idx[nodeid], :] = neighborsreturn adjdef end(self):return self.batch_num * self.batch_size >= len(self.train_edges)def batch_feed_dict(self, batch_edges):batch1 = []batch2 = []for node1, node2 in batch_edges:batch1.append(self.id2idx[node1])batch2.append(self.id2idx[node2])feed_dict = dict()feed_dict.update({self.placeholders['batch_size'] : len(batch_edges)})feed_dict.update({self.placeholders['batch1']: batch1})feed_dict.update({self.placeholders['batch2']: batch2})return feed_dict#函数中获取下个edgeminibatch的起始与终止序号，将batch后的边的信息传给batch_feed_dict(self, batch_edges)函数，更新placeholders中的batch1, batch2, batch_size信息def next_minibatch_feed_dict(self):start_idx = self.batch_num * self.batch_sizeself.batch_num += 1end_idx = min(start_idx + self.batch_size, len(self.train_edges))batch_edges = self.train_edges[start_idx : end_idx]return self.batch_feed_dict(batch_edges)def num_training_batches(self):return len(self.train_edges) // self.batch_size + 1def val_feed_dict(self, size=None):edge_list = self.val_edgesif size is None:return self.batch_feed_dict(edge_list)else:ind = np.random.permutation(len(edge_list))val_edges = [edge_list[i] for i in ind[:min(size, len(ind))]]return self.batch_feed_dict(val_edges)def incremental_val_feed_dict(self, size, iter_num):edge_list = self.val_edgesval_edges = edge_list[iter_num*size:min((iter_num+1)*size, len(edge_list))]return self.batch_feed_dict(val_edges), (iter_num+1)*size >= len(self.val_edges), val_edgesdef incremental_embed_feed_dict(self, size, iter_num):node_list = self.nodesval_nodes = node_list[iter_num*size:min((iter_num+1)*size, len(node_list))]val_edges = [(n,n) for n in val_nodes]return self.batch_feed_dict(val_edges), (iter_num+1)*size >= len(node_list), val_edgesdef label_val(self):train_edges = []val_edges = []for n1, n2 in self.G.edges():if (self.G.node[n1]['val'] or self.G.node[n1]['test'] or self.G.node[n2]['val'] or self.G.node[n2]['test']):val_edges.append((n1,n2))else:train_edges.append((n1,n2))return train_edges, val_edges# shuffle直接在原来的数组上进行操作，改变原来数组的顺序，无返回值def shuffle(self):""" Re-shuffle the training set.Also reset the batch number."""self.train_edges = np.random.permutation(self.train_edges)self.nodes = np.random.permutation(self.nodes)self.batch_num = 0class NodeMinibatchIterator(object):""" This minibatch iterator iterates over nodes for supervised learning.G -- networkx graphid2idx -- dict mapping node ids to integer values indexing feature tensorplaceholders -- standard tensorflow placeholders object for feedinglabel_map -- map from node ids to class values (integer or list)num_classes -- number of output classesbatch_size -- size of the minibatchesmax_degree -- maximum size of the downsampled adjacency lists"""def __init__(self, G, id2idx, placeholders, label_map, num_classes, batch_size=100, max_degree=25,**kwargs):self.G = Gself.nodes = G.nodes()self.id2idx = id2idxself.placeholders = placeholdersself.batch_size = batch_sizeself.max_degree = max_degreeself.batch_num = 0self.label_map = label_mapself.num_classes = num_classesself.adj, self.deg = self.construct_adj()self.test_adj = self.construct_test_adj()self.val_nodes = [n for n in self.G.nodes() if self.G.node[n]['val']]self.test_nodes = [n for n in self.G.nodes() if self.G.node[n]['test']]self.no_train_nodes_set = set(self.val_nodes + self.test_nodes)self.train_nodes = set(G.nodes()).difference(self.no_train_nodes_set)# don't train on nodes that only have edges to test setself.train_nodes = [n for n in self.train_nodes if self.deg[id2idx[n]] > 0]def _make_label_vec(self, node):label = self.label_map[node]if isinstance(label, list):label_vec = np.array(label)else:label_vec = np.zeros((self.num_classes))class_ind = self.label_map[node]label_vec[class_ind] = 1return label_vecdef construct_adj(self):# 该矩阵记录训练数据中各节点的邻居节点的编号# 采样只取max_degree个邻居节点，采样方法见下# 同样进行了行数加一操作adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))# 该矩阵记录了每个节点的度数deg = np.zeros((len(self.id2idx),))for nodeid in self.G.nodes():# 在选取邻居节点时进行了筛选，对于G.neighbors(nodeid) 点node的邻居，# 只取该node与neighbor相连的边的train_removed = False的neighbor# 也就是只取不是val, test的节点# neighbors得到了邻居节点编号数列if self.G.node[nodeid]['test'] or self.G.node[nodeid]['val']:continueneighbors = np.array([self.id2idx[neighbor] for neighbor in self.G.neighbors(nodeid)if (not self.G[nodeid][neighbor]['train_removed'])])# deg各位取值为该位对应nodeid的节点的度数，# 也即经过上面筛选后得到的邻居数deg[self.id2idx[nodeid]] = len(neighbors)if len(neighbors) == 0:continueif len(neighbors) > self.max_degree:                # np.random.choice为选取size大小的数列neighbors = np.random.choice(neighbors, self.max_degree, replace=False)elif len(neighbors) < self.max_degree:# 经过choice随机选取，得到了固定大小max_degree = 25的直接相连的邻居数列neighbors = np.random.choice(neighbors, self.max_degree, replace=True)# 把该node的邻居数列，赋值给adj矩阵中对应nodeid位的向量adj[self.id2idx[nodeid], :] = neighborsreturn adj, deg#在construct_test_adj()  函数中，与上不同之处在于，可以直接得到邻居而无需根据val/test/train_removed筛选def construct_test_adj(self):adj = len(self.id2idx)*np.ones((len(self.id2idx)+1, self.max_degree))for nodeid in self.G.nodes():neighbors = np.array([self.id2idx[neighbor] for neighbor in self.G.neighbors(nodeid)])if len(neighbors) == 0:continueif len(neighbors) > self.max_degree:neighbors = np.random.choice(neighbors, self.max_degree, replace=False)elif len(neighbors) < self.max_degree:neighbors = np.random.choice(neighbors, self.max_degree, replace=True)adj[self.id2idx[nodeid], :] = neighborsreturn adjdef end(self):return self.batch_num * self.batch_size >= len(self.train_nodes)#也即next_minibatch_feed_dict()返回的是下一个edge minibatch的placeholders信息def batch_feed_dict(self, batch_nodes, val=False):batch1id = batch_nodesbatch1 = [self.id2idx[n] for n in batch1id]labels = np.vstack([self._make_label_vec(node) for node in batch1id])feed_dict = dict()feed_dict.update({self.placeholders['batch_size'] : len(batch1)})feed_dict.update({self.placeholders['batch']: batch1})feed_dict.update({self.placeholders['labels']: labels})return feed_dict, labelsdef node_val_feed_dict(self, size=None, test=False):if test:val_nodes = self.test_nodeselse:val_nodes = self.val_nodesif not size is None:val_nodes = np.random.choice(val_nodes, size, replace=True)# add a dummy neighborret_val = self.batch_feed_dict(val_nodes)return ret_val[0], ret_val[1]def incremental_node_val_feed_dict(self, size, iter_num, test=False):if test:val_nodes = self.test_nodeselse:val_nodes = self.val_nodesval_node_subset = val_nodes[iter_num*size:min((iter_num+1)*size, len(val_nodes))]# add a dummy neighborret_val = self.batch_feed_dict(val_node_subset)return ret_val[0], ret_val[1], (iter_num+1)*size >= len(val_nodes), val_node_subsetdef num_training_batches(self):return len(self.train_nodes) // self.batch_size + 1def next_minibatch_feed_dict(self):start_idx = self.batch_num * self.batch_sizeself.batch_num += 1end_idx = min(start_idx + self.batch_size, len(self.train_nodes))batch_nodes = self.train_nodes[start_idx : end_idx]return self.batch_feed_dict(batch_nodes)def incremental_embed_feed_dict(self, size, iter_num):node_list = self.nodesval_nodes = node_list[iter_num*size:min((iter_num+1)*size, len(node_list))]return self.batch_feed_dict(val_nodes), (iter_num+1)*size >= len(node_list), val_nodesdef shuffle(self):""" Re-shuffle the training set.Also reset the batch number."""self.train_nodes = np.random.permutation(self.train_nodes)self.batch_num = 0

`aggregators.py`

该类主要用于实现:
$hvk←σ(W⋅MEAN({hvk−1}∪{huk−1,∀u∈N(v)}))\mathbf{h}^k_v \leftarrow \sigma(\mathbf{W} \cdot \mathrm{MEAN}(\lbrace \mathbf{h}^{k-1}_v \rbrace \cup \lbrace \mathbf{h}^{k-1}_u, \forall u \in \mathcal{N}(v) \rbrace))$

（1）class GCNAggregator(Layer)
这里__init__()与MeanAggregator基本相同，在_call()的实现中略有不同

def _call(self, inputs):self_vecs, neigh_vecs = inputsneigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)means = tf.reduce_mean(tf.concat([neigh_vecs, tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)# [nodes] x [out_dim]output = tf.matmul(means, self.vars['weights'])# biasif self.bias:output += self.vars['bias']return self.act(output)

其中对means求解时：

先将self_vecs行列转换(tf.expand_dims(self_vecs, axis=1)),
之后self_vecs的行数与neigh_vecs行数相同时，将二者concat,即相当于在原先的neigh_vecs矩阵后面新增一列self_vecs的转置
最后将得到的矩阵每行求均值，即得means
之后means与权值矩阵vars[‘weights’]求内积，并加上vars[‘bias’], 最终将该值带入激活函数(ReLu)。

下面举个例子简单说明(例子中省略了点乘W的操作)：

import tensorflow as tfneigh_vecs = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
self_vecs = [2, 3, 4]means = tf.reduce_mean(tf.concat([neigh_vecs,tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)print(tf.shape(self_vecs))print(tf.expand_dims(self_vecs, axis=0))
# Tensor("ExpandDims_1:0", shape=(1, 3), dtype=int32)print(tf.expand_dims(self_vecs, axis=1))
# Tensor("ExpandDims_2:0", shape=(3, 1), dtype=int32)sess = tf.Session()
print(sess.run(tf.expand_dims(self_vecs, axis=1)))
# [[2]
#  [3]
#  [4]]print(sess.run(tf.concat([neigh_vecs,tf.expand_dims(self_vecs, axis=1)], axis=1)))
# [[1 2 3 2]
#  [4 5 6 3]
#  [7 8 9 4]]print(means)
# Tensor("Mean:0", shape=(3,), dtype=int32)print(sess.run(tf.reduce_mean(tf.concat([neigh_vecs,tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)))
# [2 4 7]# [[1 2 3 2]   = 8 // 4  = 2
#  [4 5 6 3]   = 18 // 4 = 4
#  [7 8 9 4]]  = 28 // 4 = 7bias = [1]
output = means + bias
print(sess.run(output))
# [3 5 8]
# [2 + 1, 4 + 1, 7 + 1] = [3, 5, 8]

aggregators.py完整代码

import tensorflow as tffrom .layers import Layer, Dense
from .inits import glorot, zerosclass MeanAggregator(Layer):"""Aggregates via mean followed by matmul and non-linearity."""#__init_() 用于获取并初始化成员变量 dropout, bias(False), act(ReLu), concat(False), input_dim, output_dim, name(Variable scopr)def __init__(self, input_dim, output_dim, neigh_input_dim=None,dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):super(MeanAggregator, self).__init__(**kwargs)self.dropout = dropoutself.bias = biasself.act = actself.concat = concatif neigh_input_dim is None:neigh_input_dim = input_dimif name is not None:name = '/' + nameelse:name = ''#用glorot()方法初始化节点v的权值矩阵 vars['self_weights'] 和邻居节点均值u的权值矩阵 vars['neigh_weights']with tf.variable_scope(self.name + name + '_vars'):self.vars['neigh_weights'] = glorot([neigh_input_dim, output_dim],name='neigh_weights')self.vars['self_weights'] = glorot([input_dim, output_dim],name='self_weights')#用零向量初始化vars['bias']if self.bias:self.vars['bias'] = zeros([self.output_dim], name='bias')#若logging为True,则调用 layers.py 中 class Layer()的成员函数_log_vars(), 生成vars中各个变量的直方图if self.logging:self._log_vars()self.input_dim = input_dimself.output_dim = output_dimdef _call(self, inputs):self_vecs, neigh_vecs = inputs#tf.nn.dropout(x, keep_prob, noise_shape=None, seed=None, name=None)#输出的非0元素是原来的 “1/keep_prob” 倍，以保证总和不变neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)neigh_means = tf.reduce_mean(neigh_vecs, axis=1)# [nodes] x [out_dim]from_neighs = tf.matmul(neigh_means, self.vars['neigh_weights'])from_self = tf.matmul(self_vecs, self.vars["self_weights"])if not self.concat:output = tf.add_n([from_self, from_neighs])else:output = tf.concat([from_self, from_neighs], axis=1) #在concat后其维数变为之前的2倍# biasif self.bias:output += self.vars['bias']return self.act(output)#这里__init__()与MeanAggregator基本相同，在_call()的实现中略有不同
class GCNAggregator(Layer):"""Aggregates via mean followed by matmul and non-linearity.Same matmul parameters are used self vector and neighbor vectors."""def __init__(self, input_dim, output_dim, neigh_input_dim=None,dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):super(GCNAggregator, self).__init__(**kwargs)self.dropout = dropoutself.bias = biasself.act = actself.concat = concatif neigh_input_dim is None:neigh_input_dim = input_dimif name is not None:name = '/' + nameelse:name = ''with tf.variable_scope(self.name + name + '_vars'):self.vars['weights'] = glorot([neigh_input_dim, output_dim],name='neigh_weights')if self.bias:self.vars['bias'] = zeros([self.output_dim], name='bias')if self.logging:self._log_vars()self.input_dim = input_dimself.output_dim = output_dimdef _call(self, inputs):self_vecs, neigh_vecs = inputsneigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)# 其中对means求解时，# 1.先将self_vecs行列转换(tf.expand_dims(self_vecs, axis=1)),# 2.之后self_vecs的行数与neigh_vecs行数相同时，将二者concat, 即相当于在原先的neigh_vecs矩阵后面新增一列self_vecs的转置# 3.最后将得到的矩阵每行求均值，即得means.# 之后means与权值矩阵vars['weights']# 求内积，并加上vars['bias'], 最终将该值带入激活函数(ReLu)means = tf.reduce_mean(tf.concat([neigh_vecs, tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)# [nodes] x [out_dim]output = tf.matmul(means, self.vars['weights'])# biasif self.bias:output += self.vars['bias']return self.act(output)class MaxPoolingAggregator(Layer):""" Aggregates via max-pooling over MLP functions."""def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):super(MaxPoolingAggregator, self).__init__(**kwargs)self.dropout = dropoutself.bias = biasself.act = actself.concat = concatif neigh_input_dim is None:neigh_input_dim = input_dimif name is not None:name = '/' + nameelse:name = ''if model_size == "small":hidden_dim = self.hidden_dim = 512elif model_size == "big":hidden_dim = self.hidden_dim = 1024self.mlp_layers = []self.mlp_layers.append(Dense(input_dim=neigh_input_dim,output_dim=hidden_dim,act=tf.nn.relu,dropout=dropout,sparse_inputs=False,logging=self.logging))with tf.variable_scope(self.name + name + '_vars'):self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],name='neigh_weights')self.vars['self_weights'] = glorot([input_dim, output_dim],name='self_weights')if self.bias:self.vars['bias'] = zeros([self.output_dim], name='bias')if self.logging:self._log_vars()self.input_dim = input_dimself.output_dim = output_dimself.neigh_input_dim = neigh_input_dimdef _call(self, inputs):self_vecs, neigh_vecs = inputsneigh_h = neigh_vecsdims = tf.shape(neigh_h)batch_size = dims[0]num_neighbors = dims[1]# [nodes * sampled neighbors] x [hidden_dim]h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))for l in self.mlp_layers:h_reshaped = l(h_reshaped)neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))neigh_h = tf.reduce_max(neigh_h, axis=1)from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])from_self = tf.matmul(self_vecs, self.vars["self_weights"])if not self.concat:output = tf.add_n([from_self, from_neighs])else:output = tf.concat([from_self, from_neighs], axis=1)# biasif self.bias:output += self.vars['bias']return self.act(output)class MeanPoolingAggregator(Layer):""" Aggregates via mean-pooling over MLP functions."""def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):super(MeanPoolingAggregator, self).__init__(**kwargs)self.dropout = dropoutself.bias = biasself.act = actself.concat = concatif neigh_input_dim is None:neigh_input_dim = input_dimif name is not None:name = '/' + nameelse:name = ''if model_size == "small":hidden_dim = self.hidden_dim = 512elif model_size == "big":hidden_dim = self.hidden_dim = 1024self.mlp_layers = []self.mlp_layers.append(Dense(input_dim=neigh_input_dim,output_dim=hidden_dim,act=tf.nn.relu,dropout=dropout,sparse_inputs=False,logging=self.logging))with tf.variable_scope(self.name + name + '_vars'):self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],name='neigh_weights')self.vars['self_weights'] = glorot([input_dim, output_dim],name='self_weights')if self.bias:self.vars['bias'] = zeros([self.output_dim], name='bias')if self.logging:self._log_vars()self.input_dim = input_dimself.output_dim = output_dimself.neigh_input_dim = neigh_input_dimdef _call(self, inputs):self_vecs, neigh_vecs = inputsneigh_h = neigh_vecsdims = tf.shape(neigh_h)batch_size = dims[0]num_neighbors = dims[1]# [nodes * sampled neighbors] x [hidden_dim]h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))for l in self.mlp_layers:h_reshaped = l(h_reshaped)neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))neigh_h = tf.reduce_mean(neigh_h, axis=1)from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])from_self = tf.matmul(self_vecs, self.vars["self_weights"])if not self.concat:output = tf.add_n([from_self, from_neighs])else:output = tf.concat([from_self, from_neighs], axis=1)# biasif self.bias:output += self.vars['bias']return self.act(output)class TwoMaxLayerPoolingAggregator(Layer):""" Aggregates via pooling over two MLP functions."""def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):super(TwoMaxLayerPoolingAggregator, self).__init__(**kwargs)self.dropout = dropoutself.bias = biasself.act = actself.concat = concatif neigh_input_dim is None:neigh_input_dim = input_dimif name is not None:name = '/' + nameelse:name = ''if model_size == "small":hidden_dim_1 = self.hidden_dim_1 = 512hidden_dim_2 = self.hidden_dim_2 = 256elif model_size == "big":hidden_dim_1 = self.hidden_dim_1 = 1024hidden_dim_2 = self.hidden_dim_2 = 512self.mlp_layers = []self.mlp_layers.append(Dense(input_dim=neigh_input_dim,output_dim=hidden_dim_1,act=tf.nn.relu,dropout=dropout,sparse_inputs=False,logging=self.logging))self.mlp_layers.append(Dense(input_dim=hidden_dim_1,output_dim=hidden_dim_2,act=tf.nn.relu,dropout=dropout,sparse_inputs=False,logging=self.logging))with tf.variable_scope(self.name + name + '_vars'):self.vars['neigh_weights'] = glorot([hidden_dim_2, output_dim],name='neigh_weights')self.vars['self_weights'] = glorot([input_dim, output_dim],name='self_weights')if self.bias:self.vars['bias'] = zeros([self.output_dim], name='bias')if self.logging:self._log_vars()self.input_dim = input_dimself.output_dim = output_dimself.neigh_input_dim = neigh_input_dimdef _call(self, inputs):self_vecs, neigh_vecs = inputsneigh_h = neigh_vecsdims = tf.shape(neigh_h)batch_size = dims[0]num_neighbors = dims[1]# [nodes * sampled neighbors] x [hidden_dim]h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))for l in self.mlp_layers:h_reshaped = l(h_reshaped)neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim_2))neigh_h = tf.reduce_max(neigh_h, axis=1)from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])from_self = tf.matmul(self_vecs, self.vars["self_weights"])if not self.concat:output = tf.add_n([from_self, from_neighs])else:output = tf.concat([from_self, from_neighs], axis=1)# biasif self.bias:output += self.vars['bias']return self.act(output)class SeqAggregator(Layer):""" Aggregates via a standard LSTM."""def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,dropout=0., bias=False, act=tf.nn.relu, name=None,  concat=False, **kwargs):super(SeqAggregator, self).__init__(**kwargs)self.dropout = dropoutself.bias = biasself.act = actself.concat = concatif neigh_input_dim is None:neigh_input_dim = input_dimif name is not None:name = '/' + nameelse:name = ''if model_size == "small":hidden_dim = self.hidden_dim = 128elif model_size == "big":hidden_dim = self.hidden_dim = 256with tf.variable_scope(self.name + name + '_vars'):self.vars['neigh_weights'] = glorot([hidden_dim, output_dim],name='neigh_weights')self.vars['self_weights'] = glorot([input_dim, output_dim],name='self_weights')if self.bias:self.vars['bias'] = zeros([self.output_dim], name='bias')if self.logging:self._log_vars()self.input_dim = input_dimself.output_dim = output_dimself.neigh_input_dim = neigh_input_dimself.cell = tf.contrib.rnn.BasicLSTMCell(self.hidden_dim)def _call(self, inputs):self_vecs, neigh_vecs = inputsdims = tf.shape(neigh_vecs)batch_size = dims[0]initial_state = self.cell.zero_state(batch_size, tf.float32)used = tf.sign(tf.reduce_max(tf.abs(neigh_vecs), axis=2))length = tf.reduce_sum(used, axis=1)length = tf.maximum(length, tf.constant(1.))length = tf.cast(length, tf.int32)with tf.variable_scope(self.name) as scope:try:rnn_outputs, rnn_states = tf.nn.dynamic_rnn(self.cell, neigh_vecs,initial_state=initial_state, dtype=tf.float32, time_major=False,sequence_length=length)except ValueError:scope.reuse_variables()rnn_outputs, rnn_states = tf.nn.dynamic_rnn(self.cell, neigh_vecs,initial_state=initial_state, dtype=tf.float32, time_major=False,sequence_length=length)batch_size = tf.shape(rnn_outputs)[0]max_len = tf.shape(rnn_outputs)[1]out_size = int(rnn_outputs.get_shape()[2])index = tf.range(0, batch_size) * max_len + (length - 1)flat = tf.reshape(rnn_outputs, [-1, out_size])neigh_h = tf.gather(flat, index)from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])from_self = tf.matmul(self_vecs, self.vars["self_weights"])output = tf.add_n([from_self, from_neighs])if not self.concat:output = tf.add_n([from_self, from_neighs])else:output = tf.concat([from_self, from_neighs], axis=1)# biasif self.bias:output += self.vars['bias']return self.act(output)

`prediction.py`

prediction.py完整代码

from __future__ import division
from __future__ import print_functionfrom graphsage.inits import zeros
from graphsage.layers import Layer
import tensorflow as tfflags = tf.app.flags
FLAGS = flags.FLAGSclass BipartiteEdgePredLayer(Layer):def __init__(self, input_dim1, input_dim2, placeholders, dropout=False, act=tf.nn.sigmoid,loss_fn='xent', neg_sample_weights=1.0,bias=False, bilinear_weights=False, **kwargs):"""Basic class that applies skip-gram-like loss(i.e., dot product of node+target and node and negative samples)Args:bilinear_weights: use a bilinear weight for affinity calculation: u^T A v. If set tofalse, it is assumed that input dimensions are the same and the affinity will be based on dot product."""super(BipartiteEdgePredLayer, self).__init__(**kwargs)self.input_dim1 = input_dim1self.input_dim2 = input_dim2self.act = actself.bias = biasself.eps = 1e-7# Margin for hinge lossself.margin = 0.1self.neg_sample_weights = neg_sample_weightsself.bilinear_weights = bilinear_weightsif dropout:self.dropout = placeholders['dropout']else:self.dropout = 0.# output a likelihood termself.output_dim = 1with tf.variable_scope(self.name + '_vars'):# bilinear formif bilinear_weights:#self.vars['weights'] = glorot([input_dim1, input_dim2],#                              name='pred_weights')self.vars['weights'] = tf.get_variable('pred_weights', shape=(input_dim1, input_dim2),dtype=tf.float32, initializer=tf.contrib.layers.xavier_initializer())if self.bias:self.vars['bias'] = zeros([self.output_dim], name='bias')if loss_fn == 'xent':self.loss_fn = self._xent_losselif loss_fn == 'skipgram':self.loss_fn = self._skipgram_losselif loss_fn == 'hinge':self.loss_fn = self._hinge_lossif self.logging:self._log_vars()def affinity(self, inputs1, inputs2):""" Affinity score between batch of inputs1 and inputs2.Args:inputs1: tensor of shape [batch_size x feature_size]."""# shape: [batch_size, input_dim1]if self.bilinear_weights:prod = tf.matmul(inputs2, tf.transpose(self.vars['weights']))self.prod = prodresult = tf.reduce_sum(inputs1 * prod, axis=1)else:result = tf.reduce_sum(inputs1 * inputs2, axis=1)return resultdef neg_cost(self, inputs1, neg_samples, hard_neg_samples=None):""" For each input in batch, compute the sum of its affinity to negative samples.Returns:Tensor of shape [batch_size x num_neg_samples]. For each node, a list of affinities tonegative samples is computed."""if self.bilinear_weights:inputs1 = tf.matmul(inputs1, self.vars['weights'])neg_aff = tf.matmul(inputs1, tf.transpose(neg_samples))return neg_affdef loss(self, inputs1, inputs2, neg_samples):""" negative sampling loss.Args:neg_samples: tensor of shape [num_neg_samples x input_dim2]. Negative samples for allinputs in batch inputs1."""return self.loss_fn(inputs1, inputs2, neg_samples)def _xent_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):aff = self.affinity(inputs1, inputs2)neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)true_xent = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.ones_like(aff), logits=aff)negative_xent = tf.nn.sigmoid_cross_entropy_with_logits(labels=tf.zeros_like(neg_aff), logits=neg_aff)loss = tf.reduce_sum(true_xent) + self.neg_sample_weights * tf.reduce_sum(negative_xent)return lossdef _skipgram_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):aff = self.affinity(inputs1, inputs2)neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)neg_cost = tf.log(tf.reduce_sum(tf.exp(neg_aff), axis=1))loss = tf.reduce_sum(aff - neg_cost)return lossdef _hinge_loss(self, inputs1, inputs2, neg_samples, hard_neg_samples=None):aff = self.affinity(inputs1, inputs2)neg_aff = self.neg_cost(inputs1, neg_samples, hard_neg_samples)diff = tf.nn.relu(tf.subtract(neg_aff, tf.expand_dims(aff, 1) - self.margin), name='diff')loss = tf.reduce_sum(diff)self.neg_shape = tf.shape(neg_aff)return lossdef weights_norm(self):return tf.nn.l2_norm(self.vars['weights'])

`supervised_train.py`

supervised_train.py完整代码

from __future__ import division
from __future__ import print_functionimport os
import time
import tensorflow as tf
import numpy as np
import sklearn
from sklearn import metricsfrom graphsage.supervised_models import SupervisedGraphsage
from graphsage.models import SAGEInfo
from graphsage.minibatch import NodeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_dataos.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"# Set random seed
seed = 123
np.random.seed(seed)
tf.set_random_seed(seed)# Settings
flags = tf.app.flags
FLAGS = flags.FLAGStf.app.flags.DEFINE_boolean('log_device_placement', False,"""Whether to log device placement.""")
#core params..
flags.DEFINE_string('model', 'graphsage_mean', 'model names. See README for possible values.')
flags.DEFINE_float('learning_rate', 0.01, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '../example_data/toy-ppi', 'prefix identifying training data. must be specified.')# left to default values in main experiments
flags.DEFINE_integer('epochs', 10, 'number of epochs to train.')
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')
flags.DEFINE_integer('max_degree', 128, 'maximum node degree.')
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of samples in layer 2')
flags.DEFINE_integer('samples_3', 0, 'number of users samples in layer 3. (Only for mean model)')
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')
flags.DEFINE_boolean('sigmoid', False, 'whether to use sigmoid loss')
flags.DEFINE_integer('identity_dim', 0, 'Set to positive value to use identity embedding features of that dimension. Default 0.')#logging, saving, validation settings etc.
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")
flags.DEFINE_integer('gpu', 1, "which gpu to use.")
flags.DEFINE_integer('print_every', 5, "How often to print training info.")
flags.DEFINE_integer('max_total_steps', 10**10, "Maximum total number of iterations")os.environ["CUDA_VISIBLE_DEVICES"]=str(FLAGS.gpu)GPU_MEM_FRACTION = 0.8def calc_f1(y_true, y_pred):if not FLAGS.sigmoid:y_true = np.argmax(y_true, axis=1)y_pred = np.argmax(y_pred, axis=1)else:y_pred[y_pred > 0.5] = 1y_pred[y_pred <= 0.5] = 0return metrics.f1_score(y_true, y_pred, average="micro"), metrics.f1_score(y_true, y_pred, average="macro")# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):t_test = time.time()feed_dict_val, labels = minibatch_iter.node_val_feed_dict(size)node_outs_val = sess.run([model.preds, model.loss], feed_dict=feed_dict_val)mic, mac = calc_f1(labels, node_outs_val[0])return node_outs_val[1], mic, mac, (time.time() - t_test)def log_dir():log_dir = FLAGS.base_log_dir + "/sup-" + FLAGS.train_prefix.split("/")[-2]log_dir += "/{model:s}_{model_size:s}_{lr:0.4f}/".format(model=FLAGS.model,model_size=FLAGS.model_size,lr=FLAGS.learning_rate)if not os.path.exists(log_dir):os.makedirs(log_dir)return log_dirdef incremental_evaluate(sess, model, minibatch_iter, size, test=False):t_test = time.time()finished = Falseval_losses = []val_preds = []labels = []iter_num = 0finished = Falsewhile not finished:feed_dict_val, batch_labels, finished, _  = minibatch_iter.incremental_node_val_feed_dict(size, iter_num, test=test)node_outs_val = sess.run([model.preds, model.loss], feed_dict=feed_dict_val)val_preds.append(node_outs_val[0])labels.append(batch_labels)val_losses.append(node_outs_val[1])iter_num += 1val_preds = np.vstack(val_preds)labels = np.vstack(labels)f1_scores = calc_f1(labels, val_preds)return np.mean(val_losses), f1_scores[0], f1_scores[1], (time.time() - t_test)def construct_placeholders(num_classes):# Define placeholdersplaceholders = {'labels' : tf.placeholder(tf.float32, shape=(None, num_classes), name='labels'),'batch' : tf.placeholder(tf.int32, shape=(None), name='batch1'),'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),'batch_size' : tf.placeholder(tf.int32, name='batch_size'),}return placeholdersdef train(train_data, test_data=None):G = train_data[0]features = train_data[1]id_map = train_data[2]class_map  = train_data[4]if isinstance(list(class_map.values())[0], list):num_classes = len(list(class_map.values())[0])else:num_classes = len(set(class_map.values()))if not features is None:# pad with dummy zero vectorfeatures = np.vstack([features, np.zeros((features.shape[1],))])context_pairs = train_data[3] if FLAGS.random_context else Noneplaceholders = construct_placeholders(num_classes)minibatch = NodeMinibatchIterator(G, id_map,placeholders, class_map,num_classes,batch_size=FLAGS.batch_size,max_degree=FLAGS.max_degree, context_pairs = context_pairs)adj_info_ph = tf.placeholder(tf.int32, shape=minibatch.adj.shape)adj_info = tf.Variable(adj_info_ph, trainable=False, name="adj_info")if FLAGS.model == 'graphsage_mean':# Create modelsampler = UniformNeighborSampler(adj_info)if FLAGS.samples_3 != 0:layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2),SAGEInfo("node", sampler, FLAGS.samples_3, FLAGS.dim_2)]elif FLAGS.samples_2 != 0:layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]else:layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1)]model = SupervisedGraphsage(num_classes, placeholders, features,adj_info,minibatch.deg,layer_infos, model_size=FLAGS.model_size,sigmoid_loss = FLAGS.sigmoid,identity_dim = FLAGS.identity_dim,logging=True)elif FLAGS.model == 'gcn':# Create modelsampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2*FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, 2*FLAGS.dim_2)]model = SupervisedGraphsage(num_classes, placeholders, features,adj_info,minibatch.deg,layer_infos=layer_infos, aggregator_type="gcn",model_size=FLAGS.model_size,concat=False,sigmoid_loss = FLAGS.sigmoid,identity_dim = FLAGS.identity_dim,logging=True)elif FLAGS.model == 'graphsage_seq':sampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]model = SupervisedGraphsage(num_classes, placeholders, features,adj_info,minibatch.deg,layer_infos=layer_infos, aggregator_type="seq",model_size=FLAGS.model_size,sigmoid_loss = FLAGS.sigmoid,identity_dim = FLAGS.identity_dim,logging=True)elif FLAGS.model == 'graphsage_maxpool':sampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]model = SupervisedGraphsage(num_classes, placeholders, features,adj_info,minibatch.deg,layer_infos=layer_infos, aggregator_type="maxpool",model_size=FLAGS.model_size,sigmoid_loss = FLAGS.sigmoid,identity_dim = FLAGS.identity_dim,logging=True)elif FLAGS.model == 'graphsage_meanpool':sampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]model = SupervisedGraphsage(num_classes, placeholders, features,adj_info,minibatch.deg,layer_infos=layer_infos, aggregator_type="meanpool",model_size=FLAGS.model_size,sigmoid_loss = FLAGS.sigmoid,identity_dim = FLAGS.identity_dim,logging=True)else:raise Exception('Error: model name unrecognized.')config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)config.gpu_options.allow_growth = True#config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTIONconfig.allow_soft_placement = True# Initialize sessionsess = tf.Session(config=config)merged = tf.summary.merge_all()summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)# Init variablessess.run(tf.global_variables_initializer(), feed_dict={adj_info_ph: minibatch.adj})# Train modeltotal_steps = 0avg_time = 0.0epoch_val_costs = []train_adj_info = tf.assign(adj_info, minibatch.adj)val_adj_info = tf.assign(adj_info, minibatch.test_adj)for epoch in range(FLAGS.epochs): minibatch.shuffle() iter = 0print('Epoch: %04d' % (epoch + 1))epoch_val_costs.append(0)while not minibatch.end():# Construct feed dictionaryfeed_dict, labels = minibatch.next_minibatch_feed_dict()feed_dict.update({placeholders['dropout']: FLAGS.dropout})t = time.time()# Training stepouts = sess.run([merged, model.opt_op, model.loss, model.preds], feed_dict=feed_dict)train_cost = outs[2]if iter % FLAGS.validate_iter == 0:# Validationsess.run(val_adj_info.op)if FLAGS.validate_batch_size == -1:val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size)else:val_cost, val_f1_mic, val_f1_mac, duration = evaluate(sess, model, minibatch, FLAGS.validate_batch_size)sess.run(train_adj_info.op)epoch_val_costs[-1] += val_costif total_steps % FLAGS.print_every == 0:summary_writer.add_summary(outs[0], total_steps)# Print resultsavg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)if total_steps % FLAGS.print_every == 0:train_f1_mic, train_f1_mac = calc_f1(labels, outs[-1])print("Iter:", '%04d' % iter, "train_loss=", "{:.5f}".format(train_cost),"train_f1_mic=", "{:.5f}".format(train_f1_mic), "train_f1_mac=", "{:.5f}".format(train_f1_mac), "val_loss=", "{:.5f}".format(val_cost),"val_f1_mic=", "{:.5f}".format(val_f1_mic), "val_f1_mac=", "{:.5f}".format(val_f1_mac), "time=", "{:.5f}".format(avg_time))iter += 1total_steps += 1if total_steps > FLAGS.max_total_steps:breakif total_steps > FLAGS.max_total_steps:breakprint("Optimization Finished!")sess.run(val_adj_info.op)val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size)print("Full validation stats:","loss=", "{:.5f}".format(val_cost),"f1_micro=", "{:.5f}".format(val_f1_mic),"f1_macro=", "{:.5f}".format(val_f1_mac),"time=", "{:.5f}".format(duration))with open(log_dir() + "val_stats.txt", "w") as fp:fp.write("loss={:.5f} f1_micro={:.5f} f1_macro={:.5f} time={:.5f}".format(val_cost, val_f1_mic, val_f1_mac, duration))print("Writing test set stats to file (don't peak!)")val_cost, val_f1_mic, val_f1_mac, duration = incremental_evaluate(sess, model, minibatch, FLAGS.batch_size, test=True)with open(log_dir() + "test_stats.txt", "w") as fp:fp.write("loss={:.5f} f1_micro={:.5f} f1_macro={:.5f}".format(val_cost, val_f1_mic, val_f1_mac))def main(argv=None):print("Loading training data..")train_data = load_data(FLAGS.train_prefix)print("Done loading training data..")train(train_data)if __name__ == '__main__':tf.app.run()

`unsupervised_train.py`

unsupervised_train.py完整代码

from __future__ import division
from __future__ import print_functionimport os
import time
import tensorflow as tf
import numpy as npfrom graphsage.models import SampleAndAggregate, SAGEInfo, Node2VecModel
from graphsage.minibatch import EdgeMinibatchIterator
from graphsage.neigh_samplers import UniformNeighborSampler
from graphsage.utils import load_dataos.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"# Set random seed
seed = 123
np.random.seed(seed)
tf.set_random_seed(seed)# Settings
flags = tf.app.flags
FLAGS = flags.FLAGStf.app.flags.DEFINE_boolean('log_device_placement', False,"""Whether to log device placement.""")
# core params..
flags.DEFINE_string('model', 'graphsage_maxpool', 'model names. See README for possible values.')
flags.DEFINE_float('learning_rate', 0.00001, 'initial learning rate.')
flags.DEFINE_string("model_size", "small", "Can be big or small; model specific def'ns")
flags.DEFINE_string('train_prefix', '../example_data/toy-ppi','name of the object file that stores the training data. must be specified.')# left to default values in main experiments
flags.DEFINE_integer('epochs', 1, 'number of epochs to train.')
flags.DEFINE_float('dropout', 0.0, 'dropout rate (1 - keep probability).')
flags.DEFINE_float('weight_decay', 0.0, 'weight for l2 loss on embedding matrix.')
flags.DEFINE_integer('max_degree', 100, 'maximum node degree.')#对应论文中的K = 1 ，第一层S1 = 25； K = 2 ，第二层S2 = 10。
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of users samples in layer 2')#若有concat操作，则维度变为2倍
flags.DEFINE_integer('dim_1', 128, 'Size of output dim (final is 2x this, if using concat)')
flags.DEFINE_integer('dim_2', 128, 'Size of output dim (final is 2x this, if using concat)')flags.DEFINE_boolean('random_context', True, 'Whether to use random context or direct edges')
flags.DEFINE_integer('neg_sample_size', 20, 'number of negative samples')
flags.DEFINE_integer('batch_size', 512, 'minibatch size.')
flags.DEFINE_integer('n2v_test_epochs', 1, 'Number of new SGD epochs for n2v.')
flags.DEFINE_integer('identity_dim', 0,'Set to positive value to use identity embedding features of that dimension. Default 0.')# logging, saving, validation settings etc.
flags.DEFINE_boolean('save_embeddings', True, 'whether to save embeddings for all nodes after training')
flags.DEFINE_string('base_log_dir', '.', 'base directory for logging and saving embeddings')
flags.DEFINE_integer('validate_iter', 5000, "how often to run a validation minibatch.")
flags.DEFINE_integer('validate_batch_size', 256, "how many nodes per validation sample.")
flags.DEFINE_integer('gpu', 0, "which gpu to use.")
flags.DEFINE_integer('print_every', 50, "How often to print training info.")
flags.DEFINE_integer('max_total_steps', 10 ** 10, "Maximum total number of iterations")os.environ["CUDA_VISIBLE_DEVICES"] = str(FLAGS.gpu)  # 使用哪一块gpu，本人只有一块，需将1改为0
#
GPU_MEM_FRACTION = 0.8def log_dir():log_dir = FLAGS.base_log_dir + "/unsup-" + FLAGS.train_prefix.split("/")[-2]log_dir += "/{model:s}_{model_size:s}_{lr:0.6f}/".format(model=FLAGS.model,model_size=FLAGS.model_size,lr=FLAGS.learning_rate)if not os.path.exists(log_dir):os.makedirs(log_dir)return log_dir# Define model evaluation function
def evaluate(sess, model, minibatch_iter, size=None):t_test = time.time()feed_dict_val = minibatch_iter.val_feed_dict(size)outs_val = sess.run([model.loss, model.ranks, model.mrr],feed_dict=feed_dict_val)return outs_val[0], outs_val[1], outs_val[2], (time.time() - t_test)def incremental_evaluate(sess, model, minibatch_iter, size):t_test = time.time()finished = Falseval_losses = []val_mrrs = []iter_num = 0while not finished:feed_dict_val, finished, _ = minibatch_iter.incremental_val_feed_dict(size, iter_num)iter_num += 1outs_val = sess.run([model.loss, model.ranks, model.mrr],feed_dict=feed_dict_val)val_losses.append(outs_val[0])val_mrrs.append(outs_val[2])return np.mean(val_losses), np.mean(val_mrrs), (time.time() - t_test)def save_val_embeddings(sess, model, minibatch_iter, size, out_dir, mod=""):val_embeddings = []finished = Falseseen = set([])nodes = []iter_num = 0name = "val"while not finished:feed_dict_val, finished, edges = minibatch_iter.incremental_embed_feed_dict(size, iter_num)iter_num += 1outs_val = sess.run([model.loss, model.mrr, model.outputs1],feed_dict=feed_dict_val)# ONLY SAVE FOR embeds1 because of planetoidfor i, edge in enumerate(edges):if not edge[0] in seen:val_embeddings.append(outs_val[-1][i, :])nodes.append(edge[0])seen.add(edge[0])if not os.path.exists(out_dir):os.makedirs(out_dir)val_embeddings = np.vstack(val_embeddings)np.save(out_dir + name + mod + ".npy", val_embeddings)with open(out_dir + name + mod + ".txt", "w") as fp:fp.write("\n".join(map(str, nodes)))def construct_placeholders():# Define placeholdersplaceholders = {'batch1': tf.placeholder(tf.int32, shape=(None), name='batch1'),'batch2': tf.placeholder(tf.int32, shape=(None), name='batch2'),# negative samples for all nodes in the batch ，所有nodes均为负样本'neg_samples': tf.placeholder(tf.int32, shape=(None,),name='neg_sample_size'),'dropout': tf.placeholder_with_default(0., shape=(), name='dropout'),'batch_size': tf.placeholder(tf.int32, name='batch_size'),}return placeholdersdef train(train_data, test_data=None):G = train_data[0]features = train_data[1]   # 训练数据的featuresid_map = train_data[2] ## "n" : n；已经删除了节点是不具有'val'或'test'属性 的节点class_map=train_data[4]# print("class_map:",class_map)print("G:", G)# G: disjoint_union(, )print("features:", features)# features: [[-0.08760569 - 0.08760569 - 0.1132336... - 0.13184157 - 0.14681277#             - 0.14717815]#            [-0.08760569 - 0.08760569 - 0.1132336... - 0.13184157 - 0.14681277#             - 0.14717815]#            ...#           ]print("id_map:", id_map)# id_map: {0: 0, 1: 1,...,14754: 14754}print("feature.length",len(features))#feature.length 14755print("features.shape:", features.shape)# features.shape: (14755, 50)if not features is None:# pad with dummy zero vector#vstack为features添加列一行0向量，用于WX + b中与b相加features = np.vstack([features, np.zeros((features.shape[1],))])print("features.shape:", features.shape)# features.shape: (14756, 50)print(features[14755])  # 添加一个0向量# [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.#  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.#  0. 0.]context_pairs = train_data[3] if FLAGS.random_context else None  # #random walk的点对placeholders = construct_placeholders()minibatch = EdgeMinibatchIterator(G,id_map,placeholders, batch_size=FLAGS.batch_size,max_degree=FLAGS.max_degree,num_neg_samples=FLAGS.neg_sample_size,context_pairs=context_pairs)adj_info_ph = tf.placeholder(tf.int32, shape=minibatch.adj.shape)adj_info = tf.Variable(adj_info_ph, trainable=False, name="adj_info")if FLAGS.model == 'graphsage_mean':# Create modelsampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]model = SampleAndAggregate(placeholders,features,adj_info,minibatch.deg,layer_infos=layer_infos,model_size=FLAGS.model_size,identity_dim=FLAGS.identity_dim,logging=True)elif FLAGS.model == 'gcn':# Create modelsampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, 2 * FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, 2 * FLAGS.dim_2)]model = SampleAndAggregate(placeholders,features,adj_info,minibatch.deg,layer_infos=layer_infos,aggregator_type="gcn",model_size=FLAGS.model_size,identity_dim=FLAGS.identity_dim,concat=False,logging=True)elif FLAGS.model == 'graphsage_seq':sampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]model = SampleAndAggregate(placeholders,features,adj_info,minibatch.deg,layer_infos=layer_infos,identity_dim=FLAGS.identity_dim,aggregator_type="seq",model_size=FLAGS.model_size,logging=True)elif FLAGS.model == 'graphsage_maxpool':sampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]model = SampleAndAggregate(placeholders,features,adj_info,minibatch.deg,layer_infos=layer_infos,aggregator_type="maxpool",model_size=FLAGS.model_size,identity_dim=FLAGS.identity_dim,logging=True)elif FLAGS.model == 'graphsage_meanpool':sampler = UniformNeighborSampler(adj_info)layer_infos = [SAGEInfo("node", sampler, FLAGS.samples_1, FLAGS.dim_1),SAGEInfo("node", sampler, FLAGS.samples_2, FLAGS.dim_2)]model = SampleAndAggregate(placeholders,features,adj_info,minibatch.deg,layer_infos=layer_infos,aggregator_type="meanpool",model_size=FLAGS.model_size,identity_dim=FLAGS.identity_dim,logging=True)elif FLAGS.model == 'n2v':  # node2vecmodel = Node2VecModel(placeholders, features.shape[0],minibatch.deg,# 2x because graphsage uses concatnodevec_dim=2 * FLAGS.dim_1,lr=FLAGS.learning_rate)else:raise Exception('Error: model name unrecognized.')config = tf.ConfigProto(log_device_placement=FLAGS.log_device_placement)config.gpu_options.allow_growth = True  # 使用allow_growth option，刚一开始分配少量的GPU容量，然后按需慢慢的增加# 设置每个GPU应该拿出多少容量给进程使用，# per_process_gpu_memory_fraction =0.4代表 40%# config.gpu_options.per_process_gpu_memory_fraction = GPU_MEM_FRACTION# 自动选择运行设备# 在tf中，通过命令 "with tf.device('/cpu:0'):",允许手动设置操作运行的设备# 如果手动设置的设备不存在或者不可用，就会导致tf程序等待或异常，# 为了防止这种情况，可以设置tf.ConfigProto()中参数allow_soft_placement=True，# 允许tf自动选择一个存在并且可用的设备来运行操作。config.allow_soft_placement = True  # 如果指定的设备不存在，允许TF自动分配设备# Initialize sessionsess = tf.Session(config=config)# tf.summary()能够保存训练过程以及参数分布图并在tensorboard显示。# merge_all 可以将所有summary全部保存到磁盘，以便tensorboard显示。# 如果没有特殊要求，一般用这一句就可一显示训练时的各种信息了merged = tf.summary.merge_all()# 指定一个文件用来保存图# 格式：tf.summary.FileWritter(path,sess.graph)# 可以调用其add_summary（）方法将训练过程数据保存在filewriter指定的文件中summary_writer = tf.summary.FileWriter(log_dir(), sess.graph)# Init variablessess.run(tf.global_variables_initializer(), feed_dict={adj_info_ph: minibatch.adj})# Train modeltrain_shadow_mrr = Noneshadow_mrr = Nonetotal_steps = 0avg_time = 0.0epoch_val_costs = []train_adj_info = tf.assign(adj_info, minibatch.adj)val_adj_info = tf.assign(adj_info, minibatch.test_adj)for epoch in range(FLAGS.epochs):minibatch.shuffle()iter = 0print('Epoch: %04d' % (epoch + 1))epoch_val_costs.append(0)while not minibatch.end():# Construct feed dictionaryfeed_dict = minibatch.next_minibatch_feed_dict()feed_dict.update({placeholders['dropout']: FLAGS.dropout})t = time.time()# Training stepouts = sess.run([merged, model.opt_op, model.loss, model.ranks, model.aff_all,model.mrr, model.outputs1], feed_dict=feed_dict)train_cost = outs[2]train_mrr = outs[5]if train_shadow_mrr is None:train_shadow_mrr = train_mrr  #else:train_shadow_mrr -= (1 - 0.99) * (train_shadow_mrr - train_mrr)if iter % FLAGS.validate_iter == 0:# Validationsess.run(val_adj_info.op)val_cost, ranks, val_mrr, duration = evaluate(sess, model, minibatch, size=FLAGS.validate_batch_size)sess.run(train_adj_info.op)epoch_val_costs[-1] += val_costif shadow_mrr is None:shadow_mrr = val_mrrelse:shadow_mrr -= (1 - 0.99) * (shadow_mrr - val_mrr)if total_steps % FLAGS.print_every == 0:summary_writer.add_summary(outs[0], total_steps)# Print resultsavg_time = (avg_time * total_steps + time.time() - t) / (total_steps + 1)if total_steps % FLAGS.print_every == 0:print("Iter:", '%04d' % iter,"train_loss=", "{:.5f}".format(train_cost),"train_mrr=", "{:.5f}".format(train_mrr),"train_mrr_ema=", "{:.5f}".format(train_shadow_mrr),  # exponential moving average，指数滑动平均EMA"val_loss=", "{:.5f}".format(val_cost),"val_mrr=", "{:.5f}".format(val_mrr),"val_mrr_ema=", "{:.5f}".format(shadow_mrr),  # exponential moving average"time=", "{:.5f}".format(avg_time))iter += 1total_steps += 1if total_steps > FLAGS.max_total_steps:breakif total_steps > FLAGS.max_total_steps:breakprint("Optimization Finished!")if FLAGS.save_embeddings:  # 训练以后是否存储节点的embeddingssess.run(val_adj_info.op)save_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir())if FLAGS.model == "n2v":# stopping the gradient for the already trained nodestrain_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if not G.node[n]['val'] and not G.node[n]['test']],dtype=tf.int32)test_ids = tf.constant([[id_map[n]] for n in G.nodes_iter() if G.node[n]['val'] or G.node[n]['test']],dtype=tf.int32)update_nodes = tf.nn.embedding_lookup(model.context_embeds, tf.squeeze(test_ids))no_update_nodes = tf.nn.embedding_lookup(model.context_embeds, tf.squeeze(train_ids))update_nodes = tf.scatter_nd(test_ids, update_nodes, tf.shape(model.context_embeds))no_update_nodes = tf.stop_gradient(tf.scatter_nd(train_ids, no_update_nodes, tf.shape(model.context_embeds)))model.context_embeds = update_nodes + no_update_nodessess.run(model.context_embeds)# run random walksfrom graphsage.utils import run_random_walksnodes = [n for n in G.nodes_iter() if G.node[n]["val"] or G.node[n]["test"]]start_time = time.time()pairs = run_random_walks(G, nodes, num_walks=50)walk_time = time.time() - start_timetest_minibatch = EdgeMinibatchIterator(G,id_map,placeholders, batch_size=FLAGS.batch_size,max_degree=FLAGS.max_degree,num_neg_samples=FLAGS.neg_sample_size,context_pairs=pairs,n2v_retrain=True,fixed_n2v=True)start_time = time.time()print("Doing test training for n2v.")test_steps = 0for epoch in range(FLAGS.n2v_test_epochs):test_minibatch.shuffle()while not test_minibatch.end():feed_dict = test_minibatch.next_minibatch_feed_dict()feed_dict.update({placeholders['dropout']: FLAGS.dropout})outs = sess.run([model.opt_op, model.loss, model.ranks, model.aff_all,model.mrr, model.outputs1], feed_dict=feed_dict)if test_steps % FLAGS.print_every == 0:print("Iter:", '%04d' % test_steps,"train_loss=", "{:.5f}".format(outs[1]),"train_mrr=", "{:.5f}".format(outs[-2]))test_steps += 1train_time = time.time() - start_timesave_val_embeddings(sess, model, minibatch, FLAGS.validate_batch_size, log_dir(), mod="-test")print("Total time: ", train_time + walk_time)print("Walk time: ", walk_time)print("Train time: ", train_time)# main函数，加载数据并训练
def main(argv=None):print("Loading training data..")train_data = load_data(FLAGS.train_prefix, load_walks=True)print("Done loading training data..")train(train_data)if __name__ == '__main__':tf.app.run()
# tf.app.run()的作用：通过处理flag解析，然后执行main函数
# 如果代码中的入口函数不叫main()，而是一个其他名字的函数，如test()，则应该这样写入口tf.app.run(test)
# 如果代码中的入口函数叫main()，则可以把入口写成tf.app.run()

`inits.py`

glorot初始化方法：它为了保证前向传播和反向传播时每一层的方差一致:在正向传播时，每层的激活值的方差保持不变；在反向传播时，每层的梯度值的方差保持不变。根据每层的输入个数和输出个数来决定参数随机初始化的分布范围，是一个通过该层的输入和输出参数个数得到的分布范围内的均匀分布。
(推导见：https://blog.csdn.net/yyl424525/article/details/100823398#4_Xavier_21)

这部分和GCN中的一样：

import tensorflow as tf
import numpy as np#产生一个维度为shape的Tensor，值分布在（-0.005-0.005）之间，且为均匀分布
def uniform(shape, scale=0.05, name=None):"""Uniform init."""initial = tf.random_uniform(shape, minval=-scale, maxval=scale, dtype=tf.float32)return tf.Variable(initial, name=name)def glorot(shape, name=None):"""Glorot & Bengio (AISTATS 2010) init."""#init_range = np.sqrt(6.0/(shape[0]+shape[1]))initial = tf.random_uniform(shape, minval=-init_range, maxval=init_range, dtype=tf.float32)return tf.Variable(initial, name=name)#产生一个维度为shape，值全为1的Tensor
def zeros(shape, name=None):"""All zeros."""initial = tf.zeros(shape, dtype=tf.float32)return tf.Variable(initial, name=name)#产生一个维度为shape，值全为0的Tensor
def ones(shape, name=None):"""All ones."""initial = tf.ones(shape, dtype=tf.float32)return tf.Variable(initial, name=name)

其他

`citation_eval.py`

`ppi_eval.py`

`reddit_eval.py`

参考

博客园 listenviolet 的GraphSAGE 代码解析系列