文章目录

1. 论文阅读
- 1.1. 前言
- 1.2. 阅读笔记
- - 1.2.1 研究点的引出
  - 1.2.2 相关工作
  - 1.2.3 社交影响
  - - 1.2.3.1 r-neighbors
    - 1.2.3.2 Social Action
    - 1.2.3.3 Social Influence Locality
  - 1.2.4 DeepInf
- 1.3. 论文阅读总结
2. 代码阅读
- 2.1. 说明
- 2.2. DeepInf代码阅读
- - 2.2.1 代码粗读
  - 2.2.2 数据集

1. 论文阅读

1.1. 前言

在公众号看到一篇感兴趣的论文：

pdf地址：https://arxiv.org/pdf/1807.05560.pdf

代码地址：https://github.com/xptree/DeepInf

1.2. 阅读笔记

DeepInf: Social Influence Prediction with Deep Learning

1.2.1 研究点的引出

在摘要中作者也提出了本文的研究其实为“social influence prediction”，也即是社交影响预测。

传统的“社交影响预测”方法依赖于人工定义的规则来抽取用户和网络特征，然而这些方法受限于个人相关专业领域的知识。
因此，作者在本文中设计了一个深度神经网络“DeepInf”，来学习用户潜在的特征表示，进而预测社交影响。
具体而言，本文设计了一些策略去整合网络结构、用户特征。

读到这里，那么本文所认为的传统的“社交影响预测”方法有哪些？不妨先看看对比的方法：

Logistic Regression（LR），逻辑回归。
Support Vector Machine (SVM)，支持向量机。
PSCN，作者原文是这样说的：“As we model social influence locality prediction as a graph classification problem, we compare our framework with the state-of-the-art graph classification models, PSCN[34].”，译：当我们将社会影响位置预测建模为图分类问题时，我们将我们的框架与最先进的图分类模型 PSCN [34] 进行比较。

那么这三个基线方法为什么能够成为作者本文所描述的“传统”社交网络预测方法？
作者将其分为两类，在 LR 和 SVM 中，均考虑使用三类特征（Vertex, Embedding, Ego)，如下图：

对于Vertex和Embedding很好理解，也就是节点特征和DeepWalk64 维的嵌入表示，那么在Ego中的active neighbors是什么意思？继续查看一下给出的相关参考文献：

Group formation in large social networks: membership, growth, and evolution.

在这篇论文中也提到了，“Of those communities which had at least 1 post, we selected the 700 most active communities along with 300 at random from the others with at least 1 post.”
也就是选择预定义数目的最活跃（有发帖）的群体以及随机选择。

也就是第三个特征Ego其实也就是作者自己定义的“最活跃”。

回到主题，也就是在LR和SVM中将上述三种特征（Vertex, Embedding, Ego)作为分类器的训练数据，进而进行分类训练。

注意到，PSCN 来自论文：“Learning convolutional neural networks for graphs, ICML’2016”，也是一个考虑使用图网络的分类研究。

1.2.2 相关工作

从前一小节我们知道，可以说本文（DeepInf: Social Influence Prediction with Deep Learning）的研究其实也就是一篇图分类的研究，使用的手段为图深度神经网络。那么和“Social Influence Prediction”之间有什么关联？下面带着这个疑问开始引言的阅读。

社交影响：“refers to the phenomenon that a person’s emotions, opinions, or behaviors are affected by others.”
“there is little doubt that social influence has become a prevalent, yet complex force that drives our social decisions, making a clear need for methodologies to characterize, understand, and quantify the underlying mechanisms and dynamics of social influence.”

简单来说，社交影响也就是对他人观点、情感、决策的影响，且在文献[26, 32, 42, 43]中进行了研究。
作者目标：“We aim to predict the action status of a user given the action statuses of her near neighbors and her local structural information.”

具体过程为：DeepInf, to represent both influence dynamics and network structures into a latent space. To predict the action status of a user v, we first sample her local neighbors through random walks with restart. After obtaining a local network as shown in Figure 1, we leverage both graph convolution and attention techniques to learn latent predictive signals.

使用随机游走来获取用户结构特征，关于社交影响，文中在第二节进行介绍。

1.2.3 社交影响

1.2.3.1 r-neighbors

也就是最短路径小于等于r的节点的集合：

而上述节点的集合构成的子图，称为r-ego netwrok，描述为：

1.2.3.2 Social Action

这里的描述很高级，记录一下：

1.2.3.3 Social Influence Locality

根据上述引入两个定义，这里引入了“Social Influence Locality”的概念。因为在“Social Action”中引入了时间序列的转发行为表示，故而在“Social Influence Locality”这个概念中同样引入了时间序列的概念：
在下一个时刻激活概率为：

假定有 N 个实例，那么总体的社交影响预测问题的目标为：

1.2.4 DeepInf

步骤一：邻居采样，使用BFS抽取出节点v的r-ego网络，表示为 $GvrG^{r}_v$ 。但是这个网络的大小可能由于“小世界”特性而特别大，为了解决这个问题，进行了大小的控制，可以理解为在r-ego网络上进行二次采样；采样过程根据边的权重的比例进行采样（即有偏），此外在每个时间步骤都有一定概率会重头开始随机游走，
步骤二：神经网络模型，主要目的是为了节点整合结构属性以及行为状态，最终用来预测用户的转发行为状态（即：0 或者 1）。

注意到（d）的特征由两部分组成，（是否激活，是否是 ego）以及其网络嵌入。

在上图中的中间部分都比较常见，这里不再介绍。着重看下最终的对比部分：

完全变成了一个分类问题，也就是是否转发。

最终的对比实验也是引用了逻辑回归、支持向量机等，将其分类，然后对比分类效果的好坏。

1.3. 论文阅读总结

根据前面的阅读，我们知道其实作者论文“DeepInf: Social Influence Prediction with Deep Learning”
在解决的问题也就是嵌入可能影响因素，然后预测用户是否会转发消息。

个人感觉题目的“Social Influence Prediction”很高级。

2. 代码阅读

2.1. 说明

Code 阅读，代码地址：https://github.com/xptree/DeepInf

当然，这里还是重要的贴出论文中的框架图，然后进行代码的阅读：

2.2. DeepInf代码阅读

在train.py文件中，可以看到：

parser.add_argument('--model', type=str, default='gcn', help="models used")

这里参数model默认为gcn，同时观察后面的代码：

if args.model == "pscn":influence_dataset = PatchySanDataSet(args.file_dir, args.dim, args.seed, args.shuffle, args.model,sequence_size=args.sequence_size, stride=1, neighbor_size=args.neighbor_size)
else:influence_dataset = InfluenceDataSet(args.file_dir, args.dim, args.seed, args.shuffle, args.model)

也就是说，对于pscn和gcn模型的数据集是有差别的。这里不妨继续看看gcn模型下的数据集构成。

2.2.1 代码粗读

class InfluenceDataSet(Dataset):

可以看到这个数据集继承自pytorch的Dataset类，具体而言：

class InfluenceDataSet(Dataset):def __init__(self, file_dir, embedding_dim, seed, shuffle, model):self.graphs = np.load(os.path.join(file_dir, "adjacency_matrix.npy")).astype(np.float32)# self-loop trick, the input graphs should have no self-loopidentity = np.identity(self.graphs.shape[1])self.graphs += identityself.graphs[self.graphs != 0] = 1.0if model == "gat" or model == "pscn":self.graphs = self.graphs.astype(np.dtype('B'))elif model == "gcn":# normalized graph laplacian for GCN: D^{-1/2}AD^{-1/2}for i in range(len(self.graphs)):graph = self.graphs[i]d_root_inv = 1. / np.sqrt(np.sum(graph, axis=1))graph = (graph.T * d_root_inv).T * d_root_invself.graphs[i] = graphelse:raise NotImplementedErrorlogger.info("graphs loaded!")...

传入了文件目录、嵌入维度、随机数种子和是否打乱，以及模型。且从上面代码中我们知道加载的是一个图的邻接矩阵（numpy）格式。

这里判断了一下模型，如果是gat或者pscn就直接返回邻接矩阵+自环的矩阵表示，否则就使用拉普拉斯归一化处理。至于self.graphs.astype(np.dtype('B'))其实也就是转换数据类型到无符号位的8位整数。不妨测试一下：

uint8为8位无符号整数类型，表示范围为[0, 255]。

class InfluenceDataSet(Dataset):def __init__(self, file_dir, embedding_dim, seed, shuffle, model):    ...# wheather a user has been influenced# wheather he/she is the ego userself.influence_features = np.load(os.path.join(file_dir, "influence_feature.npy")).astype(np.float32)logger.info("influence features loaded!")self.labels = np.load(os.path.join(file_dir, "label.npy"))logger.info("labels loaded!")self.vertices = np.load(os.path.join(file_dir, "vertex_id.npy"))logger.info("vertex ids loaded!")...if shuffle:self.graphs, self.influence_features, self.labels, self.vertices = \sklearn.utils.shuffle(self.graphs, self.influence_features,self.labels, self.vertices,random_state=seed)vertex_features = np.load(os.path.join(file_dir, "vertex_feature.npy"))vertex_features = preprocessing.scale(vertex_features)self.vertex_features = torch.FloatTensor(vertex_features)logger.info("global vertex features loaded!")embedding_path = os.path.join(file_dir, "deepwalk.emb_%d" % embedding_dim)max_vertex_idx = np.max(self.vertices)embedding = load_w2v_feature(embedding_path, max_vertex_idx)self.embedding = torch.FloatTensor(embedding)logger.info("%d-dim embedding loaded!", embedding_dim)

然后加载一些文中所提及的特征，判断是否需要打乱顺序。值得注意的是，deepwalk.emb_%d也是现有的文件，然后读取的。

因为该类继承自pytorch的Dataset，故而需要关注__getitem__方法：

def __getitem__(self, idx):return self.graphs[idx], self.influence_features[idx], self.labels[idx], self.vertices[idx]

也就是会返回图的邻接矩阵（或者拉普拉斯归一化）的结构表示、影响特征、标签，以及对应的节点。而将deepwalk的嵌入直接返回：

def get_embedding(self):return self.embedding

在train.py中可以看到：

if args.model == "gcn":model = BatchGCN(pretrained_emb=influence_dataset.get_embedding(),vertex_feature=influence_dataset.get_vertex_features(),use_vertex_feature=args.use_vertex_feature,n_units=n_units,dropout=args.dropout,instance_normalization=args.instance_normalization)

也就是将deepwalk的输出作为了结构表示，将特征作为节点特征输入，而没有直接使用邻接矩阵或者是拉普拉斯归一化处理的矩阵。特征为：

vertex_features = np.load(os.path.join(file_dir, "vertex_feature.npy"))
vertex_features = preprocessing.scale(vertex_features)
self.vertex_features = torch.FloatTensor(vertex_features)
logger.info("global vertex features loaded!")

也就是还是作者处理的节点特征numpy表示。但值得注意的是，在划分数据集的时候：

train_loader = DataLoader(influence_dataset, batch_size=args.batch,sampler=ChunkSampler(valid_start - train_start, 0))
valid_loader = DataLoader(influence_dataset, batch_size=args.batch,sampler=ChunkSampler(test_start - valid_start, valid_start))
test_loader = DataLoader(influence_dataset, batch_size=args.batch,sampler=ChunkSampler(N - test_start, test_start))

这里传入了influence_dataset也就是会调用前面的__getitem__方法，也就是会用到上面的疑惑部分。不妨继续看看train方法：

def train(epoch, train_loader, valid_loader, test_loader, log_desc='train_'):model.train()loss = 0.total = 0.for i_batch, batch in enumerate(train_loader):graph, features, labels, vertices = batch...output = model(features, vertices, graph)

上面的model调用的时候用到了"结构矩阵表示"，但是对应的三个模型中，却没有对应的模型，这里是自己的一个疑惑点。

总结一下：也就是将deepwalk的嵌入作为结构初始化矩阵，联合特征进行多层优化，使得节点的分类接近预期目标。这里还需要关注与他用的数据集的构成。

2.2.2 数据集

这个项目的原话是这么说的：

DatasetsPlease download the peprocessed datasets from onedrive, dropbox, google drive,or baidu pan(with password 242g). If you are interested in un-preprocessed data, please download them from the following links.

Digg dataset

Twitter dataset

OAG dataset

Weibo dataset

那就不妨下载一个处理过的，以及没有处理的数据集。文件太大了，作者没有分文件，242G，还是放弃了，没有这么大的磁盘空间。如果有条件的可以尝试看看。

【兴趣阅读】DeepInf: Social Influence Prediction with Deep Learning相关推荐

#Paper reading#DeepInf: Social Influence Prediction with Deep Learning
#Paper reading# DeepInf: Social Influence Prediction with Deep Learning 设计了一个端到端的框架DeepInf,研究用户层面的社会 ...
【论文下饭】Functional Connectivity Prediction With Deep Learning for Graph Transformation
水平有限,有误请指出. Functional Connectivity Prediction With Deep Learning for Graph Transformation 省流版任务模型 ...
虚假新闻检测论文阅读（六）：A Deep Learning Model for Early Detection of Fake News on Social Media
论文标题:A Deep Learning Model for Early Detection of Fake News on Social Media 日期:IEEE2020 #半监督.#伪标签.#可 ...
【论文阅读】3D点云 -- PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
前言本博客详解遵从论文讲述的顺序.但我们要明确该论文的要点,以更好的阅读论文:针对点云的3个特性,pointnet设计的应对方法,以及设计理念. 点云的无序性:网络使用了对称函数 (maxpooli ...
人脸识别论文阅读（3）- DeeplD1：Deep Learning Face Representation from Predicting 10,000 Classes]
理解参考:张雨石博客 DeepID1 (Linkface 孙祎) 文章目录 DeepID1 @[TOC](文章目录) 前言一.pandas是什么? 二.使用步骤 1.引入库 2.读入数据总结为什 ...
文献阅读(part2)--Towards K-means-friendly spaces Simultaneous deep learning and clustering
学习笔记,仅供参考文章目录 Abstract Introduction Background and Related Works Proposed Formulation Optimization ...
机器学习(Machine Learning)深度学习(Deep Learning)资料(Chapter 2)
机器学习(Machine Learning)&深度学习(Deep Learning)资料(Chapter 2) - tony的专栏 - 博客频道 - CSDN.NET 注:机器学习资料篇目一共 ...
【深度学习Deep Learning】资料大全
感谢关注天善智能,走好数据之路↑↑↑ 欢迎关注天善智能,我们是专注于商业智能BI,人工智能AI,大数据分析与挖掘领域的垂直社区,学习,问答.求职一站式搞定! 对商业智能BI.大数据分析挖掘.机器学习, ...
深度学习Deep Learning 资料大全
转自:http://www.cnblogs.com/charlotte77/ [深度学习Deep Learning]资料大全最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: ...

【兴趣阅读】DeepInf: Social Influence Prediction with Deep Learning