清华大学发布OpenNE:用于网络嵌入的开源工具包

为了方便大家对网络表示学习(NE/NRL)开展相关的实验或研究,清华大学计算机科学与技术系的研究人员在 GitHub 上发布了 NE/NRL 训练和测试框架 OpenNE,其中统一了 NE 模型输入/输出/评测接口,并且修订和复现了目前比较经典的网络表征学习模型。该项目还在持续开发中,作者还提供了与未扩展模型的比较结果。

项目链接:https://github.com/thunlp/OpenNE

本项目是一个标准的 NE/NRL(Network Representation Learning,网络表征学习)训练和测试框架。在这个框架中,我们统一了不同 NE 模型输入和输出接口,并为每个模型提供可扩展选项。此外,我们还在这个框架中用 TensorFlow 实现了经典 NE 模型,使这些模型可以用 GPU 训练。

我们根据 DeepWalk 的设置开发了这个工具包, 实现和修改的模型包括 DeepWalk、LINE、node2vec、GraRep、TADW 和 GCN。我们还将根据已公布的 NRL 论文发表持续实现更多有代表性的 NE 模型。特别地,我们欢迎其他研究者在该框架中构建 NE 模型到这个工具包中,也会公布项目中的贡献内容。

配置需求

  • numpy==1.13.1
  • networkx==2.0
  • scipy==0.19.1
  • tensorflow==1.3.0
  • gensim==3.0.1
  • scikit-learn==0.19.0

使用

通用选项

如果想查看其它可用的 OpenNE 选项,请输入以下命令:


  1. python src/main.py --help

  • input,一个网络的输入文件;
  • graph-format,输入图的格式,类邻接表或边表;
  • output,表征的输出文件;
  • representation-size,用于学习每个节点的隐维数,默认为 128;
  • method,NE 模型的学习方法,包括 deepwalk、line、node2vec、grarep、tadw 和 gcn;
  • directed,将图转换为定向的;
  • weighted,将图加权;
  • label-file,节点标签的文件;只在测试时使用;
  • clf-ratio,节点分类的训练数据的比例;默认值为 0.5;
  • epochs,LINE 和 GCN 的训练 epoch 数;默认值为 5;

样例

在 BlogCatalog 网络上运行「node2vec」,评估多标签节点分类任务上的学习表征,并在这个项目的主目录上运行以下命令:


  1. python src/main.py --method node2vec --label-file data/blogCatalog/bc_labels.txt --input data/blogCatalog/bc_adjlist.txt --graph-format adjlist --output vec_all.txt --q 0.25 --p 0.25

在 Cora 网络上运行「gcn」,并在多标签节点分类任务上评估学习表征,在这个项目的主目录上运行以下命令:


  1. python src/main.py --method gcn --label-file data/cora/cora_labels.txt --input data/cora/cora_edgelist.txt --graph-format edgelist --feature-file data/cora/cora.features  --epochs 200 --output vec_all.txt --clf-ratio 0.1

特定选项

DeepWalk 和 node2vec:

  • number-walks,每个节点起始的随机行走数目;默认值为 10;
  • walk-length,每个节点起始的随机行走步长;默认值为 80;
  • workers,平行处理的数量;默认值为 8;
  • window-size,skip-gram 模型的 window-size;默认值为 10;
  • q,只用于 node2vec;默认值为 1.0;
  • p,只用于 node2vec;默认值为 1.0;

LINE:

  • negative-ratio,默认值为 5;
  • order,1 为 1 阶模型,2 为 2 阶模型;默认值为 3;
  • no-auto-stop,训练 LINE 时不使用早期停止法;训练 LINE 的时候,对每个 epoch 计算 micro-F1。如果当前的 micro-F1 小于上一个,训练过程将使用早期停止法;

GraRep:

  • kstep,使用 k-step 转移概率矩阵(确保 representation-size%k-step == 0);

TADW:

  • lamb,lamb 是 TADW 中控制正则化项的权重的参数;

GCN:

  • feature-file,节点特征的文件;
  • epochs,GCN 的训练 epoch 数;默认值为 5;
  • dropout,dropout 率;
  • weight-decay,嵌入矩阵的 L2 损失的权重;
  • hidden,第一个隐藏层的单元数量;

输入

支持的输入格式是边表(edgelist)或类邻接表(adjlist):


  1. edgelist: node1 node2 <weight_float, optional>

  2. adjlist: node n1 n2 n3 ... nk

默认图为非定向、未加权。这些选项可以通过设置合适的 flag 进行修改。

如果该模型需要额外的特征,支持的特征输入格式如下(feature_i 指浮点):


  1. node feature_1 feature_2 ... feature_n

输出

带有 n 个节点的图的输入文件有 n+1 行。第一行的格式为:


  1. num_of_nodes dim_of_representation

下面 n 行的格式为:


  1. node_id dim1 dim2 ... dimd

其中,dim1, ... , dimd 是 OpenNE 学到的 d 维表示。

评估

如果你想评估学得的节点表征,你可以输入节点标签。它将使用一部分节点(默认:50%)来训练分类器,在剩余的数据集上计算 F1 得分。

支持的输入标签格式为:


  1. node label1 label2 label3...

与其他实现进行对比

运行环境:CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz

我们展示了在不同数据集上对不同方法的节点分类结果。我们将表征维度设置为 128,GraRep 中的 kstep=4,node2vec 中 p=1,q=1。

注意:GCN(半监督 NE 模型)和 TADW 需要额外的文本特征作为输入。因此,我们在 Cora 上评估这两个模型,Cora 的每个节点都有文本信息。我们使用 10% 的标注数据来训练 GCN。

BlogCatalog:10312 节点,333983 边缘,39 标签,非定向:

  • data/blogCatalog/bc_adjlist.txt
  • data/blogCatalog/bc_edgelist.txt
  • data/blogCatalog/bc_labels.txt

Wiki:2405 节点,17981 边缘,19 标签,定向:

  • data/wiki/Wiki_edgelist.txt
  • data/wiki/Wiki_category.txt

cora:2708 节点,5429 边缘,7 标签,定向:

  • data/cora/cora_edgelist.txt
  • data/cora/cora.features
  • data/cora/cora_labels.txt

引用

如果 OpenNE 对你的研究有用,请考虑引用以下论文:

OpenNE: An open source toolkit for Network Embedding

This repository provides a standard NE/NRL(Network Representation Learning)training and testing framework. In this framework, we unify the input and output interfaces of different NE models and provide scalable options for each model. Moreover, we implement typical NE models under this framework based on tensorflow, which enables these models to be trained with GPUs.

We develop this toolkit according to the settings of DeepWalk. The implemented or modified models include DeepWalk, LINE, node2vec, GraRep, TADW, GCN, HOPE, GF, SDNE and LE. We will implement more representative NE models continuously according to our released NRL paper list. Specifically, we welcome other researchers to contribute NE models into this toolkit based on our framework. We will announce the contribution in this project.

Usage

Installation

  • Clone this repo.
  • enter the directory where you clone it, and run the following code
    pip install -r requirements.txt
    cd src
    python setup.py install

General Options

You can check out the other options available to use with OpenNE using:

python -m openne --help
  • --input, the input file of a network;
  • --graph-format, the format of input graph, adjlist or edgelist;
  • --output, the output file of representation (GCN doesn't need it);
  • --representation-size, the number of latent dimensions to learn for each node; the default is 128
  • --method, the NE model to learn, including deepwalk, line, node2vec, grarep, tadw, gcn, lap, gf, hope and sdne;
  • --directed, treat the graph as directed; this is an action;
  • --weighted, treat the graph as weighted; this is an action;
  • --label-file, the file of node label; ignore this option if not testing;
  • --clf-ratio, the ratio of training data for node classification; the default is 0.5;
  • --epochs, the training epochs of LINE and GCN; the default is 5;

Example

To run "node2vec" on BlogCatalog network and evaluate the learned representations on multi-label node classification task, run the following command in the home directory of this project:

python -m openne --method node2vec --label-file data/blogCatalog/bc_labels.txt --input data/blogCatalog/bc_adjlist.txt --graph-format adjlist --output vec_all.txt --q 0.25 --p 0.25

To run "gcn" on Cora network and evaluate the learned representations on multi-label node classification task, run the following command in the home directory of this project:

python -m openne --method gcn --label-file data/cora/cora_labels.txt --input data/cora/cora_edgelist.txt --graph-format edgelist --feature-file data/cora/cora.features  --epochs 200 --output vec_all.txt --clf-ratio 0.1

Specific Options

DeepWalk and node2vec:

  • --number-walks, the number of random walks to start at each node; the default is 10;
  • --walk-length, the length of random walk started at each node; the default is 80;
  • --workers, the number of parallel processes; the default is 8;
  • --window-size, the window size of skip-gram model; the default is 10;
  • --q, only for node2vec; the default is 1.0;
  • --p, only for node2vec; the default is 1.0;

LINE:

  • --negative-ratio, the default is 5;
  • --order, 1 for the 1st-order, 2 for the 2nd-order and 3 for 1st + 2nd; the default is 3;
  • --no-auto-save, no early save when training LINE; this is an action; when training LINE, we will calculate F1 scores every epoch. If current F1 is the best F1, the embeddings will be saved.

GraRep:

  • --kstep, use k-step transition probability matrix(make sure representation-size%k-step == 0).

TADW:

  • --lamb, lamb is a hyperparameter in TADW that controls the weight of regularization terms.

GCN:

  • --feature-file, The file of node features;
  • --epochs, the training epochs of GCN; the default is 5;
  • --dropout, dropout rate;
  • --weight-decay, weight for l2-loss of embedding matrix;
  • --hidden, number of units in the first hidden layer.

GraphFactorization:

  • --epochs, the training epochs of GraphFactorization; the default is 5;
  • --weight-decay, weight for l2-loss of embedding matrix;
  • --lr, learning rate, the default is 0.01

SDNE:

  • --encoder-list, a list of numbers of the neuron at each encoder layer, the last number is the dimension of the output node representation, the default is [1000, 128]
  • --alpha, alpha is a hyperparameter in SDNE that controls the first order proximity loss, the default is 1e-6
  • --beta, beta is used for construct matrix B, the default is 5
  • --nu1, parameter controls l1-loss of weights in autoencoder, the default is 1e-5
  • --nu2, parameter controls l2-loss of weights in autoencoder, the default is 1e-4
  • --bs, batch size, the default is 200
  • --lr, learning rate, the default is 0.01

Input

The supported input format is an edgelist or an adjlist:

edgelist: node1 node2 <weight_float, optional>
adjlist: node n1 n2 n3 ... nk

The graph is assumed to be undirected and unweighted by default. These options can be changed by setting the appropriate flags.

If the model needs additional features, the supported feature input format is as follow (feature_i should be a float number):

node feature_1 feature_2 ... feature_n

Output

The output file has n+1 lines for a graph with n nodes. The first line has the following format:

num_of_nodes dim_of_representation

The next n lines are as follows:

node_id dim1 dim2 ... dimd

where dim1, ... , dimd is the d-dimensional representation learned by OpenNE.

Evaluation

If you want to evaluate the learned node representations, you can input the node labels. It will use a portion (default: 50%) of nodes to train a classifier and calculate F1-score on the rest dataset.

The supported input label format is

node label1 label2 label3...

Embedding visualization

To show how to apply dimension reduction methods like t-SNE and PCA to embedding visualization, we choose the 20 newsgroups dataset. Using the text feature, we built the news network by kneighbors_graph in scikit-learn. We uploaded the results of different methods in t-SNE-PCA.pptx where the colors of nodes represent the labels of nodes. A simple script is shown as follows:

cd visualization_example
python 20newsgroup.py
tensorboard --logdir=log/

After running the tensorboard, visit localhost:6006 to view the result.

Comparisons with other implementations

Running environment: 
BlogCatalog: CPU: Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz. 
Wiki, Cora: CPU: Intel(R) Core(TM) i5-7267U CPU @ 3.10GHz.

We show the node classification results of various methods in different datasets. We set representation dimension to 128, kstep=4 in GraRep.

Note that, both GCN(a semi-supervised NE model) and TADW need additional text features as inputs. Thus, we evaluate these two models on Cora in which each node has text information. We use 10% labeled data to train GCN.

BlogCatalog: 10312 nodes, 333983 edges, 39 labels, undirected:

  • data/blogCatalog/bc_adjlist.txt
  • data/blogCatalog/bc_edgelist.txt
  • data/blogCatalog/bc_labels.txt
Algorithm Time Micro-F1 Macro-F1
DeepWalk 271s 0.385 0.238
LINE 1st+2nd 2008s 0.398 0.235
Node2vec 2623s 0.404 0.264
GraRep - - -
OpenNE(DeepWalk) 986s 0.394 0.249
OpenNE(LINE 1st+2nd) 1555s 0.390 0.253
OpenNE(node2vec) 3501s 0.405 0.275
OpenNE(GraRep) 4178s 0.393 0.230

Wiki (Wiki dataset is provided by LBC project. But the original link failed.): 2405 nodes, 17981 edges, 19 labels, directed:

  • data/wiki/Wiki_edgelist.txt
  • data/wiki/Wiki_category.txt
Algorithm Time Micro-F1 Macro-F1
DeepWalk 52s 0.669 0.560
LINE 2nd 70s 0.576 0.387
node2vec 32s 0.651 0.541
GraRep 19.6s 0.633 0.476
OpenNE(DeepWalk) 42s 0.658 0.570
OpenNE(LINE 2nd) 90s 0.661 0.521
OpenNE(Node2vec) 33s 0.655 0.538
OpenNE(GraRep) 23.7s 0.649 0.507
OpenNE(GraphFactorization) 12.5s 0.637 0.450
OpenNE(HOPE) 3.2s 0.601 0.438
OpenNE(LaplacianEigenmaps) 4.9s 0.277 0.073
OpenNE(SDNE) 39.6s 0.643 0.498

Cora: 2708 nodes, 5429 edges, 7 labels, directed:

  • data/cora/cora_edgelist.txt
  • data/cora/cora.features
  • data/cora/cora_labels.txt
Algorithm Dropout Weight_decay Hidden Dimension Time Accuracy
TADW - - - 80*2 13.9s 0.780
GCN 0.5 5e-4 16 - 4.0s 0.790
OpenNE(TADW) - - - 80*2 20.8s 0.791
OpenNE(GCN) 0.5 5e-4 16 - 5.5s 0.789
OpenNE(GCN) 0 5e-4 16 - 6.1s 0.779
OpenNE(GCN) 0.5 1e-4 16 - 5.4s 0.783
OpenNE(GCN) 0.5 5e-4 64 - 6.5s 0.779

Citing

If you find OpenNE is useful for your research, please consider citing the following papers:

@InProceedings{perozzi2014deepwalk,Title                    = {Deepwalk: Online learning of social representations},Author                   = {Perozzi, Bryan and Al-Rfou, Rami and Skiena, Steven},Booktitle                = {Proceedings of KDD},Year                     = {2014},Pages                    = {701--710}
}@InProceedings{tang2015line,Title                    = {Line: Large-scale information network embedding},Author                   = {Tang, Jian and Qu, Meng and Wang, Mingzhe and Zhang, Ming and Yan, Jun and Mei, Qiaozhu},Booktitle                = {Proceedings of WWW},Year                     = {2015},Pages                    = {1067--1077}
}@InProceedings{grover2016node2vec,Title                    = {node2vec: Scalable feature learning for networks},Author                   = {Grover, Aditya and Leskovec, Jure},Booktitle                = {Proceedings of KDD},Year                     = {2016},Pages                    = {855--864}
}@article{kipf2016semi,Title                    = {Semi-Supervised Classification with Graph Convolutional Networks},Author                   = {Kipf, Thomas N and Welling, Max},journal                  = {arXiv preprint arXiv:1609.02907},Year                     = {2016}
}@InProceedings{cao2015grarep,Title                    = {Grarep: Learning graph representations with global structural information},Author                   = {Cao, Shaosheng and Lu, Wei and Xu, Qiongkai},Booktitle                = {Proceedings of CIKM},Year                     = {2015},Pages                    = {891--900}
}@InProceedings{yang2015network,Title                    = {Network representation learning with rich text information},Author                   = {Yang, Cheng and Liu, Zhiyuan and Zhao, Deli and Sun, Maosong and Chang, Edward},Booktitle                = {Proceedings of IJCAI},Year                     = {2015}
}@Article{tu2017network,Title                    = {Network representation learning: an overview},Author                   = {TU, Cunchao and YANG, Cheng and LIU, Zhiyuan and SUN, Maosong},Journal                  = {SCIENTIA SINICA Informationis},Volume                   = {47},Number                   = {8},Pages                    = {980--996},Year                     = {2017}
}@inproceedings{ou2016asymmetric,title                    = {Asymmetric transitivity preserving graph embedding},author                   = {Ou, Mingdong and Cui, Peng and Pei, Jian and Zhang, Ziwei and Zhu, Wenwu},booktitle                = {Proceedings of the 22nd ACM SIGKDD},pages                    = {1105--1114},year                     = {2016},organization             = {ACM}
}@inproceedings{belkin2002laplacian,title                    = {Laplacian eigenmaps and spectral techniques for embedding and clustering},author                   = {Belkin, Mikhail and Niyogi, Partha},booktitle                = {Advances in neural information processing systems},pages                    = {585--591},year                     = {2002}
}@inproceedings{ahmed2013distributed,title                    = {Distributed large-scale natural graph factorization},author                   = {Ahmed, Amr and Shervashidze, Nino and Narayanamurthy, Shravan and Josifovski, Vanja and Smola, Alexander J},booktitle                = {Proceedings of the 22nd international conference on World Wide Web},pages                    = {37--48},year                     = {2013},organization             = {ACM}
}@inproceedings{wang2016structural,title                    = {Structural deep network embedding},author                   = {Wang, Daixin and Cui, Peng and Zhu, Wenwu},booktitle                = {Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining},pages                    = {1225--1234},year                     = {2016},organization             = {ACM}
}

Sponsor

This research is supported by Tencent, MSRA, NSFC and BBDM-Lab.

tencent

MSRA

NSFC

清华大学发布OpenNE:用于网络嵌入的开源工具包相关推荐

  1. 一个用于网络摄像机的开源软件 --- mjpg-streamer

    一个用于网络摄像机的开源软件 --- mjpg-streamer http://sourceforge.net/projects/mjpg-streamer luther@gliethttp:~$ w ...

  2. 清华大学发布首个自动图机器学习工具包 AutoGL,开源易用可扩展,支持自定义模型...

    来源:机器之心本文约2800字,建议阅读6分钟如何应用自动机器学习 (AutoML) 加速图机器学习任务的处理? 清华大学发布全球首个开源自动图学习工具包:AutoGL (Auto Graph Lea ...

  3. 支持向量回归_浙江大学王慧芳等:用于电网节点重要度评估的一种基于网络嵌入和支持向量回归的人工智能方法...

    内容介绍 中文摘要: 重要节点识别对电网安全意义重大.但电网在规模.结构等方面差异较大,评价指标难以涵盖电网不同状态下所有信息,因此基于指标构建的传统评估方法,其效果视情况而定,通用性不足.由此,本文 ...

  4. DeepMNE:用于lncRNA疾病关联预测的深度多网络嵌入

    摘要 长非编码RNA(lncRNA)参与多种生物学过程,因此其突变和疾病在多种人类疾病的发病机制中起着重要作用.识别与疾病相关的lncRNAs对于疾病的诊断.预防和治疗至关重要.尽管已经开发了大量计算 ...

  5. KDD 18 论文解读 | GraphWave:一种全新的无监督网络嵌入方法

    在碎片化阅读充斥眼球的时代,越来越少的人会去关注每篇论文背后的探索和思考. 在这个栏目里,你会快速 get 每篇精选论文的亮点和痛点,时刻紧跟 AI 前沿成果. 点击本文底部的「阅读原文」即刻加入社区 ...

  6. 赞!清华大学发布首个自动图机器学习工具包AutoGL

    点上方蓝字计算机视觉联盟获取更多干货 在右上方 ··· 设为星标 ★,与你不见不散 仅作学术分享,不代表本公众号立场,侵权联系删除 转载于:机器之心 AI博士笔记系列推荐 周志华<机器学习> ...

  7. 60款流行网络工具的开源替代选择

    开源网络工具能派得上用处.无论你是在管理大型企业数据中心中成千上万的系统,还是仅仅把你家里的几台电脑连接起来,开源网络工具都能帮助你搭建和维护一个低成本的网络. 本文整理出了让这项任务变得更容易一点的 ...

  8. KDD 2016 | SDNE:结构化深层网络嵌入

    目录 前言 摘要 1. 引言 2. 相关工作 2.1 深度神经网络 2.2 网络嵌入 3. SDNE 3.1 问题定义 3.2 模型 3.2.1 框架 3.2.2 损失函数 3.2.3 优化 3.3 ...

  9. 远程实习 | 达特茅斯学院招收网络嵌入和图挖掘方向研究型实习生

    来源:AI求职 Dartmouth College 达特茅斯学院(Dartmouth College)是位于美国新罕布什尔州汉诺威镇的一所私立研究型综合性大学.达特茅斯是八所常春藤联盟(Ivy Lea ...

  10. OPNFV董事邓辉:网络功能虚拟化开源平台OPNFV介绍

    2016年6月1-2日,"2016全球SDNFV技术大会"在北京盛大召开.作为连续举办三届的SDN/NFV技术与产业盛会,本届大会着眼于SDN/NFV的实践应用与部署,从SDN/N ...

最新文章

  1. PHP获取客户端真实IP的自定义函数
  2. 工作总结 npoi 模板 导出公式 excel
  3. stm32f103zet6linux,stm32f103zet6定时器详解及应用
  4. BootStrap入门教程 (四)
  5. 在centos 下安装配置基于gitosis 的git 服务
  6. 你应该知道的25道Javascript面试题
  7. 自己做网站翻译服务器 - 添加网站,猎场seo视频教程:站群之间应该如何进行链接-专业...
  8. qt添加菜单纯代码_QtCreator插件开发(二)——QtCreator菜单和菜单项
  9. 10x程序员是如何思考的?
  10. 让程序员崩溃只需要一句话
  11. 开源库uthash第一弹uthash.h
  12. 如何查找ADI原厂提供的DSP技术资料
  13. 网站地图在线生成html,如何制作网站地图(sitemap.html和sitemap.xml)?
  14. 分享文章到新浪微博(源码)
  15. Compact, Redundant, Compressed, Dynamic的作用
  16. C#使用公共语言拓展(CLE)调用Python3(tensorflow)
  17. javascript中的prototype原型、_proto_属性、原型链
  18. jenkins邮件使用自定义变量
  19. 【运维必备】网络端口大全,收藏!
  20. 11.CSS3新增了哪些新特性?

热门文章

  1. C++基础语法-01-引用
  2. eclipse没有java web,Java-我的Eclipse IDE中缺少Web服务选项
  3. BIGO 使用 Flink 做 OLAP 分析及实时数仓的实践和优化
  4. sharepoint文件夹本地同步_FreeFileSync for Mac(文件夹同步和比较工具)
  5. python 直方图每个bin中的值_python – 如何获取直方图bin中的数据
  6. vsan 一台主机磁盘组全报错_分享VSAN磁盘无法识别的故障解决方法
  7. mysql能将查询结果与表左查询_mysql重点,表查询操作和多表查询
  8. 地图旋转_人类一败涂地手游:地图冰进阶攻略,团队配合与齐心协力缺一不可...
  9. java final 变量 大小写_java – 为什么“final static int”可以用作开关的大小写常量但不是“final static”...
  10. 查看工作日志Linux,工作日志,Linux的表现还是不错的