GCN使用的数据集Cora、Citeseer、Pubmed、Tox21格式

文章目录

- Cora、Citeseer、Pubmed
- 以Cora为例
  - 数据格式示例
Tox21 数据集

本文分享一下图卷积网络GCN里用到的一些数据集的格式

Cora、Citeseer、Pubmed

├── gcn
│   ├── data          //图数据
│   │   ├── ind.citeseer.allx
│   │   ├── ind.citeseer.ally
│   │   ├── ind.citeseer.graph
│   │   ├── ind.citeseer.test.index
│   │   ├── ind.citeseer.tx
│   │   ├── ind.citeseer.ty
│   │   ├── ind.citeseer.x
│   │   ├── ind.citeseer.y
│   │   ├── ind.cora.allx
│   │   ├── ind.cora.ally
│   │   ├── ind.cora.graph
│   │   ├── ind.cora.test.index
│   │   ├── ind.cora.tx
│   │   ├── ind.cora.ty
│   │   ├── ind.cora.x
│   │   ├── ind.cora.y
│   │   ├── ind.pubmed.allx
│   │   ├── ind.pubmed.ally
│   │   ├── ind.pubmed.graph
│   │   ├── ind.pubmed.test.index
│   │   ├── ind.pubmed.tx
│   │   ├── ind.pubmed.ty
│   │   ├── ind.pubmed.x
│   │   └── ind.pubmed.y
│   ├── __init__.py
│   ├── inits.py    //初始化的公用函数
│   ├── layers.py   //GCN层定义
│   ├── metrics.py  //评测指标的计算
│   ├── models.py   //模型结构定义
│   ├── train.py    //训练
│   └── utils.py    //工具函数的定义
├── LICENCE
├── README.md
├── requirements.txt
└── setup.py

三种数据都由以下八个文件组成，存储格式类似：

ind.dataset_str.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.allx => the feature vectors of both labeled and unlabeled training instances (a superset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object;ind.dataset_str.y => the one-hot labels of the labeled training instances as numpy.ndarray object;
ind.dataset_str.ty => the one-hot labels of the test instances as numpy.ndarray object;
ind.dataset_str.ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object;ind.dataset_str.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict object;
ind.dataset_str.test.index => the indices of test instances in graph, for the inductive setting as list object.All objects above must be saved using python pickle module.以cora为例：
ind.dataset_str.x => 训练实例的特征向量，是scipy.sparse.csr.csr_matrix类对象，shape:(140, 1433)
ind.dataset_str.tx => 测试实例的特征向量,shape:(1000, 1433)
ind.dataset_str.allx => 有标签的+无无标签训练实例的特征向量，是ind.dataset_str.x的超集，shape:(1708, 1433)ind.dataset_str.y => 训练实例的标签，独热编码，numpy.ndarray类的实例，是numpy.ndarray对象，shape：(140, 7)
ind.dataset_str.ty => 测试实例的标签，独热编码，numpy.ndarray类的实例,shape:(1000, 7)
ind.dataset_str.ally => 对应于ind.dataset_str.allx的标签，独热编码,shape:(1708, 7)ind.dataset_str.graph => 图数据，collections.defaultdict类的实例，格式为 {index：[index_of_neighbor_nodes]}
ind.dataset_str.test.index => 测试实例的id，2157行上述文件必须都用python的pickle模块存储

《补充转载论文中大佬ltrbless:的评论：

“您好，ind.dataset_str.allx => 有标签的+无无标签训练实例的特征向量，是ind.dataset_str.x的超集，shape:(1708, 1433)，这一句话很容易让人误解为其中部分的ally是[0, 0, 0, 0, 0, 0, 0], 其实都是给了标签的，只是在处理处理数据的时候可以自己去屏蔽一”些标签来做半监督实验。“

问：请问为什么allx中只有1708个实例呢，而不是2708个呢？

大佬回答：因为 allx 里面的都是用来训练的节点，以共 1708 个，另外1000是划分到测试集里面了即在 tx 里面， allx + tx = 2708. 应该是这样。

》

Semi-Supervised Classification with Graph Convolutional Networks论文中的GCN是半监督学习，因此训练数据集中有的有标签有的没有标签

以Cora为例
原始数据集链接：http://linqs.cs.umd.edu/projects/projects/lbc/
数据集划分方式：https://github.com/kimiyoung/planetoid (Zhilin Yang, William W. Cohen, Ruslan Salakhutdinov, Revisiting Semi-Supervised Learning with Graph Embeddings, ICML 2016)

Cora数据集由机器学习论文组成，是近年来图深度学习很喜欢使用的数据集。在数据集中，论文分为以下七类之一:

基于案例
遗传算法
神经网络
概率方法
强化学习
规则学习
理论
论文的选择方式是，在最终语料库中，每篇论文引用或被至少一篇其他论文引用。整个语料库中有2708篇论文。

在词干堵塞和去除词尾后，只剩下1433个独特的单词。文档频率小于10的所有单词都被删除。cora数据集包含1433个独特单词，所以特征是1433维。0和1描述的是每个单词在paper中是否存在。

变量data是个scipy.sparse.csr.csr_matrix，类似稀疏矩阵，输出得到的是矩阵中非0的行列坐标及值

数据格式示例

(1)--------------------------------------ind.cora.x
def load_cora():names = ['x']with open("data/ind.cora.x", 'rb') as f:if sys.version_info > (3, 0):print(f)  # <_io.BufferedReader name='data/ind.cora.x'>data = pkl.load(f, encoding='latin1')print(type(data)) #<class 'scipy.sparse.csr.csr_matrix'>print(data.shape)   #(140, 1433)-ind.cora.x是140行，1433列的print(data.shape[0]) #row:140print(data.shape[1]) #column:1433print(data[1])# 变量data是个scipy.sparse.csr.csr_matrix，类似稀疏矩阵，输出得到的是矩阵中非0的行列坐标及值# (0, 19)   1.0# (0, 88)    1.0# (0, 149)   1.0# (0, 212)   1.0# (0, 233)   1.0# (0, 332)   1.0# (0, 336)   1.0# (0, 359)   1.0# (0, 472)   1.0# (0, 507)   1.0# (0, 548)   1.0# ...# print(data[100][1]) #IndexError: index (1) out of rangenonzero=data.nonzero()print(nonzero)     #输出非零元素对应的行坐标和列坐标
# (array([  0,   0,   0, ..., 139, 139, 139], dtype=int32), array([  19,   81,  146, ..., 1263, 1274, 1393], dtype=int32))# nonzero是个tupleprint(type(nonzero)) #<class 'tuple'>print(nonzero[0])    #行：[  0   0   0 ... 139 139 139]print(nonzero[1])    #列：[  19   81  146 ... 1263 1274 1393]print(nonzero[1][0])  #19print(data.toarray())
# [[0. 0. 0. ... 0. 0. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  ...
#  [0. 0. 0. ... 0. 1. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  [0. 1. 0. ... 0. 0. 0.]](2)--------------------------------------ind.cora.ydef load_cora():with open("data/ind.cora.y", 'rb') as f:if sys.version_info > (3, 0):print(f)  #<_io.BufferedReader name='data/ind.cora.y'>data = pkl.load(f, encoding='latin1')print(type(data)) #<class 'numpy.ndarray'>print(data.shape)   #(140, 7)print(data.shape[0]) #row:140print(data.shape[1]) #column:7print(data[1]) #[0 0 0 0 1 0 0](3)--------------------------------------ind.cora.graphdef load_cora():with open("data/ind.cora.graph", 'rb') as f:if sys.version_info > (3, 0):data = pkl.load(f, encoding='latin1')print(type(data)) #<class 'collections.defaultdict'>print(data)
# defaultdict(<class 'list'>, {0: [633, 1862, 2582], 1: [2, 652, 654], 2: [1986, 332, 1666, 1, 1454],
#   , ... ,
#   2706: [165, 2707, 1473, 169], 2707: [598, 165, 1473, 2706]})(4)--------------------------------------ind.cora.test.indextest_idx_reorder = parse_index_file("data/ind.{}.test.index".format(dataset_str))
print("test index:",test_idx_reorder)
#test index: [2692, 2532, 2050, 1715, 2362, 2609, 2622, 1975, 2081, 1767, 2263,..]
print("min_index:",min(test_idx_reorder))
# min_index: 1708(5)citeseer数据集中一些孤立点的特殊处理#处理citeseer中一些孤立的点if dataset_str == 'citeseer':# Fix citeseer dataset (there are some isolated nodes in the graph)# Find isolated nodes, add them as zero-vecs into the right positiontest_idx_range_full = range(min(test_idx_reorder), max(test_idx_reorder)+1)# print("test_idx_range_full.length",len(test_idx_range_full))#test_idx_range_full.length 1015#转化成LIL格式的稀疏矩阵,tx_extended.shape=(1015,1433)tx_extended = sp.lil_matrix((len(test_idx_range_full), x.shape[1]))# print(tx_extended)#[2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325# ....# 3321 3322 3323 3324 3325 3326]#test_idx_range-min(test_idx_range):列表中每个元素都减去min(test_idx_range)，即将test_idx_range列表中的index值变为从0开始编号tx_extended[test_idx_range-min(test_idx_range), :] = tx# print(tx_extended.shape) #(1015, 3703)# print(tx_extended)# (0, 19) 1.0# (0, 21) 1.0# (0, 169) 1.0# (0, 170) 1.0# (0, 425) 1.0#  ...# (1014, 3243) 1.0# (1014, 3351) 1.0# (1014, 3472) 1.0tx = tx_extended# print(tx.shape)# (1015, 3703)#997,994,993,980,938...等15行全为0ty_extended = np.zeros((len(test_idx_range_full), y.shape[1]))ty_extended[test_idx_range-min(test_idx_range), :] = tyty = ty_extended# for i in range(ty.shape[0]):#     print(i," ",ty[i])#     # 980 [0. 0. 0. 0. 0. 0.]#     # 994 [0. 0. 0. 0. 0. 0.]#     # 993 [0. 0. 0. 0. 0. 0.]

A=np.array([[1,0,2,0],[0,0,0,0],[3,0,0,0],[1,0,0,4]])
AS=sp.lil_matrix(A)
print(AS)
# (0, 0) 1
# (0, 2) 2
# (2, 0) 3
# (3, 0) 1
# (3, 3) 4

Tox21 数据集
此数据集来源于一个PubChem网站的一个2014年的竞赛：https://tripod.nih.gov/tox21/challenge/about.jsp
PubChem是美国国立卫生研究院（NIH）的开放化学数据库，是世界上最大的免费化学物信息集合。
PubChem的数据由数百个数据源提供，包括：政府机构，化学品供应商，期刊出版商等。

21世纪的毒理学（Tox21）计划是NIH，环境保护局和食品药品管理局的联邦合作计划，旨在开发更好的毒性评估方法。目标是快速有效地测试某些化合物是否有可能破坏人体中可能导致不良健康影响的过程。Tox21数据集是其中一个比赛用到的数据集，包含了12个毒理试验测定的化学合成物质的结构信息

训练集中一个分子的信息存储格式如下：

NCGC00255644-01Marvin  07111412562D          26 27  0  0  1  0            999 V20004.5831   -4.3075    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  05.2840   -3.9061    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  05.9910   -4.3075    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  05.2840   -3.0973    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  01.4379   -1.6595    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  01.4379   -2.4863    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  02.1508   -2.0609    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  01.4379   -3.3010    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  00.7070   -2.0609    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  02.8577   -2.4863    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  02.1508   -1.2342    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  00.7070   -3.7084    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  02.1508   -3.7084    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  00.0000   -2.4863    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  02.8577   -3.3010    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  03.5646   -2.0609    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  02.8577   -0.8388    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  01.1323   -4.4273    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  00.3056   -4.4273    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  00.0000   -3.3010    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  03.5646   -1.2342    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  00.7189   -5.1463    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  04.2955   -0.8388    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  05.0085   -1.2342    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  04.2955    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  01.4379   -4.1338    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  01  2  1  0  0  0  02  3  1  0  0  0  02  4  2  0  0  0  06  5  1  1  0  0  06  7  1  0  0  0  06  8  1  0  0  0  06  9  1  0  0  0  07 10  1  0  0  0  07 11  2  0  0  0  08 12  1  0  0  0  08 13  1  0  0  0  08 26  1  6  0  0  09 14  1  0  0  0  010 15  1  0  0  0  010 16  2  0  0  0  011 17  1  0  0  0  012 18  1  6  0  0  012 19  1  1  0  0  012 20  1  0  0  0  013 15  1  0  0  0  014 20  1  0  0  0  016 21  1  0  0  0  017 21  2  0  0  0  018 22  1  0  0  0  021 23  1  0  0  0  023 24  1  0  0  0  023 25  1  0  0  0  0
M  END
>  <Formula>
C22H35NO2>  <FW>
345.5188 (60.0520+285.4668)>  <DSSTox_CID>
27102>  <Active>
0$$$$

测试集中一个分子的信息存储格式如下：

NCGC00261443Marvin  10161415332D          20 22  0  0  1  0            999 V20000.5185    2.9762    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  01.2330    2.5637    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  01.2330    1.7387    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  00.5185    1.3262    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0-0.2661    1.5812    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0-0.7510    0.9137    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0-0.2661    0.2463    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0-0.5210   -0.5383    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0-1.3056   -0.7933    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0-1.3056   -1.6183    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0-1.9731   -2.1032    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0-2.7268   -1.7676    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0-0.5210   -1.8732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0-0.2661   -2.6578    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0-0.0361   -1.2058    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  00.7889   -1.2058    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  00.5185    0.5012    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  01.2330    0.0887    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  01.9475    0.5012    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  01.9475    1.3262    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  01  2  1  0  0  0  02  3  1  0  0  0  03  4  2  0  0  0  04  5  1  0  0  0  05  6  2  0  0  0  06  7  1  0  0  0  07  8  1  0  0  0  08  9  1  1  0  0  09 10  1  0  0  0  010 11  1  1  0  0  011 12  1  0  0  0  010 13  1  0  0  0  013 14  1  0  0  0  013 15  1  0  0  0  08 15  1  0  0  0  015 16  1  6  0  0  07 17  1  0  0  0  04 17  1  0  0  0  017 18  2  0  0  0  018 19  1  0  0  0  019 20  2  0  0  0  03 20  1  0  0  0  0
M  END
>  <Compound ID>
NCGC00261443>  <Compound Batch ID>
NCGC00261443-01>  <NR-AR>
0>  <NR-AR-LBD>
0>  <NR-AhR>
0>  <NR-ER>
0>  <NR-ER-LBD>
0>  <NR-PPAR-gamma>
0>  <SR-ARE>
0>  <SR-ATAD5>
1>  <SR-HSE>
0>  <SR-MMP>
0>  <SR-p53>
0$$$$

目标应该就是根据训练集的分子结构信息和是否是活性的标签去预测测试集中的分子结构的活性。训练集中可能是一个分子构成一张图，里面的原子和健构成节点和边，但是没有找到关于数据集中原子和健部分的数据更具体介绍，不知道每一行数据的意义。

GCN使用的数据集Cora、Citeseer、Pubmed、Tox21格式相关推荐

图网络常用数据集总结——Cora, CiteSeer, PubMed, PPI, BlogCatalog, Yelp
Cora数据集(引文网络)由机器学习论文组成,是近年来图深度学习很喜欢使用的数据集.该数据集共2708个样本点,每个样本点都是一篇科学论文,所有样本点被分为8个类别,类别分别是: 1)基于案例:2)遗 ...
GCN数据集Cora、Citeseer、Pubmed文件分析
转载自简介本文分享一下图卷积网络GCN里用到的一些数据集的格式 Cora.Citeseer.Pubmed(都是一张图:都是引文网:节点数表示论文数,特征表示过滤后的节点数-特征数,标签是论文类别数 ...
torch_geometric 笔记：数据集Cora 简易 GNN
1 获取数据集该数据集用于semi-supervised的节点分类任务 from torch_geometric.datasets import Planetoiddataset = Planeto ...
Dataset之MNIST：自定义函数mnist.load_mnist根据网址下载mnist数据集(四个ubyte.gz格式数据集文件)
Dataset之MNIST:自定义函数mnist.load_mnist根据网址下载mnist数据集(四个ubyte.gz格式数据集文件) 目录下载结果运行代码下载结果运行代码 mnist.py ...
【目标检测数据集汇总】YOLO txt格式各种数据集
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档 [目标检测数据集汇总]目标检测YOLO txt格式数据集~各种数据集前言相关连接: 一.安全帽数据集(10755张,nc2) 二. ...
widerface人脸数据集，yolo训练集格式，voc训练集格式
在wideface官网下载测试集1.2万张和验证集0.3万,直接下载的数据集不符合自己训练集格式要求,需要进行格式转换.YOLO需要TXT格式,有些算法需要voc格式.转换为YOLO格式后进行训练RT ...
BDD 100K数据集label转换为yolo训练格式
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档 BDD 100K数据集label转换为yolo训练格式前言数据集介绍: 代码如下: 补充说明: 总结前言因为最近要做车辆,行人 ...
常用图像数据集原始数据(.png或.jpg格式)生成方法
引言在计算机视觉方面的工作,我们常常需要用到很多图像数据集．像ImageNet这样早已大名鼎鼎的数据集,我等的百十个G的硬盘容量怕是怎么也承载不下:本文中,将给出一些Hello world级的图像数 ...
SIPaKMeD 数据集下载 bmp和dat格式转png掩膜方案
SIPaKMeD 数据集下载 bmp和dat格式转png掩膜方案 SIPaKMeD 数据集下载 bmp和dat格式转png图片和png掩膜 SIPaKMeD 数据集下载直接从官网下载即可,该数据集网 ...
通过GCN来实现对Cora数据集节点的分类
代码来自<深入浅出图神经网络:GNN原理解析> 本节我们通过一个完整的例子来理解如何通过GCN来实现对节点的分类. 我们使用的是Cora数据集,该数据集由2708篇论文,及它们之间的引用 ...

GCN使用的数据集Cora、Citeseer、Pubmed、Tox21格式

GCN使用的数据集Cora、Citeseer、Pubmed、Tox21格式相关推荐

最新文章

热门文章