参考:https://towardsdatascience.com/learn-to-smell-molecules-with-graph-convolutional-neural-networks-62fa5a826af5
https://github.com/snap-stanford/ogb
https://github.com/dmlc/dgl/blob/master/examples/pytorch

https://programtalk.com/python-more-examples/ogb.utils.features.atom_to_feature_vector/
https://keras.io/examples/generative/wgan-graphs/

ogb 包是一个图数据操作、加载

1、ogb、dgl simles生成图网络

import torch
import dgl
import torch_geometric
from ogb.utils.features import atom_to_feature_vector, bond_to_feature_vector, get_atom_feature_dims, \get_bond_feature_dims
from rdkit import Chem
from rdkit.Chem.rdmolops import GetAdjacencyMatrix
from torch.utils.data import Dataset
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch.nn.functional as F
from scipy.constants import physical_constants
from typing import List, Tupledef graph_only_collate(batch: List[Tuple]):return dgl.batch(batch)class InferenceDataset(Dataset):def __init__(self, smiles_txt_path, device='cuda:0', transform=None, **kwargs):with open(smiles_txt_path) as file:lines = file.readlines()smiles_list = [line.rstrip() for line in lines]atom_slices = [0]edge_slices = [0]all_atom_features = []all_edge_features = []edge_indices = []  # edges of each molecule in coo formattotal_atoms = 0total_edges = 0n_atoms_list = []for mol_idx, smiles in tqdm(enumerate(smiles_list)):# get the molecule using the smiles representation from the csv filemol = Chem.MolFromSmiles(smiles)# add hydrogen bonds to molecule because they are not in the smiles representationmol = Chem.AddHs(mol)n_atoms = mol.GetNumAtoms()atom_features_list = []for atom in mol.GetAtoms():atom_features_list.append(atom_to_feature_vector(atom))all_atom_features.append(torch.tensor(atom_features_list, dtype=torch.long))edges_list = []edge_features_list = []for bond in mol.GetBonds():i = bond.GetBeginAtomIdx()j = bond.GetEndAtomIdx()edge_feature = bond_to_feature_vector(bond)# add edges in both directionsedges_list.append((i, j))edge_features_list.append(edge_feature)edges_list.append((j, i))edge_features_list.append(edge_feature)# Graph connectivity in COO format with shape [2, num_edges]edge_index = torch.tensor(edges_list, dtype=torch.long).Tedge_features = torch.tensor(edge_features_list, dtype=torch.long)edge_indices.append(edge_index)all_edge_features.append(edge_features)total_edges += len(edges_list)total_atoms += n_atomsedge_slices.append(total_edges)atom_slices.append(total_atoms)n_atoms_list.append(n_atoms)self.n_atoms = torch.tensor(n_atoms_list)self.atom_slices = torch.tensor(atom_slices, dtype=torch.long)self.edge_slices = torch.tensor(edge_slices, dtype=torch.long)self.edge_indices = torch.cat(edge_indices, dim=1)self.all_atom_features = torch.cat(all_atom_features, dim=0)self.all_edge_features = torch.cat(all_edge_features, dim=0)def __len__(self):return len(self.atom_slices) - 1def __getitem__(self, idx):e_start = self.edge_slices[idx]e_end = self.edge_slices[idx + 1]start = self.atom_slices[idx]n_atoms = self.n_atoms[idx]edge_indices = self.edge_indices[:, e_start: e_end]g = dgl.graph((edge_indices[0], edge_indices[1]), num_nodes=n_atoms)g.ndata['feat'] = self.all_atom_features[start: start + n_atoms]g.edata['feat'] = self.all_edge_features[e_start: e_end]return gtest_data = InferenceDataset(device=device, smiles_txt_path=args.smiles_txt_path)  ## simles 字符串txt,每行一个test_loader = DataLoader(test_data, batch_size=2, collate_fn=graph_only_collate)  ## torch的DataLoader

单个测试

smiles_list =["OC1CC1(O)CC1CC1"]atom_slices = [0]
edge_slices = [0]
all_atom_features = []
all_edge_features = []
edge_indices = []  # edges of each molecule in coo format
total_atoms = 0
total_edges = 0
n_atoms_list = []
for mol_idx, smiles in tqdm(enumerate(smiles_list)):# get the molecule using the smiles representation from the csv filemol = Chem.MolFromSmiles(smiles)# add hydrogen bonds to molecule because they are not in the smiles representationmol = Chem.AddHs(mol)print(Chem.MolToSmiles(mol))n_atoms = mol.GetNumAtoms()print(n_atoms)atom_features_list = []for atom in mol.GetAtoms():atom_features_list.append(atom_to_feature_vector(atom))all_atom_features.append(torch.tensor(atom_features_list, dtype=torch.long))edges_list = []edge_features_list = []for bond in mol.GetBonds():i = bond.GetBeginAtomIdx()j = bond.GetEndAtomIdx()edge_feature = bond_to_feature_vector(bond)# add edges in both directionsedges_list.append((i, j))edge_features_list.append(edge_feature)edges_list.append((j, i))edge_features_list.append(edge_feature)# Graph connectivity in COO format with shape [2, num_edges]edge_index = torch.tensor(edges_list, dtype=torch.long).Tedge_features = torch.tensor(edge_features_list, dtype=torch.long)edge_indices.append(edge_index)all_edge_features.append(edge_features)total_edges += len(edges_list)total_atoms += n_atomsedge_slices.append(total_edges)atom_slices.append(total_atoms)n_atoms_list.append(n_atoms)n_atoms = torch.tensor(n_atoms_list)
atom_slices = torch.tensor(atom_slices, dtype=torch.long)
edge_slices = torch.tensor(edge_slices, dtype=torch.long)
edge_indices = torch.cat(edge_indices, dim=1)
all_atom_features = torch.cat(all_atom_features, dim=0)
all_edge_features = torch.cat(all_edge_features, dim=0)e_start = edge_slices[0]
e_end = edge_slices[0 + 1]
start = atom_slices[0]
n_atoms = n_atoms[0]
edge_indices = edge_indices[:, e_start: e_end]
g = dgl.graph((edge_indices[0], edge_indices[1]), num_nodes=n_atoms)
g.ndata['feat'] = all_atom_features[start: start + n_atoms]
g.edata['feat'] = all_edge_features[e_start: e_end]

2、GNN模型训练代码

row.csv

SMILES,SENTENCE
C/C=C/C(=O)C1CCC(C=C1C)(C)C,"fruity,rose"
COC(=O)OC,"fresh,ethereal,fruity"
Cc1cc2c([nH]1)cccc2,"resinous,animalic"
C1CCCCCCCC(=O)CCCCCCC1,"powdery,musk,animalic"
CC(CC(=O)OC1CC2C(C1(C)CC2)(C)C)C,"coniferous,camphor,fruity"
CCC[C@H](CCO)SC,tropicalfruit
from rdkit import Chem
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch
import dgl
from ogb.utils.features import atom_to_feature_vector, bond_to_feature_vector, get_atom_feature_dims, \get_bond_feature_dimsdef smiles2graph(smiles_string):mol = Chem.MolFromSmiles(smiles_string)A = Chem.GetAdjacencyMatrix(mol)A = np.asmatrix(A)nz = np.nonzero(A)src, dst = nz[0], nz[1]g = dgl.graph((src, dst))return gdef feat_vec(smiles_string):"""Returns atom features for a molecule given a smiles string"""# atomsmol = Chem.MolFromSmiles(smiles_string)atom_features_list = []for atom in mol.GetAtoms():atom_features_list.append(atom_to_feature_vector(atom))x = np.array(atom_features_list, dtype = np.int64)return xdf =pd.read_csv("row.csv")lista_senten=df['SENTENCE'].to_list()
labels=[]for olor in lista_senten:olor=olor.split(",")if 'fruity' in olor:labels.append(1)else:labels.append(0)lista_mols=df['SMILES'].to_list()j=0
graphs=[]
execptions=[]
for mol in lista_mols:g_mol=smiles2graph(mol)try:g_mol.ndata['feat']=torch.tensor(feat_vec(mol)) except:execptions.append(j)graphs.append(g_mol)j+=1from dgl.data import DGLDatasetclass SyntheticDataset(DGLDataset):def __init__(self):super().__init__(name='synthetic')def process(self):self.graphs = graphsself.labels = torch.LongTensor(labels)def __getitem__(self, i):return self.graphs[i], self.labels[i]def __len__(self):return len(self.graphs)dataset = SyntheticDataset()from dgl.dataloading import GraphDataLoader
from torch.utils.data.sampler import SubsetRandomSamplernum_examples = len(dataset)
num_train = int(num_examples * 0.8)train_sampler = SubsetRandomSampler(torch.arange(num_train))
test_sampler = SubsetRandomSampler(torch.arange(num_train, num_examples))train_dataloader = GraphDataLoader(dataset, sampler=train_sampler, batch_size=5, drop_last=False)
test_dataloader = GraphDataLoader(dataset, sampler=test_sampler, batch_size=5, drop_last=False)from dgl.nn import GraphConv
from torch import nn
import torch.nn.functional as Fclass GCN(nn.Module):def __init__(self, in_feats, h_feats, num_classes):super(GCN, self).__init__()self.conv1 = GraphConv(in_feats, h_feats)self.conv2 = GraphConv(h_feats, num_classes)def forward(self, g, in_feat):h = self.conv1(g, in_feat)h = F.relu(h)h = self.conv2(g, h)g.ndata['h'] = hreturn dgl.mean_nodes(g, 'h')model = GCN(9, 8, 2)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)for epoch in range(20):for batched_graph, labels in train_dataloader:pred = model(batched_graph, batched_graph.ndata['feat'].float())#print(pred,labels)loss = F.cross_entropy(pred, labels)optimizer.zero_grad()loss.backward()optimizer.step()num_correct = 0
num_tests = 0
for batched_graph, labels in test_dataloader:pred = model(batched_graph, batched_graph.ndata['feat'].float())num_correct += (pred.argmax(1) == labels).sum().item()num_tests += len(labels)print('Test accuracy:', num_correct / num_tests)
##保存
torch.save(model.state_dict(), os.path.join(args.save_dir, args.name))## 先加载模型框架
model.load_state_dict(torch.load(os.path.join(args.save_dir, args.name)))

2、pgl图框架安装

参考:https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html

这款是torch原生;上面dgl是亚马逊先开发的也比较通用

pgl容易出错,建议conda 创建个python==3.9环境,然后pgl对于的一些列包如果安装报Microsoft Visual C++ 14.0 or greater is required. Get it wit等错误,建议直接下载wheel轮子进行安装

https://pytorch-geometric.com/whl wheel轮子地址;这里安装的torch对应torch=1.12.1,windows上安装

pip install torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric


化合物分子 ogb、dgl生成图网络及GNN模型训练;pgl图框架相关推荐

  1. NeurIPS-21 | MGSSL: 基于官能团的分子属性预测图网络自监督预训练

    本文介绍一篇来自中国科学技术大学刘淇教授课题组和腾讯量子实验室联合发表的一篇文章.该文章提出了基于官能团的分子属性预测图网络自监督预训练方法MGSSL.MGSSL结合化学领域知识,在大量无标签分子数据 ...

  2. AI模型训练无需购买设备啦!Tesar超算网络让AI模型训练更便捷!

    现代科技的发展可以用日新月异来形容,新技术的出现也是层出不穷.一个眨眼的功夫,一门足以改变世界的应用可能就被发明出来了,当然也有可能一个遥遥领先的企业瞬间被超越.处在风云变化时代,最重要的就是时间,就 ...

  3. 调研-笔记-基于生成对抗网络的恶意域名训练数据生成

    DGA 域名字符生成模型 域名字符分析 问题:理论上 GAN 中的生成器和判别器部分采用任意可微函数都能表示,因此其主要用于连续数据的处理,如图像生成.视频检测等[13].基于文本的离散数据处理一直是 ...

  4. 收藏 | PyTorch模型训练特征图可视化(TensorboardX)

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达 仅作学术分享,不代表本公众号立场,侵权联系删除 转载于:作者丨Pa ...

  5. 造福于万千游戏建模初学者的贴图基础知识【模型烘焙贴图】

    烘焙贴图,其实就是将模型与模型之间的光影关系通过图片的形式转换出来,这样就形成了一种贴图,将这种贴图控制在模型上,可以得到一种假的但很真实的效果.烘焙的贴图有:法线贴图,OCC或ao贴图,转换贴图,高 ...

  6. NLP(4) | 用词向量技术简单分析红楼梦人物关系用n-gramma生成词向量word2vect进行模型训练

    NLP(1) | 词向量one hot编码词向量编码思想 NLP(2) | 中文分词分词的概念分词方法分类CRFHMM分词 NLP(3)| seq to seq 模型 前言:出于种种原因,总是不自觉把 ...

  7. 【华为云技术分享】网络场景AI模型训练效率实践

    问题 在网络场景下的AI模型训练的过程中,KPI异常检测项目需要对设备内多模块.多类型数据,并根据波形以及异常表现进行分析,这样的数据量往往较大,对内存和性能要求较高. 同时,在设计优化算法时,需要快 ...

  8. 面向动态图网络嵌入/表示学习的动态图/网络数据集整理

    Dynamic-graph-dataset Dynamic graph/network dataset for dynamic graph/network embedding/representati ...

  9. VGG16、VGG19网络架构及模型训练 tricks :训练技巧、测试技巧

    在上一篇文章的基础之上,总结一下论文中提出的训练技巧和测试技巧.上一篇文章参考:VGG论文笔记--VGGNet网络架构演变[VGG16,VGG19] 一.训练技巧 技巧1:Scale jitterin ...

最新文章

  1. (灌水)如何限制一个WinForm应用程序只能在一个进程运行
  2. 洛谷 P1019 单词接龙 (DFS)
  3. 全球及中国洗涤剂行业十四五总体规模与盈利状况分析报告2022版
  4. 《剑指 Offer I》刷题笔记 51 ~ 61 题
  5. java 微信 图灵机器人_使用图灵api创建微信聊天机器人
  6. 字符串数组(String []) 去掉重复值的方法
  7. 机器学习实战(入门级) ------ Kaggle 泰坦尼克号幸存者预测 (随机森林,KNN,SVM)
  8. 悬浮窗——判断及跳转(包含OPPO 5.1 系统等)
  9. Android--Button、TabLayout英文小写自动变为大写的问题
  10. 55 - 字符流中第一个不反复的字符
  11. Maven中scope标签详解
  12. LC链表(算法系列)
  13. Ubuntu 14.04开启root账号,禁用Guest账号
  14. 天蓝-skyblue迁移到博客园
  15. python openslide 查看并保存切片的略缩图,并将Image图片转换成Base64
  16. python字符串的大小写转换
  17. 搜索引擎蜘蛛机器人User-Agent特征收集
  18. 麦肯锡称三年内将会发生金融AI大变革
  19. 关于 Burte Force
  20. 在极路由极1S上使用entware

热门文章

  1. Bi-Direction attention flow for machine reading(原理篇)
  2. Prometheus监控神器-Alertmanager篇(1)
  3. 低代码平台:企业IT管理的一剂“良方”
  4. 在PyCharm环境中使用graphviz遇到的问题
  5. html中hover的作用,hover在css中的作用是什么
  6. web前端需要学习什么?需要掌握什么技术
  7. android 截屏需要权限,安卓App要权限还会偷删截屏?专治流氓App神器
  8. Redmine3.3.3 搭建与不完全填坑指南
  9. 执行npm出现“Error:Cannot find module ‘fs/promises”的问题
  10. 上方网首发:TestBird《2015年度手游测试白皮书》