Pytorch单机多卡加速

忙了两个月从收到原始数据到最后在工程项目中加载成功完成测试，好像从元旦上班后就再没休息过，昨天项目通过三期评审终于可以喘口气补点作业了。（年前写的文章，今天转过来了）
多卡并行
一定要使用

torch.nn.parallel.DistributedDataParallel()
torch.nn.parallel.DistributedDataParallel()
torch.nn.parallel.DistributedDataParallel()

重要的事情要说三遍！
千万不要使用torch.nn.DataParallel(model)这个函数，这家伙的提速效果几乎没有，项目压力太大没时间去仔细查API了，反正不用这个就对了。在使用了DDP模式后，每轮epoch从900s降到了250s（两张3090, bitch size: 6），这效果不用我多说了吧。

导入包，并加入初始化代码

from torch
from torch import distributed as disttorch.distributed.init_process_group(backend="nccl")

模型并行

model = your_model()local_rank = torch.distributed.get_rank()
model = model.cuda(local_rank)
torch.cuda.set_device(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[0, 1],output_device=local_rank,find_unused_parameters=False,broadcast_buffers=False)

修改数据加载方式

sampler = torch.utils.data.distributed.DistributedSampler(Your_Dataset())
loader_train = DataLoader(Your_Dataset(),batch_size=cfg.batch_size,shuffle=False,num_workers=16,pin_memory=True,drop_last=True,sampler=train_sampler)

说明：其实就是比正常的加载方式多了一个sampler ，然后将DataLoader()中的shuffle设置为False，这里并不是说不做random了，而是交给了sampler 来做。
这样就Ok啦，运行的话就已经可以并行计算了。不过，你会发现好像数据并没有打乱，每次加载的顺序都是一样的！去查一下官网的API会发现，这里的随机需要我们再多写一句话，每次epoch的时候设置一下种子

sampler.set_epoch(epoch)

所以最终的代码大概长这个样子：

form torch
from torch import distributed as disttorch.distributed.init_process_group(backend="nccl")model = your_model()local_rank = torch.distributed.get_rank()
model = model.cuda(local_rank)
torch.cuda.set_device(local_rank)
model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[0, 1],output_device=local_rank,find_unused_parameters=False,broadcast_buffers=False)
。。。。。
sampler = torch.utils.data.distributed.DistributedSampler(Your_Dataset())
loader_train = DataLoader(Your_Dataset(),batch_size=cfg.batch_size,shuffle=False,num_workers=16,pin_memory=True,drop_last=True,sampler=train_sampler)
。。。。。
for epoch in range(cfg.epoch):sampler.set_epoch(epoch)。。。。。。if local_rank == 0:torch.save(model.state_dict(), save_path)

最后记得在Linux下搞，不然总会出现莫名其妙的错误。还有这样写运行的时候有一个warning，由于着急赶项目也没空处理，有哪位童鞋知道的话麻烦告诉一声，多谢！

运行

python -m torch.distributed.launch train.py

加载模型
多卡训练出的模型保存后，在模型参数的前面会多出个“module.”前缀出来，加载的时候把这个前缀干掉就好了

def load_parallal_model(model, pretrain_dir):state_dict_ = torch.load(pretrain_dir, map_location='cuda:0')print('loaded pretrained weights form %s !' % pretrain_dir)state_dict = OrderedDict()# convert data_parallal to modelfor key in state_dict_:if key.startswith('module') and not key.startswith('module_list'):state_dict[key[7:]] = state_dict_[key]else:state_dict[key] = state_dict_[key]# check loaded parameters and created model parametersmodel_state_dict = model.state_dict()for key in state_dict:if key in model_state_dict:if state_dict[key].shape != model_state_dict[key].shape:print('Skip loading parameter {}, required shape{}, loaded shape{}.'.format(key, model_state_dict[key].shape, state_dict[key].shape))state_dict[key] = model_state_dict[key]else:print('Drop parameter {}.'.format(key))for key in model_state_dict:if key not in state_dict:print('No param {}.'.format(key))state_dict[key] = model_state_dict[key]model.load_state_dict(state_dict, strict=False)return model

为了效率一切都得改
为了提高执行效率，记得不要写循环！不要写循环！不要写循环！特别是核心代码的循环，能不写就不要写！！！之前留下代码的哥们，为了生成三维数据的heatmap，写了个四重循环我想死的心都有了
之前的代码：

for id in class_ids:heatmap = np.zeros_like(img_np, dtype=np.float)    pos = sample[id]range_heat = [[0, 0], [0, 0], [0, 0]]range_heat[0][0] = pos[0] - r[0] if pos[0] > r[0] else 0range_heat[1][0] = pos[1] - r[1] if pos[1] > r[1] else 0range_heat[2][0] = pos[2] - r[2] if pos[2] > r[2] else 0range_heat[0][1] = pos[0] + r[0] if pos[0] + r[0] < X else Xrange_heat[1][1] = pos[1] + r[1] if pos[1] + r[1] < Y else Yrange_heat[2][1] = pos[2] + r[2] if pos[2] + r[2] < Z else Zfor z in range(range_heat[2][0], range_heat[2][1]):for y in range(range_heat[1][0], range_heat[1][1]):for x in range(range_heat[0][0], range_heat[0][1]):              d = np.sqrt(np.power(z - pos[2], 2) + np.power(y - pos[1], 2) + np.power(x - pos[0], 2))heat_value = value_transform(d)heatmap[z][y][x] = heat_value

所以一个epoch需要250s！！！
这是修改后的：

def gaussian3D(shape, sigma=1):m, n, p = [(ss - 1.) / 2. for ss in shape]y, x, q = np.ogrid[-m:m + 1, -n:n + 1, -p:p + 1]h = np.exp(-(x * x + y * y + q * q) / (2 * sigma * sigma))h[h < np.finfo(h.dtype).eps * h.max()] = 0return hdef draw_gaussian(heatmap, center, radius):diameter = 2 * radius + 1gaussian = gaussian3D((diameter, diameter, diameter), sigma=diameter / 6)c_x, c_y, c_z = int(center[0]), int(center[1]), int(center[2])Z, Y, X = heatmap.shapeleft, right = min(c_x, radius), min(X - c_x, radius + 1)top, bottom = min(c_y, radius), min(Y - c_y, radius + 1)h_top, h_bottom = min(c_z, radius), min(Z - c_z, radius + 1)masked_heatmap = heatmap[c_z - h_top:c_z + h_bottom, c_y - top:c_y + bottom, c_x - left:c_x + right]masked_gaussian = gaussian[radius - h_top: radius + h_bottom,radius - top:radius + bottom, radius - left:radius + right]np.maximum(masked_heatmap, masked_gaussian, out=masked_heatmap)return heatmapfor id in class_ids:heatmap = np.zeros_like(img_np, dtype=np.float)  pos = sample[id]draw_gaussian(heatmap, pos, 9)

最终一个epoch从最初的900s降到了现在的150s，本来需要2 -3 天才能训练完的代码最后只用了12个小时搞定。

Pytorch单机多卡加速相关推荐

pytorch 单机多卡训练distributedDataParallel
pytorch单机多卡:从DataParallel到DistributedDataParallel 最近想做的实验比较多,于是稍微学习了一下和pytorch相关的加速方式.本人之前一直在使用DataP ...
pytorch单机多卡的正确打开方式以及可能会遇到的问题和相应的解决方法
pytorch 单机多卡的正确打开方式 pytorch 使用单机多卡,大体上有两种方式简单方便的 torch.nn.DataParallel(很 low,但是真的很简单很友好) 使用 torch.d ...
PyTorch单机多卡分布式训练教程及代码示例
导师不是很懂PyTorch的分布式训练流程,我就做了个PyTorch单机多卡的分布式训练介绍,但是他觉得我做的没这篇好PyTorch分布式训练简明教程 - 知乎.这篇讲的确实很好,不过我感觉我做的也还 ...
收藏 | PyTorch 单机多卡操作总结
点上方蓝字计算机视觉联盟获取更多干货在右上方 ··· 设为星标 ★,与你不见不散仅作学术分享,不代表本公众号立场,侵权联系删除转载于:作者丨科技猛兽@知乎来源丨https://zhuanlan ...
PyTorch 单机多卡操作总结：分布式DataParallel，混合精度，Horovod)
点击上方"3D视觉工坊",选择"星标" 干货第一时间送达作者丨科技猛兽@知乎来源丨https://zhuanlan.zhihu.com/p/15837505 ...
PyTorch单机多卡训练（DDP-DistributedDataParallel的使用）备忘记录
不做具体的原理分析和介绍(因为我也不咋懂),针对我实际修改可用的一个用法介绍,主要是模型训练入口主函数(main_multi_gpu.py)的四处修改. 以上的介绍来源https://zhuanlan ...
Pytorch 单机多卡训练DDP
多卡训练方式 1.DP--torch.nn.DataParallel 2.DDP--torch.nn.parallel.DistributedDataParallel 通俗一点讲就是用了4张卡训练,就 ...
PyTorch 分布式训练DDP 单机多卡快速上手
PyTorch 分布式训练DDP 单机多卡快速上手本文旨在帮助新人快速上手最有效的 PyTorch 单机多卡训练,对于 PyTorch 分布式训练的理论介绍.多方案对比,本文不做详细介绍,有兴趣的读 ...
pytorch-多GPU训练（单机多卡、多机多卡）
pytorch-多GPU训练(单机多卡.多机多卡) pytorch 单机多卡训练首先是数据集的分布处理需要用到的包: torch.utils.data.distributed.Distribute ...

Pytorch单机多卡加速

Pytorch单机多卡加速相关推荐

最新文章

热门文章