PyTorch学习—21.GPU的使用

文章目录

引言
一、CPU与GPU
二、torch.cuda常用方法
三、PyTorch中多gpu运算的分发并行机制
四、gpu模型加载中的常见报错

引言

本节学习如何使用GPU进行加速模型运算，介绍Tensor和Module的to函数使用以及它们之间的差异，同时学习多GPU运算的分发并行机制。

一、CPU与GPU

CPU（Central Processing Unit, 中央处理器）：主要包括控制器和运算器
GPU(Graphics Processing Unit, 图形处理器)：处理统一的，无依赖的大规模数据运算

处理器处理数据的运算，数据必须位于同一个处理器上，要么，同时存在CPU上；要么，同时存在GPU上。那么，在PyTorch中，我们如何将数据在CPU与GPU上进行切换呢？PyTorch中的to函数实现了在CPU与GPU之间的切换。

# CPU->GPU
data.to("cuda")# GPU->CPU
data.to("cpu")

PyTorch中有两种数据类型：

Tensor
Module

针对这两种数据类型，都有to函数（转换数据类型/设备），下面我们来学习这两种数据类型中的to函数。

tensor.to(*args, **kwargs)

# 转换数据类型
x = torch.ones((3, 3))
x = x.to(torch.float64)# 转换数据设备
x = torch.ones((3, 3))
x = x.to("cuda")

module.to(*args, **kwargs)

# 转换模型中数据的类型-所有数据
linear = nn.Linear(2, 2)
linear.to(torch.double)# 转换模型数据设备
gpu1 = torch.device("cuda")
linear.to(gpu1)

Tensor与 Module的to函数的区别：张量不执行inplace（原地操作），即to函数后会构建新的张量；模型执行inplace（原地操作）。这也就是为什么 Module的to函数不需要“=”重新赋值。下面，我们通过代码来学习这两个方法：

Tensor的to函数:

import torch
import torch.nn as nndevice = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")# ========================== tensor to cuda
x_cpu = torch.ones((3, 3))
print("x_cpu:\ndevice: {} is_cuda: {} id: {}".format(x_cpu.device, x_cpu.is_cuda, id(x_cpu)))x_gpu = x_cpu.to(device)
print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))

x_cpu:
device: cpu is_cuda: False id: 1515330671360
x_gpu:
device: cuda:0 is_cuda: True id: 1515354356800

可以发现：Tensor的to函数非原地操作。
Module的to函数:

import torch
import torch.nn as nndevice = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")net = nn.Sequential(nn.Linear(3, 3))print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))net.to(device)
print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

id:2807152911216 is_cuda: False
id:2807152911216 is_cuda: True

可以发现：Module的to函数原地操作。
在GPU上运行，

import torch
import torch.nn as nnx_cpu = torch.ones((3, 3))
x_gpu = x_cpu.to(device)net = nn.Sequential(nn.Linear(3, 3))
net.to(device)output = net(x_gpu)
print("output is_cuda: {}".format(output.is_cuda))

output is_cuda: True

二、torch.cuda常用方法

torch.cuda.device_count()：计算当前可见可用gpu数
torch.cuda.get_device_name()：获取gpu名称
torch.cuda.manual_seed()：为当前gpu设置随机种子
torch.cuda.manual_seed_all()：为所有可见可用gpu设置随机种子
torch.cuda.set_device()：设置主gpu为哪一个物理gpu（不推荐）
这个方法不推荐，因为在多GPU的使用过程中，这个方法很容易产生一些混淆。推荐设置系统的环境变量
```
# 设置gpu“2”和“3”是可见的，此时，只有两个逻辑gpu
os.environ.setdefault("CUDA_VISIBLE_DEVICES", "2, 3")
```
物理gpu指的是主机上的GPU数目，逻辑gpu指的是python脚本中可见的GPU,逻辑gpu的数量一定是小于物理gpu的。逻辑gpu中默认为第0个gou是主gpu。

下面通过代码实验torch.cuda的方法

import torchgpu_id = 0
gpu_str = "cuda:{}".format(gpu_id)
device = torch.device(gpu_str if torch.cuda.is_available() else "cpu")x_cpu = torch.ones((3, 3))
x_gpu = x_cpu.to(device)print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))device_count = torch.cuda.device_count()
print("\ndevice_count: {}".format(device_count))device_name = torch.cuda.get_device_name(0)
print("\ndevice_name: {}".format(device_name))

x_gpu:
device: cuda:0 is_cuda: True id: 1535529732288
device_count: 1
device_name: GeForce GTX 1650

三、PyTorch中多gpu运算的分发并行机制

多gpu运算的分发并行机制:将batch size的训练数据进行一个平均的分发，分发到每一个gpu上，然后每个gpu进行一个并行的运算，得到运算结果之后再进行结果的回收，将运算结果回收到主gpu（默认可见gpu（可见gpu）的第一个）上。
下面展示在PyTorch中如何实现多gpu的分发并行运算

torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)

功能：包装模型，实现分发并行机制
主要参数：

module: 需要包装分发的模型
device_ids : 可分发的gpu，默认分发到所有可见可用gpu
output_device: 结果输出设备

下面我们通过代码实现多gpu的分发并行运算

import os
import numpy as np
import torch
import torch.nn as nn# ============================ 手动选择gpu
# flag = 0
flag = 1
if flag:gpu_list = [0]gpu_list_str = ','.join(map(str, gpu_list))os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)device = torch.device("cuda" if torch.cuda.is_available() else "cpu")# ============================ 依内存情况自动选择主gpu
# flag = 0
flag = 1
if flag:def get_gpu_memory():"""统计gpu剩余内存return：内存"""import platformif 'Windows' != platform.system():import os# 英伟达查询命令，-q表示查询，-d表示查询内容 Memory查询gpu的内存# grep搜索os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]os.system('rm tmp.txt')else:memory_gpu = Falseprint("显存计算功能暂不支持windows操作系统")return memory_gpugpu_memory = get_gpu_memory()if gpu_memory:print("\ngpu free memory: {}".format(gpu_memory))# 排序gpu_list = np.argsort(gpu_memory)[::-1]gpu_list_str = ','.join(map(str, gpu_list))os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)device = torch.device("cuda" if torch.cuda.is_available() else "cpu")class FooNet(nn.Module):def __init__(self, neural_num, layers=3):super(FooNet, self).__init__()self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])def forward(self, x):# 如果有1个gpu，那么size=16；如果有两个gpu，那么size=8print("\nbatch size in forward: {}".format(x.size()[0]))for (i, linear) in enumerate(self.linears):x = linear(x)x = torch.relu(x)return xif __name__ == "__main__":batch_size = 16# materialsinputs = torch.randn(batch_size, 3)labels = torch.randn(batch_size, 3)# 将数据放入gpu当中device = torch.device("cuda" if torch.cuda.is_available() else "cpu")inputs, labels = inputs.to(device), labels.to(device)# modelnet = FooNet(neural_num=3, layers=3)# 包装模型，实现分发并行机制net = nn.DataParallel(net)# 将模型放入gpu中net.to(device)# trainingfor epoch in range(1):# 前向传播outputs = net(inputs)print("model outputs.size: {}".format(outputs.size()))print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))print("device_count :{}".format(torch.cuda.device_count()))

显存计算功能暂不支持windows操作系统
batch size in forward: 16
model outputs.size: torch.Size([16, 3])
CUDA_VISIBLE_DEVICES :0
device_count :1

四、gpu模型加载中的常见报错

报错1：

RuntimeError: Attempting to deserialize object on a CUDA device but
torch.cuda.is_available() is False. If you are running on a CPU -only machine, please
use torch.load with map_location=torch.device('cpu') to map your storages to the
CPU

尝试在CUDA不可用的设备上进行模型的反序列化，模型是以CUDA的形式保存的
解决方法：

torch.load(path_state_dict, map_location="cpu")

这个就可以在CPU设备上加载GPU模型了。

import os
import numpy as np
import torch
import torch.nn as nnclass FooNet(nn.Module):def __init__(self, neural_num, layers=3):super(FooNet, self).__init__()self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])def forward(self, x):print("\nbatch size in forward: {}".format(x.size()[0]))for (i, linear) in enumerate(self.linears):x = linear(x)x = torch.relu(x)return xif torch.cuda.device_count() < 2:print("gpu数量不足，请到多gpu环境下运行")import syssys.exit(0)gpu_list = [0, 1, 2, 3]
gpu_list_str = ','.join(map(str, gpu_list))
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")net = FooNet(neural_num=3, layers=3)
net = nn.DataParallel(net)
net.to(device)# save
net_state_dict = net.state_dict()
path_state_dict = "./model_in_multi_gpu.pkl"
torch.save(net_state_dict, path_state_dict)

报错2：

RuntimeError: Error(s) in loading state_dict for FooNet:
Missing key(s) in state_dict: "linears.0.weight", "linears.1.weight", "linears.2.weight".
Unexpected key(s) in state_dict: "module.linears.0.weight",
"module.linears.1.weight", "module.linears.2.weight".

这个错误是由于在训练时采用多gpu训练并行运算，所以模型会使用DataParallel进行包装，所以模型的网络层命名会多了一个module，所以，在加载state_dict时，出现字典命名不匹配。
解决方法：
通过下面代码将当前的state_dict中的key进行修改

from collections import OrderedDictnew_state_dict = OrderedDict()for k, v in state_dict_load.items ():# 移除module.namekey = k[7:] if k.startswith('module.') else knew_state_dict[namekey] = v

import os
import numpy as np
import torch
import torch.nn as nnclass FooNet(nn.Module):def __init__(self, neural_num, layers=3):super(FooNet, self).__init__()self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])def forward(self, x):print("\nbatch size in forward: {}".format(x.size()[0]))for (i, linear) in enumerate(self.linears):x = linear(x)x = torch.relu(x)return xif torch.cuda.device_count() < 2:print("gpu数量不足，请到多gpu环境下运行")import syssys.exit(0)net = FooNet(neural_num=3, layers=3)path_state_dict = "./model_in_multi_gpu.pkl"
state_dict_load = torch.load(path_state_dict, map_location="cpu")
print("state_dict_load:\n{}".format(state_dict_load))# net.load_state_dict(state_dict_load)# remove module.
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict_load.items():namekey = k[7:] if k.startswith('module.') else knew_state_dict[namekey] = v
print("new_state_dict:\n{}".format(new_state_dict))net.load_state_dict(new_state_dict)

state_dict_load:
OrderedDict([('module.linears.0.weight', tensor([[ 0.3337,  0.0317, -0.1331],[ 0.0431,  0.0454,  0.1235],[ 0.0575, -0.2903, -0.2634]])), ('module.linears.1.weight', tensor([[ 0.1235,  0.1520, -0.1611],[ 0.4511, -0.1460, -0.1098],[ 0.0653, -0.5025, -0.1693]])), ('module.linears.2.weight', tensor([[ 0.3657, -0.1107, -0.2341],[ 0.0657, -0.0194, -0.3119],[-0.0477, -0.1008,  0.2462]]))])
new_state_dict:
OrderedDict([('linears.0.weight', tensor([[ 0.3337,  0.0317, -0.1331],[ 0.0431,  0.0454,  0.1235],[ 0.0575, -0.2903, -0.2634]])), ('linears.1.weight', tensor([[ 0.1235,  0.1520, -0.1611],[ 0.4511, -0.1460, -0.1098],[ 0.0653, -0.5025, -0.1693]])), ('linears.2.weight', tensor([[ 0.3657, -0.1107, -0.2341],[ 0.0657, -0.0194, -0.3119],[-0.0477, -0.1008,  0.2462]]))])

如果对您有帮助，麻烦点赞关注，这真的对我很重要！！！如果需要互关，请评论或者私信！