前言

续接 https://blog.csdn.net/DaZheng121/article/details/124337239
参考学长的建议，对batch size过大出现的CUDA问题做了改进，给出了一种较低配置训练batch size较大模型的方法。

一、梯度累加变相扩大batch size

参考：
https://zhuanlan.zhihu.com/p/445009191
https://www.cnblogs.com/lart/p/11628696.html

1.基础知识

首先，要先理解loss.backward()、optimizer.step()和optimizer.zero_grad()
optimizer.zero_grad()意思是把梯度置零，也就是把loss关于weight的导数变成0

2.知识补充

backward计算后，默认会释放计算图（bp算法会需要这些信息），而这些计算图就是网络计算的一些中间结果，那么一次回传计算完梯度后，它会将这些梯度保留在模型每一层的属性里，计算图得到释放，你又有显存（内存）可以用，再跑多个batch梯度一样回传，存到模型的每一层属性里，然后再更新就可以实现梯度累加了。

3.梯度累加的实现

for i,(images,target) in enumerate(train_loader):# 1. input outputimages = images.cuda(non_blocking=True)target = torch.from_numpy(np.array(target)).float().cuda(non_blocking=True)outputs = model(images)loss = criterion(outputs,target)# 2.1 loss regularizationloss = loss/accumulation_steps# 2.2 back propagationloss.backward()# 3. update parameters of netif((i+1)%accumulation_steps)==0:# optimizer the netoptimizer.step()        # update parameters of netoptimizer.zero_grad()   # reset gradient

1、获取 loss：输入图像和标签，通过infer计算得到预测值，计算损失函数；
2、loss.backward() 反向传播，计算当前梯度；
3、多次循环步骤 1-2，不清空梯度，使梯度累加在已有梯度上；
4、梯度累加了一定次数后，先optimizer.step() 根据累计的梯度更新网络参数，然后optimizer.zero_grad() 清空过往梯度，为下一波梯度累加做准备；

总结来说：梯度累加就是，每次获取1个batch的数据，计算1次梯度，梯度不清空，不断累加，累加一定次数后，根据累加的梯度更新网络参数，然后清空梯度，进行下一次循环。

一定条件下，batchsize 越大训练效果越好，梯度累加则实现了 batchsize 的变相扩大，如果accumulation_steps 为 8，则batchsize ‘变相’ 扩大了8倍，使用时需要注意，学习率也要适当放大。

二、基于pytorch的实现

1.Train2.py

除了梯度累加变相增大batch size之外，与HBP（一）的不同在于调小了patience和end_patient。

代码如下：

import torch
import torch.nn as nn
import torch.optim
import torch.utils.data
import torchvision
import os
import NetModel
import CUB200# base_lr = 0.1
# batch_size = 24
num_epochs = 200
weight_decay = 1e-8
num_classes = 200
cub200_path = 'E:/DataSets/CUB_200_2011/'
save_model_path = 'model_saved'device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')fc = 1
ft = 2def train(mode, Model, model_path, base_lr, batch_size, step_num):# load the network.model = Modelmodel = model.to(device)param_to_optim = []if mode == fc:# Load the fc parameter.for param in model.parameters():if not param.requires_grad:continueparam_to_optim.append(param)optimizer = torch.optim.SGD(param_to_optim, lr=base_lr, momentum=0.9, weight_decay=weight_decay)elif mode == ft:# Load the saved model.model.load_state_dict(torch.load(os.path.join(save_model_path,model_path),map_location=lambda storage, loc: storage))# Load all parameters.# param_to_optim = model.parameters()optimizer = torch.optim.SGD(model.parameters(), lr=base_lr, momentum=0.9, weight_decay=weight_decay)# for param in model.parameters():#     param_to_optim.append(param)# optimizer = torch.optim.SGD(model.parameters(), lr=base_lr, momentum=0.9, weight_decay=weight_decay)criterion = nn.CrossEntropyLoss()# If the incoming value does not increase for 3 consecutive times, the learning rate will be reduced by 0.1 timesscheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', factor=0.1, patience=2, verbose=True)# Calculate the mean and variance of each channel of sample data,# run it only once, and record the corresponding value# get_statistic()# Mean and variance of CUB_200 dataset are [0.4856, 0.4994, 0.4324], [0.1817, 0.1811, 0.1927]# Set up the data preprocessing processtrain_transform = torchvision.transforms.Compose([torchvision.transforms.Resize(448),torchvision.transforms.CenterCrop(448),torchvision.transforms.RandomHorizontalFlip(),torchvision.transforms.ToTensor(),torchvision.transforms.Normalize([0.4856, 0.4994, 0.4324],[0.1817, 0.1811, 0.1927])])test_transform = torchvision.transforms.Compose([torchvision.transforms.Resize(448),torchvision.transforms.CenterCrop(448),torchvision.transforms.ToTensor(),torchvision.transforms.Normalize([0.4856, 0.4994, 0.4324],[0.1817, 0.1811, 0.1927])])train_data = CUB200.CUB200(cub200_path, train=True, transform=train_transform)test_data = CUB200.CUB200(cub200_path, train=False, transform=test_transform)train_loader = torch.utils.data.DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True)test_loader = torch.utils.data.DataLoader(dataset=test_data, batch_size=batch_size, shuffle=False)print('Start training ...')best_acc = 0.best_epoch = 0end_patient = 0training_accuracy = []testing_accuracy = []epochs = []size = len(train_loader.dataset)for epoch in range(num_epochs):correct = 0total = 0epoch_loss = 0.for i, (images, labels) in enumerate(train_loader):images = images.to(device)labels = labels.to(device)outputs = model(images)loss = criterion(outputs, labels)loss = loss/step_numloss.backward()if (i+1) % step_num == 0:optimizer.step()optimizer.zero_grad()epoch_loss += loss_, prediction = torch.max(outputs.data, 1)correct += (prediction == labels).sum().item()total += labels.size(0)if (i+1) % 480 == 0:print('Epoch %d: Iter %d/%d, Loss %g' % (epoch + 1, (i+1) * batch_size, size, loss))train_acc = 100 * correct / totalprint('Testing on test dataset...')test_acc = test_accuracy(model, test_loader)print('Epoch [{}/{}] Loss: {:.4f} Train_Acc: {:.4f}  Test1_Acc: {:.4f}'.format(epoch + 1, num_epochs, epoch_loss, train_acc, test_acc))scheduler.step(test_acc)training_accuracy.append(train_acc)testing_accuracy.append(test_acc)epochs.append(epoch)if test_acc > best_acc:if mode == fc:model_file = os.path.join(save_model_path, 'CUB_200_train_fc_epoch_%d_acc_%g.pth' %(best_epoch, best_acc))if os.path.isfile(model_file):os.remove(os.path.join(save_model_path, 'CUB_200_train_fc_epoch_%d_acc_%g.pth' %(best_epoch, best_acc)))end_patient = 0best_acc = test_accbest_epoch = epoch + 1print('The accuracy is improved, save model')torch.save(model.state_dict(), os.path.join(save_model_path,'CUB_200_train_fc_epoch_%d_acc_%g.pth' %(best_epoch, best_acc)))elif mode == ft:model_file = os.path.join(save_model_path, 'CUB_200_train_ft_epoch_%d_acc_%g.pth' %(best_epoch, best_acc))if os.path.isfile(model_file):os.remove(os.path.join(save_model_path, 'CUB_200_train_ft_epoch_%d_acc_%g.pth' %(best_epoch, best_acc)))end_patient = 0best_acc = test_accbest_epoch = epoch + 1print('The accuracy is improved, save model')torch.save(model.state_dict(), os.path.join(save_model_path,'CUB_200_train_ft_epoch_%d_acc_%g.pth' %(best_epoch, best_acc)))else:end_patient += 1print('Impatient: ', end_patient)# If the accuracy of the 10 iteration is not improved, the training endsif end_patient >= 8:breakprint('After the training, the end of the epoch %d, the accuracy %g is the highest' % (best_epoch, best_acc))print('epochs:', epochs)print('training accuracy:', training_accuracy)print('testing accuracy:', testing_accuracy)def test_accuracy(model, test_loader):model.eval()with torch.no_grad():correct = 0total = 0for images, labels in test_loader:images = images.to(device)labels = labels.to(device)outputs = model(images)_, prediction = torch.max(outputs.data, 1)correct += (prediction == labels).sum().item()total += labels.size(0)model.train()return 100 * correct / total

2.main.py

由于我的电脑配置，在训练时，fc mode的batch size最大为6，ft mode的batch size最大为1，所以在变相增大batch size时要同时修改对应的step_num和batch size。

代码如下：

import Train
import Train2
import NetModelstep_num1 = 8
step_num2 = 16
model = NetModel.HBP(pretrained=False)
model_path = 'CUB_200_train_fc_epoch_42_acc_79.6859.pth'
base_lr = 0.1
batch_size = 24fc = 1
fc_base_lr = 0.1
fc_batch_size = int(6*step_num1/step_num1)     # max=6
ft = 2
ft_base_lr = 0.001
ft_batch_size = int(step_num2/step_num2)  # max=1mode = ft
if mode == fc:model = NetModel.HBP(pretrained=True)base_lr = fc_base_lrbatch_size = fc_batch_sizeTrain2.train(mode=mode, Model=model, model_path=model_path, base_lr=base_lr,batch_size=batch_size, step_num=step_num1)
elif mode == ft:base_lr = ft_base_lrbatch_size = ft_batch_sizeTrain2.train(mode=mode, Model=model, model_path=model_path, base_lr=base_lr,batch_size=batch_size, step_num=step_num2)

三、结果分析

1.在Train.py的训练出的基础上进行的微调

Train.py见 https://blog.csdn.net/DaZheng121/article/details/124337239

初始学习率被设置为0.0001
累加梯度后，batch size = 8

可以看出微调训练的效果并不乐观，可能是由于在训练集上的准确度已经达到100%，也有可能是由于初始学习率过小导致无法跳出局部最优解，但是训练时间太久了所以我打算直接调高batch size去提高模型的泛化能力。

2.基于Train2.py的训练结果

相较于1，泛化性有所改善，总体准确度提高了2.4%。

总结

本文给出了一种在显存较小的设备上训练batch size较大的模型，也是一种解决CUDA out of memory导致不得不减小batch size的办法。

细粒度分类：Hierarchical Bilinear Pooling(HBP)，分级双线性池化（二）相关推荐

双线性池化（Bilinear Pooling）
Awesome Fine-Grained Image Analysis – Papers, Codes and Datasets bilinear pooling主要用于特征融合,对于从同一个样本提取 ...
双线性池化（Bilinear Pooling）详解、改进及应用
点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达本文转自|深度学习这件小事最近看了一系列bilinear pool ...
双线性池化【BiLinear Pooling】
1. 简介 bilinear pooling主要用于特征融合,对于从同一个样本中提取出来的特征x和y,通过bilinear pooling操作可以得到x和y融合后的特征(本质是向量相乘) ps:如果x ...
双线性池化_卷积神经网络中的各种池化操作
池化操作(Pooling)是CNN中非常常见的一种操作,Pooling层是模仿人的视觉系统对数据进行降维,池化操作通常也叫做子采样(Subsampling)或降采样(Downsampling),在构建 ...
【文献阅读】MFB——结合协同注意力的多模态矩阵分解的双线性池化方法（Z. Yu等人，ICCV，2017，有代码）
一.背景文章题目:<Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question ...
Spatial Pyramid Pooling（空间金字塔池化）-变尺度CNN训练
1. 需求创造好的产品,产品拓宽原始的需求当前的深度神经网络一般都需要固定的输入图像尺寸(如224*224). 这种需求很明显是人为的,潜在性的弊端会降低识别精度(为了使图像尺寸相同,一定会涉及到图 ...
卷积和池化matlab 实现,UFLDL新版教程与编程练习（七）：Convolution and Pooling（卷积和池化）...
UFLDL是吴恩达团队编写的较早的一门深度学习入门,里面理论加上练习的节奏非常好,每次都想快点看完理论去动手编写练习,因为他帮你打好了整个代码框架,也有详细的注释,所以我们只要实现一点核心的代码编写工 ...
mysql pooling 作用_关于池化（pooling）理解！！！
网上看到一个池化的解释是: 为了描述大的图像,可以对不同位置的特征进行聚合统计,如计算平均值或者是最大值,即mean-pooling和max-pooling 我的想法是,图像做卷积以后,将图像信息(特 ...
（转）双线性汇合(bilinear pooling)在细粒度图像分析及其他领域的进展综述
本博文转载自:https://www.itcodemonkey.com/article/11427.html 细粒度图像分类旨在同一大类图像的确切子类.由于不同子类之间的视觉差异很小,而且容易受姿势. ...
双线性汇合(bilinear pooling)在细粒度图像分析及其他领域的进展综述
作者简介: 张皓南京大学计算机系机器学习与数据挖掘所(LAMDA) 研究方向为计算机视觉和机器学习,特别是视觉识别和深度学习个人主页:goo.gl/N715YT 细粒度图像分类旨在同一大类图像的确 ...

细粒度分类：Hierarchical Bilinear Pooling(HBP)，分级双线性池化（二）

文章目录

前言