深度学习 CNN trick 合集

来自 | 知乎作者| sticky

链接 | https://zhuanlan.zhihu.com/p/137940586

编辑 | 深度学习这件小事公众号

本文仅作学术交流，如有侵权，请联系后台删除。

这两天发现了一篇宝藏paper，2019年CVPR中的一篇 Bag of Tricks for Image Classification with Convolutional Neural Networks。这篇paper主要从3个方面讲述了提高现有baseline(ResNet-50)的有效trick：

1. 在新的硬件上有效训练

2. 在ResNet-50的基础上，对模型进行了一些微量的调整

3. 训练的一些技巧

Bag of Tricks for Image Classification with Convolutional Neural Networks

论文地址：

http://openaccess.thecvf.com/content_CVPR_2019/papers/He_Bag_of_Tricks_for_Image_Classification_with_Convolutional_Neural_Networks_CVPR_2019_paper.pdf

大概回顾这篇文章

1. 在新的硬件上有效训练

1.1 背景

在ResNet刚提出的时候，为了考虑当时的硬件条件，不得不做很多跟performance相关的trade-offs。但是随着这几年硬件(尤其是GPU)的快速发展，很多与performance相关的trade-offs已经改变。其中包括：

1. 使用更大的batch size。例如从256到1024

2. 使用较低的数值精度。例如从FP32到FP16

1.2 使用更大的batch size

使用更大的batch size会导致减缓训练进度。对于凸问题，收敛速度会随着batch size的增加而降低。也就是说，在相同的epoch下，使用更大的batch size可能会导致验证集accuracy更低。因此使用一些trick来解决这个问题。

Linear scaling learning rate：例如，当我们选择初始学习率为0.1，batch size为256时，那么当我们将batch size增大至b时，就需要将初始学习率增加曾0.1×b/256

Learning rate warmup：例如，选择5个epoch去进行warmup，在这5个epoch中线性地从0开始增加学习率至初始学习率，然后再开始正常decay

Zero ：在residual block中的batch normalization(BN)中：BN首先标准化输入，得到，然后进行线性变化，其中和都是可以学习的参数，其值被初始化为1s和0s。而在这里初始化

No bias decay：为了避免过拟合，对于权重weight和偏差bias，我们通常会使用weight decay。但在这里，仅对weight使用decay，而不对bias使用decay。

1.3 使用更低的数值精度

以前神经网络通常使用32-bit浮点数精度(FP32)来训练。但是现在的新的硬件增强了低精度数据类型的算术逻辑单元。例如Nvidia V100对FP32提供14 TFLOPS，而对FP16提供100 TFLOPS。因此，使用FP16时，总的训练速度加速了2~3倍：

Comparison of the training time and validation accuracy for ResNet-50 between the baseline (BS=256 with FP32) and a more hardware efficient setting (BS=1024 with FP16).

The breakdown effect for each effective training heuristic on ResNet-50.

2. 模型调整

The architecture of ResNet-50. The convolution kernel size, output channel size and stride size (default is 1) are illustrated, similar for pooling layers.

主要对downsampling block和input steam(上图指出部分)做了一些改动：

1. downsampling做改动主要是由于使用stride=2的1×1 conv会忽略3/4的feature-map。因此，为了使输出的shape保持不变，将path A的前两个conv分别改为stride=1的1×1 conv和stride=2的3×3 conv，即ResNet-C；将path B换成stride=2的2×2 AvgPool和stride=1的1×1 conv，即ResNet-D

2. 而input steam做的改动主要是由于使用7×7 conv的计算cost是3×3的5.4倍。因此将7×7 conv换成3个连续的3×3conv，即ResNet-C

Three ResNet tweaks. ResNet-B modifies the downsampling block of Resnet. ResNet-C further modifies the input stem. On top of that, ResNet-D again modifies the downsampling block.

Compare ResNet-50 with three model tweaks onmodel size, FLOPs and ImageNet validation accuracy.

3. 训练技巧

3.1 Cosine Learning Rate Decay

以往学习率衰减的策略一般是"step decay"，即每隔一定的epoch，学习率才进行一次指数衰减。而现在，学习率随着epoch的增大不断衰减：

Visualization of learning rate schedules with warm-up. Top: cosine and step schedules for batch size 1024. Bottom: Top-1 validation accuracy curve with regard to the two schedules.

3.2 Label Smoothing

3.3 Knowledge Distillation

3.4 Mixup Training

在mixup中，每次随机采样两个样本和，然后通过加权线性插值生成新的样本进行训练：

其中为从分布的得到的随机数。

3.5 Experiment Results

The validation accuracies on ImageNet for stacking training refinements one by one. We repeat each refinement on ResNet-50-D for 4 times with different initialization, and report the mean and standard deviation in the table.

—完—