使用Google Colab Pro训练模型并且使用distiller进行模型优化

Google Colab Pro
Colab Pro基本使用
Distiller的下载与环境配置(Colab中)
Distiller 提供的功能以及使用distiller自带深度学习网络模型训练数据
数据集的上传
模型训练
模型评估

Google Colab Pro

由于已有的机器GPU和内存不给力，这里尝试过了阿里云和Google的Colab,阿里云可以提供很好的算力，但比较贵，所以最后选择了Google的Colab,并且挂US的VPN花9.9刀开了一个月的Pro版本，真的香，无论是通过Google driver 直接拖拽上传到云端文件还是GPU和TPU的算力，不得不感慨这9.9刀花的很值。（不是广告hhh）

Colab Pro基本使用

首先是可以看到在colab中可以添加一行行代码（linux指令需要在最前面加上感叹号!）Ctrl+Enter 可以快速运行单行代码，并且将代码结果在下面运行框中保存显示。
左侧圈出来的小文件夹就是谷歌云盘的链接可以随时上传下载文件，为了防止文件丢失，尽量将运行的程序保存到driver文件之中，也就是云盘当中，开了pro之后会为你提供32g的内存和200g的容量，优先分配GPU和TPU基本可以满足一切需求。

Distiller的下载与环境配置(Colab中)

Distiller 是一个非常好用的深度学习网络模型分析优化框架，提供了针对于网络模型各个方面的分析优化工具以及包含了已有的大部分网络模型，本文中简单对squeezenet1_1使用其训练tiny_imagenet_200以及对模型的两个方向上的优化，sensitivity优化和distiller提供的AGP优化

首先git克隆distiller代码到自己的driver文件下

cd driver

cd  My driver

进入到云盘，不要直接克隆在根目录，否则重新分配内存后会被删掉

!git clone https://github.com/NervanaSystems/distiller.git

然后首先安装virtualenv使用pip3

!pip3 install virtualenv

virtualenv 是为了后面创建虚拟环境

安装好virtualenv 之后使用如下代码创建名为env的虚拟环境：

!python3 -m virtualenv env

激活名为env的虚拟环境：

!source env/bin/activate

进入distiller文件中：

cd distiller

配置distiller所需环境变量（一定要在distiller文件中）：

!pip3 install -e .

如果报错就重新运行一遍，然后就ok

Distiller 提供的功能以及使用distiller自带深度学习网络模型训练数据

配置好环境以后我们进入到examples/classifier_compression/（distiller主要功能便是在此运行实现）：

cd examples/classifier_compression/

查看distiller提供的的函数功能：

!python compress_classifier.py -h

下面就是distiller的使用语法，

usage: compress_classifier.py [-h] [–arch ARCH] [-j N] [–epochs N] [-b N]
[–lr LR] [–momentum M] [–weight-decay W]
[–print-freq N] [–verbose]
[–resume-from PATH | --exp-load-weights-from PATH]
[–pretrained] [–reset-optimizer] [-e]
[–activation-stats PHASE [PHASE …]]
[–activation-histograms PORTION_OF_TEST_SET]
[–masks-sparsity] [–param-hist]
[–summary {sparsity,compute,model,modules,png,png_w_params}]
[–export-onnx [EXPORT_ONNX]]
[–compress [COMPRESS]]
[–sense {element,filter,channel}]
[–sense-range SENSITIVITY_RANGE SENSITIVITY_RANGE SENSITIVITY_RANGE]
[–deterministic] [–seed SEED] [–gpus DEV_ID]
[–cpu] [–name NAME] [–out-dir OUTPUT_DIR]
[–validation-split VALIDATION_SPLIT]
[–effective-train-size EFFECTIVE_TRAIN_SIZE]
[–effective-valid-size EFFECTIVE_VALID_SIZE]
[–effective-test-size EFFECTIVE_TEST_SIZE]
[–confusion]
[–num-best-scores NUM_BEST_SCORES]
[–load-serialized] [–thinnify]
[–quantize-eval] [–qe-mode QE_MODE]
[–qe-mode-acts QE_MODE_ACTS]
[–qe-mode-wts QE_MODE_WTS]
[–qe-bits-acts NUM_BITS]
[–qe-bits-wts NUM_BITS]
[–qe-bits-accum NUM_BITS]
[–qe-clip-acts QE_CLIP_ACTS]
[–qe-clip-n-stds QE_CLIP_N_STDS]
[–qe-no-clip-layers LAYER_NAME [LAYER_NAME …]]
[–qe-no-quant-layers LAYER_NAME [LAYER_NAME …]]
[–qe-per-channel]
[–qe-scale-approx-bits NUM_BITS]
[–qe-save-fp-weights] [–qe-convert-pytorch]
[–qe-pytorch-backend {fbgemm,qnnpack}]
[–qe-stats-file PATH | --qe-dynamic | --qe-calibration PORTION_OF_TEST_SET | --qe-config-file PATH]
[–qe-lapq] [–lapq-maxiter LAPQ_MAXITER]
[–lapq-maxfev LAPQ_MAXFEV]
[–lapq-method LAPQ_METHOD]
[–lapq-basinhopping]
[–lapq-basinhopping-niter LAPQ_BASINHOPPING_NITER]
[–lapq-init-mode LAPQ_INIT_MODE]
[–lapq-init-method LAPQ_INIT_METHOD]
[–lapq-eval-size LAPQ_EVAL_SIZE]
[–lapq-eval-memoize-dataloader]
[–lapq-search-clipping]
[–save-untrained-model]
[–earlyexit_lossweights [EARLYEXIT_LOSSWEIGHTS [EARLYEXIT_LOSSWEIGHTS …]]]
[–earlyexit_thresholds [EARLYEXIT_THRESHOLDS [EARLYEXIT_THRESHOLDS …]]]
[–kd-teacher ARCH] [–kd-pretrained]
[–kd-resume PATH] [–kd-temperature TEMP]
[–kd-distill-wt WEIGHT]
[–kd-student-wt WEIGHT]
[–kd-teacher-wt WEIGHT]
[–kd-start-epoch EPOCH_NUM] [–greedy]
[–greedy-ft-epochs GREEDY_FT_EPOCHS]
[–greedy-target-density GREEDY_TARGET_DENSITY]
[–greedy-pruning-step GREEDY_PRUNING_STEP]
[–greedy-finetuning-policy {constant,linear-grow}]
DATASET_DIR

如下是可以直接在distiller中调用的网络模型，我们可以看到基本涵盖了已有的所有比较常见的模型

optional arguments:
-h, --help show this help message and exit
–arch ARCH, -a ARCH model architecture: alexnet | alexnet_bn | bninception
| cafferesnet101 | densenet121 | densenet161 |
densenet169 | densenet201 | dpn107 | dpn131 | dpn68 |
dpn68b | dpn92 | dpn98 | fbresnet152 | googlenet |
inception_v3 | inceptionresnetv2 | inceptionv3 |
inceptionv4 | mnasnet0_5 | mnasnet0_75 | mnasnet1_0 |
mnasnet1_3 | mobilenet | mobilenet_025 | mobilenet_050
| mobilenet_075 | mobilenet_v1_dropout | mobilenet_v2
| nasnetalarge | nasnetamobile | plain20_cifar |
plain20_cifar_nobn | pnasnet5large | polynet |
preact_resnet101 | preact_resnet110_cifar |
preact_resnet110_cifar_conv_ds | preact_resnet152 |
preact_resnet18 | preact_resnet20_cifar |
preact_resnet20_cifar_conv_ds | preact_resnet32_cifar
| preact_resnet32_cifar_conv_ds | preact_resnet34 |
preact_resnet44_cifar | preact_resnet44_cifar_conv_ds
| preact_resnet50 | preact_resnet56_cifar |
preact_resnet56_cifar_conv_ds | resnet101 |
resnet110_cifar_earlyexit | resnet1202_cifar_earlyexit
| resnet152 | resnet18 | resnet20_cifar |
resnet20_cifar_earlyexit | resnet32_cifar |
resnet32_cifar_earlyexit | resnet34 | resnet44_cifar |
resnet44_cifar_earlyexit | resnet50 |
resnet50_earlyexit | resnet56_cifar |
resnet56_cifar_earlyexit | resnext101_32x4d |
resnext101_32x8d | resnext101_64x4d | resnext50_32x4d
| se_resnet101 | se_resnet152 | se_resnet50 |
se_resnext101_32x4d | se_resnext50_32x4d | senet154 |
shufflenet_v2_x0_5 | shufflenet_v2_x1_0 |
shufflenet_v2_x1_5 | shufflenet_v2_x2_0 |
simplenet_cifar | simplenet_mnist | simplenet_v2_mnist
| squeezenet1_0 | squeezenet1_1 | vgg11 | vgg11_bn |
vgg11_bn_cifar | vgg11_cifar | vgg13 | vgg13_bn |
vgg13_bn_cifar | vgg13_cifar | vgg16 | vgg16_bn |
vgg16_bn_cifar | vgg16_cifar | vgg19 | vgg19_bn |
vgg19_bn_cifar | vgg19_cifar | wide_resnet101_2 |
wide_resnet50_2 | xception (default: resnet18)

数据集的上传

这里我使用的是tiny_imagenet_200数据集链接是
http://www.image-net.org/image/tiny/tiny-imagenet-200.zip
自行下载后可以直接拖拽到My driver 中（左侧小文件标致）

一定要放在和distiller同一文件下

模型训练

使用distiller自带的squeezenet1_1：
lr:学习率
-p:打印频率
-j:数据loader1-4
–epoch:训练轮数
这里你也可以将squeezenet1_1替换为自己的网络模型文件或者上文中提到的所有自带的网络例如resnet101

!python3 compress_classifier.py -a squeezenet1_1 ../../../tiny-imagenet-200 -p 10 -j=1  --lr=0.02 --epochs 10

然后运行可以看到速度很快，一轮下来加上反馈时间总共几十秒

同时还会将模型数据分析存在log文件当中

模型评估

这里训练完成会生成log文件模型评估时候一定要将log文件中的checkpoint.pth.tar文件用来评估：
–resume-from …/…/examples/classifier_compression/logs/2020.10.28-122858/checkpoint.pth.tar
这里是我的训练完成后chekpoint的位置自己的chekpoint可以在上面训练结束后找到

!python3 compress_classifier.py -a squeezenet1_1 ../../../tiny-imagenet-200 -p 10 --resume-from ../../examples/classifier_compression/logs/2020.10.28-122858/checkpoint.pth.tar --evaluate

然后运行可以看到评估结果了