一、apex

是什么:混合精度

什么用:提升GPU上的训练速度

GitHub:https://github.com/NVIDIA/apex

API文档:https://nvidia.github.io/apex

使用要求:

Python 3CUDA 9 or newerPyTorch 0.4 or newer. The CUDA and C++ extensions require pytorch 1.0 or newer.推荐已发布的最新版本,见https://pytorch.org/.我们也针对最新的主分支进行测试, obtainable from https://github.com/pytorch/pytorch.在Docker容器中使用Apex通常很方便。兼容的选项包括:NVIDIA Pytorch containers from NGC, which come with Apex preinstalled. To use the latest Amp API, you may need to pip uninstall apex then reinstall Apex using the Quick Start commands below.
official Pytorch -devel Dockerfiles, e.g. docker pull pytorch/pytorch:nightly-devel-cuda10.0-cudnn7, in which you can install Apex using the Quick Start commands.

如何安装:
Linux:

为了性能和完整的功能,建议通过CUDA和c++扩展来安装Apex$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./Apex 同样支持 Python-only build (required with Pytorch 0.4) via$ pip install -v --no-cache-dir ./

Windows:

Windows支持是实验性的,建议使用Linux。如果你能在你的系统上从源代码构建Pytorch,采用pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .pip install -v --no-cache-dir .(没有CUDA/ c++扩展)更可能有效。如果您已经在Conda环境中安装了Pytorch,请确保在相同的环境中安装Apex。

相关链接:https://github.com/NVIDIA/apex/tree/master/examples/docker

安装后如何使用:参考文档https://nvidia.github.io/apex/amp.html

例子:

# Declare model and optimizer as usual, with default (FP32) precision
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)# Allow Amp to perform casts as required by the opt_level
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
...
# loss.backward() becomes:
with amp.scale_loss(loss, optimizer) as scaled_loss:scaled_loss.backward()

二、我的安装流程:

1. $ git clone https://github.com/NVIDIA/apex 完成
2. $ cd apex 完成
3. $ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

3时出现报错,这个问题issue上有很多人在问

Cleaning up...Removing source in /tmp/pip-req-build-v0deounv
Removed build tracker '/tmp/pip-req-tracker-3n3fyj4o'
ERROR: Command errored out with exit status 1: /users4/zsun/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-v0deounv/setup.py'"'"'; __file__='"'"'/tmp/p
ip-req-build-v0deounv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' --cpp_e
xt --cuda_ext install --record /tmp/pip-record-rce1cb4d/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.
Exception information:
Traceback (most recent call last):File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 153, in _mainstatus = self.run(options, args)File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 455, in runuse_user_site=options.use_user_site,File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/req/__init__.py", line 62, in install_given_reqs**kwargsFile "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 888, in installcwd=self.unpacked_source_directory,File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 275, in runnerspinner=spinner,File "/users4/zsun/anaconda3/lib/python3.6/site-packages/pip/_internal/utils/subprocess.py", line 242, in call_subprocessraise InstallationError(exc_msg)
pip._internal.exceptions.InstallationError: Command errored out with exit status 1: /users4/zsun/anaconda3/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-v0deoun
v/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-v0deounv/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code
, __file__, '"'"'exec'"'"'))' --cpp_ext --cuda_ext install --record /tmp/pip-record-rce1cb4d/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.
1 location(s) to search for versions of pip:
* http://mirrors.aliyun.com/pypi/simple/pip/
Getting page http://mirrors.aliyun.com/pypi/simple/pip/
Found index url http://mirrors.aliyun.com/pypi/simple/
Starting new HTTP connection (1): mirrors.aliyun.com:80
http://mirrors.aliyun.com:80 "GET /pypi/simple/pip/ HTTP/1.1" 200 12139
Analyzing links from page http://mirrors.aliyun.com/pypi/simple/pip/Found link http://mirrors.aliyun.com/pypi/packages/18/ad/c0fe6cdfe1643a19ef027c7168572dac6283b80a384ddf21b75b921877da/pip-0.2.1.tar.gz#sha256=83522005c1266cc2de97e65072ff7554ac0f30ad369c3b02ff3a764b9620
48da (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.2.1Found link http://mirrors.aliyun.com/pypi/packages/3d/9d/1e313763bdfb6a48977b65829c6ce2a43eaae29ea2f907c8bbef024a7219/pip-0.2.tar.gz#sha256=88bb8d029e1bf4acd0e04d300104b7440086f94cc1ce1c5c3c31e3293aee1f
81 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.2Found link http://mirrors.aliyun.com/pypi/packages/0a/bb/d087c9a1415f8726e683791c0b2943c53f2b76e69f527f2e2b2e9f9e7b5c/pip-0.3.1.tar.gz#sha256=34ce534f17065c78f980702928e988a6b6b2d8a9851aae5f1571a1feb9bb
58d8 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.3.1Found link http://mirrors.aliyun.com/pypi/packages/17/05/f66144ef69b436d07f8eeeb28b7f77137f80de4bf60349ec6f0f9509e801/pip-0.3.tar.gz#sha256=183c72455cb7f8860ac1376f8c4f14d7f545aeab8ee7c22cd4caf79f35a2ed
47 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.3Found link http://mirrors.aliyun.com/pypi/packages/cf/c3/153571aaac6cf999f4bb09c019b1ff379b7b599ea833813a41c784eec995/pip-0.4.tar.gz#sha256=28fc67558874f71fddda7168f73595f1650523dce3bc5bf189713ecdfc1e45
6e (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.4Found link Found link http://mirrors.aliyun.com/pypi/packages/ac/95/a05b56bb975efa78d3557efa36acaf9cf5d2fd0ee0062060493687432e03/pip-9.0.3-py2.py3-none-any.whl#sha256=c3ede34530e0e0b2381e7363aded78e0c33291654937e7373032fda04e8803e5 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 9.0.3Found link http://mirrors.aliyun.com/pypi/packages/c4/44/e6b8056b6c8f2bfd1445cc9990f478930d8e3459e9dbf5b8e2d2922d64d3/pip-9.0.3.tar.gz#sha256=7bf48f9a693be1d58f49f7af7e0ae9fe29fd671cde8a55e6edca3581c4ef5796 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 9.0.3
Given no hashes to check 131 links for project 'pip': discarding no candidates

4. 使用

$ pip install -v --no-cache-dir ./

得到的反馈是

  copying build/lib/apex/RNN/RNNBackend.py -> build/bdist.linux-x86_64/wheel/apex/RNNcopying build/lib/apex/RNN/cells.py -> build/bdist.linux-x86_64/wheel/apex/RNNcreating build/bdist.linux-x86_64/wheel/apex/normalizationcopying build/lib/apex/normalization/__init__.py -> build/bdist.linux-x86_64/wheel/apex/normalizationcopying build/lib/apex/normalization/fused_layer_norm.py -> build/bdist.linux-x86_64/wheel/apex/normalizationrunning install_egg_inforunning egg_infocreating apex.egg-infowriting apex.egg-info/PKG-INFOwriting dependency_links to apex.egg-info/dependency_links.txtwriting top-level names to apex.egg-info/top_level.txtwriting manifest file 'apex.egg-info/SOURCES.txt'reading manifest file 'apex.egg-info/SOURCES.txt'writing manifest file 'apex.egg-info/SOURCES.txt'Copying apex.egg-info to build/bdist.linux-x86_64/wheel/apex-0.1-py3.6.egg-inforunning install_scriptscreating build/bdist.linux-x86_64/wheel/apex-0.1.dist-info/WHEEL
doneCreated wheel for apex: filename=apex-0.1-cp36-none-any.whl size=136906 sha256=55830f559061fcb30ed616dd6879086c9b79926c3d3e0017a2dcf6c0e1aa8037Stored in directory: /tmp/pip-ephem-wheel-cache-m4cxipvx/wheels/6c/91/1a/143cfe0f99d10c8c415d1594024d1de93c5f8c03f5edfad2baRemoving source in /tmp/pip-req-build-yg8bljf6
Successfully built apex
Installing collected packages: apexFound existing installation: apex 0.1Uninstalling apex-0.1:Created temporary directory: /users4/zsun/anaconda3/lib/python3.6/site-packages/~pex-0.1.dist-infoRemoving file or directory /users4/zsun/anaconda3/lib/python3.6/site-packages/apex-0.1.dist-info/Created temporary directory: /users4/zsun/anaconda3/lib/python3.6/site-packages/~pexRemoving file or directory /users4/zsun/anaconda3/lib/python3.6/site-packages/apex/Successfully uninstalled apex-0.1Successfully installed apex-0.1
Cleaning up...
Removed build tracker '/tmp/pip-req-tracker-nm6wywoj'
1 location(s) to search for versions of pip:
* http://mirrors.aliyun.com/pypi/simple/pip/
Getting page http://mirrors.aliyun.com/pypi/simple/pip/
Found index url http://mirrors.aliyun.com/pypi/simple/
Starting new HTTP connection (1): mirrors.aliyun.com:80
http://mirrors.aliyun.com:80 "GET /pypi/simple/pip/ HTTP/1.1" 200 12139
Analyzing links from page http://mirrors.aliyun.com/pypi/simple/pip/Found link http://mirrors.aliyun.com/pypi/packages/18/ad/c0fe6cdfe1643a19ef027c7168572dac6283b80a384ddf21b75b921877da/pip-0.2.1.tar.gz#sha256=83522005c1266cc2de97e65072ff7554ac0f30ad369c3b02ff3a764b962048da (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.2.1Found link http://mirrors.aliyun.com/pypi/packages/3d/9d/1e313763bdfb6a48977b65829c6ce2a43eaae29ea2f907c8bbef024a7219/pip-0.2.tar.gz#sha256=88bb8d029e1bf4acd0e04d300104b7440086f94cc1ce1c5c3c31e3293aee1f81 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.2Found link http://mirrors.aliyun.com/pypi/packages/0a/bb/d087c9a1415f8726e683791c0b2943c53f2b76e69f527f2e2b2e9f9e7b5c/pip-0.3.1.tar.gz#sha256=34ce534f17065c78f980702928e988a6b6b2d8a9851aae5f1571a1feb9bb58d8 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.3.1Found link http://mirrors.aliyun.com/pypi/packages/17/05/f66144ef69b436d07f8eeeb28b7f77137f80de4bf60349ec6f0f9509e801/pip-0.3.tar.gz#sha256=183c72455cb7f8860ac1376f8c4f14d7f545aeab8ee7c22cd4caf79f35a2ed47 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.3Found link http://mirrors.aliyun.com/pypi/packages/cf/c3/153571aaac6cf999f4bb09c019b1ff379b7b599ea833813a41c784eec995/pip-0.4.tar.gz#sha256=28fc67558874f71fddda7168f73595f1650523dce3bc5bf189713ecdfc1e456e (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 0.4Found link http://mirrors.aliyun.com/pypi/packages/9a/aa/f536b6d14fe03343367da2ff44eee28f340ae650cd017ca088b6be13084a/pip-0.5.1.tar.gz#sha256=e27650538c41fe1007a41abd4cfd0f905b822622cbe1f8e7e09d1215af207694 (from http://mirrors.aliyun.com/pypi/simple/piFound link http://mirrors.aliyun.com/pypi/packages/ac/95/a05b56bb975efa78d3557efa36acaf9cf5d2fd0ee0062060493687432e03/pip-9.0.3-py2.py3-none-any.whl#sha256=c3ede34530e0e0b2381e7363aded78e0c33291654937e7373032fda04e8803e5 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 9.0.3Found link http://mirrors.aliyun.com/pypi/packages/c4/44/e6b8056b6c8f2bfd1445cc9990f478930d8e3459e9dbf5b8e2d2922d64d3/pip-9.0.3.tar.gz#sha256=7bf48f9a693be1d58f49f7af7e0ae9fe29fd671cde8a55e6edca3581c4ef5796 (from http://mirrors.aliyun.com/pypi/simple/pip/), version: 9.0.3
Given no hashes to check 131 links for project 'pip': discarding no candidates

虽然最后报了一样的错误,但在中间出现了【Successfully installed apex-0.1】

暂且当做安装成功,继续向下进行。

5. 按照上面的例子以及下面的例子更改我的代码,很简单,只有几处更改。

if args.apex:from apex import amp
# Declare model and optimizer as usual, with default (FP32) precision
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)# Allow Amp to perform casts as required by the opt_level
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
...
# loss.backward() becomes:
with amp.scale_loss(loss, optimizer) as scaled_loss:scaled_loss.backward()
...
# Save checkpoint
checkpoint = {'model': model.state_dict(),'optimizer': optimizer.state_dict(),'amp': amp.state_dict()
}
torch.save(checkpoint, 'amp_checkpoint.pt')
...
amp.load_state_dict(checkpoint['amp'])
...

6.运行第一次看效果,得到反馈

2019-11-27 14:42:14,362 INFO: Loading vocab,train and val dataset.Wait a second,please
#Params: 73.7M
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Warning:  multi_tensor_applier fused unscale kernel is unavailable, possibly because apex was installed without --cuda_ext --cpp_ext. Using Python fallback.  Original ImportError was: ModuleNotFoundError("No module named 'amp_C'",)
Traceback (most recent call last):File "main.py", line 582, in <module>train()File "main.py", line 388, in trainwith amp.scale_loss(loss, optimizer) as scaled_loss:File "/users4/zsun/anaconda3/lib/python3.6/contextlib.py", line 81, in __enter__return next(self.gen)File "/users4/zsun/anaconda3/lib/python3.6/site-packages/apex/amp/handle.py", line 111, in scale_lossoptimizer._prepare_amp_backward()File "/users4/zsun/anaconda3/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 219, in prepare_backward_no_master_weightsself._amp_lazy_init()File "/users4/zsun/anaconda3/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 309, in _amp_lazy_initself._lazy_init_maybe_master_weights()File "/users4/zsun/anaconda3/lib/python3.6/site-packages/apex/amp/_process_optimizer.py", line 210, in lazy_init_no_master_weights"Received {}".format(param.type()))
TypeError: Optimizer's parameters must be either torch.cuda.FloatTensor or torch.cuda.HalfTensor. Received torch.FloatTensor

这里可以看到前面步骤没有使用C扩展来安装apex还是有一定的问题的,无法使用fused unscale kernel(本程序用不到这个),但是这是warning不是error,所以我们改正错误继续运行。运行成功。

(上面出错的原因是我的optim是放到cpu计算的,但是apex要求他要放到gpu,与本步骤无关)

没加apex之前的程序 batch=8的,<3.5h一次eval,三次eval一轮,4911 / 12196 MB | zsun(4901M)

batch=10的,<2.8h一次eval,三次eval一轮, 9723 / 12196 MB | zsun(9713M)

                                    batch=4的,<7.25h一次eval,三次eval一轮,9863 / 16280 MB | zsun(9853M)(另一份程序|没CVAE的)

batch=16的,会一段时间之后out of memory

加了apex之后的程序 batch=16,<1.5h一次eval,三次eval一轮,11517 / 12196 MB | zsun(11507M),不再oom

7. 但是出现了效果下降的问题。有错误提醒:

2019-11-28 06:44:23,244 INFO: eval
2019-11-28 06:49:59,922 INFO: Epoch:  4 fmax: 0.246901 cur_max_f: 0.183399
2019-11-28 06:49:59,923 INFO: Epoch:  4 Min_Val_Loss: 0.318247 Cur_Val_Loss: 0.427435
2019-11-28 06:49:59,930 INFO:   [0] Cur_fmax: 0.183399 Cur_bound: 0.139370
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00048828125
2019-11-28 08:17:46,401 INFO: eval
2019-11-28 08:23:15,293 INFO: Epoch:  4 fmax: 0.246901 cur_max_f: 0.215748
2019-11-28 08:23:15,294 INFO: Epoch:  4 Min_Val_Loss: 0.318247 Cur_Val_Loss: 0.474844
2019-11-28 08:23:15,304 INFO:   [0] Cur_fmax: 0.215748 Cur_bound: 0.016865
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.000244140625

而之前我的程序是这样的

2019-11-28 00:03:36,134 INFO: eval
2019-11-28 00:06:15,011 INFO: Epoch:  5 fmax: 0.437465 cur_max_f: 0.437465
2019-11-28 00:06:15,012 INFO: Epoch:  5 Min_Val_Loss: 0.251108 Cur_Val_Loss: 0.251108
2019-11-28 00:06:15,019 INFO:   [0] Cur_fmax: 0.437465 Cur_bound: 0.231224

发现loss整体变大,而且很不稳定。效果变差。而且这个错误提醒是什么意思呢?

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 0.00048828125

意思是:梯度溢出,issue上也有很多人提出了这个问题,貌似作者一直在收集这个问题出现的样例,尚未解决。

三、其余注意事项(均来自于官方文档)

3.1.apex.amp

注意⚠️:

目前,控制纯或混合精度训练的底层特性如下:cast_model_type:将模型的参数和缓冲区强制转换为所需的类型。
patch_torch_functions:修补所有Torch函数和张量方法,以执行对张量核心友好的操作,比如FP16中的GEMMs和convolutions,以及FP32中任何受益于FP32精度的操作。
keep_batchnorm_fp32:为了提高精度并启用cudnn batchnorm(这可以提高性能),将batchnorm的权重保持在FP32中通常是有益的,即使模型的其余部分是FP16。
master_weights:保持FP32的主权重,以配合任何FP16模型的权重。FP32主权重由优化器逐步提高精度和捕获小梯度。
loss_scale:如果loss_scale是一个浮点值,那么使用这个值作为静态(固定)损失范围。如果loss_scale是字符串“dynamic”,则自适应地随时间调整损失比例。动态损失比例调整由放大器自动执行。同样,您通常不需要手动指定这些属性。相反,选择一个opt_level,它将为您设置它们。在选择opt_level之后,可以选择将属性kwargs作为手动覆盖传递。如果您试图覆盖一个属性,这是没有意义的选择opt_level, Amp将提出一个错误的解释。例如,选择opt_level="O1"并使用override master_weights=True是没有意义的。O1插入围绕Torch函数而不是模型权重进行强制转换。数据、激活和权重在它们流经修补过的函数时被动态地重新分配。因此,模型本身的权重可以(也应该)保持FP32,不需要保持单独的FP32主权重。
opt_levels可识别的opt_levels是“O0”、“O1”、“O2”和“O3”。O0和O3并不是真正的混合精度,但是它们分别用于建立精度和速度基线。
O1和O2是混合精度的不同实现。试试这两种方法,看看什么能给你的模型带来最好的加速和准确性。O0: FP32 training
你的incoming model应该已经是FP32了,所以这可能是一个无操作。O0可用于建立准确性基线。O0设置的默认属性:
cast_model_type = torch.float32
patch_torch_functions = False
keep_batchnorm_fp32=None(实际上,“不适用”,一切都是FP32)
master_weights = False
loss_scale = 1.0O1: Mixed Precision (recommended for typical use)对所有Torch函数和张量方法进行修补,使它们的输入符合白名单-黑名单模型。白名单操作(例如,张量核心友好操作,如GEMMs和convolutions)在FP16中执行。受益于FP32精度的黑名单操作(例如softmax)在FP32中执行。O1还使用动态损失缩放,除非覆盖。O1设置的默认属性:
cast_model_type=None (not applicable)
patch_torch_functions=True
keep_batchnorm_fp32=None (again, not applicable, all model weights remain FP32)
master_weights=None (not applicable, model weights remain FP32)
loss_scale="dynamic"O2: “Almost FP16” Mixed PrecisionO2将模型的权值转换为FP16,修补模型的前向方法,将输入数据转换为FP16,保持FP32中的批处理规范,维护FP32的主权值,更新优化器的param_groups,以便optimizer.step()直接作用于FP32的权值
(随后是FP32主重量-如有必要,>FP16型号重量拷贝),
并实现动态损失缩放(除非被覆盖)。与O1不同,O2不修补Torch函数或张量方法。O2设置的默认属性:
cast_model_type=torch.float16
patch_torch_functions=False
keep_batchnorm_fp32=True
master_weights=True
loss_scale="dynamic"O3: FP16 trainingO3可能无法实现真正的混合精度选项O1和O2的稳定性。但是,为您的模型建立一个速度基线是很有用的,可以比较O1和O2的性能。如果您的模型使用批处理规范化,为了建立“光速”,您可以尝试使用带有附加属性override keep_batchnorm_fp32=True的O3(如前所述,它支持cudnn batchnorm)。O3设置的默认属性:
cast_model_type=torch.float16
patch_torch_functions=False
keep_batchnorm_fp32=False
master_weights=False
loss_scale=1.0

注意⚠️:amp.initialize should be called after you have finished constructing your model(s) and optimizer(s), but before you send your model through any DistributedDataParallel wrapper. Currently, amp.initialize should only be called once.

参数:

Parameters
models (torch.nn.Module or list of torch.nn.Modules) – Models to modify/cast.optimizers (optional, torch.optim.Optimizer or list of torch.optim.Optimizers) – Optimizers to modify/cast. REQUIRED for training, optional for inference.enabled (bool, optional, default=True) – If False, renders all Amp calls no-ops, so your script should run as if Amp were not present.opt_level (str, optional, default="O1") – Pure or mixed precision optimization level. Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above.cast_model_type (torch.dtype, optional, default=None) – Optional property override, see above.patch_torch_functions (bool, optional, default=None) – Optional property override.keep_batchnorm_fp32 (bool or str, optional, default=None) – Optional property override. If passed as a string, must be the string “True” or “False”.master_weights (bool, optional, default=None) – Optional property override.loss_scale (float or str, optional, default=None) – Optional property override. If passed as a string, must be a string representing a number, e.g., “128.0”, or the string “dynamic”.cast_model_outputs (torch.dpython:type, optional, default=None) – Option to ensure that the outputs of your model(s) are always cast to a particular type regardless of opt_level.num_losses (int, optional, default=1) – Option to tell Amp in advance how many losses/backward passes you plan to use. When used in conjunction with the loss_id argument to amp.scale_loss, enables Amp to use a different loss scale per loss/backward pass, which can improve stability. See “Multiple models/optimizers/losses” under Advanced Amp Usage for examples. If num_losses is left to 1, Amp will still support multiple losses/backward passes, but use a single global loss scale for all of them.verbosity (int, default=1) – Set to 0 to suppress Amp-related output.min_loss_scale (float, default=None) – Sets a floor for the loss scale values that can be chosen by dynamic loss scaling. The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.max_loss_scale (float, default=2.**24) – Sets a ceiling for the loss scale values that can be chosen by dynamic loss scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.Returns
Model(s) and optimizer(s) modified according to the opt_level. If either the models or optimizers args were lists, the corresponding return value will also be a list.

checkpoint

为了正确地保存和加载amp训练,我们引入了amp.state_dict(),它包含所有的loss_scalers及其相应的未跳过步骤,还引入了amp.load_state_dict()来恢复这些属性。
注意,我们建议使用相同的opt_level恢复模型。还要注意,我们建议在amp.initialize之后调用load_state_dict方法。

...
# Save checkpoint
checkpoint = {'model': model.state_dict(),'optimizer': optimizer.state_dict(),'amp': amp.state_dict()
}
torch.save(checkpoint, 'amp_checkpoint.pt')
...
amp.load_state_dict(checkpoint['amp'])
...

Advanced use cases:
统一的Amp API支持跨迭代的梯度累积、每次迭代的多次后向遍历、多个模型/优化器、自定义/用户定义的autograd函数和自定义数据批处理类。梯度裁剪和GANs也需要特殊的处理,但是这种处理不需要改变不同的opt_levels。

Transition guide for old API users:

我们强烈鼓励迁移到新的Amp API,因为它更多功能,更容易使用,并在未来的证明。原始的FP16_Optimizer和旧的“Amp”API都是不支持的,而且随时可能被移除。
以前通过amp_handle公开的函数现在可以通过amp模块访问。应该删除对amp_handle = amp.init()的任何现有调用。
详细内容请参照文档,此处不赘述。

3.2.apex.optimizers

待更

apex 安装/使用 记录相关推荐

  1. CV之detectron2:detectron2安装过程记录

    CV之detectron2:detectron2安装过程记录 detectron2安装记录 python setup.py build develop Microsoft Windows [版本 10 ...

  2. OpenFOAM安装+ParaView安装+环境配置(deb直接安装详细记录-Ubuntu14.04+OpenFOAM4.1)

    OpenFOAM安装+ParaView安装+环境配置 Ubuntu14.04+OpenFOAM4.1(deb直接安装详细记录) Ubuntu14.04安装配置OpenFOAM4.1:https://w ...

  3. [原创] Android SDK 安装全记录

    [原创] Android SDK 安装全记录 1. JDK jdk-se-7u3 http://www.oracle.com/technetwork/java/javase/downloads/ind ...

  4. linux chrome 安装过程记录

    最近,由于公司需要做爬虫抓取一些新闻,在开发过程中,发现有些网站有一定的反爬措施,通过浏览器访问一切正常,通过其他方式,包括:curl,urlconnection 等,就算加入了cookie,agen ...

  5. Ubuntu16.04 Caffe 编译安装步骤记录

    历时一周终于在 ubuntu16.04 系统成功安装 caffe 并编译,网上有很多教程,但是某些步骤并没有讲解详尽,导致配置过程总是出现各种各样匪夷所思的问题,尤其对于新手而言更是欲哭无泪,在我饱受 ...

  6. ArcGIS Enterprise 10.5.1 静默安装部署记录(Centos 7.2 minimal)- 2、安装WebAdapter

    解压webadapter安装包,tar -xzvf Web_Adaptor_Java_Linux_1051_156442.tar.gz 进入下Webadapter目录下静默安装 ./Setup -m ...

  7. 安卓模拟器安装过程记录 20200926

    安卓模拟器安装过程记录 20200926 使用的软件 网易MuMu模拟器-安卓模拟器-极速最安全 http://mumu.163.com/baidu/ 下载并安装 选择路径 在线下载并且安装 安装好后 ...

  8. CentOS 6.6下Redis安装配置记录

    这篇文章主要介绍了CentOS 6.6下Redis安装配置记录,本文给出了安装需要的支持环境.安装redis.测试Redis.配置redis等步骤,需要的朋友可以参考下 在先前的文章中介绍过redis ...

  9. ArcGIS Enterprise 10.5.1 静默安装部署记录(Centos 7.2 minimal)- 6、总结

    安装小结 安装完成后,首先我们需要将Datastore托管给Server,再将Server托管给Portal以此来完成整个单机版Enterprise 部署流程.为了测试流程是否正确,我们可以采用上传一 ...

最新文章

  1. 千山独行-一个人的创业路(连载五)
  2. 中移动12580领跑世界杯商旅营销
  3. 大数据基础学习二:在VMware虚拟机上安装Ubuntu完整步骤及需要注意的问题(以VMware Workstation 15.1.0 Pro和Ubuntu18.04.3优麒麟版为例)
  4. ArcMap DayDreamInGIS数据处理工具 插件之 搜狗词库生成
  5. 蓝桥杯2015年第六届C/C++省赛A组第三题-奇妙的数字
  6. android tts 音量,Android TTS音量控制
  7. html和js制作个人所得税表格,用JS编写个人所得税计算器
  8. 投影幕布尺寸计算器_投影距离和屏幕尺寸计算器
  9. ip地址:string和int互转方案
  10. 河北易县八佛洼辽三彩罗汉造像
  11. 微信小程序:map组件标注callout与label简单用法
  12. Mongodb常用查询
  13. 华三H3C端口聚合与链路聚合
  14. 鸿蒙系统荣耀新机,鸿蒙系统要来了?网传荣耀新机搭载麒麟9000+鸿蒙OS
  15. QT打包软件在另一电脑运行后出现Cannot load library XXX.dll之解决方案-MSVC编译器
  16. vue 响应式 responsive
  17. 四十七、批量操作数据
  18. 大数据开发:大数据背景下的数据库选型
  19. 计算机术语什么叫袜子,大电脑织袜机部分功能操作与说明
  20. 2022大数据十大关键词-记录

热门文章

  1. [ECE]模拟试题-5
  2. 服务器ftp文件不能共享文件夹权限,ftp服务器共享文件夹权限设置
  3. 解决Bean with name ‘XX‘ has been injected into other beans 问题
  4. 计算机中常用于比较的图表有,2013年职称计算机Excel考点:常用图表类型
  5. Docker build创建指定容器镜像
  6. 将当前容器保存为本地镜像
  7. 13、Horizon App Volumes 安装配置
  8. 一文openpose姿态估计
  9. Android天气预报+百度天气接口
  10. java 交易金额转换分,java金额元与分转换工具种