A Empirical Study of Binary Neural Networks’ Optimisation


  1. ADAM for optimising the objective, (2) not using early stopping, (3) splitting the training into two stages, (4) removing gradient and weight clipping in the first stage and (5) reducing the averaging rate in Batch Normalisation layers in the second stage

Gradient clipping 梯度裁剪 梯度超过一定范围就丢弃
Weight clipping 权重裁剪 让权重值保持在一定范围

forward path (and at the end of the training):

STE with gradient clipping provides an estimate for the gradient of this operation:

上图(a)中的二值卷积核实怎么得到的?二值卷积核 是通过 对 full-precision proxy 进行 二值化(sign函数)得到,对应右图前向。那么这个 full-precision proxy 又是怎么来的了? 通过 STE estimator 学习得到的,对应右图反向

3.1 Impact of Optimiser 优化器的影响

A possible hypothesis is that early stages of training binary models require more averaging for the optimiser to proceed in presence of binarisaton operation. On the other hand, in the late stages of the training, we rely on noisier sources to increase exploration power of the optimiser.

总体上来说 ADAM 更有优势

3.2 Impact of gradient and weight clipping

the well-known observation that training a binary model is often notably slower than its non-binary counter-part

The slow down is mainly caused by the commonly applied gradient and weight clipping, as they keep parameters within the
{-1,1} range at all times during training

weight and gradient clipping help achieve better accuracy

We tested this hypothesis by training a binary model in two stages: (1) using vanilla STE in the first stage with higher learning rates and (2) turning clippings back on when the accuracy stops improving by reducing learning rate.


