batch norm layer & scale layer

简述

Batch Normalization 论文给出的计算：

前向计算：

后向计算：

BatchNorm 主要做了两部分：

[1] 对输入进行归一化， xnorm=x−μσ x n o r m = x − μ σ x_{norm} = \frac{x - \mu}{\sigma}，其中， μ μ \mu 和 σ σ \sigma 是计算的均值和方差；—— 对应 Caffe BatchNorm 层
[2] 归一化后进行缩放和平移，得到输出 y=γ⋅xnorm+β y = γ ⋅ x n o r m + β y = \gamma \cdot x_{norm} + \beta. —— 对应 Caffe Scale 层

Scale层设置bias_term=True，即对应于 β β \beta.

Caffe BatchNorm 层的训练，根据从总样本中的 mini-batch 个样本，进行多次前向训练，每次计算都会考虑已经计算得到的 mean 和 variance.

前向计算

Caffe 实现中，不是将每次计算的 mean 和 variance 的结果简单累加，而是通过一个因子(一般小于 1 的变量) 把前一次计算的 mean 和 variance 的作用逐渐较少，再加上本次计算的 mean 和 variance，作为最终的结果. 即滑动平均(Moving Average)的方式.

其过程如下：

St−1 S t − 1 S_{t-1} - 前一次 mini-batch 计算的 mean；

Yt Y t Y_{t} - 本次 mini-batch 计算的 mean；

λ λ \lambda - 滑动平均因子， moving_average_fraction

Forward 中，

[F1] - 滑动系数， snew=λsold+1 s n e w = λ s o l d + 1 s_{new} = \lambda s_{old} + 1

[F2] - 均值， μnew=λμold+μ μ n e w = λ μ o l d + μ \mu _{new} = \lambda \mu _{old} + \mu

[F3] - 方差， σnew=λσold+mσ σ n e w = λ σ o l d + m σ \sigma _{new} = \lambda \sigma _{old} + m\sigma，其中， m>1，则m=m−1m m > 1 ，则 m = m − 1 m m > 1，则 m = \frac{m-1}{m}

Caffe 源码未加参数 γ γ \gamma 和 β β \beta.
反向计算

对输入的梯度进行计算，没有参数 γ γ \gamma 和 β β \beta.

方差的梯度计算：

∂L∂σ=∑ni=0∂L∂yi⋅∂yi∂σ=∑ni=0∂L∂yi⋅(xi−μ)(−12)(σ+eps)−32 ∂ L ∂ σ = ∑ i = 0 n ∂ L ∂ y i ⋅ ∂ y i ∂ σ = ∑ i = 0 n ∂ L ∂ y i ⋅ ( x i − μ ) ( − 1 2 ) ( σ + e p s ) − 3 2 \frac{\partial L}{\partial \sigma} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \sigma} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot (x_i - \mu)(-\frac{1}{2})(\sigma + eps)^{-\frac{3}{2}}

均值的梯度计算：

∂L∂μ=∑ni=0∂L∂yi⋅∂yi∂μ=∑ni=0∂L∂yi⋅−1σ+eps√ ∂ L ∂ μ = ∑ i = 0 n ∂ L ∂ y i ⋅ ∂ y i ∂ μ = ∑ i = 0 n ∂ L ∂ y i ⋅ − 1 σ + e p s \frac{\partial L}{\partial \mu} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial \mu} = \sum _{i=0}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{-1}{\sqrt {\sigma + eps}}

输入 x x x 的梯度计算：

∂L∂xi=∂L∂yi1σ+eps+∂L∂σ∂σ∂xi+∂L∂μ∂μ∂xi" role="presentation">∂L∂xi=∂L∂yi1σ+eps√+∂L∂σ∂σ∂xi+∂L∂μ∂μ∂xi∂L∂xi=∂L∂yi1σ+eps+∂L∂σ∂σ∂xi+∂L∂μ∂μ∂xi\frac{\partial L}{\partial x_i} = \frac{\partial L}{\partial y_i} \frac{1}{\sqrt{\sigma+eps}} + \frac{\partial L}{\partial \sigma} \frac{\partial \sigma}{\partial x_i} + \frac{\partial L}{\partial \mu} \frac{\partial \mu}{\partial x_i}

=∂L∂yi1σ+eps√+∂L∂σ2n(xi−μ)+∂L∂μ1n = ∂ L ∂ y i 1 σ + e p s + ∂ L ∂ σ 2 n ( x i − μ ) + ∂ L ∂ μ 1 n \ \ \ \ \ = \frac{\partial L}{\partial y_i} \frac{1}{\sqrt{\sigma + eps}} + \frac{\partial L}{\partial \sigma} \frac{2}{n} (x_i - \mu) + \frac{\partial L}{\partial \mu} \frac{1}{n}

=1σ+eps√(∂L∂yi−1n∑ni=0∂L∂yi−(1n∑ni=0∂L∂yiyi)yi) = 1 σ + e p s ( ∂ L ∂ y i − 1 n ∑ i = 0 n ∂ L ∂ y i − ( 1 n ∑ i = 0 n ∂ L ∂ y i y i ) y i ) \ \ \ \ \ = \frac{1}{\sqrt{\sigma + eps}} (\frac{\partial L}{\partial y_i} - \frac{1}{n}\sum_{i=0}^n \frac{\partial L}{\partial y_i} - (\frac{1}{n} \sum_{i=0}^n \frac{\partial L}{\partial y_i} y_i) y_i )

=1σ+eps√(∂L∂yi−1n(∂L∂yi)−1n(∂L∂yi⋅yi)⋅yi) = 1 σ + e p s ( ∂ L ∂ y i − 1 n ( ∂ L ∂ y i ) − 1 n ( ∂ L ∂ y i ⋅ y i ) ⋅ y i ) \ \ \ \ \ = \frac{1}{\sqrt{\sigma + eps}} (\frac{\partial L}{\partial y_i} - \frac{1}{n}(\frac{\partial L}{\partial y_i}) - \frac{1}{n}(\frac{\partial L}{\partial y_i} \cdot y_i) \cdot y_i)

Caffe Scale 层是主要处理参数 γ γ \gamma 和 β β \beta (均为向量).

前向计算：

top=γ⋅bottom+β t o p = γ ⋅ b o t t o m + β top = \gamma \cdot bottom + \beta

y=γ⋅x+β y = γ ⋅ x + β y = \gamma \cdot x + \beta
反向计算：

∂y∂x=γ ∂ y ∂ x = γ \frac{\partial y}{\partial x} = \gamma

∂y∂γ=x ∂ y ∂ γ = x \frac{\partial y}{\partial \gamma} = x

∂y∂β=1 ∂ y ∂ β = 1 \frac{\partial y}{\partial \beta} = 1

1. prototxt 中的定义

在Caffe 中，一般一个 BatchNorm 层后接一个 Scale 层，例如：

layer {bottom: "conv1"top: "conv1"name: "bn_conv1"type: "BatchNorm"batch_norm_param {use_global_stats: true}param {name: "bn_conv1_0"lr_mult: 0}param {name: "bn_conv1_1"lr_mult: 0}param {name: "bn_conv1_2"lr_mult: 0}
}layer {bottom: "conv1"top: "conv1"name: "scale_conv1"type: "Scale"scale_param {bias_term: true}param {name: "scale_conv1_0"lr_mult: 0}param {name: "scale_conv1_1"lr_mult: 0}
}

From train_voc_trainval_aug.prototxt

2. caffeproto 中 BatchNorm 的定义

message LayerParameter {optional BatchNormParameter batch_norm_param = 139;
}message BatchNormParameter {// If false, normalization is performed over the current mini-batch// and global statistics are accumulated (but not yet used) by a moving// average.// 如果 use_global_stats = 0，则对当前 mini-batch 内的数据归一化； 同时 global statistics 通过滑动平均逐渐累加.// If true, those accumulated mean and variance values are used for the// normalization.// 如果 use_global_stats = 1，则采用累加的 均值和方差 对数据进行归一化.// By default, it is set to false when the network is in the training// phase and true when the network is in the testing phase.// 默认情况下，网络训练时 use_global_stats = 0；网络测试时 use_global_stats = 1. optional bool use_global_stats = 1;// What fraction of the moving average remains each iteration?// 滑动平均时每次迭代保留的百分比？// Smaller values make the moving average decay faster, giving more// weight to the recent values.// 较小的值使得平均累加过程衰退较快，给予最近的值较大的权重// Each iteration updates the moving average @f$S_{t- 1}@f$ with the// current mean @f$ Y_t @f$ by // @f$ S_t = (1-\beta)Y_t + \beta \cdot S_{t-1} @f$, where @f$ \beta @f$// is the moving_average_fraction parameter.optional float moving_average_fraction = 2 [default = .999];// Small value to add to the variance estimate so that we don't divide by// zero.// 保持数值稳定性optional float eps = 3 [default = 1e-5];
}

optional float moving_average_fraction = 2 [default = .999]:

每次迭代时，根据当前均值 Yt Y t Y_t 更新滑动平均值 St−1 S t − 1 S_{t-1}:

St=(1−β)Yt+β⋅St−1 S t = ( 1 − β ) Y t + β ⋅ S t − 1 S_t = (1-\beta)Y_t + \beta \cdot S_{t-1}

其中， β β \beta 是 moving_average_fraction 参数.

3. batch_norm_layer.hpp

#ifndef CAFFE_BATCHNORM_LAYER_HPP_
#define CAFFE_BATCHNORM_LAYER_HPP_#include <vector>#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"namespace caffe {/*** @brief Normalizes the input to have 0-mean and/or unit (1) variance across*        the batch.* BatchNorm 功能：    *    将 mini-batch 的输入归一化为均值为0或方差为1.** This layer computes Batch Normalization as described in [1]. For each channel* in the data (i.e. axis 1), it subtracts the mean and divides by the variance,* where both statistics are computed across both spatial dimensions and across* the different examples in the batch.* 对数据中的每一 channel，如 axis=1，BatchNorm 首先减均值，然后除以其方差. * 其中，均值和方差是对 mini-batch 内的不同样本的所有 spatial 维度进行计算得到.** By default, during training time, the network is computing global* mean/variance statistics via a running average, which is then used at test* time to allow deterministic outputs for each input. You can manually toggle* whether the network is accumulating or using the statistics via the* use_global_stats option. For reference, these statistics are kept in the* layer's three blobs: (0) mean, (1) variance, and (2) moving average factor.* 默认情况，训练时，网络通过平均累加，计算全局均值和方差值，然后用于测试来计算每一个输入的输出.* 可以通过手工设置 use_global_stats 参数，来控制网络是采用累加还是统计值.* 统计值被保存在网络层的三个 blobs：(0) mean, (1) variance, and (2) moving average factor** Note that the original paper also included a per-channel learned bias and* scaling factor. To implement this in Caffe, define a `ScaleLayer` configured* with `bias_term: true` after each `BatchNormLayer` to handle both the bias* and scaling factor.* 原始论文中还包括一个 per-channel 的学习 bias 和一个 scaling 因子.* 因此，Caffe 实现中，在每个 BatchNormLayer 后的 ScaleLayer 中配置 bias_term: true 来处理 bias 和 scaling 因子.** [1] S. Ioffe and C. Szegedy, "Batch Normalization: Accelerating Deep Network*     Training by Reducing Internal Covariate Shift." arXiv preprint*     arXiv:1502.03167 (2015).** TODO(dox): thorough documentation for Forward, Backward, and proto params.*/
template <typename Dtype>
class BatchNormLayer : public Layer<Dtype> {public:explicit BatchNormLayer(const LayerParameter& param): Layer<Dtype>(param) {}virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top);virtual void Reshape(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top);virtual inline const char* type() const { return "BatchNorm"; }virtual inline int ExactNumBottomBlobs() const { return 1; }virtual inline int ExactNumTopBlobs() const { return 1; }protected:virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top);virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top);virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);Blob<Dtype> mean_, variance_, temp_, x_norm_; // 均值，方差，bool use_global_stats_;Dtype moving_average_fraction_;int channels_;Dtype eps_;// extra temporarary variables is used to carry out sums/broadcasting// using BLASBlob<Dtype> batch_sum_multiplier_;Blob<Dtype> num_by_chans_;Blob<Dtype> spatial_sum_multiplier_;
};}  // namespace caffe#endif  // CAFFE_BATCHNORM_LAYER_HPP_

4. batch_norm_layer.cpp

#include <algorithm>
#include <vector>#include "caffe/layers/batch_norm_layer.hpp"
#include "caffe/util/math_functions.hpp"namespace caffe {template <typename Dtype>
void BatchNormLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top) {BatchNormParameter param = this->layer_param_.batch_norm_param(); // batchnorm 参数moving_average_fraction_ = param.moving_average_fraction(); // 滑动平均参数// 训练时，mean 和 variance 基于 mini-batch 计算// 测试时，mean 和 variance 基于整个 dataset.use_global_stats_ = this->phase_ == TEST; if (param.has_use_global_stats())use_global_stats_ = param.use_global_stats();if (bottom[0]->num_axes() == 1)channels_ = 1;elsechannels_ = bottom[0]->shape(1);eps_ = param.eps();if (this->blobs_.size() > 0) {LOG(INFO) << "Skipping parameter initialization";} else {this->blobs_.resize(3); // 存储的学习参数vector<int> sz;sz.push_back(channels_);this->blobs_[0].reset(new Blob<Dtype>(sz)); // 均值滑动平均值，channels_ 大小的数组this->blobs_[1].reset(new Blob<Dtype>(sz)); // 方差滑动平均值，channels_ 大小的数组sz[0] = 1;this->blobs_[2].reset(new Blob<Dtype>(sz)); // 滑动平均系数，大小为 1 的数组for (int i = 0; i < 3; ++i) {caffe_set(this->blobs_[i]->count(), Dtype(0),this->blobs_[i]->mutable_cpu_data()); // 值初始化为 0}}// Mask statistics from optimization by setting local learning rates// for mean, variance, and the bias correction to zero.for (int i = 0; i < this->blobs_.size(); ++i) {if (this->layer_param_.param_size() == i) {ParamSpec* fixed_param_spec = this->layer_param_.add_param();fixed_param_spec->set_lr_mult(0.f);} else {CHECK_EQ(this->layer_param_.param(i).lr_mult(), 0.f)<< "Cannot configure batch normalization statistics as layer "<< "parameters.";}}
}template <typename Dtype>
void BatchNormLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top) {// 如果 bottom 是一维的，均值和方差的个数为1；否则，等于 channelsif (bottom[0]->num_axes() >= 1)CHECK_EQ(bottom[0]->shape(1), channels_);top[0]->ReshapeLike(*bottom[0]); // top[0] 与输入 bottom[0] 的形状一致vector<int> sz;sz.push_back(channels_);mean_.Reshape(sz); // 存储均值variance_.Reshape(sz); // 存储方差temp_.ReshapeLike(*bottom[0]); // 存储减去均值 mean_ 后的每个数的方差x_norm_.ReshapeLike(*bottom[0]);sz[0] = bottom[0]->shape(0);batch_sum_multiplier_.Reshape(sz); // batch size// 空间维度， height*widthint spatial_dim = bottom[0]->count()/(channels_*bottom[0]->shape(0));if (spatial_sum_multiplier_.num_axes() == 0 ||spatial_sum_multiplier_.shape(0) != spatial_dim) {sz[0] = spatial_dim;spatial_sum_multiplier_.Reshape(sz);Dtype* multiplier_data = spatial_sum_multiplier_.mutable_cpu_data();// spatial_sum_multiplier_ 初始化值为1，其尺寸为 height*width caffe_set(spatial_sum_multiplier_.count(), Dtype(1), multiplier_data);}int numbychans = channels_*bottom[0]->shape(0); // channels * batchsizeif (num_by_chans_.num_axes() == 0 ||num_by_chans_.shape(0) != numbychans) {sz[0] = numbychans;num_by_chans_.Reshape(sz);caffe_set(batch_sum_multiplier_.count(), Dtype(1),batch_sum_multiplier_.mutable_cpu_data()); // 初始化值为 1}
}// Forward 函数，计算均值和方差，以矩阵-向量乘积的方式.
template <typename Dtype>
void BatchNormLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top) {const Dtype* bottom_data = bottom[0]->cpu_data();Dtype* top_data = top[0]->mutable_cpu_data();int num = bottom[0]->shape(0); // int spatial_dim = bottom[0]->count()/(bottom[0]->shape(0)*channels_); // height*width// 判断 BN 层的输入输出是否是同一 blobif (bottom[0] != top[0]) {caffe_copy(bottom[0]->count(), bottom_data, top_data);}if (use_global_stats_) {// 如果 use_global_stats_ = 1，则使用预定义的均值和方差估计值.// use the stored mean/variance estimates.const Dtype scale_factor = this->blobs_[2]->cpu_data()[0] == 0 ?0 : 1 / this->blobs_[2]->cpu_data()[0];caffe_cpu_scale(variance_.count(), scale_factor,this->blobs_[0]->cpu_data(), mean_.mutable_cpu_data()); // 乘以缩放因子caffe_cpu_scale(variance_.count(), scale_factor,this->blobs_[1]->cpu_data(), variance_.mutable_cpu_data());} else {// 如果 use_global_stats_ = 0// compute mean// 均值计算// num_by_chans_ = (1. / (num * spatial_dim)) * bottom_data * spatial_sum_multiplier_ // channels*num 行； spatial_dim 列// 共 channels * num 个值caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim,1. / (num * spatial_dim), bottom_data,spatial_sum_multiplier_.cpu_data(), 0.,num_by_chans_.mutable_cpu_data());// mean_ = 1 * num_by_chans_ * batch_sum_multiplier_// num 行； channels 列// 每个通道值相加，得到 channel 个值caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,mean_.mutable_cpu_data());}// subtract mean// 减均值// num_by_chans_ = 1 * batch_sum_multiplier_ * mean_ caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,num_by_chans_.mutable_cpu_data());// top_data = -1 * num_by_chans_ *  + spatial_sum_multiplier_ + 1.0 * top_data// top_data 中的数据减去均值 mean_caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,spatial_dim, 1, -1, num_by_chans_.cpu_data(),spatial_sum_multiplier_.cpu_data(), 1., top_data);if (!use_global_stats_) {// 如果 use_global_stats_ = 0，计算方差// compute variance using var(X) = E((X-EX)^2)// 对向量的每一个值求其方差，得到结果为 temp_caffe_sqr<Dtype>(top[0]->count(), top_data, temp_.mutable_cpu_data());  // (X-EX)^2// num_by_chans_ = (1. / (num * spatial_dim)) * temp_ * spatial_sum_multiplier_caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim,1. / (num * spatial_dim), temp_.cpu_data(),spatial_sum_multiplier_.cpu_data(), 0.,num_by_chans_.mutable_cpu_data()); // 矩阵向量乘// variance_ = 1.0 * num_by_chans_ * batch_sum_multiplier_caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,variance_.mutable_cpu_data());  // E((X_EX)^2)// compute and save moving average// 计算并保存滑动平均值// 简述部分的 [F1] 步this->blobs_[2]->mutable_cpu_data()[0] *= moving_average_fraction_;this->blobs_[2]->mutable_cpu_data()[0] += 1;// this->blobs_[0] = 1 * mean_ + moving_average_fraction_ * this->blobs_[0]// 简述部分的 [F2] 步caffe_cpu_axpby(mean_.count(), Dtype(1), mean_.cpu_data(),moving_average_fraction_, this->blobs_[0]->mutable_cpu_data());// m = num * height * width int m = bottom[0]->count()/channels_;Dtype bias_correction_factor = m > 1 ? Dtype(m)/(m-1) : 1;// this->blobs_[1] = bias_correction_factor * variance_ + moving_average_fraction_ * this->blobs_[1]// 无偏估计方差 m/(m-1)// 简述部分的 [F3] 步caffe_cpu_axpby(variance_.count(), bias_correction_factor,variance_.cpu_data(), moving_average_fraction_,this->blobs_[1]->mutable_cpu_data());}// normalize variance// 方差归一化// variance_ = variance_ + eps_ 添加一个很小的值caffe_add_scalar(variance_.count(), eps_, variance_.mutable_cpu_data());// 对 variance_ 的每个值进行操作，求开方caffe_sqrt(variance_.count(), variance_.cpu_data(),variance_.mutable_cpu_data());// replicate variance to input size// 下面两个 gemm 函数将 channels_ 个值的方差 variance_ 扩展到 channels_ * num * height * width// num_by_chans_ = 1 * batch_sum_multiplier_ * variance_caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,batch_sum_multiplier_.cpu_data(), variance_.cpu_data(), 0.,num_by_chans_.mutable_cpu_data());// temp_ = 1.0 * num_by_chans_ * spatial_sum_multiplier_caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,spatial_dim, 1, 1., num_by_chans_.cpu_data(),spatial_sum_multiplier_.cpu_data(), 0., temp_.mutable_cpu_data());// 逐元素操作，top_data[i] = top_data[i] / temp_[i] caffe_div(temp_.count(), top_data, temp_.cpu_data(), top_data);// TODO(cdoersch): The caching is only needed because later in-place layers//                 might clobber the data.  Can we skip this if they won't?// 将 top_data 的计算结果 copy 到 x_norm_. caffe_copy(x_norm_.count(), top_data, x_norm_.mutable_cpu_data());
}// 参考简述中的反向计算公式.
template <typename Dtype>
void BatchNormLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,const vector<bool>& propagate_down,const vector<Blob<Dtype>*>& bottom) {const Dtype* top_diff; // 梯度if (bottom[0] != top[0]) {top_diff = top[0]->cpu_diff();} else {caffe_copy(x_norm_.count(), top[0]->cpu_diff(), x_norm_.mutable_cpu_diff());top_diff = x_norm_.cpu_diff();}Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();if (use_global_stats_) {caffe_div(temp_.count(), top_diff, temp_.cpu_data(), bottom_diff);return;}const Dtype* top_data = x_norm_.cpu_data();int num = bottom[0]->shape()[0];int spatial_dim = bottom[0]->count()/(bottom[0]->shape(0)*channels_);// if Y = (X-mean(X))/(sqrt(var(X)+eps)), then//// dE(Y)/dX =//   (dE/dY - mean(dE/dY) - mean(dE/dY \cdot Y) \cdot Y)//     ./ sqrt(var(X) + eps)//// where \cdot and ./ are hadamard product and elementwise division,// respectively, dE/dY is the top diff, and mean/var/sum are all computed// along all dimensions except the channels dimension.  In the above// equation, the operations allow for expansion (i.e. broadcast) along all// dimensions except the channels dimension where required.// sum(dE/dY \cdot Y)caffe_mul(temp_.count(), top_data, top_diff, bottom_diff);caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1.,bottom_diff, spatial_sum_multiplier_.cpu_data(), 0.,num_by_chans_.mutable_cpu_data());caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,mean_.mutable_cpu_data());// reshape (broadcast) the abovecaffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,num_by_chans_.mutable_cpu_data());caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, channels_ * num,spatial_dim, 1, 1., num_by_chans_.cpu_data(),spatial_sum_multiplier_.cpu_data(), 0., bottom_diff);// sum(dE/dY \cdot Y) \cdot Ycaffe_mul(temp_.count(), top_data, bottom_diff, bottom_diff);// sum(dE/dY)-sum(dE/dY \cdot Y) \cdot Ycaffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, 1.,top_diff, spatial_sum_multiplier_.cpu_data(), 0.,num_by_chans_.mutable_cpu_data());caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1.,num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0.,mean_.mutable_cpu_data());// reshape (broadcast) the above to make// sum(dE/dY)-sum(dE/dY \cdot Y) \cdot Ycaffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num, channels_, 1, 1,batch_sum_multiplier_.cpu_data(), mean_.cpu_data(), 0.,num_by_chans_.mutable_cpu_data());caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num * channels_,spatial_dim, 1, 1., num_by_chans_.cpu_data(),spatial_sum_multiplier_.cpu_data(), 1., bottom_diff);// dE/dY - mean(dE/dY)-mean(dE/dY \cdot Y) \cdot Ycaffe_cpu_axpby(temp_.count(), Dtype(1), top_diff,Dtype(-1. / (num * spatial_dim)), bottom_diff);// note: temp_ still contains sqrt(var(X)+eps), computed during the forward// pass.caffe_div(temp_.count(), bottom_diff, temp_.cpu_data(), bottom_diff);
}#ifdef CPU_ONLY
STUB_GPU(BatchNormLayer);
#endifINSTANTIATE_CLASS(BatchNormLayer);
REGISTER_LAYER_CLASS(BatchNorm);
}  // namespace caffe

5. caffeproto 中 Scale 的定义

message LayerParameter {optional ScaleParameter scale_param = 142;
}message ScaleParameter {// The first axis of bottom[0] (the first input Blob) along which to apply// bottom[1] (the second input Blob).  May be negative to index from the end// (e.g., -1 for the last axis).// 根据 bottom[0] 指定 bottom[1] 的形状// For example, if bottom[0] is 4D with shape 100x3x40x60, the output// top[0] will have the same shape, and bottom[1] may have any of the// following shapes (for the given value of axis)://    (axis == 0 == -4) 100; 100x3; 100x3x40; 100x3x40x60//    (axis == 1 == -3)          3;     3x40;     3x40x60//    (axis == 2 == -2)                   40;       40x60//    (axis == 3 == -1)                                60// Furthermore, bottom[1] may have the empty shape (regardless of the value of// "axis") -- a scalar multiplier.// 例如，如果 bottom[0] 的 shape 为 100x3x40x60，则 top[0] 输出相同的 shape；// bottom[1] 可以包含上面 shapes 中的任一种(对于给定 axis 值). // 而且，bottom[1] 可以是 empty shape 的，没有任何的 axis 值，只是一个标量的乘子.optional int32 axis = 1 [default = 1];// (num_axes is ignored unless just one bottom is given and the scale is// a learned parameter of the layer.  Otherwise, num_axes is determined by the// number of axes by the second bottom.)// (忽略 num_axes 参数，除非只给定一个 bottom 及 scale 是网络层的一个学习到的参数. // 否则，num_axes 是由第二个 bottom 的数量来决定的.)// The number of axes of the input (bottom[0]) covered by the scale// parameter, or -1 to cover all axes of bottom[0] starting from `axis`.// Set num_axes := 0, to multiply with a zero-axis Blob: a scalar.// bottom[0] 的 num_axes 是由 scale 参数覆盖的；optional int32 num_axes = 2 [default = 1];// (filler is ignored unless just one bottom is given and the scale is// a learned parameter of the layer.)// (忽略 filler 参数，除非只给定一个 bottom 及 scale 是网络层的一个学习到的参数.// The initialization for the learned scale parameter.// scale 参数学习的初始化// Default is the unit (1) initialization, resulting in the ScaleLayer// initially performing the identity operation.// 默认是单位初始化，使 Scale 层初始进行单位操作.optional FillerParameter filler = 3;// Whether to also learn a bias (equivalent to a ScaleLayer+BiasLayer, but// may be more efficient).  Initialized with bias_filler (defaults to 0).// 是否学习 bias，等价于 ScaleLayer+BiasLayer，只不过效率更高// 采用 bias_filler 进行初始化. 默认为 0.optional bool bias_term = 4 [default = false];optional FillerParameter bias_filler = 5;
}

即，按元素计算连个输入的乘积。该过程以广播第二个输入来匹配第一个输入矩阵的大小。
也就是通过平铺第二个输入矩阵来计算按元素乘积（点乘）。

简化：

optional int32 axis [default = 1] ; 默认的处理维度
optional int32 num_axes [default = 1] ; //在BN中可以忽略，主要决定第二个bottom
optional FillerParameter filler ; //初始alpha和beta的填充方式。
optional FillerParameter bias_filler;
optional bool bias_term = 4 [default = false]; //是否学习bias，若不学习，则简化为 y = alpha*x

6 scale_layer.hpp

#ifndef CAFFE_SCALE_LAYER_HPP_
#define CAFFE_SCALE_LAYER_HPP_#include <vector>#include "caffe/blob.hpp"
#include "caffe/layer.hpp"
#include "caffe/proto/caffe.pb.h"#include "caffe/layers/bias_layer.hpp"namespace caffe {/*** @brief Computes the elementwise product of two input Blobs, with the shape of*        the latter Blob "broadcast" to match the shape of the former.*        Equivalent to tiling the latter Blob, then computing the elementwise*        product. Note: for efficiency and convenience, this layer can*        additionally perform a "broadcast" sum too when `bias_term: true`*        is set.** The latter, scale input may be omitted, in which case it's learned as* parameter of the layer (as is the bias, if it is included).*/
template <typename Dtype>
class ScaleLayer: public Layer<Dtype> {public:explicit ScaleLayer(const LayerParameter& param): Layer<Dtype>(param) {}virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top);virtual void Reshape(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top);virtual inline const char* type() const { return "Scale"; }// Scalevirtual inline int MinBottomBlobs() const { return 1; }virtual inline int MaxBottomBlobs() const { return 2; }virtual inline int ExactNumTopBlobs() const { return 1; }protected:/*** In the below shape specifications, @f$ i @f$ denotes the value of the* `axis` field given by `this->layer_param_.scale_param().axis()`, after* canonicalization (i.e., conversion from negative to positive index,* if applicable).** @param bottom input Blob vector (length 2)*   -# @f$ (d_0 \times ... \times*           d_i \times ... \times d_j \times ... \times d_n) @f$*      the first factor @f$ x @f$*   -# @f$ (d_i \times ... \times d_j) @f$*      the second factor @f$ y @f$* @param top output Blob vector (length 1)*   -# @f$ (d_0 \times ... \times*           d_i \times ... \times d_j \times ... \times d_n) @f$*      the product @f$ z = x y @f$ computed after "broadcasting" y.*      Equivalent to tiling @f$ y @f$ to have the same shape as @f$ x @f$,*      then computing the elementwise product.*/virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top);virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top);virtual void Backward_cpu(const vector<Blob<Dtype>*>& top,const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);virtual void Backward_gpu(const vector<Blob<Dtype>*>& top,const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);shared_ptr<Layer<Dtype> > bias_layer_;vector<Blob<Dtype>*> bias_bottom_vec_;vector<bool> bias_propagate_down_;int bias_param_id_;Blob<Dtype> sum_multiplier_;Blob<Dtype> sum_result_;Blob<Dtype> temp_;int axis_;int outer_dim_, scale_dim_, inner_dim_;
};}  // namespace caffe#endif  // CAFFE_SCALE_LAYER_HPP_

7 scale_layer.cpp

#include <algorithm>
#include <vector>#include "caffe/filler.hpp"
#include "caffe/layer_factory.hpp"
#include "caffe/layers/scale_layer.hpp"
#include "caffe/util/math_functions.hpp"namespace caffe {template <typename Dtype>
void ScaleLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top) {const ScaleParameter& param = this->layer_param_.scale_param(); // scale 参数// 判断 bottom blobs 是否已经有值if (bottom.size() == 1 && this->blobs_.size() > 0) { LOG(INFO) << "Skipping parameter initialization";} else if (bottom.size() == 1) {// scale is a learned parameter; initialize it// 待学习参数 scale，初始化axis_ = bottom[0]->CanonicalAxisIndex(param.axis()); //const int num_axes = param.num_axes();CHECK_GE(num_axes, -1) << "num_axes must be non-negative, "<< "or -1 to extend to the end of bottom[0]";if (num_axes >= 0) {CHECK_GE(bottom[0]->num_axes(), axis_ + num_axes)<< "scale blob's shape extends past bottom[0]'s shape when applied "<< "starting with bottom[0] axis = " << axis_;}this->blobs_.resize(1); // gamma// const vector<int>::const_iterator& shape_start =bottom[0]->shape().begin() + axis_;const vector<int>::const_iterator& shape_end =(num_axes == -1) ? bottom[0]->shape().end() : (shape_start + num_axes);vector<int> scale_shape(shape_start, shape_end);this->blobs_[0].reset(new Blob<Dtype>(scale_shape));FillerParameter filler_param(param.filler());if (!param.has_filler()) {// 未初始化时，初始化值为 1// Default to unit (1) filler for identity operation.filler_param.set_type("constant");filler_param.set_value(1);}shared_ptr<Filler<Dtype> > filler(GetFiller<Dtype>(filler_param));filler->Fill(this->blobs_[0].get());}if (param.bias_term()) { // 是否需要处理 bias 项LayerParameter layer_param(this->layer_param_);layer_param.set_type("Bias");BiasParameter* bias_param = layer_param.mutable_bias_param();bias_param->set_axis(param.axis());if (bottom.size() > 1) {bias_param->set_num_axes(bottom[1]->num_axes());} else {bias_param->set_num_axes(param.num_axes());}bias_param->mutable_filler()->CopyFrom(param.bias_filler());bias_layer_ = LayerRegistry<Dtype>::CreateLayer(layer_param);bias_bottom_vec_.resize(1);bias_bottom_vec_[0] = bottom[0];bias_layer_->SetUp(bias_bottom_vec_, top);if (this->blobs_.size() + bottom.size() < 3) {// case: blobs.size == 1 && bottom.size == 1// or blobs.size == 0 && bottom.size == 2bias_param_id_ = this->blobs_.size();this->blobs_.resize(bias_param_id_ + 1);this->blobs_[bias_param_id_] = bias_layer_->blobs()[0];} else {// bias param already initializedbias_param_id_ = this->blobs_.size() - 1;bias_layer_->blobs()[0] = this->blobs_[bias_param_id_];}bias_propagate_down_.resize(1, false);}this->param_propagate_down_.resize(this->blobs_.size(), true);
}template <typename Dtype>
void ScaleLayer<Dtype>::Reshape(const vector<Blob<Dtype>*>& bottom,const vector<Blob<Dtype>*>& top) {const ScaleParameter& param = this->layer_param_.scale_param();Blob<Dtype>* scale = (bottom.size() > 1) ? bottom[1] : this->blobs_[0].get();// Always set axis_ == 0 in special case where scale is a scalar// (num_axes == 0). Mathematically equivalent for any choice of axis_, so the// actual setting can be safely ignored; and computation is most efficient// with axis_ == 0 and (therefore) outer_dim_ == 1. (Setting axis_ to// bottom[0]->num_axes() - 1, giving inner_dim_ == 1, would be equally// performant.)axis_ = (scale->num_axes() == 0) ?0 : bottom[0]->CanonicalAxisIndex(param.axis());CHECK_GE(bottom[0]->num_axes(), axis_ + scale->num_axes())<< "scale blob's shape extends past bottom[0]'s shape when applied "<< "starting with bottom[0] axis = " << axis_;for (int i = 0; i < scale->num_axes(); ++i) {CHECK_EQ(bottom[0]->shape(axis_ + i), scale->shape(i))<< "dimension mismatch between bottom[0]->shape(" << axis_ + i<< ") and scale->shape(" << i << ")";}outer_dim_ = bottom[0]->count(0, axis_);scale_dim_ = scale->count();inner_dim_ = bottom[0]->count(axis_ + scale->num_axes());// 如果 top 层和 bottom 层同名，则进行 in-place 计算if (bottom[0] == top[0]) {  // in-place computationtemp_.ReshapeLike(*bottom[0]);} else {top[0]->ReshapeLike(*bottom[0]);}sum_result_.Reshape(vector<int>(1, outer_dim_ * scale_dim_));const int sum_mult_size = std::max(outer_dim_, inner_dim_);sum_multiplier_.Reshape(vector<int>(1, sum_mult_size));if (sum_multiplier_.cpu_data()[sum_mult_size - 1] != Dtype(1)) {caffe_set(sum_mult_size, Dtype(1), sum_multiplier_.mutable_cpu_data());}if (bias_layer_) {bias_bottom_vec_[0] = top[0];bias_layer_->Reshape(bias_bottom_vec_, top);}
}template <typename Dtype>
void ScaleLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {const Dtype* bottom_data = bottom[0]->cpu_data();if (bottom[0] == top[0]) {// In-place computation; need to store bottom data before overwriting it.// Note that this is only necessary for Backward; we could skip this if not// doing Backward, but Caffe currently provides no way of knowing whether// we'll need to do Backward at the time of the Forward call.// in-place 计算，需要先临时复制一份，再进行计算.caffe_copy(bottom[0]->count(), bottom[0]->cpu_data(),temp_.mutable_cpu_data());}const Dtype* scale_data =((bottom.size() > 1) ? bottom[1] : this->blobs_[0].get())->cpu_data();Dtype* top_data = top[0]->mutable_cpu_data();for (int n = 0; n < outer_dim_; ++n) {for (int d = 0; d < scale_dim_; ++d) {const Dtype factor = scale_data[d];caffe_cpu_scale(inner_dim_, factor, bottom_data, top_data);bottom_data += inner_dim_;top_data += inner_dim_;}}if (bias_layer_) {bias_layer_->Forward(bias_bottom_vec_, top);}
}template <typename Dtype>
void ScaleLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {if (bias_layer_ &&this->param_propagate_down_[this->param_propagate_down_.size() - 1]) {bias_layer_->Backward(top, bias_propagate_down_, bias_bottom_vec_);}const bool scale_param = (bottom.size() == 1);Blob<Dtype>* scale = scale_param ? this->blobs_[0].get() : bottom[1];if ((!scale_param && propagate_down[1]) ||(scale_param && this->param_propagate_down_[0])) {const Dtype* top_diff = top[0]->cpu_diff();const bool in_place = (bottom[0] == top[0]);const Dtype* bottom_data = (in_place ? &temp_ : bottom[0])->cpu_data();// Hack: store big eltwise product in bottom[0] diff, except in the special// case where this layer itself does the eltwise product, in which case we// can store it directly in the scale diff, and we're done.// If we're computing in-place (and not doing eltwise computation), this// hack doesn't work and we store the product in temp_.const bool is_eltwise = (bottom[0]->count() == scale->count());Dtype* product = (is_eltwise ? scale->mutable_cpu_diff() :(in_place ? temp_.mutable_cpu_data() : bottom[0]->mutable_cpu_diff()));caffe_mul(top[0]->count(), top_diff, bottom_data, product);if (!is_eltwise) {Dtype* sum_result = NULL;if (inner_dim_ == 1) {sum_result = product;} else if (sum_result_.count() == 1) {const Dtype* sum_mult = sum_multiplier_.cpu_data();Dtype* scale_diff = scale->mutable_cpu_diff();if (scale_param) {Dtype result = caffe_cpu_dot(inner_dim_, product, sum_mult);*scale_diff += result;} else {*scale_diff = caffe_cpu_dot(inner_dim_, product, sum_mult);}} else {const Dtype* sum_mult = sum_multiplier_.cpu_data();sum_result = (outer_dim_ == 1) ?scale->mutable_cpu_diff() : sum_result_.mutable_cpu_data();caffe_cpu_gemv(CblasNoTrans, sum_result_.count(), inner_dim_,Dtype(1), product, sum_mult, Dtype(0), sum_result);}if (outer_dim_ != 1) {const Dtype* sum_mult = sum_multiplier_.cpu_data();Dtype* scale_diff = scale->mutable_cpu_diff();if (scale_dim_ == 1) {if (scale_param) {Dtype result = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);*scale_diff += result;} else {*scale_diff = caffe_cpu_dot(outer_dim_, sum_mult, sum_result);}} else {caffe_cpu_gemv(CblasTrans, outer_dim_, scale_dim_,Dtype(1), sum_result, sum_mult, Dtype(scale_param),scale_diff);}}}}if (propagate_down[0]) {const Dtype* top_diff = top[0]->cpu_diff();const Dtype* scale_data = scale->cpu_data();Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();for (int n = 0; n < outer_dim_; ++n) {for (int d = 0; d < scale_dim_; ++d) {const Dtype factor = scale_data[d];caffe_cpu_scale(inner_dim_, factor, top_diff, bottom_diff);bottom_diff += inner_dim_;top_diff += inner_dim_;}}}
}#ifdef CPU_ONLY
STUB_GPU(ScaleLayer);
#endifINSTANTIATE_CLASS(ScaleLayer);
REGISTER_LAYER_CLASS(Scale);}  // namespace caffe

Reference

[1] - caffe中batch_norm层代码详细注解

[2] - CAFFE源码学习笔记之batch_norm_layer

[3] - Caffe BatchNormalization 推导

[4] - Caffe Scale层解析

Caffe 源码 - BatchNorm 层与 Scale 层相关推荐

剖析Caffe源码之Net---Net构造函数
目录 Net构造函数读取Prototxt ReadProtoFromTextFile UpgradeNetAsNeeded 设置网络状态 Init函数 FilterNet InsertSplits ...
Caffe源码中blob文件分析
Caffe源码(caffe version commit: 09868ac , date: 2015.08.15)中有一些重要的头文件,这里介绍下include/caffe/blob.hpp文件的内容 ...
caffe源码学习:softmaxWithLoss前向计算
caffe源码学习:softmaxWithLoss 在caffe中softmaxwithLoss是由两部分组成,softmax+Loss组成,其实主要就是为了caffe框架的可扩展性. 表达式(1)是 ...
Caffe源码中Solver文件分析
Caffe源码(caffe version commit: 09868ac , date: 2015.08.15)中有一些重要的头文件,这里介绍下include/caffe/solver.hpp文件的 ...
Caffe源码中Net文件分析
Caffe源码(caffe version commit: 09868ac , date: 2015.08.15)中有一些重要的头文件,这里介绍下include/caffe/net.hpp文件的内容: ...
Caffe源码中Pooling Layer文件分析
Caffe源码(caffe version commit: 09868ac , date: 2015.08.15)中有一些重要的头文件,这里介绍下include/caffe/vision_layers ...
Caffe源码中layer文件分析
Caffe源码(caffe version commit: 09868ac , date: 2015.08.15)中有一些重要的头文件,这里介绍下include/caffe/layer.hpp文件的内 ...
Caffe源码中caffe.proto文件分析
Caffe源码(caffe version:09868ac , date: 2015.08.15)中有一些重要文件,这里介绍下caffe.proto文件. 在src/caffe/proto目录下有一个 ...
零基础学caffe源码 ReLU激活函数
零基础学caffe源码 ReLU激活函数原创 2016年08月03日 17:30:19 1.如何有效阅读caffe源码 1.caffe源码阅读路线最好是从src/cafffe/proto/caffe ...

Caffe 源码 - BatchNorm 层与 Scale 层