A guide to receptive field arithmetic for CNN

感知野在CNN中是一个非常重要的概念之一。现有state-of-art的目标识别方法都是围绕这一点展开模型构建的。这个guide介绍了一种新的视觉化CNN中揭露了感知信息的feature map的方法,并且有完整的可以用于任何CNN架构的计算。作者同时给出了一个简单的程序来演示如何计算。pre-reading:A guide to convolution arithmetic for deep learning

The fixed-sized CNN feature map visulization

感知野:特定CNN特征的输入空间中的一个区域。 一个感知野的特征可以被它中心location和他的size完整的描述。通过卷积(kernel_size k=3x3,padding_size p=1x1, stride s = 2x2),我们可以得到输出的featuremap 3x3(绿色),采用相同的卷积核在3x3的featuremap上继续卷积,我们可以得到2x2的featuremap(橙色)。每个维度输出feature map的数量可以用下列公式计算:在本文中,为了简化,假设CNN结构是对称的,输入的图片也是square的。所以两个维度对于所有的变量都有相同的值,如果CNN结构或者输入图片不对称,可以分开计算feature map的属性。如图一,左边代表一般可视化CNN feature map 的方法。在此方法中,通过观察feature map我们可以知道他包含了多少特征,但是我们很难知道每个特征具体在哪里(感受野center)、那个区域有多大(感受野size)。右边展示了固定size的CNN可视化,从而能够使所有尺度的feature map大小保持不变且都和输入图像一致。每个特征用center location标记,因为所有feature map当中的feature都有同样大小的感受野size,我们可以简单的画出bounding box来代表这一个感受野。feature map和输入层相同size,那么我们就不必将bounding box一直向底层映射了。图二,另一个例子,采用相同的卷积核但是应用于一个大一点的输入图片(7x7)。同样的,我们可以画出固定size的CNN feature map。图二的感知野size增长很快,第二层feature layer就几乎覆盖了整个输入图像。This is an important insight which was used to improve the design of a deep CNN.

Receptive Field Arithmetic

Note that the center coordinate of a feature is defined to be the center coordinate of its receptive field, as shown in the fixed-sized CNN feature map above. 输出层计算公式:
通过上述公式可知,坐标系原点选在左上角feature的center,递归的调用上述四个公式,得到CNN中所有feature map的感受野信息。

python example

输入任何feature map的名字和在map中的索引,返回对应感受野的size及location。 
# [filter size, stride, padding]
#Assume the two dimensions are the same
#Each kernel requires the following parameters:
# - k_i: kernel size
# - s_i: stride
# - p_i: padding (if padding is uneven, right padding will higher than left padding; "SAME" option in tensorflow)
#Each layer i requires the following parameters to be fully represented:
# - n_i: number of feature (data layer has n_1 = imagesize )
# - j_i: distance (projected to image pixel distance) between center of two adjacent features
# - r_i: receptive field of a feature in layer i
# - start_i: position of the first feature's receptive field in layer i (idx start from 0, negative means the center fall into padding)import math
convnet =   [[11,4,0],[3,2,0],[5,1,2],[3,2,0],[3,1,1],[3,1,1],[3,1,1],[3,2,0],[6,1,0], [1, 1, 0]]
layer_names = ['conv1','pool1','conv2','pool2','conv3','conv4','conv5','pool5','fc6-conv', 'fc7-conv']
imsize = 227def outFromIn(conv, layerIn):n_in = layerIn[0]j_in = layerIn[1]r_in = layerIn[2]start_in = layerIn[3]k = conv[0]s = conv[1]p = conv[2]n_out = math.floor((n_in - k + 2*p)/s) + 1actualP = (n_out-1)*s - n_in + k pR = math.ceil(actualP/2)pL = math.floor(actualP/2)j_out = j_in * sr_out = r_in + (k - 1)*j_instart_out = start_in + ((k-1)/2 - pL)*j_inreturn n_out, j_out, r_out, start_outdef printLayer(layer, layer_name):print(layer_name + ":")print("\t n features: %s \n \t jump: %s \n \t receptive size: %s \t start: %s " % (layer[0], layer[1], layer[2], layer[3]))layerInfos = []
if __name__ == '__main__':
#first layer is the data layer (image) with n_0 = image size; j_0 = 1; r_0 = 1; and start_0 = 0.5print ("-------Net summary------")currentLayer = [imsize, 1, 1, 0.5]printLayer(currentLayer, "input image")for i in range(len(convnet)):currentLayer = outFromIn(convnet[i], currentLayer)layerInfos.append(currentLayer)printLayer(currentLayer, layer_names[i])print ("------------------------")layer_name = raw_input ("Layer name where the feature in: ")layer_idx = layer_names.index(layer_name)idx_x = int(raw_input ("index of the feature in x dimension (from 0)"))idx_y = int(raw_input ("index of the feature in y dimension (from 0)"))n = layerInfos[layer_idx][0]j = layerInfos[layer_idx][1]r = layerInfos[layer_idx][2]start = layerInfos[layer_idx][3]assert(idx_x < n)assert(idx_y < n)print ("receptive field: (%s, %s)" % (r, r))print ("center: (%s, %s)" % (start+idx_x*j, start+idx_y*j))
