SPP、RFB和ASPP

CNN中“金字塔”系列的结构块在深度学习诸多模型中发挥着重要的作用，个人认为有大类：“横向”并联金字塔结构(Google的Inception系列)和“纵向”特征金字塔（FPN）。前者输出的特征图融合了具有多种尺度感受野的特征；后者更侧重于是深层语义特征与浅层边缘特征的融合，这里主要学习一下前者的几个典型结构（SPP、RFB和ASPP）,这几种结构或许可以作为模型即插即用的组件，对于优化模型有帮助，更重要的是学到如何根据实际需求设计合适的模型。

1、SPP(Spatial Pyramid Pooling)

空间金字塔池化在何凯明的SPPNet[1]中具有详尽的描述，解决的是如何在具有全连接层的网络中实现多尺度图像的输入，全连接层的输入是固定的，使用文中提及的SPP结构则可以实现无论输入特征图多大均能通过空间金字塔池化得到固定维度的神经元输入。其结构如下图所示：

如图所示，此处最后一层卷积conv5的输出维度256，按照图中三个并行的池化层计算则可得到256*（1+4+16）维的特征向量。在目标检测和语义分割的全卷积网络中，同样用到所谓的金字塔池化结构，但是结构上却有所不同，由于不需要展开成一个固定大小的一维向量，池化的Size则不一定是Sliding windows的形式（每个池化分支的对特征图划分的bins是固定不变的）。因此后续所谓的SPP都是修改后的结构。比如yolov3-spp[6]和yolov4[7]中的SPP结构仅仅是并联的池化层kernel size不用，输出特征图的大小则是相同的，如下图所示：

另外在语义分割任务的PSPNet[8]中，则是普通的最大池化并联最后接一个上采样恢复到同样尺寸的特征图。如图所示输入特征图大小为60x60，池化的kernel size分别10，20,30,60，如下图所示。

2、RFB(Receptive Field Block)

论文[2]提出RFBNet目标检测网络，RFBNet是在SSD网络中引入Receptive Field Block (RFB) ，引入RFB的出发点通过模拟人类视觉的感受野加强网络的特征提取能力，在结构上RFB借鉴了Inception的思想，主要是在Inception的基础上加入了dilated卷积层（dilated convolution），从而有效增大了感受野（receptive field）。RFB结构对于增大轻量级目标检测网络的感受野很有帮助，并且不会引入过多的计算量和粗暴增加网络深度。原理部分论文[2]给出了一个比较形象的示意图，描述RFB中的dilated卷积层如何加持感受野金字塔的，一三五卷积核配合空洞卷积实现了感受野的增大及不同感受野特征图的融合。补充一下空洞卷积的卷积层感受野大小的计算式：

$k+(k-1)(rate-1)$

论文[2]中的给出了RFB网络结构：

论文[2]官方开源实现：https://github.com/ruinmessi/RFBNet

另外 Github上的Ultra-Light-Fast-Generic-Face-Detector-1MB（https://github.com/Linzaer/Ultra-Light-Fast-Generic-Face-Detector-1MB）应该是一个应用RFB结构的较为火的的项目，超轻量超快速效果还不赖，其结构：

3、ASPP(atrous spatial pyramid pooling)[3,4,5]

RFBNet结构在Google的Incepetion系列的结构中引入了空洞卷积。同样用到了空洞卷积，其实 Google自家的ASPP（空洞空间卷积池化金字塔）也是能打的。ASPP最早是出现在语义分割网络DeepLabv2中，ASPP对所给定的输入以不同采样率的空洞卷积并行采样，相当于以多个比例捕捉图像的上下文，结构示意图如下

论文[3]（DeepLabv2）中从语义分割的角度阐述了使用这个一结构的好处：空洞卷积可以在不增加太多计算量的情况下增大滤波器的视野；空洞空间卷积池化金字塔可以增强模型对不同尺度分割目标的感知能力。

DeepLabv3[4]的论文中通过实验作者发现：当膨胀率越大，卷积核中的有效权重越少，当膨胀率足够大时，只有卷积核最中间的权重有效，即退化成了1x1卷积核，并不能获取到全局的context信息；解决方案是在最后一个特征上使用了全局平均池化（global average pooling）（包含1x1卷积核，输出256个通道，正则化，通过bilinear上采样还原到对应尺度）,最后和其他并行的分支concat.

（已经用大老师的onnxsim merge了BN）的结构如下：

1）Pytorch实现的不含BN版本代码：

class ASPPNOBN(nn.Module):def __init__(self, in_channel=512, depth=256):super(ASPPNOBN, self).__init__()self.mean = nn.AdaptiveAvgPool2d((1, 1))  # (1,1)means ouput_dimself.conv = nn.Conv2d(in_channel, depth, 1, 1)self.atrous_block1 = nn.Conv2d(in_channel, depth, 1, 1)self.atrous_block6 = nn.Conv2d(in_channel, depth, 3, 1, padding=6, dilation=6)self.atrous_block12 = nn.Conv2d(in_channel, depth, 3, 1, padding=12, dilation=12)self.atrous_block18 = nn.Conv2d(in_channel, depth, 3, 1, padding=18, dilation=18)self.conv_1x1_output = nn.Conv2d(depth * 5, depth, 1, 1)def forward(self, x):size = x.shape[2:]image_features = self.mean(x)image_features = self.conv(image_features)image_features = F.upsample(image_features, size=size, mode='bilinear')atrous_block1 = self.atrous_block1(x)atrous_block6 = self.atrous_block6(x)atrous_block12 = self.atrous_block12(x)atrous_block18 = self.atrous_block18(x)net = self.conv_1x1_output(torch.cat([image_features, atrous_block1, atrous_block6,atrous_block12, atrous_block18], dim=1))return net

2）Pytorch实现的BN版本代码(来自https://github.com/fregu856/deeplabv3)：

class ASPPBN(nn.Module):def __init__(self, in_planes=512,out_planes=256):super(ASPPBN, self).__init__()self.conv_1x1_1 = nn.Conv2d(in_planes, out_planes, kernel_size=1)self.bn_conv_1x1_1 = nn.BatchNorm2d(out_planes)self.conv_3x3_1 = nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=1, padding=6, dilation=6)self.bn_conv_3x3_1 = nn.BatchNorm2d(out_planes)self.conv_3x3_2 = nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=1, padding=12, dilation=12)self.bn_conv_3x3_2 = nn.BatchNorm2d(out_planes)self.conv_3x3_3 = nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=1, padding=18, dilation=18)self.bn_conv_3x3_3 = nn.BatchNorm2d(out_planes)self.avg_pool = nn.AdaptiveAvgPool2d(1)self.conv_1x1_2 = nn.Conv2d(in_planes, out_planes, kernel_size=1,stride=1)self.bn_conv_1x1_2 = nn.BatchNorm2d(out_planes)self.conv_1x1_3 = nn.Conv2d(out_planes*5, out_planes, kernel_size=1) # (1280 = 5*256)self.bn_conv_1x1_3 = nn.BatchNorm2d(out_planes)def forward(self, feature_map):feature_map_h = feature_map.size()[2] # (== h/16)feature_map_w = feature_map.size()[3] # (== w/16)out_1x1 = F.relu(self.bn_conv_1x1_1(self.conv_1x1_1(feature_map))) # (shape: (batch_size, 256, h/16, w/16))out_3x3_1 = F.relu(self.bn_conv_3x3_1(self.conv_3x3_1(feature_map))) # (shape: (batch_size, 256, h/16, w/16))out_3x3_2 = F.relu(self.bn_conv_3x3_2(self.conv_3x3_2(feature_map))) # (shape: (batch_size, 256, h/16, w/16))out_3x3_3 = F.relu(self.bn_conv_3x3_3(self.conv_3x3_3(feature_map))) # (shape: (batch_size, 256, h/16, w/16))out_img = self.avg_pool(feature_map) # (shape: (batch_size, 512, 1, 1))out_img = F.relu(self.bn_conv_1x1_2(self.conv_1x1_2(out_img))) # (shape: (batch_size, 256, 1, 1))out_img = F.upsample(out_img, size=(feature_map_h, feature_map_w), mode="bilinear") # (shape: (batch_size, 256, h/16, w/16))out = torch.cat([out_1x1, out_3x3_1, out_3x3_2, out_3x3_3,out_img], 1) # (shape: (batch_size, 1280, h/16, w/16))out = F.relu(self.bn_conv_1x1_3(self.conv_1x1_3(out))) # (shape: (batch_size, 256, h/16, w/16))return out

参考资料：

[1] He K , Zhang X , Ren S , et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition[J]. 2014.

[2] Liu S , Huang D , Wang Y . Receptive Field Block Net for Accurate and Fast Object Detection[J]. 2017.

[3] Chen L C , Papandreou G , Kokkinos I , et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4):834-848.

[4] Chen L C , Papandreou G , Schroff F , et al. Rethinking Atrous Convolution for Semantic Image Segmentation[J]. 2017.

[5] Ma X , Liu K , Ding C , et al. Encoder-decoder with multi-scale information fusion for semantic image segmentation[C]// Eleventh International Conference on Graphics and Image Processing. 2020.

[6] Redmon J , Farhadi A . YOLOv3: An Incremental Improvement[J]. 2018.

[7] Bochkovskiy A , Wang C Y , Liao H Y M . YOLOv4: Optimal Speed and Accuracy of Object Detection[J]. 2020.

[8] Zhao H , Shi J , Qi X , et al. Pyramid Scene Parsing Network[J]. 2016.

9、https://blog.csdn.net/qq_36530992/article/details/102628455

10、https://www.jianshu.com/p/edbaa56d250d

11、https://blog.csdn.net/u014380165/article/details/81556769

SPP、RFB和ASPP相关推荐

空间金字塔池化改进 SPP / SPPF / SimSPPF / ASPP / RFB / SPPCSPC / SPPFCSPC
大家好,我是迪菲赫尔曼
YOLOv4全文阅读（全文中文翻译）
YOLOv4全文阅读(全文中文翻译) YOLOv4: Optimal Speed and Accuracy of Object Detection 论文链接: https://arxiv.org/pd ...
重磅更新！YoloV4最新论文与源码!权重！结构！翻译!
今年初 YOLO 之父Joseph Redmon宣布推出CV界,引起轩然大波,大家纷纷猜测YOLO是否不会再出v4版,退出历史舞台. 今天,YOLOv4重磅发布,作者为俄罗斯开发者 Alexey Bo ...
对象检测目标小用什么模型好_[目标检测] YOLO4论文中文版
点击上方蓝字关注我呀! [目标检测] YOLO4论文中文版文章目录 YOLO4论文中文版摘要 1.介绍 2.相关工作 2.1.目标检测模型 2.2.Bag of freebies 2.3.Ba ...
深度学习目标检测之 YOLO v4
论文原文:https://arxiv.org/abs/2004.10934 代码原版c++: https://github.com/AlexeyAB/darknet keras:https://gi ...
YOLOv4论文阅读（附原文翻译）
YOLOv4论文阅读(附原文翻译) 论文阅读论文翻译 Abstract摘要 1.Introduction 引言 2.Related work相关工作 2.1.Object detection mod ...
YOLOv4论文(中文版)
摘要据说有大量的特征可以提高卷积神经网络(CNN)的准确性.需要在大数据集上对这些特征的组合进行实际测试,并对结果进行理论验证.有些特征专门针对某些模型和某些问题,或者只针对小规模数据集;而一些特性 ...
【全文翻译】YOLOv4：目标检测的最佳速度和准确性
论文连接:https://arxiv.org/abs/2004.10934 翻译的很多都是直译的,不准的地方请欢迎大佬指正摘要有许多功能可以提高卷积神经网络(CNN)的准确性.需要在大型数据集 ...
提升目标检测模型性能的tricks
提升目标检测模型性能的tricks bag of freebies pixel-wise调整几何畸变光照变化遮挡 Random Erase CutOut Hide-and-Seek Grid M ...
YOLOv4论文中文翻译
YOLOv4: 目标检测最优速度和精度论文地址:https://arxiv.org/abs/2004.10934 参考文献和最后一大串表见原论文,机翻加手翻,错误请指正,讨论请留言. 摘要: 有大量 ...

SPP、RFB和ASPP

SPP、RFB和ASPP

1、SPP(Spatial Pyramid Pooling)

2、RFB(Receptive Field Block)

3、ASPP(atrous spatial pyramid pooling)[3,4,5]

SPP、RFB和ASPP相关推荐

最新文章

热门文章