1. MobileNet V1

1.1 Abstract

我们提出了一类用于移动和嵌入式视觉应用程序的高效模——MobileNet，此模型使用深度可分离卷积来构建轻量级深度神经网络。我们还介绍了两个超参数：用于控制模型的延迟（模型运行时间）和准确率

1.2 Introduction

在计算机视觉中，目前存在的一般趋势是制造更深更复杂的网络以实现更高的准确性。但是，在现实世界中的应用程序中，需要在有限的平台上以低延迟的方式实现识别任务。

最近出现的许多方法只是关注模型的大小，而没有考虑速度，主要通过压缩预训练网络或直接训练小型网络。

本文介绍了一种有效的网络架构（基于深度可分离卷积）和一组两个超参数，以便构建非常小的低延迟模型，可以轻松匹配移动和嵌入式视觉应用程序的设计要求。

1.3 MobileNet Architecture

1.3.1 Depthwise Separable Convolution

MobileNet基于深度可分离卷积，此卷积将标准卷积分解为深度卷积（depthwise conv）和1x1点卷积（pointwise conv）。

depthwise conv：将单个滤波器作用于每个输入通道
pointwise conv：利用1x1卷积合并depthwise conv的输出

例子分析：

标准卷积与深度可分离卷积计算量比较：

input feature map：DF×DF×MD_F \times D_F \times MDF×DF×M

output feature map：DG×DG×ND_G \times D_G \times NDG×DG×N

conv kernel size：DK×Dk×M×ND_K \times D_k \times M \times NDK×Dk×M×N

DFD_FDF ：输入特征图的宽度和高度的大小
DGD_GDG ：输入特征图的宽度和高度的大小
M：输入特征图的个数
N：输出特征图的个数
DKD_KDK：卷积核的大小

标准卷积的计算量：

对于每一个像素点的计算量为：DK×DK×MD_K \times D_K \times MDK×DK×M，共有DG×DG×ND_G \times D_G \times NDG×DG×N个像素点，所以总计算量为：DK×DK×M×N×DG×DGD_K \times D_K \times M \times N \times D_G \times D_GDK×DK×M×N×DG×DG

深度可分离卷积的计算量：

depthwise conv：对于每一个像素点的计算量为：DK×DKD_K \times D_KDK×DK，共有DG×DG×MD_G \times D_G \times MDG×DG×M个像素点，所以总计算量为：DK×DK×M×DG×DGD_K \times D_K \times M \times D_G \times D_GDK×DK×M×DG×DG
pointwise conv：计算量为：1×1×M×N×DG×DG1 \times 1 \times M \times N \times D_G \times D_G1×1×M×N×DG×DG

深度可分离卷积计算量标准卷积计算量=DK×DK×M×DG×DG+M×N×DG×DGDK×DK×M×N×DG×DG=1N+1DK2\frac {深度可分离卷积计算量} {标准卷积计算量} = \frac {D_K \times D_K \times M \times D_G \times D_G + M \times N \times D_G \times D_G } {D_K \times D_K \times M \times N \times D_G \times D_G} = \frac {1} {N} + \frac {1} {D_K^2}标准卷积计算量深度可分离卷积计算量=DK×DK×M×N×DG×DGDK×DK×M×DG×DG+M×N×DG×DG=N1+DK21

MobileNet使用3x3深度可分离卷积，可在只需减少略小的准确率换来8-9倍的计算量的减少

1.3.2 Network Structure and Training

MobileNet基于深度可分离卷积构建，除了第一层是标准卷积外，其他均为深度可分离卷积。每个卷积层后边跟着BatchNorm层和ReLU层最后加一个全连接层和softmax层。

1.3.3 Width Multiplier: Thinner Models

width multiplier α\alphaα 的作用是在每层均匀缩小网络。此超参数将输入通道数MMM变为αM\alpha MαM，输出通道NNN变为αN\alpha NαN，从而计算量就变为：DK×DK×αM×DG×DG+αM×αN×DG×DGD_K \times D_K \times \alpha M \times D_G \times D_G + \alpha M \times \alpha N \times D_G \times D_GDK×DK×αM×DG×DG+αM×αN×DG×DG

1.3.4 Resolution Multiplier: Reduced Representation

resolution multiplier ρ\rhoρ 的作用是改变输入图像的大小从而改变每层特征图的大小。此参数将特征图的大小DGD_GDG变为ρDG\rho D_GρDG，从而计算量就变为：DK×DK×M×ρDG×ρDG+M×N×ρDG×ρDGD_K \times D_K \times M \times \rho D_G \times \rho D_G + M \times N \times \rho D_G \times \rho D_GDK×DK×M×ρDG×ρDG+M×N×ρDG×ρDG

1.4 Experiments

从表4可知，深度可分离卷积仅减少了1%的准确度，但是运行时间快了8.5倍，并且模型的大小缩小了近7倍。

2. MobileNet V2

2.1. Abstract

我们提出了MobileNetV2，此网络结构基于倒残差结构，其中shortcut在窄的bottleneck之间。中间扩展层使用深度卷积，此外，我们还发现为了维持模型的表达能力，删除窄层中的非线性层是必要的。

2.2. Introduction

我们的主要贡献是一种新型层模块——具有线性bottleneck的倒残差结构。该模块的输入是一个低维压缩表示，该表示首先被扩展到高维并用深度卷积，随后通过一个线性层将其转回至低维表示。

这种卷积模块特别适用于移动设计，因为它允许通过从不完全实现大型中间张量来显着降低推理期间所需的存储空间。这减少了在许多嵌入式硬件设计中对主要内存访问的需求，可提供少量的非常快速的软件控制缓存存储器。

2.3 Preliminaries, discussion and intuition

2.3.1 Depthwise Separable Convolutions

深度可分离卷积是许多效率网络结构中的关键。其基本思想为：将完整卷积分解为两个单独部分，第一部分为深度卷积，它通过将每个输入通道应用单个卷积滤波来执行轻量级滤波功能；第二部分是一个1x1卷积，称为点卷积，通过计算输入通道的线性组合。

2.3.2 Linear Bottlenecks

由上图可知，ReLU在低维度是会对特征有损失。所以当inverted residual block 先升维再降至低维度之后不在使用非线性的RELU作为激活函数，这也就是所谓的linear bottleneck

2.3.3 Inverted residuals

2.4 Model Architecture

我们使用ReLU6作为非线性层，因为它具有低精度计算时的鲁棒性。

除了第一层，我们在整个网络中使用常量扩展速率，在我们的主要实验中将expansion factor设置为6

2.5 Pytorch实现

import torch
import torch.nn as nn
import torchvisionclass ConvBNReLu(nn.Sequential):def __init__(self, in_channel, out_channel, kernel_size=3, stride=1, groups=1):padding = (kernel_size - 1) // 2super(ConvBNReLu, self).__init__(nn.Conv2d(in_channels=in_channel, out_channels=out_channel, kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, bias=False),nn.BatchNorm2d(out_channel),nn.ReLU6(inplace=True))class InvertedResidual(nn.Module):def __init__(self, in_channel, out_channel, stride, expand_ratio):super(InvertedResidual, self).__init__()hidden_channel = in_channel * expand_ratioself.use_shortcut = stride == 1 and in_channel == out_channellayers = []if expand_ratio != 1:# 1x1 pointWise convlayers.append(ConvBNReLu(in_channel, hidden_channel, kernel_size=1))layers.extend([# 3x3 depthWise convConvBNReLu(hidden_channel, hidden_channel, stride=stride, groups=hidden_channel),# 1x1 pointWise conv(linear)nn.Conv2d(in_channels=hidden_channel, out_channels=out_channel, kernel_size=1, bias=False),nn.BatchNorm2d(out_channel),])self.conv = nn.Sequential(*layers)def forward(self, x):if self.use_shortcut:return x + self.conv(x)else:return self.conv(x)class MobileNetV2(nn.Module):def __init__(self, num_classes=1000):super(MobileNetV2, self).__init__()block = InvertedResidualinput_channel = 32last_channel = 1280inverted_residual_setting = [# t, c, n, s[1, 16, 1, 1],[6, 24, 2, 2],[6, 32, 3, 2],[6, 64, 4, 2],[6, 96, 3, 1],[6, 169, 3, 2],[6, 320, 1, 1],]features = []features.append(ConvBNReLu(in_channel=3, out_channel=input_channel, stride=2))for t, c, n, s in inverted_residual_setting:for i in range(n):stride = s if i == 0 else 1features.append(block(in_channel=input_channel, out_channel=c, stride=stride, expand_ratio=t))input_channel = cfeatures.append(ConvBNReLu(in_channel=input_channel, out_channel=last_channel, kernel_size=1))self.features = nn.Sequential(*features)self.avgpool = nn.AdaptiveAvgPool2d((1, 1))self.classifier = nn.Sequential(nn.Dropout(p=0.2),nn.Linear(last_channel, num_classes))# weight initfor m in self.modules():if isinstance(m, nn.Conv2d):nn.init.kaiming_normal_(m.weight, mode='fan_out')if m.bias is not None:nn.init.zeros_(m.bias)elif isinstance(m, nn.BatchNorm2d):nn.init.ones_(m.weight)nn.init.zeros_(m.bias)elif isinstance(m, nn.Linear):nn.init.normal_(m.weight, mean=0, std=0.01)nn.init.zeros_(m.bias)def forward(self, x):x = self.features(x)x = self.avgpool(x)x = torch.flatten(x, start_dim=1)x = self.classifier(x)return x

3. Mobile Net V3

3.1 Efficient Mobile Building Blocks

移动模型已经建立在越来越高效的构建块上。 MobileNetV1引入了深度方向可分离卷积作为传统卷积层的有效替代。深度可分离卷积通过将空间滤波与特征生成机制分离，有效地分解了传统卷积。深度可分离卷积由两个单独的层定义：用于空间滤波的轻量级深度卷积和用于特征生成的1x1点卷积。

MobileNetV2引入了线性瓶颈和倒置残差结构，以通过利用问题的低秩性质来提高层结构的效率。这种结构如下图所示，由1x1扩展卷积、深度卷积和1x1投影层定义。当且仅当输入和输出具有相同数量的通道时，才通过残差连接进行连接。这种结构在输入和输出处保持紧凑的表示形式，同时在内部扩展到更高维度的特征空间以提高非线性每通道变换的表达能力。

MnasNet通过在瓶颈结构中引入基于SE的轻量级注意力模块来建立在MobileNetV2结构的基础上。注意，与SENet中提出的基于ResNet的模块相比，SE模块集成在不同的位置。该模块被放置在扩展中的深度过滤器之后，以便为了表达注意力的最大化，如下图所示。

其中SE模块为：

对于MobileNetV3，我们使用这些层的组合作为构建块，以构建最有效的模型。层也通过修改的swish非线性进行了升级。SE以及swish非线性都使用Sigmoid形，这在定点算术中计算效率低下，而且难以保持精度，因此我们将其替换hard sigmoid

3.2 Network Improvements

3.2.1 Redesigning Expensive Layers

第一个修改重做了网络的最后几层如何交互以更有效地产生最终特征。当前基于MobileNet V2的倒置瓶颈结构和变体的模型使用1x1卷积作为最后一层，以便扩展到更高维度的特征空间。为了具有丰富的预测功能，这一层至关重要。但是，这要付出额外的等待时间。为了减少延迟并保留高维特征，我们将该层移到最终的平均池化之外。现在以1x1的空间分辨率而不是7x7的空间分辨率计算出最终的特征图。这种设计选择的结果是，就计算和等待时间而言，特征的计算几乎变得没有消耗

一旦减轻了该特征生成层的成本，就不再需要先前的瓶颈投影层来减少计算量。该观察结果使我们能够删除先前瓶颈层中的投影和过滤层，从而进一步降低了计算复杂性。原始的和优化的最后阶段可以在下图中看到。有效的最后阶段将等待时间减少了7毫秒，这是运行时间的11％，并减少了3000万MAdds的操作次数，几乎没有准确性的损失。

另一个昂贵的层是过滤器的初始集合。当前的移动模型倾向于在完整的3x3卷积中使用32个滤波器来构建用于边缘检测的初始滤波器组。这些滤镜中有十个互为镜像。我们尝试减少滤波器的数量并使用不同的非线性来尝试减少冗余。我们决定在该层的性能以及其他经过测试的非线性上使用hard swish非线性。使用ReLU或swish，我们能够将过滤器数量减少到16个，同时保持与32个过滤器相同的精度。这样可以节省额外的2毫秒和1000万个MAdd。

3.2.2 Nonlinearities

引入了称为swish的非线性，当它用作ReLU的直接替代品时，可以大大提高神经网络的准确性。非线性定义为 swishx=xσ˙(x)swish x = x \dot \sigma(x)swishx=xσ˙(x)

尽管这种非线性提高了精度，但在嵌入式环境中却带来了非零成本，因为在移动设备上计算Sigmoid型函数的成本要高得多。我们以两种方式处理这个问题：

我们用它的分段线性的 ReLU6(x+3)6\frac {ReLU6(x+3)} {6}6ReLU6(x+3)代替sigmoid函数，较小的区别是我们使用ReLU6而不是自定义裁剪常数。同样，swish的hard版本变成h−swish[x]=xReLU6(x+3)6h-swish[x]=x \frac {ReLU6(x+3)} {6}h−swish[x]=x6ReLU6(x+3)
随着我们深入网络，应用非线性的成本降低，这是因为每当分辨率降低时，每个层的激活内存通常都会减半。顺便说一句，我们发现，仅在更深层次中使用它们就可以实现大部分收益。因此，在我们的架构中，我们仅在模型的后半部分使用h-swish。

3.3.3 Large squeeze-and-excite

在MnasNet中，SE瓶颈的大小是卷积瓶颈的大小的相对值。相反，我们将它们全部替换为固定为扩展层中通道数量的1/4。我们发现这样做可以在不增加参数数量的情况下提高准确度，并且没有明显的等待时间成本。

3.3.4 MobileNetV3 Definitions:

3.4 Pytorch实现

import torch
import torch.nn as nn
import torchvisionclass HardSwish(nn.Module):def __init__(self, inplace=True):super(HardSwish, self).__init__()self.relu6 = nn.ReLU6(inplace=inplace)def forward(self, x):return x * self.relu6(x+3)/6class ConvBNActivation(nn.Sequential):def __init__(self, in_channel, out_channel, kernel_size, stride, groups, activate):padding = (kernel_size - 1) // 2super(ConvBNActivation, self).__init__(nn.Conv2d(in_channels=in_channel, out_channels=out_channel, kernel_size=kernel_size, stride=stride, padding=padding, groups=groups, bias=False),nn.BatchNorm2d(out_channel),nn.ReLU6(inplace=True) if activate == 'relu' else HardSwish())class SqueezeAndExcite(nn.Module):def __init__(self, in_channel, out_channel, divide=4):super(SqueezeAndExcite, self).__init__()mid_channel = in_channel // divideself.pool = nn.AdaptiveAvgPool2d((1, 1))self.SEblock = nn.Sequential(nn.Linear(in_features=in_channel, out_features=mid_channel),nn.ReLU6(inplace=True),nn.Linear(in_features=mid_channel, out_features=out_channel),HardSwish(),)def forward(self, x):b, c, h, w = x.size()out = self.pool(x)out = torch.flatten(out, start_dim=1)out = self.SEblock(out)out = out.view(b, c, 1, 1)return out * xclass SEInverteBottleneck(nn.Module):def __init__(self, in_channel, mid_channel, out_channel, kernel_size, use_se, activate, stride):super(SEInverteBottleneck, self).__init__()self.use_shortcut = stride == 1 and in_channel == out_channelself.use_se = use_seself.conv = ConvBNActivation(in_channel=in_channel, out_channel=mid_channel, kernel_size=1, stride=1, groups=1, activate=activate)self.depth_conv = ConvBNActivation(in_channel=mid_channel, out_channel=mid_channel, kernel_size=kernel_size, stride=stride, groups=mid_channel, activate=activate)if self.use_se:self.SEblock = SqueezeAndExcite(in_channel=mid_channel, out_channel=mid_channel)self.point_conv = ConvBNActivation(in_channel=mid_channel, out_channel=out_channel, kernel_size=1, stride=1, groups=1, activate=activate)def forward(self, x):out = self.conv(x)out = self.depth_conv(out)if self.use_se:out = self.SEblock(out)out = self.point_conv(out)if self.use_shortcut:return x + outreturn outclass MobileNetV3(nn.Module):def __init__(self, num_classes=1000, type='large'):super(MobileNetV3, self).__init__()self.type = typeself.first_conv = nn.Sequential(nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=2, padding=1, bias=False),nn.BatchNorm2d(16),HardSwish(),)if self.type == 'large':self.large_bottleneck = nn.Sequential(SEInverteBottleneck(in_channel=16, mid_channel=16, out_channel=16, kernel_size=3, use_se=False, activate='relu', stride=1),SEInverteBottleneck(in_channel=16, mid_channel=64, out_channel=24, kernel_size=3, use_se=False, activate='relu', stride=2),SEInverteBottleneck(in_channel=24, mid_channel=72, out_channel=24, kernel_size=3, use_se=False, activate='relu', stride=1),SEInverteBottleneck(in_channel=24, mid_channel=72, out_channel=40, kernel_size=5, use_se=True, activate='relu', stride=2),SEInverteBottleneck(in_channel=40, mid_channel=120, out_channel=40, kernel_size=5, use_se=True, activate='relu', stride=1),SEInverteBottleneck(in_channel=40, mid_channel=120, out_channel=40, kernel_size=5, use_se=True, activate='relu', stride=1),SEInverteBottleneck(in_channel=40, mid_channel=240, out_channel=80, kernel_size=3, use_se=False, activate='hswish', stride=2),SEInverteBottleneck(in_channel=80, mid_channel=200, out_channel=80, kernel_size=3, use_se=False, activate='hswish', stride=1),SEInverteBottleneck(in_channel=80, mid_channel=184, out_channel=80, kernel_size=3, use_se=False, activate='hswish', stride=1),SEInverteBottleneck(in_channel=80, mid_channel=184, out_channel=80, kernel_size=3, use_se=False, activate='hswish', stride=1),SEInverteBottleneck(in_channel=80, mid_channel=480, out_channel=112, kernel_size=3, use_se=True, activate='hswish', stride=1),SEInverteBottleneck(in_channel=112, mid_channel=672, out_channel=112, kernel_size=3, use_se=True, activate='hswish', stride=1),SEInverteBottleneck(in_channel=112, mid_channel=672, out_channel=160, kernel_size=5, use_se=True, activate='hswish', stride=2),SEInverteBottleneck(in_channel=160, mid_channel=960, out_channel=160, kernel_size=5, use_se=True, activate='hswish', stride=1),SEInverteBottleneck(in_channel=160, mid_channel=960, out_channel=160, kernel_size=5, use_se=True, activate='hswish', stride=1),)self.large_last_stage = nn.Sequential(nn.Conv2d(in_channels=160, out_channels=960, kernel_size=1, stride=1, bias=False),nn.BatchNorm2d(960),HardSwish(),nn.AdaptiveAvgPool2d((1, 1)),nn.Conv2d(in_channels=960, out_channels=1280, kernel_size=1, stride=1, bias=False),HardSwish(),)else:self.small_bottleneck = nn.Sequential(SEInverteBottleneck(in_channel=16, mid_channel=16, out_channel=16, kernel_size=3, use_se=True, activate='relu', stride=2),SEInverteBottleneck(in_channel=16, mid_channel=72, out_channel=24, kernel_size=3, use_se=False, activate='relu', stride=2),SEInverteBottleneck(in_channel=24, mid_channel=88, out_channel=24, kernel_size=3, use_se=False, activate='relu', stride=1),SEInverteBottleneck(in_channel=24, mid_channel=96, out_channel=40, kernel_size=5, use_se=True, activate='hswish', stride=2),SEInverteBottleneck(in_channel=40, mid_channel=240, out_channel=40, kernel_size=5, use_se=True, activate='hswish', stride=1),SEInverteBottleneck(in_channel=40, mid_channel=240, out_channel=40, kernel_size=5, use_se=True, activate='hswish', stride=1),SEInverteBottleneck(in_channel=40, mid_channel=120, out_channel=48, kernel_size=5, use_se=True, activate='hswish', stride=1),SEInverteBottleneck(in_channel=48, mid_channel=144, out_channel=48, kernel_size=5, use_se=True, activate='hswish', stride=1),SEInverteBottleneck(in_channel=48, mid_channel=288, out_channel=96, kernel_size=5, use_se=True, activate='hswish', stride=2),SEInverteBottleneck(in_channel=96, mid_channel=576, out_channel=96, kernel_size=5, use_se=True, activate='hswish', stride=1),SEInverteBottleneck(in_channel=96, mid_channel=576, out_channel=96, kernel_size=5, use_se=True, activate='hswish', stride=1),)self.small_last_stage = nn.Sequential(nn.Conv2d(in_channels=96, out_channels=576, kernel_size=1, stride=1, bias=False),nn.BatchNorm2d(576),HardSwish(),nn.AdaptiveAvgPool2d((1, 1)),nn.Conv2d(in_channels=576, out_channels=1280, kernel_size=1, stride=1, bias=False),HardSwish(),)self.classifier = nn.Sequential(nn.Dropout(p=0.2),nn.Linear(in_features=1280, out_features=num_classes),)# weight initfor m in self.modules():if isinstance(m, nn.Conv2d):nn.init.kaiming_normal_(m.weight, mode='fan_out')if m.bias is not None:nn.init.zeros_(m.bias)elif isinstance(m, nn.BatchNorm2d):nn.init.ones_(m.weight)nn.init.zeros_(m.bias)elif isinstance(m, nn.Linear):nn.init.normal_(m.weight, mean=0, std=0.01)nn.init.zeros_(m.bias)def forward(self, x):x = self.first_conv(x)if self.type == 'large':x = self.large_bottleneck(x)x = self.large_last_stage(x)else:x = self.small_bottleneck(x)x = self.small_last_stage(x)x = torch.flatten(x, start_dim=1)x = self.classifier(x)return x

上一篇： DenseNet
下一篇：ShuffleNet系列
完整代码

Mobile Net 系列【V1—V3】相关推荐

YOLO系列（v1~v3）的学习及YOLO-Fastest在海思平台的部署（中）
YOLO系列(v1~v3)的学习及YOLO-Fastest在海思平台的部署(上) YOLO系列(v1~v3)的学习及YOLO-Fastest在海思平台的部署(中) YOLO系列(v1~v3)的学习及Y ...
YOLO系列（v1~v3）的学习及YOLO-Fastest在海思平台的部署（上）
YOLO系列(v1~v3)的学习及YOLO-Fastest在海思平台的部署(上) YOLO系列(v1~v3)的学习及YOLO-Fastest在海思平台的部署(中) YOLO系列(v1~v3)的学习及Y ...
Windows Mobile 开发系列文章收藏 - Windows Mobile 6.x
收集整理一些Windows Mobile 6.x开发相关文章, 文章及相关代码大部分搜集自网络,版权属于原作者! 智能手机手机词汇研发手机基本流程 WAP协议分析(1) ...
Java自动化测试系列[v1.0.0][TestNG测试开发环境配置]
基于之前写的一篇文章Java自动化测试系列[v1.0.0][Maven开发环境]的基础上,阐述如何配置单元测试框架TestNG的测试开发环境创建Maven项目启动IDEA,点击Create New ...
Dojo mobile TweetView 系列教程之三——Tweets和Mentions视图
Dojo mobile TweetView 系列教程之三--Tweets和Mentions视图分类: Javascript Dojo扩展 (dojox)2011-05-18 19:13 2211人阅 ...
大疆 DJI mobile SDK系列详细教程——运行实例代码（跑通大疆官方提供Mobile SDK里的sample code）
大疆 DJI mobile SDK系列详细教程--运行实例代码(跑通大疆官方提供Mobile SDK里的sample code) 文章目录一.官方文献与资源地址二.操作步骤提示:昨天在尝试跑通大 ...
【速达软件】速达3000系列、V3、S3批量更改税率
[速达软件]速达3000系列.V3.S3批量更改税率账套选项增值税率由16%改为13% update accoptions set optionvalue='13' where optionname ...
Mobile net系列总结（V1、V2、V3）
一.Mobile Net V1 主要贡献: (1)使用了深度可分离卷积构建轻量级卷积神经网络,由depthwise(DW)和pointwise(PW)两个部分结合起来,用来提取特征feature ma ...
论文记录1_YOLO系列(v1 v2 v3 v4)
注:此文为阅读笔记,参考了很多论文,博客,如有侵权请联系,我附上原出处. 文章目录准备知识: YOLO V1 创新点 grid cell 置信度例子网络架构 Backbone Neck Head ...

Mobile Net 系列【V1—V3】