0. 引言

0.1 代码来源

代码来源：https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Detection/SSD

0.2 代码改动

NVIDIA复现的代码中有很多前沿的新技术（tricks），比如NVIDIA DALI模块，该模块可以加速数据的读取和预处理。注意，虽然NVIDIA复现了该代码，但相关人员对代码进行了修改。

The SSD300 v1.1 model is based on the SSD: Single Shot MultiBox Detector paper, which describes SSD as “a method for detecting objects in images using a single deep neural network". The input size is fixed to 300x300.

输入依然是300×300

The main difference between this model and the one described in the paper is in the backbone. Specifically, the VGG model is obsolete and is replaced by the ResNet-50 model.

没有继续使用论文中VGG-16作为backbone，而是使用ResNet-50。

From the Speed/accuracy trade-offs for modern convolutional object detectors paper, the following enhancements were made to the backbone:

The conv5_x, avgpool, fc and softmax layers were removed from the original classification model.
All strides in conv4_x are set to 1x1.

移除ResNet-50最后一个残差结构(conv5_x)以及后面的结构

对于conv4_x的第一个残差结构，将其stride都修改为1×1大小 —— 这个残差结构里面有6个block，只有第一个block会对shape进行下采样，而我们只需要修改第一个残差结构的卷积核大小和步距为1 —— 那么特征图在经过这个残差结构后，特征图的shape不会发生变化（不影响通道数，因为通道数=卷积核的个数）

之前在SSD理论上说过，经过backbone之后会让特征图经过一系列的卷积从而生成不同感受野的预测特征图。
输入图像经过backbone之后会生成第一个预测特征图

Detector heads are similar to the ones referenced in the paper, however, they are enhanced by additional BatchNorm layers after each convolution.

检测器和原论文基本上一样，不同的是在每一个卷积层都引入了额外的BN层（在原论文的中，生成预测特征图的卷积是没有使用BN的）

Additionally, we removed weight decay on every bias parameter and all the BatchNorm layer parameters as described in the Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes paper.

此外，我们删除了每个偏差参数和所有 BatchNorm 层参数的权重衰减，如具有混合精度的高度可扩展深度学习训练系统：四分钟内训练 ImageNet 论文中所述。

Training of SSD requires computational costly augmentations. To fully utilize GPUs during training we are using the NVIDIA DALI library to accelerate data preparation pipelines.

SSD 的训练需要计算成本高昂的增强。为了在训练期间充分利用 GPU，我们正在使用 NVIDIA DALI 库来加速数据预处理流程。
我们训练网络的时候基本上使用CPU对数据进行读取和预处理，然后再将数据传入GPUs进行模型的训练。
但这种方法有一个问题：训练模型的时候GPUs的利用率很低。出现这种问题的主要原因是数据的读取和预处理太慢了。CPU处理数据太慢了，而GPUs处理数据又太快了——GPUs处理好了这个batch的数据，而CPU还没准备好下一个batch的数据 —— GPUs空闲 -> 导致GPUs利用率低
针对这个问题，NVIDIA提供了NVIDIA DALI包 —— 让GPUs处理数据

This model is trained with mixed precision using Tensor Cores on Volta, Turing, and the NVIDIA Ampere GPU architectures. Therefore, researchers can get results 2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time.

该模型使用 Volta、Turing 和 NVIDIA Ampere GPU 架构上的 Tensor Cores 以混合精度进行训练。因此，研究人员可以获得比没有 Tensor Cores 的训练快 2 倍的结果，同时体验混合精度训练的好处。该模型针对每个 NGC 每月容器版本进行测试，以确保随着时间的推移保持一致的准确性和性能。

0.3 代码使用注意事项

0.3.1 环境配置：

Python 3.6/3.7/3.8
Pytorch 1.7.1
pycocotools(Linux:pip install pycocotools; Windows:pip install pycocotools-windows(不需要额外安装vs))
Ubuntu或Centos(不建议Windows)
最好使用GPU训练

0.3.2 文件结构：

├── src: 实现SSD模型的相关模块
│     ├── resnet50_backbone.py   使用resnet50网络作为SSD的backbone
│     ├── ssd_model.py           SSD网络结构文件
│     └── utils.py               训练过程中使用到的一些功能实现
├── train_utils: 训练验证相关模块（包括cocotools）
├── my_dataset.py: 自定义dataset用于读取VOC数据集
├── train_ssd300.py: 以resnet50做为backbone的SSD网络进行训练
├── train_multi_GPU.py: 针对使用多GPU的用户使用
├── predict_test.py: 简易的预测脚本，使用训练好的权重进行预测测试
├── pascal_voc_classes.json: pascal_voc标签文件
├── plot_curve.py: 用于绘制训练过程的损失以及验证集的mAP
└── validation.py: 利用训练好的权重验证/测试数据的COCO指标，并生成record_mAP.txt文件

0.3.3 预训练权重下载地址（下载后放入src文件夹中）：

ResNet50+SSD: https://ngc.nvidia.com/catalog/models
搜索ssd -> 找到SSD for PyTorch(FP32) -> download FP32 -> 解压文件
如果找不到可通过百度网盘下载，链接:https://pan.baidu.com/s/1byOnoNuqmBLZMDA0-lbCMQ 提取码:iggj

0.3.4 数据集，本例程使用的是PASCAL VOC2012数据集(下载后放入项目当前文件夹中)

Pascal VOC2012 train/val数据集下载地址：http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
Pascal VOC2007 test数据集请参考：http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar

Q：为什么不用MS COCO数据集
A：因为COCO数据集太大了，有几十个G -> 训练时间太长太昂贵

0.3.5 训练方法

确保提前准备好数据集
确保提前下载好对应预训练模型权重
单GPU训练或CPU，直接使用train_ssd300.py训练脚本
若要使用多GPU训练，使用 python -m torch.distributed.launch --nproc_per_node=8 --use_env train_multi_GPU.py 指令

nproc_per_node参数为使用GPU数量

0.3.6 训练结果展示

因为多次训练花费的时间太多（训练集为VOC 2012），所以这里只训练了5个epoch，结果如下：

Accumulating evaluation results...
DONE (t=2.50s).
IoU metric: bboxAverage Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.295Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.567Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.273Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.051Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.158Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.355Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.323Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.449Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.459Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.112Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.311Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.520
successful save loss curve!
successful save mAP curve!

SSD算法和Faster R-CNN检测精度差不多。对于小的数据集，Faster R-CNN的检测精度应该比SSD高；当数据集很大时，Faster R-CNN和SSD的检测精度是差不多的。
SSD的检测速度比Faster R-CNN要快很多

Faster R-CNN在单张GPU上大概每秒能检测6~7张图片

SSD在单张GPU上大概每秒能检测50~60张图片

Note：

相比使用了FPN（特征金字塔）的Faster R-CNN而言，SSD的检测精度要差很多

但后面有一个模型是SSD+FPN，名为RetinaNet -> 使用ResNet和FPN作为检测模型的backbone，相比使用了FPN的Faster R-CNN而言，检测精度差不多，但速度提升很多

对于我们真正的生产环境，如果真的要使用目标检测模型其实还需要做很多后续的工作。
+ 模式使用GPU进行训练，需将模型转换为TensorRT的格式（检测速度还会提升）

1. SSD代码

1.1 框架示意图

1.2 SSD网络的搭建 —— `ssd_model.py`

1.2.1 修改backbone（ResNet-50）

class Backbone(nn.Module):def __init__(self, pretrain_path=None):super(Backbone, self).__init__()net = resnet50()self.out_channels = [1024, 512, 512, 256, 256, 256]  # 对应每一个预测特征图的channelsif pretrain_path is not None:net.load_state_dict(torch.load(pretrain_path))# 构建backbone（截止到conv4_x）"""net.children():net: 实例化的ResNet-50网络children()：网络下一层所有的子模块（简单理解为，只要是`nn.xxx`构建的层结构都可以认为是它的子模块）*list(net.children())[:7]先将net.children()得到的子模块使用list接收对list进行切片，只取前7个子模块（0, 1, 2, ..., 6）① conv1; ② bn1; ③ relu; ④ maxpool; ⑤ layer1; ⑥ layer2; ⑦ layer3再将list进行解包送给nn.Sequential完成网络的重构（后面的就都不要了）"""self.feature_extractor = nn.Sequential(*list(net.children())[:7])# 修改Conv4_x中的第一个blockconv4_block1 = self.feature_extractor[-1][0]  # -1为最后一个子模块， 0为第一个block# 修改conv4_block1的步距，从2->1conv4_block1.conv1.stride = (1, 1)  # 1×1卷积的步距  对于ResNet-50没必要，但对于ResNet-18/34是有必要的conv4_block1.conv2.stride = (1, 1)  # 3×3卷积的步距conv4_block1.downsample[0].stride = (1, 1)  # 捷径路径上的步距def forward(self, x):x = self.feature_extractor(x)return x

1.2.2 搭建SSD网络

class SSD300(nn.Module):def __init__(self, backbone=None, num_classes=21):super(SSD300, self).__init__()if backbone is None:raise Exception("backbone is None")if not hasattr(backbone, "out_channels"):raise Exception("the backbone not has attribute: out_channel")self.feature_extractor = backboneself.num_classes = num_classes# 构建后面一系列的特征提取层（为了生成不同的预测特征图）# out_channels = [1024, 512, 512, 256, 256, 256] for resnet50self._build_additional_features(self.feature_extractor.out_channels)"""定义预测器：① box回归的预测参数； ② 预测的confidence分数"""self.num_defaults = [4, 6, 6, 6, 4, 4]  # 每一个Default box的scale（即box的生成数量） —— 在每个预测特征图上生成多少个Default Boxlocation_extractors = []  # 存储回归预测器confidence_extractors = []  # 存储置信度预测器# out_channels = [1024, 512, 512, 256, 256, 256] for resnet50for nd, oc in zip(self.num_defaults, self.feature_extractor.out_channels):# nd is number_default_boxes, oc is output_channellocation_extractors.append(nn.Conv2d(oc, nd * 4, kernel_size=3, padding=1))  # ① box回归的预测参数confidence_extractors.append(nn.Conv2d(oc, nd * self.num_classes, kernel_size=3, padding=1))  # ② 预测的confidence分数self.loc = nn.ModuleList(location_extractors)self.conf = nn.ModuleList(confidence_extractors)# 权重初始化self._init_weights()default_box = dboxes300_coco()self.compute_loss = Loss(default_box)self.encoder = Encoder(default_box)self.postprocess = PostProcess(default_box)def _build_additional_features(self, input_size):"""为backbone(resnet50)添加额外的一系列卷积层，得到相应的一系列特征提取器:param input_size::return:"""additional_blocks = []# input_size = [1024, 512, 512, 256, 256, 256] for resnet50middle_channels = [256, 256, 128, 128, 128]  # 对应5个额外添加层结构的输入通道数"""input_size[:-1]: [1024, 512, 512, 256, 256]  左闭右开，最后一个索引不会取到input_size[1:]：[512, 512, 256, 256, 256]  第一个不会取到"""for i, (input_ch, output_ch, middle_ch) in enumerate(zip(input_size[:-1], input_size[1:], middle_channels)):padding, stride = (1, 2) if i < 3 else (0, 1)  # 前三个索引,padding=1, stride=2; 最后两个索引：padding=0, stride=1layer = nn.Sequential(nn.Conv2d(input_ch, middle_ch, kernel_size=1, bias=False),  # 因为使用了BN层，所以将bias设置为Falsenn.BatchNorm2d(middle_ch),nn.ReLU(inplace=True),# 迭代结果不同就是这个卷积的stride和paddingnn.Conv2d(middle_ch, output_ch, kernel_size=3, padding=padding, stride=stride, bias=False),  # 因为使用了BN层，所以将bias设置为Falsenn.BatchNorm2d(output_ch),nn.ReLU(inplace=True),)additional_blocks.append(layer)self.additional_blocks = nn.ModuleList(additional_blocks)def _init_weights(self):layers = [*self.additional_blocks, *self.loc, *self.conf]for layer in layers:for param in layer.parameters():if param.dim() > 1:nn.init.xavier_uniform_(param)# Shape the classifier to the view of bboxesdef bbox_view(self, features, loc_extractor, conf_extractor):"""通过bbox_view()这个方法得到所有预测特征图上预测的回归参数和置信度得分参数：detection_features：预测特征图的listself.loc: 回归预测器self.conf: 置信度预测器返回值：locs： 每个box的回归参数confs：每个box的类别（置信度）分数"""locs = []confs = []for f, l, c in zip(features, loc_extractor, conf_extractor):# [batch, n*4, feat_size, feat_size] -> [batch, 4, -1] = [BS, 4, n*feat_size*feat_size]# [BS, 4, n*feat_size*feat_size] = [BS, 4个回归参数, 当前预测特征图所有的Default box数量]# l(f) -> 得到location回归参数locs.append(l(f).view(f.size(0), 4, -1))# [batch, n*classes, feat_size, feat_size] -> [batch, classes, -1] -> [BS, classes, n*feat_size*feat_size]# c(f) -> 得到confidence参数confs.append(c(f).view(f.size(0), self.num_classes, -1))"""使用view()方法时并不会改变原始数据存储的方式，所以需要调用contiguous()来将数据调整为存储连续的tensor"""locs, confs = torch.cat(locs, 2).contiguous(), torch.cat(confs, 2).contiguous()return locs, confsdef forward(self, image, targets=None):# image：打包好的一批图片数据x = self.feature_extractor(image)  # [1024, 38, 38]# Feature Map 38x38x1024, 19x19x512, 10x10x512, 5x5x256, 3x3x256, 1x1x256detection_features = torch.jit.annotate(List[Tensor], [])  # [x]  # 存储每一个预测特征图的listdetection_features.append(x)  # Feature Map1添加到list中# 遍历得到Feature Map2~6for layer in self.additional_blocks:x = layer(x)detection_features.append(x)"""通过bbox_view()这个方法得到所有预测特征图上预测的回归参数和置信度得分参数：detection_features：预测特征图的listself.loc: 回归预测器self.conf: 置信度预测器返回值：locs： 每个box的回归参数confs：每个box的类别（置信度）分数"""# Feature Map 38x38x4, 19x19x6, 10x10x6, 5x5x6, 3x3x4, 1x1x4locs, confs = self.bbox_view(detection_features, self.loc, self.conf)# For SSD 300, shall return nbatch x 8732 x {nlabels, nlocs} results# 38x38x4 + 19x19x6 + 10x10x6 + 5x5x6 + 3x3x4 + 1x1x4 = 8732"""如果是训练模式，则会进一步计算损失（不进行后处理）如果是eval模式，直接进行后处理（不进行损失计算），得到最终的结果"""if self.training:if targets is None:raise ValueError("In training mode, targets should be passed")# bboxes_out (Tensor 8732 x 4), labels_out (Tensor 8732)bboxes_out = targets['boxes']bboxes_out = bboxes_out.transpose(1, 2).contiguous()# print(bboxes_out.is_contiguous())labels_out = targets['labels']# print(labels_out.is_contiguous())# ploc, plabel, gloc, glabelloss = self.compute_loss(locs, confs, bboxes_out, labels_out)return {"total_losses": loss}# 将预测回归参数叠加到default box上得到最终预测box，并执行非极大值抑制虑除重叠框# results = self.encoder.decode_batch(locs, confs)results = self.postprocess(locs, confs)return results

1.3 Default Box的生成

def dboxes300_coco():figsize = 300  # 输入网络的图像大小feat_size = [38, 19, 10, 5, 3, 1]   # 每个预测层的feature map尺寸steps = [8, 16, 32, 64, 100, 300]   # 每个特征层上的一个cell在原图上的跨度# use the scales here: https://github.com/amdegroot/ssd.pytorch/blob/master/data/config.pyscales = [21, 45, 99, 153, 207, 261, 315]  # 每个特征层上预测的default box的scaleaspect_ratios = [[2], [2, 3], [2, 3], [2, 3], [2], [2]]  # 每个预测特征层上预测的default box的ratiosdboxes = DefaultBoxes(figsize, feat_size, steps, scales, aspect_ratios)return dboxesclass DefaultBoxes(object):"""这个类专门用来生成Default Box参数：fig_size：输入网络的图片大小 -> 300feat_size：每个预测特征图的尺寸(6个预测特征图 -> [38, 19, 10, 5, 3, 1])step: 每个在预测特征图上的cell在原图上的跨度 -> [8, 16, 32, 64, 100, 300]scales: 就是理论部分讲的scale -> [21, 45, 99, 153, 207, 261, 315]aspect_ratio: 每个预测特征图所使用的比例 -> [[2], [2, 3], [2, 3], [2, 3], [2], [2]]有6个预测特征图，所以这个大的list有6个元素为了方便就没有把1:1这个比例写进去scale_xy, scale_wh: 在论文中没有提到，但在源码是有这两个参数的。 可以理解为是一个trick，在损失函数部分会进一步讲解"""def __init__(self, fig_size, feat_size, steps, scales, aspect_ratios, scale_xy=0.1, scale_wh=0.2):self.fig_size = fig_size   # 输入网络的图像大小 300# [38, 19, 10, 5, 3, 1]self.feat_size = feat_size  # 每个预测层的feature map尺寸self.scale_xy_ = scale_xyself.scale_wh_ = scale_wh# According to https://github.com/weiliu89/caffe# Calculation method slightly different from paper# [8, 16, 32, 64, 100, 300]self.steps = steps    # 每个特征层上的一个cell在原图上的跨度# [21, 45, 99, 153, 207, 261, 315]self.scales = scales  # 每个特征层上预测的default box的scalefk = fig_size / np.array(steps)     # 计算每层特征层的fk# [[2], [2, 3], [2, 3], [2, 3], [2], [2]]self.aspect_ratios = aspect_ratios  # 每个预测特征层上预测的default box的ratiosself.default_boxes = []  # 存储后面生成的Default box的坐标信息# size of feature and number of feature# 遍历每层特征层，计算default boxfor idx, sfeat in enumerate(self.feat_size):sk1 = scales[idx] / fig_size  # scale转为相对值[0-1]sk2 = scales[idx + 1] / fig_size  # scale转为相对值[0-1]sk3 = sqrt(sk1 * sk2)# 先添加两个1:1比例的default box宽和高"""添加了2个尺度的Default box尺度为sk1，比例为sk1:sk1 -> 1:1的Default box尺度为sk3，比例为sk3:sk3 -> 1:1的Default box"""all_sizes = [(sk1, sk1), (sk3, sk3)]# 再将剩下不同比例的default box宽和高添加到all_sizes中for alpha in aspect_ratios[idx]:w, h = sk1 * sqrt(alpha), sk1 / sqrt(alpha)  # sk1 * 根号2 / sk1 /根号2 = 2all_sizes.append((w, h))  # 2: 1all_sizes.append((h, w))  # 1: 2# 计算当前特征层对应原图上的所有default boxfor w, h in all_sizes:""">>> import itertools>>> for i, j in itertools.product(range(3), repeat=2):\... print((i, j))... (0, 0)  # 第0行的第0个cell(0, 1)  # 第0行的第1个cell(0, 2)  # 第0行的第2个cell(1, 0)  # 第1行的第0个cell(1, 1)  # 第1行的第1个cell(1, 2)(2, 0)(2, 1)(2, 2)"""# 遍历每一个预测特征图的坐标for i, j in itertools.product(range(sfeat), repeat=2):  # i -> 行（y）， j -> 列（x）# 计算每个default box的中心坐标（范围是在0-1之间）"""Q：为什么i和j要加0.5A：这是因为i和j说白了就是预测特征图上每一个像素点的坐标，像素点长为1宽也为1（这就是一个cell），这个没有什么问题我们想要知道这个cell的中心坐标，而在图像中，坐标原点是在左上角，横坐标向右为正，纵坐标向下为正所以i+0.5就是这个cell的行坐标，j+0.5就是这个cell的纵坐标"""# fk是每个预测特征图的feature sizecx, cy = (j + 0.5) / fk[idx], (i + 0.5) / fk[idx]self.default_boxes.append((cx, cy, w, h))  # 中心点坐标、宽高都是相对坐标"""为什么要转换为相对值？因为转换为相对值之后，这个坐标就可以应用到不同的预测特征图，或者原图上都行"""# 将default_boxes转为tensor格式self.dboxes = torch.as_tensor(self.default_boxes, dtype=torch.float32)  # 这里不转类型会报错self.dboxes.clamp_(min=0, max=1)  # 将坐标（x, y, w, h）都限制在0-1之间# For IoU calculation# ltrb is left top coordinate and right bottom coordinate# 将(x, y, w, h)转换成(xmin, ymin, xmax, ymax)，方便后续计算IoU(匹配正负样本时)self.dboxes_ltrb = self.dboxes.clone()  # torch.Size([8732, 4]) -> 生成8732个Default box，每个Default box有4个坐标(x,y,w,h)self.dboxes_ltrb[:, 0] = self.dboxes[:, 0] - 0.5 * self.dboxes[:, 2]   # xmin <- x - w/2 = 左上角的x坐标self.dboxes_ltrb[:, 1] = self.dboxes[:, 1] - 0.5 * self.dboxes[:, 3]   # ymin <- y - h/2 = 左上角的y坐标self.dboxes_ltrb[:, 2] = self.dboxes[:, 0] + 0.5 * self.dboxes[:, 2]   # xmax <- w + w/2 = 右下角的x坐标self.dboxes_ltrb[:, 3] = self.dboxes[:, 1] + 0.5 * self.dboxes[:, 3]   # ymax <- h + h/2 = 右下角的y坐标"""self.dboxes: x,y,w,hself.dboxes_ltrb: x1,y1,x2,y2"""@propertydef scale_xy(self):return self.scale_xy_@propertydef scale_wh(self):return self.scale_wh_def __call__(self, order='ltrb'):# 根据需求返回对应格式的default boxif order == 'ltrb':return self.dboxes_ltrbif order == 'xywh':return self.dboxes

1.4 正负样本匹配

class AssignGTtoDefaultBox(object):"""将DefaultBox与GT进行匹配"""def __init__(self):self.default_box = dboxes300_coco()  # 创建DBox(最原生的8732）self.encoder = Encoder(self.default_box)def __call__(self, image, target):boxes = target['boxes']  # GTBox的坐标信息labels = target["labels"]  # GTBox对应的labels# bboxes_out (Tensor 8732 x 4), labels_out (Tensor 8732)# 通过self.encoder.encode()方法将DBox与GTBox进行匹配bboxes_out, labels_out = self.encoder.encode(boxes, labels)target['boxes'] = bboxes_outtarget['labels'] = labels_outreturn image, target# This function is from https://github.com/kuangliu/pytorch-ssd.
class Encoder(object):"""Inspired by https://github.com/kuangliu/pytorch-srcTransform between (bboxes, lables) <-> SSD outputdboxes: default boxes in size 8732 x 4,encoder: input ltrb format, output xywh formatdecoder: input xywh format, output ltrb formatencode:input  : bboxes_in (Tensor nboxes x 4), labels_in (Tensor nboxes)output : bboxes_out (Tensor 8732 x 4), labels_out (Tensor 8732)criteria : IoU threshold of bboexesdecode:input  : bboxes_in (Tensor 8732 x 4), scores_in (Tensor 8732 x nitems)output : bboxes_out (Tensor nboxes x 4), labels_out (Tensor nboxes)criteria : IoU threshold of bboexesmax_output : maximum number of output bboxes"""def __init__(self, dboxes):self.dboxes = dboxes(order='ltrb')self.dboxes_xywh = dboxes(order='xywh').unsqueeze(dim=0)self.nboxes = self.dboxes.size(0)  # default boxes的数量self.scale_xy = dboxes.scale_xyself.scale_wh = dboxes.scale_whdef encode(self, bboxes_in, labels_in, criteria=0.5):"""encode:input  : bboxes_in (Tensor nboxes x 4), labels_in (Tensor nboxes)output : bboxes_out (Tensor 8732 x 4), labels_out (Tensor 8732)criteria : IoU threshold of bboexesbboxes_in: GTBox的坐标信息labels_in： GTBox的标签信息"""# 计算每个GT与default box的iou数值ious = calc_iou_tensor(bboxes_in, self.dboxes)  # [GTBox, 8732]# 寻找每个default box匹配到的最大IoU: ① 值； ② 对应的索引best_dbox_ious, best_dbox_idx = ious.max(dim=0)  # [8732,]# 寻找每个GT匹配到的最大IoUbest_bbox_ious, best_bbox_idx = ious.max(dim=1)  # [GTBox,]# 将每个GT匹配到的最佳default box设置为正样本（对应论文中Matching strategy的第一条）"""Matching strategy：1. 用每一个GTBox与DBox求IoU（会将GTBox分配给与它IoU最大的DBox）2. 将IoU大于0.5的DBox设置为正样本"""# set best ious 2.0# 将每个GTBox匹配到的最大IoU的DBox设置为正样本"""tensor.index_fill_(0, best_bbox_idx, 2.0):第一个参数： dim第二个参数： 需要填充的索引第三个参数： 需要填充的数值填充的数只要大于0.5都是可以的"""best_dbox_ious.index_fill_(0, best_bbox_idx, 2.0)  # dim, index, value# 将相应default box匹配最大IOU的GT索引进行替换  -> best_bbox_idx.size(0): GTBox的数量idx = torch.arange(0, best_bbox_idx.size(0), dtype=torch.int64)  # [0, 1, 2, ..., GTBox的数量-1]best_dbox_idx[best_bbox_idx[idx]] = idx"""做完上面这两个操作：best_dbox_idx和best_bbox_idx的对应关系就一致了"""# filter IoU > 0.5# 寻找与GT iou大于0.5的default box,对应论文中Matching strategy的第二条(这里包括了第一条匹配到的信息)masks = best_dbox_ious > criteria  # [Boolean, Boolean, ...]"""针对每一个Default Box而言，都已经被划分为正负样本了。接下来我们需要对划分为正样本的DBox所对应的GT的 标签和GTBox信息 提取出来，方法后续计算self.nboxes: DBox的数量labels_in：针对每一个GTBox的类别标签best_dbox_idx[masks]：将mask所有为True的数据取出来 -> 将划分为正样本的所有DBox对应的GTBox的索引取出来labels_in[best_dbox_idx[masks]]：再将对应GTBox的索引传去label_in中 -> 所有DBox对应GTBox的标签取出来在labels_in中，所有正样本的数值>0，所有负样本的数值=0（背景）这里找正样本DBox对应GTBox的标签，目的是找到DBox的标签（二者是一样的）"""# [8732,]labels_out = torch.zeros(self.nboxes, dtype=torch.int64)  # nboxes = DBoxlabels_out[masks] = labels_in[best_dbox_idx[masks]]"""上面两步将所有划分为正样本的DBox的标签取出来了，接下来我们还需要DBox对应GTBox的信息提出来（坐标信息）best_dbox_idx[masks]：拿到所有被划分到正样本的GTBox的索引bboxes_in[best_dbox_idx[masks]: 拿到针对每一个划分为正样本的DBox所对应GTBox的坐标信息(bboxes_in是GTBox的坐标信息)对于每一个DBox而言，如果它被划分为正样本，那么它的标签和坐标信息都是对应GTBox的标签和坐标的如果它被划分为负样本的话，它的标签为0，它的坐标信息还是它原来的DBox坐标信息"""# 将default box匹配到正样本的位置设置成对应GT的box信息bboxes_out = self.dboxes.clone()bboxes_out[masks, :] = bboxes_in[best_dbox_idx[masks], :]# Transform format to xywh formatx = 0.5 * (bboxes_out[:, 0] + bboxes_out[:, 2])  # xy = 0.5 * (bboxes_out[:, 1] + bboxes_out[:, 3])  # yw = bboxes_out[:, 2] - bboxes_out[:, 0]  # wh = bboxes_out[:, 3] - bboxes_out[:, 1]  # hbboxes_out[:, 0] = xbboxes_out[:, 1] = ybboxes_out[:, 2] = wbboxes_out[:, 3] = hreturn bboxes_out, labels_outdef scale_back_batch(self, bboxes_in, scores_in):"""将box格式从xywh转换回ltrb, 将预测目标score通过softmax处理Do scale and transform from xywh to ltrbsuppose input N x 4 x num_bbox | N x label_num x num_bboxbboxes_in: 是网络预测的xywh回归参数scores_in: 是预测的每个default box的各目标概率"""if bboxes_in.device == torch.device("cpu"):self.dboxes = self.dboxes.cpu()self.dboxes_xywh = self.dboxes_xywh.cpu()else:self.dboxes = self.dboxes.cuda()self.dboxes_xywh = self.dboxes_xywh.cuda()# Returns a view of the original tensor with its dimensions permuted.bboxes_in = bboxes_in.permute(0, 2, 1)scores_in = scores_in.permute(0, 2, 1)# print(bboxes_in.is_contiguous())bboxes_in[:, :, :2] = self.scale_xy * bboxes_in[:, :, :2]   # 预测的x, y回归参数bboxes_in[:, :, 2:] = self.scale_wh * bboxes_in[:, :, 2:]   # 预测的w, h回归参数# 将预测的回归参数叠加到default box上得到最终的预测边界框bboxes_in[:, :, :2] = bboxes_in[:, :, :2] * self.dboxes_xywh[:, :, 2:] + self.dboxes_xywh[:, :, :2]bboxes_in[:, :, 2:] = bboxes_in[:, :, 2:].exp() * self.dboxes_xywh[:, :, 2:]# transform format to ltrbl = bboxes_in[:, :, 0] - 0.5 * bboxes_in[:, :, 2]t = bboxes_in[:, :, 1] - 0.5 * bboxes_in[:, :, 3]r = bboxes_in[:, :, 0] + 0.5 * bboxes_in[:, :, 2]b = bboxes_in[:, :, 1] + 0.5 * bboxes_in[:, :, 3]bboxes_in[:, :, 0] = l  # xminbboxes_in[:, :, 1] = t  # yminbboxes_in[:, :, 2] = r  # xmaxbboxes_in[:, :, 3] = b  # ymaxreturn bboxes_in, F.softmax(scores_in, dim=-1)def decode_batch(self, bboxes_in, scores_in, criteria=0.45, max_output=200):# 将box格式从xywh转换回ltrb（方便后面非极大值抑制时求iou）, 将预测目标score通过softmax处理bboxes, probs = self.scale_back_batch(bboxes_in, scores_in)outputs = []# 遍历一个batch中的每张image数据for bbox, prob in zip(bboxes.split(1, 0), probs.split(1, 0)):bbox = bbox.squeeze(0)prob = prob.squeeze(0)outputs.append(self.decode_single_new(bbox, prob, criteria, max_output))return outputsdef decode_single_new(self, bboxes_in, scores_in, criteria, num_output=200):"""decode:input  : bboxes_in (Tensor 8732 x 4), scores_in (Tensor 8732 x nitems)output : bboxes_out (Tensor nboxes x 4), labels_out (Tensor nboxes)criteria : IoU threshold of bboexesmax_output : maximum number of output bboxes"""device = bboxes_in.devicenum_classes = scores_in.shape[-1]# 对越界的bbox进行裁剪bboxes_in = bboxes_in.clamp(min=0, max=1)# [8732, 4] -> [8732, 21, 4]bboxes_in = bboxes_in.repeat(1, num_classes).reshape(scores_in.shape[0], -1, 4)# create labels for each predictionlabels = torch.arange(num_classes, device=device)labels = labels.view(1, -1).expand_as(scores_in)# remove prediction with the background label# 移除归为背景类别的概率信息bboxes_in = bboxes_in[:, 1:, :]scores_in = scores_in[:, 1:]labels = labels[:, 1:]# batch everything, by making every class prediction be a separate instancebboxes_in = bboxes_in.reshape(-1, 4)scores_in = scores_in.reshape(-1)labels = labels.reshape(-1)# remove low scoring boxes# 移除低概率目标，self.scores_thresh=0.05inds = torch.nonzero(scores_in > 0.05, as_tuple=False).squeeze(1)bboxes_in, scores_in, labels = bboxes_in[inds], scores_in[inds], labels[inds]# remove empty boxesws, hs = bboxes_in[:, 2] - bboxes_in[:, 0], bboxes_in[:, 3] - bboxes_in[:, 1]keep = (ws >= 0.1 / 300) & (hs >= 0.1 / 300)keep = keep.nonzero(as_tuple=False).squeeze(1)bboxes_in, scores_in, labels = bboxes_in[keep], scores_in[keep], labels[keep]# non-maximum suppressionkeep = batched_nms(bboxes_in, scores_in, labels, iou_threshold=criteria)# keep only topk scoring predictionskeep = keep[:num_output]bboxes_out = bboxes_in[keep, :]scores_out = scores_in[keep]labels_out = labels[keep]return bboxes_out, labels_out, scores_out# perform non-maximum suppressiondef decode_single(self, bboxes_in, scores_in, criteria, max_output, max_num=200):"""decode:input  : bboxes_in (Tensor 8732 x 4), scores_in (Tensor 8732 x nitems)output : bboxes_out (Tensor nboxes x 4), labels_out (Tensor nboxes)criteria : IoU threshold of bboexesmax_output : maximum number of output bboxes"""# Reference to https://github.com/amdegroot/ssd.pytorchbboxes_out = []scores_out = []labels_out = []# 非极大值抑制算法# scores_in (Tensor 8732 x nitems), 遍历返回每一列数据，即8732个目标的同一类别的概率for i, score in enumerate(scores_in.split(1, 1)):# skip backgroundif i == 0:continue# [8732, 1] -> [8732]score = score.squeeze(1)# 虑除预测概率小于0.05的目标mask = score > 0.05bboxes, score = bboxes_in[mask, :], score[mask]if score.size(0) == 0:continue# 按照分数从小到大排序score_sorted, score_idx_sorted = score.sort(dim=0)# select max_output indicesscore_idx_sorted = score_idx_sorted[-max_num:]candidates = []while score_idx_sorted.numel() > 0:idx = score_idx_sorted[-1].item()# 获取排名前score_idx_sorted名的bboxes信息 Tensor:[score_idx_sorted, 4]bboxes_sorted = bboxes[score_idx_sorted, :]# 获取排名第一的bboxes信息 Tensor:[4]bboxes_idx = bboxes[idx, :].unsqueeze(dim=0)# 计算前score_idx_sorted名的bboxes与第一名的bboxes的iouiou_sorted = calc_iou_tensor(bboxes_sorted, bboxes_idx).squeeze()# we only need iou < criteria# 丢弃与第一名iou > criteria的所有目标(包括自己本身)score_idx_sorted = score_idx_sorted[iou_sorted < criteria]# 保存第一名的索引信息candidates.append(idx)# 保存该类别通过非极大值抑制后的目标信息bboxes_out.append(bboxes[candidates, :])   # bbox坐标信息scores_out.append(score[candidates])       # score信息labels_out.extend([i] * len(candidates))   # 标签信息if not bboxes_out:  # 如果为空的话，返回空tensor，注意boxes对应的空tensor size，防止验证时出错return [torch.empty(size=(0, 4)), torch.empty(size=(0,), dtype=torch.int64), torch.empty(size=(0,))]bboxes_out = torch.cat(bboxes_out, dim=0).contiguous()scores_out = torch.cat(scores_out, dim=0).contiguous()labels_out = torch.as_tensor(labels_out, dtype=torch.long)# 对所有目标的概率进行排序（无论是什 么类别）,取前max_num个目标_, max_ids = scores_out.sort(dim=0)max_ids = max_ids[-max_output:]return bboxes_out[max_ids, :], labels_out[max_ids], scores_out[max_ids]

1.5 Loss的计算

class Loss(nn.Module):"""Implements the loss as the sum of the followings:1. Confidence Loss: All labels, with hard negative mining2. Localization Loss: Only on positive labelsSuppose input dboxes has the shape 8732x4"""def __init__(self, dboxes):"""Args:dboxes:  utils中的 dboxes = DefaultBoxes(figsize, feat_size, steps, scales, aspect_ratios)即实例化的Default Boxdboxes.scale_xy = 0.1dboxes.scale_wh = 0.2"""super(Loss, self).__init__()# Two factor are from following links# http://jany.st/post/2017-11-05-single-shot-detector-ssd-from-scratch-in-tensorflow.htmlself.scale_xy = 1.0 / dboxes.scale_xy  # 10self.scale_wh = 1.0 / dboxes.scale_wh  # 5# 定义计算定位的损失器 -> SmoothL1self.location_loss = nn.SmoothL1Loss(reduction='none')"""nn.Parameter：转化为PyTorch的参数参数：① data (Tensor) – parameter tensor.② requires_grad (bool, optional)    -> Default: Trueif the parameter requires gradient. See Locally disabling gradient computation for more details. """# [num_anchors, 4] -> [4, num_anchors] -> [1, 4, num_anchors]self.dboxes = nn.Parameter(dboxes(order="xywh").transpose(0, 1).unsqueeze(dim=0), requires_grad=False)# 定义计算分类的损失器 -> Softmax + 交叉熵self.confidence_loss = nn.CrossEntropyLoss(reduction='none')def _location_vec(self, loc):# type: (Tensor) -> Tensorr"""- Generate Location Vectors —— 计算ground truth相对anchors的回归参数:param loc: anchor匹配到的对应GTBOX Nx4x8732计算公式：.. math::\begin{aligned}& x_{回归} = \frac{x_{gtbox} - x_{dbox}}{w_{dbox}} \\& y_{回归} = \frac{y_{gtbox} - y_{dbox}}{h_{dbox}} \\& w_{回归} = \ln(\frac{w_{gtbox}}{w_{dbox}}) \\& h_{回归} = \ln(\frac{h_{gtbox}}{h_{dbox}}) \\\end{aligned}.. note::- 这里我觉得是计算 `DBox的默认位置` 和 `所匹配到GTBox位置之间` 的差距，拿这个差距和DBox的预测回归参数进行比较，得到损失- 也就是说，这里计算的是DBox和GTBox的实际差距（偏离量），网络预测的回归参数是DBox应该位移的量，那这两个数去进行损失函数的计算是比较合理的- self.scale_xy（5）和self.scale_wh（10）是两个缩放因子- 这么做的原因可以理解为是一个trick -> 加速模型收敛:return: 返回DBox与GT的差距，shape为[BS, 4, 8732]"""gxy = self.scale_xy * (loc[:, :2, :] - self.dboxes[:, :2, :]) / self.dboxes[:, 2:, :]  # Nx2x8732gwh = self.scale_wh * (loc[:, 2:, :] / self.dboxes[:, 2:, :]).log()  # Nx2x8732return torch.cat((gxy, gwh), dim=1).contiguous()def forward(self, ploc, plabel, gloc, glabel):# type: (Tensor, Tensor, Tensor, Tensor) -> Tensor"""ploc, plabel: Nx4x8732, Nxlabel_numx8732predicted location and labels① ploc：预测的回归参数② plabel：预测的标签值gloc, glabel: Nx4x8732, Nx8732ground truth location and labels① gloc：预处理过程中每个Default Box（这里应该叫anchor）所匹配到的GT box对应的坐标② glabel：预处理过程中每个Default Box（这里应该叫anchor）所匹配到的GT box对应的标签""""""在预处理过程中会对每一个Dbox匹配和它对应的GTBox以及标签如果该DBox有匹配到GTBox -> 它的glabel肯定是一个>0的数如果该DBox没有匹配到GTBox -> 它的glabel被置为0（0对应的是背景）mask = torch.gt(glabel, 0)通过这个运算之后，就能得到所有匹配到GTBox的DBoxmask是一个蒙版"""# 获取正样本的mask  Tensor: [N, 8732]  N=BSmask = torch.gt(glabel, 0)  # (gt: >)# mask1 = torch.nonzero(glabel)# 计算一个batch中的每张图片的正样本个数 Tensor: [N]pos_num = mask.sum(dim=1)  # shape: [BS]：表明是求batch中每张图片正样本的数量"""[ 2, 10, 52, 27]  这里我们的BS=4，所以只有4张图片第一张图片匹配到的正样本个数为2第二张图片匹配到的正样本个数为10...""""""计算每一个DBox所匹配到GTBox的location回归参数gloc：预处理过程中每个Default Box所匹配到的GT box对应的坐标shape为[BS, 4, 8732] = [BS, 4个坐标, 所有的DBox]"""# 计算gt的location回归参数 -> 预测框的值 Tensor: [N, 4, 8732]vec_gd = self._location_vec(gloc)  # [BS, 4, 8732]# sum on four coordinates, and mask# 计算定位损失(只有正样本)"""ploc： 预测的回归参数（DBox应该如何移动以更好的匹配GTBox）    [BS, 4, 8732]vec_gd：DBox一开始和匹配的的GTBox之间的偏移量              [BS, 4, 8732]vec_gd计算的是DBox和GTBox的实际差距（偏离量）网络预测的回归参数是DBox应该位移的量拿这两个数去进行损失函数的计算是比较合理的Note：定位损失应该只计算正样本的定义损失，负样本不应该参与，而loc_loss = self.location_loss(ploc, vec_gd).sum(dim=1)计算了所有样本的损失，我们应该对其进行筛选 -> 使用刚才得到mask进行筛选import torchmask = [True, False, True, True, False, False, False, False, True]mask = torch.tensor(mask)mask.float()  # tensor([1., 0., 1., 1., 0., 0., 0., 0., 1.])"""loc_loss = self.location_loss(ploc, vec_gd).sum(dim=1)  # Tensor: [N, 8732]loc_loss = (mask.float() * loc_loss).sum(dim=1)  # Tenosr: [N]  -> 对所有的location损失求和 -> 每张图片的location loss"""计算分类损失① plabel： DBox的分类值      torch.Size([4, 21, 8732]) = [BS, classes+1, DBox数量]② glabel： DBox所匹配到的GT的真实分类值     torch.Size([4, 8732]) = [BS, DBox数量]如果该DBox没有匹配到GTBox，则glabel=0（负样本对应的就是背景）返回值：con: 每一个DBox的分类损失       torch.Size([4, 8732]) = [BS, DBox数量]Note：我们不会用到这8732个分类loss的，因为我们在loss时需要考虑到正负样本平衡的问题（SSD论文中，正负样本比例为3:1）而这8732个DBox只有很少一部分是正样本，绝大多数都是负样本如果直接这样计算的话 -> 正负样本不平衡 -> 给网络训练带来很大的困难所以我们需要分别获取它们的正负样本的loss正样本我们一开始就得到了（用mask），现在需要计算负样本"""# hard negative mining Tenosr: [N, 8732]con = self.confidence_loss(plabel, glabel)# positive mask will never selected# 获取负样本con_neg = con.clone()con_neg[mask] = 0.0  # 将正样本的分类损失全部置为0 -> 得到负样本的分类损失# 按照confidence_loss降序排列 con_idx(Tensor: [N, 8732])_, con_idx = con_neg.sort(dim=1, descending=True)  # ① 对负样本的分类损失进行降序排列，不要返回值，只要索引 -> 损失越来越小_, con_rank = con_idx.sort(dim=1)  # 这个步骤比较巧妙  # ② 对负样本损失的索引进行升序排列# number of negative three times positive# 用于损失计算的负样本数是正样本的3倍（在原论文Hard negative mining部分），# 但不能超过总样本数8732  # mask.size(1)=8732neg_num = torch.clamp(3 * pos_num, max=mask.size(1)).unsqueeze(-1)  # 对负样本个数进行限制neg_mask = torch.lt(con_rank, neg_num)  # (lt: <) Tensor [N, 8732]  # ③ 进行取值"""经过这①②③三步之后，neg_mask中为True的个数为neg_num，且正好对应分类损失最大的位置""""""在得到正负样本的mask之后就可以计算最后的分类损失所有的正样本都会用到所有总的样本mask = 正样本的mask + 选取的负样本mask -> mask.float() + neg_mask.float()"""# confidence最终loss使用选取的正样本loss+选取的负样本losscon_loss = (con * (mask.float() + neg_mask.float())).sum(dim=1)  # Tensor [N]"""总的损失 = (定位损失 + 分类损失) / NNote:N为正样本的个数（而不是总的样本数）"""total_loss = loc_loss + con_loss# avoid no object detected# 避免出现图像中没有GTBOX的情况# eg. [15, 3, 5, 0] -> [1.0, 1.0, 1.0, 0.0]"""pos_num的确是正样本的个数    data: tensor([18, 61,  7,  6], device='cuda:0')但是不一定这个batch中每一张图片都有正样本数，因为最后的总损失需要/正样本个数，这里总损失是针对一张图片的，而不是一个batch，对于一张图片来说，它的pos_num有可能为0（正样本数为0）比如pos_num=tensor([0, 0,  7,  6], device='cuda:0')那么在计算时就会出现除零错误，因此需对其下限进行1e-6的限制"""num_mask = torch.gt(pos_num, 0).float()  # 统计一个batch中的每张图像中是否存在正样本， 如果没有则该张图片为负样本，不计算损失pos_num = pos_num.float().clamp(min=1e-6)  # 防止出现分母为零的情况ret = (total_loss * num_mask / pos_num).mean(dim=0)  # 只计算存在正样本的图像损失# ret： tensor(19.5028, device='cuda:0')return ret

1.6 后处理 —— 不训练（不计算损失）而是eval模式直接对输出进行后处理

class PostProcess(nn.Module):def __init__(self, dboxes):"""如果不在训练，那么就对结果进行后处理Args:dboxes: 根据预测特征图生成的Default Box（原生的，8732个）dboxes_xywh：先对DBox进行调用（格式为xywh），之后再添加一个维度，并且声明它不需要梯度（因为是正向传播不训练），最后将其转换为PyTorch的参数dboxes.scale_xy和dboxes.scale_wh是两个超参数self.criteria：IoU阈值的标准self.max_output: 输出目标的最大数量"""super(PostProcess, self).__init__()# [num_anchors, 4] -> [1, num_anchors, 4]self.dboxes_xywh = nn.Parameter(dboxes(order='xywh').unsqueeze(dim=0), requires_grad=False)self.scale_xy = dboxes.scale_xy  # 0.1self.scale_wh = dboxes.scale_wh  # 0.2self.criteria = 0.5self.max_output = 100def scale_back_batch(self, bboxes_in, scores_in):# type: (Tensor, Tensor) -> Tuple[Tensor, Tensor]"""1）通过预测的boxes回归参数得到最终预测坐标2）将box格式从xywh转换回ltrb（VOC的形式）3）将预测目标score通过softmax处理Do scale and transform from xywh to ltrbsuppose input N x 4 x num_bbox | N x label_num x num_bboxbboxes_in: [N, 4, 8732]是网络预测的xywh回归参数scores_in: [N, label_num, 8732]是预测的每个default box的各目标概率"""# Returns a view of the original tensor with its dimensions permuted.# [batch, 4, 8732] -> [batch, 8732, 4]bboxes_in = bboxes_in.permute(0, 2, 1)# [batch, label_num, 8732] -> [batch, 8732, label_num]  label_num = 20 +1 = 21scores_in = scores_in.permute(0, 2, 1)# print(bboxes_in.is_contiguous())bboxes_in[:, :, :2] = self.scale_xy * bboxes_in[:, :, :2]   # 预测的x, y回归参数bboxes_in[:, :, 2:] = self.scale_wh * bboxes_in[:, :, 2:]   # 预测的w, h回归参数"""回归参数的计算如下：回归参数x = (GTBox的x - 预测DBox的x) / 预测DBox的w回归参数y = (GTBox的y - 预测DBox的y) / 预测DBox的h回归参数w = ln(GTBox的w / 预测DBox的w)回归参数h = ln(GTBox的h / 预测DBox的h)现在我们知道回归参数了，去算GTbox：GTBox(最终的预测边界框)的x = (回归参数x)*预测DBox的w + 预测DBox的xGTBox(最终的预测边界框)的y = (回归参数y)*预测DBox的w + 预测DBox的hGTBox(最终的预测边界框)的w = ln(回归参数w) * 预测DBox的wGTBox(最终的预测边界框)的h = ln(回归参数h) * 预测DBox的h"""# 将预测的回归参数叠加到default box上得到最终的预测边界框bboxes_in[:, :, :2] = bboxes_in[:, :, :2] * self.dboxes_xywh[:, :, 2:] + self.dboxes_xywh[:, :, :2]  # xybboxes_in[:, :, 2:] = bboxes_in[:, :, 2:].exp() * self.dboxes_xywh[:, :, 2:]  # wh# transform format to ltrbl = bboxes_in[:, :, 0] - 0.5 * bboxes_in[:, :, 2]t = bboxes_in[:, :, 1] - 0.5 * bboxes_in[:, :, 3]r = bboxes_in[:, :, 0] + 0.5 * bboxes_in[:, :, 2]b = bboxes_in[:, :, 1] + 0.5 * bboxes_in[:, :, 3]bboxes_in[:, :, 0] = l  # xminbboxes_in[:, :, 1] = t  # yminbboxes_in[:, :, 2] = r  # xmaxbboxes_in[:, :, 3] = b  # ymax# scores_in: [batch, 8732, label_num]return bboxes_in, F.softmax(scores_in, dim=-1)def decode_single_new(self, bboxes_in, scores_in, criteria, num_output):# type: (Tensor, Tensor, float, int) -> Tuple[Tensor, Tensor, Tensor]"""decode:input  : bboxes_in (Tensor 8732 x 4), scores_in (Tensor 8732 x nitems)output : bboxes_out (Tensor nboxes x 4), labels_out (Tensor nboxes)criteria : IoU threshold of bboexesmax_output : maximum number of output bboxes"""device = bboxes_in.device  # 获取变量的设备信息num_classes = scores_in.shape[-1]  # 类别个数（20 + 1 = 21）# 对越界的bbox进行裁剪"""Note: 这里bboxes_in里面的都是相对值，所以将其限制在[0, 1]就可以了"""bboxes_in = bboxes_in.clamp(min=0, max=1)"""SSD算法每一个DBox只预测一组回归参数，并不像Faster R-CNN那样，针对每一个类别都去预测一组回归参数。为了使用和Faster R-CNN相同的算法，这里将DBox的信息复制了num_classes次（包括背景）"""# [8732, 4] -> [8732, 21, 4]bboxes_in = bboxes_in.repeat(1, num_classes).reshape(scores_in.shape[0], -1, 4)# create labels for each prediction"""为每一个DBox创建一个labelimport torchtorch.arange(21)Res:tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20])"""labels = torch.arange(num_classes, device=device)  # [21, ]# [num_classes] -> [8732, num_classes]labels = labels.view(1, -1).expand_as(scores_in)# remove prediction with the background label# 移除归为背景类别的概率信息bboxes_in = bboxes_in[:, 1:, :]  # [8732, 21, 4] -> [8732, 20, 4]scores_in = scores_in[:, 1:]  # [8732, 21] -> [8732, 20]labels = labels[:, 1:]  # [8732, 21] -> [8732, 20]# batch everything, by making every class prediction be a separate instancebboxes_in = bboxes_in.reshape(-1, 4)  # [8732, 20, 4] -> [8732x20, 4] -> 行向量scores_in = scores_in.reshape(-1)  # [8732, 20] -> [8732x20] -> 行向量labels = labels.reshape(-1)  # [8732, 20] -> [8732x20] -> 行向量# remove low scoring boxes# 移除低概率目标，self.scores_thresh=0.05# inds = torch.nonzero(scores_in > 0.05).squeeze(1)inds = torch.where(torch.gt(scores_in, 0.05))[0]  # 获取概率大于0.05的元素索引# 获取BBox预测概率、label在对应位置上的数值bboxes_in, scores_in, labels = bboxes_in[inds, :], scores_in[inds], labels[inds]# remove empty boxes，移除面积很小的DBox"""右下角x - 左上角的x -> ws右下角y - 左上角的y -> hs"""ws, hs = bboxes_in[:, 2] - bboxes_in[:, 0], bboxes_in[:, 3] - bboxes_in[:, 1]keep = (ws >= 1 / 300) & (hs >= 1 / 300)  # 宽度 >= 1个像素；高度 >= 1个像素# keep = keep.nonzero().squeeze(1)keep = torch.where(keep)[0]  # 滤除小面积目标bboxes_in, scores_in, labels = bboxes_in[keep], scores_in[keep], labels[keep]# non-maximum suppression -> 重叠目标的过滤keep = batched_nms(bboxes_in, scores_in, labels, iou_threshold=criteria)# keep only topk scoring predictionskeep = keep[:num_output]  # 控制输出box的最大数量bboxes_out = bboxes_in[keep, :]  # 根据索引取box坐标scores_out = scores_in[keep]  # 根据索引取box分数labels_out = labels[keep]  # 根据索引取box类别return bboxes_out, labels_out, scores_outdef forward(self, bboxes_in, scores_in):"""Args:bboxes_in: 预测得到的DBox的坐标偏移量（坐标回归参数）scores_in: 预测每个DBox对应的类别Returns:"""# 通过预测的boxes回归参数得到最终预测坐标, 将预测目标score通过softmax处理"""boxes: 最终的预测框坐标（x1,y1,x2,y2）probs: 每个预测框的概率"""bboxes, probs = self.scale_back_batch(bboxes_in, scores_in)"""先定义一个空listlist中的每一个元素是一个tupletuple中的每一个元素是一个Tensor"""outputs = torch.jit.annotate(List[Tuple[Tensor, Tensor, Tensor]], [])# 遍历一个batch中的每张image数据"""tensor.split(split_size, dim)split_size： 分割的大小dim: 分割的维度"""# bboxes: [batch, 8732, 4]for bbox, prob in zip(bboxes.split(1, 0), probs.split(1, 0)):  # split_size, split_dim# bbox: [1, 8732, 4]bbox = bbox.squeeze(0)  # 压缩BS维度prob = prob.squeeze(0)  # 压缩BS维度"""bbox: 每张图片的预测边界框坐标prob: 每张图片的预测边界框的概率信息self.criteria: NMS的IoU阈值self.max_output: 输出的最大目标个数得到每张图片的预测边界框相对坐标、分数（该标签对应的概率）和它的标签"""outputs.append(self.decode_single_new(bbox, prob, self.criteria, self.max_output))return outputs

参考

https://www.bilibili.com/video/BV1vK411H771?spm_id_from=333.999.0.0

SSD PyTorch源码解析相关推荐

yolov3之pytorch源码解析_springmvc源码架构解析之view
说在前面前期回顾 sharding-jdbc源码解析更新完毕 spring源码解析更新完毕 spring-mvc源码解析更新完毕 spring-tx源码解析更新完毕 spring-boot源 ...
pytorch源码解析2——数据处理torch.utils.data
迭代器理解 Python 的迭代器是解读 PyTorch 中 torch.utils.data 模块的关键. 在 Dataset, Sampler 和 DataLoader 这三个类中都会用到 py ...
pytorch源码解析：Python层 pytorchmodule源码
尝试使用了pytorch,相比其他深度学习框架,pytorch显得简洁易懂.花时间读了部分源码,主要结合简单例子带着问题阅读,不涉及源码中C拓展库的实现. 一个简单例子实现单层softmax二分类, ...
PyTorch源码解析--torchvision.transforms（数据预处理、数据增强）
PyTorch框架中有一个很常用的包:torchvision torchvision主要由3个子包构成:torchvision.datasets.torchvision.models.torchvis ...
pytorch 测试每一类_DeepFM全方面解析（附pytorch源码）
写在前面最近看了DeepFM这个模型.把我学习的思路和总结放上来给大家和未来的自己做个参考和借鉴.文章主要希望能串起学习DeepFM的各个环节,梳理整个学习思路.以"我"的角度浅 ...
[源码解析] PyTorch 流水线并行实现 (1)--基础知识
[源码解析] PyTorch 流水线并行实现 (1)–基础知识文章目录 [源码解析] PyTorch 流水线并行实现 (1)--基础知识 0x00 摘要 0x01 历史 1.1 GPipe 1.2 ...
[源码解析] PyTorch 分布式(2) ----- DataParallel(上)
[源码解析] PyTorch 分布式(2) ----- DataParallel(上) 文章目录 [源码解析] PyTorch 分布式(2) ----- DataParallel(上) 0x00 摘要 ...
[源码解析] PyTorch 流水线并行实现 (6)--并行计算
[源码解析] PyTorch 流水线并行实现 (6)–并行计算文章目录 [源码解析] PyTorch 流水线并行实现 (6)--并行计算 0x00 摘要 0x01 总体架构 1.1 使用 1.2 前 ...
[源码解析] Pytorch 如何实现后向传播 (1)---- 调用引擎
[源码解析] Pytorch 如何实现后向传播 (1)---- 调用引擎文章目录 [源码解析] Pytorch 如何实现后向传播 (1)---- 调用引擎 0x00 摘要 0x01 前文回顾 1.1 ...

SSD PyTorch源码解析

0. 引言

0.1 代码来源

0.2 代码改动

0.3 代码使用注意事项

0.3.1 环境配置：

0.3.2 文件结构：

0.3.3 预训练权重下载地址（下载后放入src文件夹中）：

0.3.4 数据集，本例程使用的是PASCAL VOC2012数据集(下载后放入项目当前文件夹中)

0.3.5 训练方法

0.3.6 训练结果展示

1. SSD代码

1.1 框架示意图

1.2 SSD网络的搭建 —— `ssd_model.py`

1.2.1 修改backbone（ResNet-50）

1.2.2 搭建SSD网络

1.3 Default Box的生成

1.4 正负样本匹配

1.5 Loss的计算

1.6 后处理 —— 不训练（不计算损失）而是eval模式直接对输出进行后处理

参考

SSD PyTorch源码解析相关推荐

最新文章

热门文章

SSD PyTorch源码解析

0. 引言

0.1 代码来源

0.2 代码改动

0.3 代码使用注意事项

0.3.1 环境配置：

0.3.2 文件结构：

0.3.3 预训练权重下载地址（下载后放入src文件夹中）：

0.3.4 数据集，本例程使用的是PASCAL VOC2012数据集(下载后放入项目当前文件夹中)

0.3.5 训练方法

0.3.6 训练结果展示

1. SSD代码

1.1 框架示意图

1.2 SSD网络的搭建 —— ssd_model.py

1.2.1 修改backbone（ResNet-50）

1.2.2 搭建SSD网络

1.3 Default Box的生成

1.4 正负样本匹配

1.5 Loss的计算

1.6 后处理 —— 不训练（不计算损失）而是eval模式直接对输出进行后处理

参考

SSD PyTorch源码解析相关推荐

最新文章

热门文章

1.2 SSD网络的搭建 —— `ssd_model.py`