再读目标检测--ssd深度解析

[目标检测 – R-CNN，Fast R-CNN，Faster R-CNN] https://www.cnblogs.com/yanghailin/p/14767995.html
[目标检测—SSD] https://www.cnblogs.com/yanghailin/p/14769384.html
[ssd 的anchor生成详解] https://www.cnblogs.com/yanghailin/p/14868575.html
[ssd 网络详解] https://www.cnblogs.com/yanghailin/p/14871296.html
[ssd loss详解] https://www.cnblogs.com/yanghailin/p/14882807.html
其实前阵子读过ssd源码，并且也写出了上面的博客，然后再隔一阵子我就完全不记得了。。。然后最近又再读了一下，但是这次读源码的速度明显快多了，
特别是一些之前似懂非懂的地方都有进一步理解了。

文章目录

ssd网络部分
ssd loss部分 -- ssd编解码
关于如何算交并比，没看到映射到原图的操作。
先验框匹配规则
难例挖掘部分：

ssd网络部分

我又手写了一遍。。。

上面我根据代码写的图就和论文中这个图一致：

我上面最后的LOC [bn,8732,4]
conf [bn,8732,num]
就是网络的输出。

ssd loss部分 – ssd编解码

关于编解码，我一开始看的时候也是看的稀里糊涂的，你说这么整就这样呗。ssd论文里面公式就是如下：

然后我看到faster rcnn里面有详细的介绍。
https://blog.csdn.net/weixin_42782150/article/details/110522380

预测的框为了和gt靠拢，需要把anchor框中心点x，y加上一个偏移量，宽高乘以一个系数不为0系数！向gt靠拢，所以产生了这种数学公式。

在faster rcnn论文里面：

可以看到是有两组t和t*.
一开始看看不懂，后来荒原大悟！
下面的t公式是gt与anchor之间的偏移。而网络学习就是学习的这种偏移t！看上面loss的公式，就是t和t。为了就是让网络向t*靠拢。
t就是编码之后的，这种编码关系就是如上式所示。

关于如何算交并比，没看到映射到原图的操作。

计算交并比，没有看到一处for循环。

def point_form(boxes):""" Convert prior_boxes to (xmin, ymin, xmax, ymax)representation for comparison to point form ground truth data.Args:boxes: (tensor) center-size default boxes from priorbox layers.Return:boxes: (tensor) Converted xmin, ymin, xmax, ymax form of boxes."""return torch.cat((boxes[:, :2] - boxes[:, 2:]/2,     # xmin, yminboxes[:, :2] + boxes[:, 2:]/2), 1)  # xmax, ymax
def intersect(box_a, box_b):""" We resize both tensors to [A,B,2] without new malloc:[A,2] -> [A,1,2] -> [A,B,2][B,2] -> [1,B,2] -> [A,B,2]Then we compute the area of intersect between box_a and box_b.Args:box_a: (tensor) bounding boxes, Shape: [A,4].box_b: (tensor) bounding boxes, Shape: [B,4].Return:(tensor) intersection area, Shape: [A,B]."""A = box_a.size(0)B = box_b.size(0)# n1 = box_a[:, 2:]# n1_1 = box_a[:, 2:].unsqueeze(1)# n1_2 = box_a[:, 2:].unsqueeze(1).expand(A, B, 2)## n2 = box_b[:, 2:]# n2_1 = box_b[:, 2:].unsqueeze(0)# n2_2 = box_b[:, 2:].unsqueeze(0).expand(A, B, 2)## n3 = torch.min(n1_2, n2_2)max_xy = torch.min(box_a[:, 2:].unsqueeze(1).expand(A, B, 2),box_b[:, 2:].unsqueeze(0).expand(A, B, 2))min_xy = torch.max(box_a[:, :2].unsqueeze(1).expand(A, B, 2),box_b[:, :2].unsqueeze(0).expand(A, B, 2))# sub_ = max_xy - min_xyinter = torch.clamp((max_xy - min_xy), min=0)return inter[:, :, 0] * inter[:, :, 1]def jaccard(box_a, box_b):"""Compute the jaccard overlap of two sets of boxes.  The jaccard overlapis simply the intersection over union of two boxes.  Here we operate onground truth boxes and default boxes.E.g.:A ∩ B / A ∪ B = A ∩ B / (area(A) + area(B) - A ∩ B)Args:box_a: (tensor) Ground truth bounding boxes, Shape: [num_objects,4]box_b: (tensor) Prior boxes from priorbox layers, Shape: [num_priors,4]Return:jaccard overlap: (tensor) Shape: [box_a.size(0), box_b.size(0)]"""inter = intersect(box_a, box_b)area_a = ((box_a[:, 2]-box_a[:, 0]) *(box_a[:, 3]-box_a[:, 1])).unsqueeze(1).expand_as(inter)  # [A,B]area_b = ((box_b[:, 2]-box_b[:, 0]) *(box_b[:, 3]-box_b[:, 1])).unsqueeze(0).expand_as(inter)  # [A,B]union = area_a + area_b - interreturn inter / union  # [A,B]overlaps = jaccard(truths,point_form(priors))

看了代码，所有的操作都是相对值，0-1之间的数值。这个数值乘以原图大小就能映射到原图，所以代码中都是相对值，和在原图上操作一致。

先验框匹配规则

在算gt与anchor交并比的时候。一张图的gt很少，1,2个目标，anchor有8732个。
代码中的策略是保证一个gt与最大交并比的anchor匹配，然后每个anchor匹配一个与其交并比最大的gt！

看这张图，先横向的找到gt与anchor最大的那个，这个是最大优先级且后续有逻辑确保这个始终满足。

 best_truth_overlap.index_fill_(0, best_prior_idx, 2)  # ensure best prior# TODO refactor: index  best_prior_idx with long tensor# ensure every gt matches with its prior of max overlapfor j in range(best_prior_idx.size(0)):best_truth_idx[best_prior_idx[j]] = j

再纵向的找最大，就是每个anchor对应最大交并比的gt。然后把8732个anchor与匹配的gt编码。
把anchor与gt交并比小于一定阈值的置位背景类。
之前我的博客有分析：
https://www.cnblogs.com/yanghailin/p/14882807.html
这个知乎有关于先验框匹配的一段讲解：https://zhuanlan.zhihu.com/p/33544892

难例挖掘部分：

挺难的，有一段还是不理解。但是感觉设计的很巧妙！两次sort。

 #loss_c[26196,1]    #https://zhuanlan.zhihu.com/p/153535799
loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

这里放出我详细注释的代码

# -*- coding: utf-8 -*-
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from data import coco as cfg
from ..box_utils import match, log_sum_exp# criterion = MultiBoxLoss(cfg['num_classes'], 0.5, True, 0, True, 3, 0.5,
#                          False, args.cuda)class MultiBoxLoss(nn.Module):"""SSD Weighted Loss FunctionCompute Targets:1) Produce Confidence Target Indices by matching  ground truth boxeswith (default) 'priorboxes' that have jaccard index > threshold parameter(default threshold: 0.5).2) Produce localization target by 'encoding' variance into offsets of groundtruth boxes and their matched  'priorboxes'.3) Hard negative mining to filter the excessive number of negative examplesthat comes with using a large number of default bounding boxes.(default negative:positive ratio 3:1)Objective Loss:L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / NWhere, Lconf is the CrossEntropy Loss and Lloc is the SmoothL1 Lossweighted by α which is set to 1 by cross val.Args:c: class confidences,l: predicted boxes,g: ground truth boxesN: number of matched default boxesSee: https://arxiv.org/pdf/1512.02325.pdf for more details."""def __init__(self, num_classes, overlap_thresh, prior_for_matching,bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,use_gpu=True):super(MultiBoxLoss, self).__init__()self.use_gpu = use_gpuself.num_classes = num_classesself.threshold = overlap_threshself.background_label = bkg_labelself.encode_target = encode_targetself.use_prior_for_matching = prior_for_matchingself.do_neg_mining = neg_miningself.negpos_ratio = neg_posself.neg_overlap = neg_overlapself.variance = cfg['variance']def forward(self, predictions, targets):"""Multibox LossArgs:predictions (tuple): A tuple containing loc preds, conf preds,and prior boxes from SSD net.conf shape: torch.size(batch_size,num_priors,num_classes)loc shape: torch.size(batch_size,num_priors,4)priors shape: torch.size(num_priors,4)targets (tensor): Ground truth boxes and labels for a batch,shape: [batch_size,num_objs,5] (last idx is the label).loc_data [3,8732,4]conf_data [3,8732,21]priors [8732,4]"""loc_data, conf_data, priors = predictionsnum = loc_data.size(0) # num就是batchsizepriors = priors[:loc_data.size(1), :] #priors [8732,4]num_priors = (priors.size(0)) #num_priors=8732num_classes = self.num_classes #21# match priors (default boxes) and ground truth boxesloc_t = torch.Tensor(num, num_priors, 4) #[3,8732,4]conf_t = torch.LongTensor(num, num_priors) #[3,8732]for idx in range(num):truths = targets[idx][:, :-1].datalabels = targets[idx][:, -1].datadefaults = priors.datamatch(self.threshold, truths, defaults, self.variance, labels,loc_t, conf_t, idx)if self.use_gpu:loc_t = loc_t.cuda()conf_t = conf_t.cuda()#后缀_t的变量是代表truth#这里经过match出来得到的loc_t[3,8732,4]， conf_t[3,8732]#loc_t[3,8732,4] 是每个anchor与gt做了编码之后的值#conf_t[3, 8732] 是label，0-21之间的值，大部分地方为0，只有anchor与gt交并比大于阈值的地方才有值# wrap targetsloc_t = Variable(loc_t, requires_grad=False) #[3,8732,4]conf_t = Variable(conf_t, requires_grad=False) #[3,8732]# pos [3,8732]  False,True 重要！ 挑选出正样本位置，正样本位置上是true，其余为false# num_pos shape[3,1]   | [23],[4],[9]   每张图正样本个数pos = conf_t > 0 #pos [3,8732]  False,Truenum_pos = pos.sum(dim=1, keepdim=True) #num_pos shape[3,1]   | [23],[4],[9]# Localization Loss (Smooth L1)#pos [3,8732]#pos.dim()=2#pos.unsqueeze(pos.dim()) [3,8732,1]# loc_data [3,8732,4]# pos_idx [3,8732,4] True  False   || 就是把pos里面的True和False放大到[3,8732,4]pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)#loc_data[3, 8732, 4]   aa[144]aa = loc_data[pos_idx]#[3,8732,4]   bb[144]bb = loc_t[pos_idx]#pos_idx [3,8732,4] True  False#loc_data [3,8732,4]  网络预测输出值#loc_data[pos_idx] shape[144]  (23+4+9)×4=144# loc_t [3,8732,4]#loc_t[pos_idx] shape[144]  (23+4+9)×4=144loc_p = loc_data[pos_idx].view(-1, 4)loc_t = loc_t[pos_idx].view(-1, 4)#loss_l tensor(14.0165, grad_fn=<SmoothL1LossBackward>)loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False)# Compute max conf across batch for hard negative mining#conf_data [3,8732,21]  网络预测值# batch_conf[3*8732,21]  [26196,21]# tt_1 = batch_conf.max() ## 35# tt_2 = batch_conf.min() ## -13#这里取出了最大最小值方便直观的感受一下里面值都是啥，可以看到这个值其实是为了预测label的。没有经过softmaxbatch_conf = conf_data.view(-1, self.num_classes)b1 = log_sum_exp(batch_conf) #[26196,1]# conf_t[3,8732]# conf_t.view(-1, 1) [26196,1]b2 = batch_conf.gather(1, conf_t.view(-1, 1)) #[26196,1]#https://blog.csdn.net/liyu0611/article/details/100547145#https://zhuanlan.zhihu.com/p/35709485# https://zhuanlan.zhihu.com/p/153535799#loss_c[26196,1]loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))# Hard Negative Mining#loss_c[pos] = 0  # filter out pos boxes for now#loss_c = loss_c.view(num, -1)#loss_c [3,8732]loss_c = loss_c.view(num, -1)#pos [3,8732]# loss_c [3,8732]loss_c[pos] = 0 #把正样本的loss置为0#两次sort https://blog.csdn.net/laizi_laizi/article/details/103482634#loss_idx [3,8732]tmp1, loss_idx = loss_c.sort(1, descending=True) ## _, loss_idx = loss_c.sort(1, descending=True)#idx_rank [3,8732]tmp2, idx_rank = loss_idx.sort(1) ## _, idx_rank = loss_idx.sort(1)# pos [3,8732]# num_pos shape[3,1]   | [23],[4],[9]num_pos = pos.long().sum(1, keepdim=True)# aaaaa shape[3,1]   | [23*3],[4*3],[9*3]#aaaaa = self.negpos_ratio * num_pos# num_neg shape[3,1]   | [23*3],[4*3],[9*3]num_neg = torch.clamp(self.negpos_ratio*num_pos, max=pos.size(1)-1)#num_pos shape[3,1]   | [69],[12],[27]# num_neg shape[3,1]# idx_rank[3,8732]#neg [3,8732]  True,False  给出的是loss_c对应坐标的True与False 排序的从大到小neg = idx_rank < num_neg.expand_as(idx_rank)# Confidence Loss Including Positive and Negative Examples# pos[3,8732]# pos.unsqueeze(2)  [3,8732,1]#conf_data[3, 8732, 21]#pos_idx [3, 8732, 21]pos_idx = pos.unsqueeze(2).expand_as(conf_data)#pos[3,8732]  conf_data[3,8732,21]# neg [3,8732]# neg_idx [3, 8732, 21]neg_idx = neg.unsqueeze(2).expand_as(conf_data)##neg [3,8732]  conf_data[3,8732,21]# tmp_1111 = pos_idx + neg_idx #[3,8732,21]  True  False   ##其他的 pos_idx+neg_idx  这两者的形状都是相同的[3,8732,21] 值都是True或者False  加运算相当执行了或运算，只要有一个True就是True# tmp_2222 = (pos_idx+neg_idx).gt(0) #[3,8732,21]  True  False# conf_data [3,8732,21]# tmp_333 = conf_data[(pos_idx + neg_idx).gt(0)] #[3696]   #conf_p [176,21]          #其他的 conf_p [144,21] -->  这里面的144就是上面两个pos_idx和neg_idx里面True数量之和 69+12+27+23+4+9=144conf_p = conf_data[(pos_idx+neg_idx).gt(0)].view(-1, self.num_classes)# pos [3,8732]# neg [3,8732]# conf_t [3,8732]# targets_weighted  [176]   ||[144]targets_weighted = conf_t[(pos+neg).gt(0)]#loss_c tensor(58.0656, grad_fn=<NllLossBackward>)loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)# Sum of losses: L(x,c,l,g) = (Lconf(x, c) + αLloc(x,l,g)) / NN = num_pos.data.sum() ##N=36  就是num_pos之和[23] + [4] + [9]loss_l /= Nloss_c /= Nreturn loss_l, loss_c

有两个地方需要说一下：
第一个地方就是

loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))

这个其实感觉就是交叉熵，log_sum_exp就是下面公式后面减号部分，batch_conf.gather(1, conf_t.view(-1, 1))就是xj

公式来源于下面博客。
https://blog.csdn.net/liyu0611/article/details/100547145
之所以要log_sum_exp，是因为e的1000次方会超数值表达范围。具体的看https://zhuanlan.zhihu.com/p/153535799
就是搞不明白如下：

loss_c = log_sum_exp(batch_conf) - batch_conf.gather(1, conf_t.view(-1, 1))
loss_c = F.cross_entropy(conf_p, targets_weighted, size_average=False)

有啥区别，为啥第一个要这么写。

还有就是两次sort
https://blog.csdn.net/laizi_laizi/article/details/103482634

这个图片就显示的很清楚了，两次sort就是为了能够方便取出排名前多少的数值，并且得到其位置。很厉害！