yolov3损失函数改进_YOLO V3 深度解析（下）

1. 前言

距离上次YOLO V3深度解析（上），隔了很久了，其实自己忙着自己的论文+专利+学习任务，在写文章这块也是有点懈怠了，但是事儿不能做一半就结束了（也有小伙伴催更了），所以接着对YOLO V3进行解析，代码是基于Tensorflow的。

上一节讲到了YOLO V3模型的搭建，模型生成了三个特征图feature_map_1, feature_map_2, feature_map_3。接下来我们要结合模型主要讲解下：

YOLO V3的先验框anchor
YOLO V3的损失函数

2. YOLO V3的Anchor

YOLO V3继承了YOLO V2中的锚（anchor，我们理解为基础框/先验框，后面需要以这些框为基础进行bounding box微调），但是又不太一样。

在YOLO V2中，设置了5个宽高比例的anchor（通过聚类获得），每个cell（特征图的一个单元格）负责的anchor的数量为5，而在YOLO V3中，共设置9个宽高不同的anchor（同样是通过聚类获取得到），每个cell的anchor的数量为9/3=3个，因为YOLO V3有3个feature_map，不同feature_map的size和感受野是不一样的，较小size的feature_map具有较大的感受野，所以负责检测较大的物体，同理，较大size的feature_map感受野较小，负责检测较小的物体。

所以在训练前，我们需要根据自己的数据集通过聚类生成适合自己数据集训练的基础框anchor,这样可以加速自己的训练进程。就比如说你要做人脸识别和行人识别的话，anchor其实差异还是蛮大的，人脸识别一般的anchor长宽比为1，而行人anchor长宽比为0.5居多。

这里我们需要根据你自己数据集上的真实框进行聚类，Tensorflow程序中我们打开get_kmeans.py,更改annotation_path为你自己的防止图片标注信息的文件，然后点击训练即可。

 target_size = [416, 416]annotation_path = r"datamy_datatrain.txt"anno_result = parse_anno(annotation_path, target_size=target_size)anchors, ave_iou = get_kmeans(anno_result, 9)

下图是我在自己数据集上的anchor，可以比较明显地看出，我这个数据集的真实框大多接近方形。

自己数据集的anchor

3. YOLO V3的前向传播和损失函数

3.1 YOLO V3的前向传播

借鉴之前的图（下图），我们在上一篇中，也详细地对YOLO V3地的结构进行了说明。

yolo v3结构

可以看出来，YOLO V3模型接受416x416x3的图片,返回三个不同size的特征图y1（13x13x255）,y2（26x26x255）,y3（52x52x255）,这里的255是在COCO数据集下的结果，因为COCO数据集有80类物体，每个特征图的cell负责三个先验框anchor，则有

255 = 3 x [80(类别估计概率分布)+1（置信度）+4（预测框）]

在训练初期的时候，由于网络参数的初始化,y1、y2、y2和真实的框、类别、置信度信息差别较大，所以会产生较大的损失，后面根据损失进行反向传播优化参数就好了，所以接下来我们对YOLO V3的损失函数进行相关的讲解。

3.2 YOLO V3的损失函数

相信大家在看其他博客或者知乎文章的时候，经常会看到很长的一个关于YOLO V3的损失函数，如下图所示。该损失函数看似包含了五个部分，实际是三个部分，即

预测框的位置损失+预测框预测的类别损失+预测框置信度损失

YOLO V3损失函数

这里我们结合代码，转到model.py中，这里定义了yolov3这个类。我们转到类函数loss_layer ，乍一看觉得很长，没事，我都做好了标注，如下所示。（如果你懒得看，直接跳过代码段看我后面解析）

    def loss_layer(self, feature_map_i, y_true, anchors):'''calc loss function from a certain scaleinput:feature_map_i: feature maps of a certain scale. shape: [N, 13, 13, 3*(5 + num_class)] etc.y_true: y_ture from a certain scale. shape: [N, 13, 13, 3, 4（宽、高、x,y）+ 1 + num_class + 1（默认权重）] etc. 最后维度是权重anchors: shape [3, 2]'''# size in [h, w] format! don't get messed up!grid_size = tf.shape(feature_map_i)[1:3] #输出feature map的宽、高# the downscale ratio in height and weightratio = tf.cast(self.img_size / grid_size, tf.float32) #返回的该特征图相对于原图的scale# N: batch_sizeN = tf.cast(tf.shape(feature_map_i)[0], tf.float32)# 将特征图上预测的框的位置，大小通过anchor先验框映射到原图大小，这里的x_y_offset, pred_boxes都是原图尺寸的x_y_offset, pred_boxes, pred_conf_logits, pred_prob_logits = self.reorg_layer(feature_map_i, anchors)############ get mask############ shape: take 416x416 input image and 13*13 feature_map for example:# [N, 13, 13, 3, 1]object_mask = y_true[..., 4:5] #哪些cell负责检测物体，则该cell为1（三个位置都为1），否则为0# the calculation of ignore mask if referred from#循环操作，循环的是batch_size，意思就是对每张图片进行分析ignore_mask = tf.TensorArray(tf.float32, size=0, dynamic_size=True)def loop_cond(idx, ignore_mask): #如果idx小于N"""对每张图片进行分析，循环batch size下"""return tf.less(idx, tf.cast(N, tf.int32))def loop_body(idx, ignore_mask):"""每次循环对一张图片中所有预测框和真实框求IOU并且剔除一些质量不好的框"""# shape: [13, 13, 3, 4] & [13, 13, 3]  ==>  [V, 4]# V: num of true gt box of each image in a batchvalid_true_boxes = tf.boolean_mask(y_true[idx, ..., 0:4], tf.cast(object_mask[idx, ..., 0], 'bool'))# shape: [13, 13, 3, 4] & [V, 4] ==> [13, 13, 3, V]iou = self.box_iou(pred_boxes[idx], valid_true_boxes)# shape: [13, 13, 3]best_iou = tf.reduce_max(iou, axis=-1)# shape: [13, 13, 3]ignore_mask_tmp = tf.cast(best_iou < 0.5, tf.float32)# finally will be shape: [N, 13, 13, 3]ignore_mask = ignore_mask.write(idx, ignore_mask_tmp)return idx + 1, ignore_mask_, ignore_mask = tf.while_loop(cond=loop_cond, body=loop_body, loop_vars=[0, ignore_mask])# [N, 13, 13, 3]ignore_mask = ignore_mask.stack()# shape: [N, 13, 13, 3, 1]ignore_mask = tf.expand_dims(ignore_mask, -1)# shape: [N, 13, 13, 3, 2]# 以下都是相对于原图的pred_box_xy = pred_boxes[..., 0:2] #预测的中心坐标pred_box_wh = pred_boxes[..., 2:4] #预测的宽高# get xy coordinates in one cell from the feature_map# numerical range: 0 ~ 1 （数值范围）（转为归一化）# shape: [N, 13, 13, 3, 2]true_xy = y_true[..., 0:2] / ratio[::-1] - x_y_offset  #将真实的框的中心坐标归一化（映射到特征图）pred_xy = pred_box_xy / ratio[::-1] - x_y_offset #将预测的框的中心坐标归一化（映射到特征图）# get_tw_th# numerical range: 0 ~ 1# shape: [N, 13, 13, 3, 2]true_tw_th = y_true[..., 2:4] / anchors ##真实的框的宽高相对于anchor的比例pred_tw_th = pred_box_wh / anchors #预测的框的宽高相对于anchor的比例# for numerical stability 为了数值稳定# where会先判断第一项是否为true,如果为true则返回 x；否则返回 y;true_tw_th = tf.where(condition=tf.equal(true_tw_th, 0),x=tf.ones_like(true_tw_th), y=true_tw_th) #不允许出现宽高为0的情况，出现了就返回一个宽和高都为1的情况pred_tw_th = tf.where(condition=tf.equal(pred_tw_th, 0),x=tf.ones_like(pred_tw_th), y=pred_tw_th) # 下面true_tw_th进行指数运算和anchor运行就还原成原来的样子# 这样子做也是为了后面求损失函数方便，毕竟损失函数不能太大了true_tw_th = tf.log(tf.clip_by_value(true_tw_th, 1e-9, 1e9)) #真实框通过上述转换，转换为特征图尺寸上的宽高（其实还要进行log）pred_tw_th = tf.log(tf.clip_by_value(pred_tw_th, 1e-9, 1e9))# box size punishment: 框大小的惩罚，就是坐标损失前的权重，大的框权重设置小一点，小的设置大一点# box with smaller area has bigger weight. This is taken from the yolo darknet C source code.# shape: [N, 13, 13, 3, 1]# 后面一串为真实框面积相对于原图的比例box_loss_scale = 2. - (y_true[..., 2:3] / tf.cast(self.img_size[1], tf.float32)) * (y_true[..., 3:4] / tf.cast(self.img_size[0], tf.float32))############# loss_part############# mix_up weight#下面求框的损失# [N, 13, 13, 3, 1]mix_w = y_true[..., -1:] # shape: [N, 13, 13, 3, 1]xy_loss = tf.reduce_sum(tf.square(true_xy - pred_xy) * object_mask * box_loss_scale * mix_w) / N #均方误差wh_loss = tf.reduce_sum(tf.square(true_tw_th - pred_tw_th) * object_mask * box_loss_scale * mix_w) / N #均方误差#下面求置信度损失# shape: [N, 13, 13, 3, 1]conf_pos_mask = object_mask #那些cell负责物体检测，这个cell的三个格子就是1，否则为0conf_neg_mask = (1 - object_mask) * ignore_mask #并不是选择所有的非负责cell中的框作为消极样本，而是选择IOU<0.5的为负样本，其余不考虑conf_loss_pos = conf_pos_mask * tf.nn.sigmoid_cross_entropy_with_logits(labels=object_mask, logits=pred_conf_logits)conf_loss_neg = conf_neg_mask * tf.nn.sigmoid_cross_entropy_with_logits(labels=object_mask, logits=pred_conf_logits)# TODO: may need to balance the pos-neg by multiplying some weightsconf_loss = conf_loss_pos + conf_loss_negif self.use_focal_loss:alpha = 1.0gamma = 2.0# TODO: alpha should be a mask array if neededfocal_mask = alpha * tf.pow(tf.abs(object_mask - tf.sigmoid(pred_conf_logits)), gamma)conf_loss *= focal_maskconf_loss = tf.reduce_sum(conf_loss * mix_w) / N# shape: [N, 13, 13, 3, 1]# whether to use label smoothif self.use_label_smooth:delta = 0.01label_target = (1 - delta) * y_true[..., 5:-1] + delta * 1. / self.class_numelse:label_target = y_true[..., 5:-1]#分类损失class_loss = object_mask * tf.nn.sigmoid_cross_entropy_with_logits(labels=label_target, logits=pred_prob_logits) * mix_wclass_loss = tf.reduce_sum(class_loss) / Nreturn xy_loss, wh_loss, conf_loss, class_loss

这个函数的输入为某一特征图（即YOLO V3的输出张量），y_true（图片标注的信息），聚类获得的anchor(这里不同特征图对应的anchor应当也是不同的)。

这里我们详细的说一下y_true的结构，这里我们以特征图宽高为 13 x 13的为例：

y_ture 的shape为 [N, 13, 13, 3, 4 + 1 + num_classes + 1]

这里的

N为batch_size
13为特征图宽高
3为anchor数量（每个特征图对应固定3个anchor）
4为坐标信息（宽、高、框中心横坐标、框中心纵坐标）
1为置信度
num_classes是一个分类标签，所属分类对应位置为1，其他为0
最后一个1指的是mix_w，默认值为1

我们开始进行分析

# size in [h, w] format! don't get messed up!
grid_size = tf.shape(feature_map_i)[1:3] #输出feature map的宽、高
# the downscale ratio in height and weight
ratio = tf.cast(self.img_size / grid_size, tf.float32) #返回的该特征图相对于原图的scale
# N: batch_size
N = tf.cast(tf.shape(feature_map_i)[0], tf.float32)# 将特征图上预测的框的位置，大小通过anchor先验框映射到原图大小，这里的x_y_offset, pred_boxes都是原图尺寸的
x_y_offset, pred_boxes, pred_conf_logits, pred_prob_logits = self.reorg_layer(feature_map_i, anchors)

以上这段中比较容易使人疑惑的是函数self.reorg_layer，所以我们进入该函数，如下

def reorg_layer(self, feature_map, anchors):'''将特征图上预测的框的位置，大小通过anchor先验框映射到原图大小便于后面和真实框进行IOU计算等'''# NOTE: size in [h, w] format! don't get messed up!grid_size = feature_map.get_shape().as_list()[1:3] if self.use_static_shape else tf.shape(feature_map)[1:3]  # [13, 13]# the downscale ratio in height and weightratio = tf.cast(self.img_size / grid_size, tf.float32)# rescale the anchors to the feature_map# NOTE: the anchor is in [w, h] format!#以下将anchor变为相对于feature map的大小了rescaled_anchors = [(anchor[0] / ratio[1], anchor[1] / ratio[0]) for anchor in anchors] feature_map = tf.reshape(feature_map, [-1, grid_size[0], grid_size[1], 3, 5 + self.class_num]) #筛选出一个num anchor维度出来# split the feature_map along the last dimension# shape info: take 416x416 input image and the 13*13 feature_map for example:# box_centers: [N, 13, 13, 3, 2] last_dimension: [center_x, center_y]# box_sizes: [N, 13, 13, 3, 2] last_dimension: [width, height]# conf_logits: [N, 13, 13, 3, 1]# prob_logits: [N, 13, 13, 3, class_num]box_centers, box_sizes, conf_logits, prob_logits = tf.split(feature_map, [2, 2, 1, self.class_num], axis=-1)box_centers = tf.nn.sigmoid(box_centers) #对box_centers 输入到sigmoid中，转为（0，1）间的数值# use some broadcast tricks to get the mesh coordinates#设置offset，将坐标中心偏移到所属格子中grid_x = tf.range(grid_size[1], dtype=tf.int32)grid_y = tf.range(grid_size[0], dtype=tf.int32)grid_x, grid_y = tf.meshgrid(grid_x, grid_y) #shape为(grid_size[1],grid_size[0])x_offset = tf.reshape(grid_x, (-1, 1))y_offset = tf.reshape(grid_y, (-1, 1))x_y_offset = tf.concat([x_offset, y_offset], axis=-1)# shape: [13, 13, 1, 2]x_y_offset = tf.cast(tf.reshape(x_y_offset, [grid_size[0], grid_size[1], 1, 2]), tf.float32)# get the absolute box coordinates on the feature_map ，not reletive coordinatesbox_centers = box_centers + x_y_offset# rescale to the original image scalebox_centers = box_centers * ratio[::-1]# avoid getting possible nan value with tf.clip_by_valuebox_sizes = tf.exp(box_sizes) * rescaled_anchors# box_sizes = tf.clip_by_value(tf.exp(box_sizes), 1e-9, 100) * rescaled_anchors# rescale to the original image scale#重新缩放到原始图像比例box_sizes = box_sizes * ratio[::-1]# shape: [N, 13, 13, 3, 4]# last dimension: (center_x, center_y, w, h)boxes = tf.concat([box_centers, box_sizes], axis=-1)  #映射回原图的预测框# shape:# x_y_offset: [13, 13, 1, 2]# boxes: [N, 13, 13, 3, 4], rescaled to the original image scale# conf_logits: [N, 13, 13, 3, 1]# prob_logits: [N, 13, 13, 3, class_num]return x_y_offset, boxes, conf_logits, prob_logits

通过我们的标注，很容易理解上面该函数的意思，即，将特征图上预测的框的位置，大小结合先验框anchor映射到原图大小，这样做的目的是为了后面求与真实框的IOU更加方便。

接着，我们解析损失函数中的两个值，

和

他们均有下面代码生成，这里的object_mask指的是

，

ignore_mask指的是

###########
# get mask
############ shape: take 416x416 input image and 13*13 feature_map for example:
# [N, 13, 13, 3, 1]
object_mask = y_true[..., 4:5] #哪些cell负责检测物体，则该cell为1（三个位置都为1），否则为0# the calculation of ignore mask if referred from
#循环操作，循环的是batch_size，意思就是对每张图片进行分析
ignore_mask = tf.TensorArray(tf.float32, size=0, dynamic_size=True)
def loop_cond(idx, ignore_mask): #如果idx小于N"""对每张图片进行分析，循环batch size下"""return tf.less(idx, tf.cast(N, tf.int32))
def loop_body(idx, ignore_mask):"""每次循环对一张图片中所有预测框和真实框求IOU并且剔除一些质量不好的框"""# shape: [13, 13, 3, 4] & [13, 13, 3]  ==>  [V, 4]# V: num of true gt box of each image in a batchvalid_true_boxes = tf.boolean_mask(y_true[idx, ..., 0:4], tf.cast(object_mask[idx, ..., 0], 'bool'))# shape: [13, 13, 3, 4] & [V, 4] ==> [13, 13, 3, V]iou = self.box_iou(pred_boxes[idx], valid_true_boxes)# shape: [13, 13, 3]best_iou = tf.reduce_max(iou, axis=-1)# shape: [13, 13, 3]ignore_mask_tmp = tf.cast(best_iou < 0.5, tf.float32)# finally will be shape: [N, 13, 13, 3]ignore_mask = ignore_mask.write(idx, ignore_mask_tmp)return idx + 1, ignore_mask
_, ignore_mask = tf.while_loop(cond=loop_cond, body=loop_body, loop_vars=[0, ignore_mask])
# [N, 13, 13, 3]
ignore_mask = ignore_mask.stack()
# shape: [N, 13, 13, 3, 1]
ignore_mask = tf.expand_dims(ignore_mask, -1)

我们可以看出，特征图的哪个单元格负责对应真实框的检测，则该单元格中的3个anchor对应的置信度均为1，否则为0，这就是

。

接着作者做的事，并不是将

的补集设为

，，而是选择一些质量比较差（与真实框IOU<0.5)的设定为

，这在上面代码中是得到体现的。

接着，作者对预测框和真实框的位置信息做了两个转换：

# shape: [N, 13, 13, 3, 2]
# 以下都是相对于原图的
pred_box_xy = pred_boxes[..., 0:2] #预测的中心坐标
pred_box_wh = pred_boxes[..., 2:4] #预测的宽高# get xy coordinates in one cell from the feature_map
# numerical range: 0 ~ 1 （数值范围）（转为归一化）
# shape: [N, 13, 13, 3, 2]
true_xy = y_true[..., 0:2] / ratio[::-1] - x_y_offset  #将真实的框的中心坐标归一化（映射到特征图）
pred_xy = pred_box_xy / ratio[::-1] - x_y_offset #将预测的框的中心坐标归一化（映射到特征图）# get_tw_th
# numerical range: 0 ~ 1
# shape: [N, 13, 13, 3, 2]
true_tw_th = y_true[..., 2:4] / anchors ##真实的框的宽高相对于anchor的比例
pred_tw_th = pred_box_wh / anchors #预测的框的宽高相对于anchor的比例# for numerical stability 为了数值稳定
# where会先判断第一项是否为true,如果为true则返回 x；否则返回 y;
true_tw_th = tf.where(condition=tf.equal(true_tw_th, 0),x=tf.ones_like(true_tw_th), y=true_tw_th) #不允许出现宽高为0的情况，出现了就返回一个宽和高都为1的情况
pred_tw_th = tf.where(condition=tf.equal(pred_tw_th, 0),x=tf.ones_like(pred_tw_th), y=pred_tw_th)
# 下面true_tw_th进行指数运算和anchor运行就还原成原来的样子
# 这样子做也是为了后面求损失函数方便，毕竟损失函数不能太大了
true_tw_th = tf.log(tf.clip_by_value(true_tw_th, 1e-9, 1e9)) #真实框通过上述转换，转换为特征图尺寸上的宽高（其实还要进行log）
pred_tw_th = tf.log(tf.clip_by_value(pred_tw_th, 1e-9, 1e9))

可以看出，作者做这两个变换是非常灵活的，这样做的目的当然也是便于损失函数的计算啦！

上述的true_xy 和pred_xy是将预测框和真实框的中心坐标（原图比例）转换为下图中的

和

。

yolo v3论文中的图

上述的true_tw_th和pred_tw_th是将预测框和真实框的宽高信息（原图比例）转换为上图中的

和

。

接着，代码出现了一行

# box size punishment: 框大小的惩罚，就是坐标损失前的权重，大的框权重设置小一点，小的设置大一点
# box with smaller area has bigger weight. This is taken from the yolo darknet C source code.
# shape: [N, 13, 13, 3, 1]
# 后面一串为真实框面积相对于原图的
box_loss_scale = 2. - (y_true[..., 2:3] / tf.cast(self.img_size[1], tf.float32)) * (y_true[..., 3:4] / tf.cast(self.img_size[0], tf.float32))

这个box_loss_scale的作用，原论文中也提及，对于大的检测框，其位置损失要远大于，小的框的位置损失。所以设置这个的意思很明显，对大的框的位置损失降低敏感度，对小的框的位置损失增加敏感度。

准备了这么多，进入计算损失的主要阶段，这里先按照上面的损失公式计算位置损失，即

位置损失

代码为：

#下面求框的损失
# [N, 13, 13, 3, 1]
mix_w = y_true[..., -1:]
# shape: [N, 13, 13, 3, 1]
xy_loss = tf.reduce_sum(tf.square(true_xy - pred_xy) * object_mask * box_loss_scale * mix_w) / N #均方误差
wh_loss = tf.reduce_sum(tf.square(true_tw_th - pred_tw_th) * object_mask * box_loss_scale * mix_w) / N #均方误差

这和公式是一一对应，非常直观的。

接着，我们计算置信度损失，公式中是

置信度损失

代码为：

#下面求置信度损失
# shape: [N, 13, 13, 3, 1]
conf_pos_mask = object_mask #那些cell负责物体检测，这个cell的三个格子就是1，否则为0
conf_neg_mask = (1 - object_mask) * ignore_mask #并不是选择所有的非负责cell中的框作为消极样本，而是选择IOU<0.5的为负样本，其余不考虑
conf_loss_pos = conf_pos_mask * tf.nn.sigmoid_cross_entropy_with_logits(labels=object_mask, logits=pred_conf_logits)
conf_loss_neg = conf_neg_mask * tf.nn.sigmoid_cross_entropy_with_logits(labels=object_mask, logits=pred_conf_logits)
# TODO: may need to balance the pos-neg by multiplying some weights
conf_loss = conf_loss_pos + conf_loss_negif self.use_focal_loss:alpha = 1.0gamma = 2.0# TODO: alpha should be a mask array if neededfocal_mask = alpha * tf.pow(tf.abs(object_mask - tf.sigmoid(pred_conf_logits)), gamma)conf_loss *= focal_mask
conf_loss = tf.reduce_sum(conf_loss * mix_w) / N

代码中区分了positive和negative置信度，这里我感觉好像和上面的

和

没什么区别。

最后一个损失就是分类损失，这是非常常见的!公式中为：

分类损失

代码中为：

# shape: [N, 13, 13, 3, 1]
# whether to use label smooth
if self.use_label_smooth:delta = 0.01label_target = (1 - delta) * y_true[..., 5:-1] + delta * 1. / self.class_num
else:label_target = y_true[..., 5:-1]#分类损失
class_loss = object_mask * tf.nn.sigmoid_cross_entropy_with_logits(labels=label_target, logits=pred_prob_logits) * mix_w
class_loss = tf.reduce_sum(class_loss) / N

至此，我们的损失函数这块就讲完了，后面根据这个损失函数进行反向传播就能够愉快地训练YOLO V3了！

4. 小结

至此，拖了近半年的YOLO V3解析就完成了，YOLO V4不久前也出来了，如果后面有空，我们可以聊聊EfficientDet和YOLO V4!