利用`pytorch`来深入理解`CELoss`、`BCELoss`和`NLLLoss`之间的关系

损失函数为为计算预测值与真实值之间差异的函数，损失函数越小，预测值与真实值间的差异越小，证明网络效果越好。对于神经网络而言，损失函数决定了神经网络学习的走向，至关重要。

pytorch中的所有损失函数都可以通过reduction = ‘mean’或者reduction = ‘sum’来设置均值还是总值。

`L1 Loss`

L1 Loss即绝对值损失，为预测值和真实值间误差的绝对值。

L 1 ( x , y ) = 1 N ∑ i = 1 n ∣ x i − y i ∣ L1(x, y) = \frac{1}{N} \sum_{i=1}^n |x_i - y_i| L1(x,y)=N1i=1∑n∣xi−yi∣或者 L 1 ( x , y ) = ∑ i = 1 n ∣ x i − y i ∣ L1(x, y) = \sum_{i=1}^n |x_i - y_i| L1(x,y)=i=1∑n∣xi−yi∣

`L2 Loss`

L2Loss 通常也被称作MSE Loss，pytorch中使用nn.MSELoss，即均方差损失，为预测值与真实值间误差的平方。

L 2 ( x , y ) = 1 N ∑ i = 1 n ( x i − y i ) 2 L2(x, y) = \frac{1}{N} \sum_{i=1}^n (x_i - y_i)^2 L2(x,y)=N1i=1∑n(xi−yi)2 或者 L 2 ( x , y ) = ∑ i = 1 n ( x i − y i ) 2 L2(x, y) = \sum_{i=1}^n (x_i - y_i)^2 L2(x,y)=i=1∑n(xi−yi)2

`Smooth L1 Loss`

Smooth L1 Loss为L1 Loss的平滑处理。L1 Loss易受异常点影响，且绝对值的梯度计算在0点容易丢失梯度。Smooth L1 Loss 在0点附近是强凸，结合了平方损失和绝对值损失的优点。

S m o o t h L 1 ( x , y ) = 1 N ∑ i = 1 n z i SmoothL1(x, y) = \frac{1}{N} \sum_{i=1}^n z_i SmoothL1(x,y)=N1i=1∑nzi

z i = { 0.5 ( x i − y i ) 2 , i f ∣ x i − y i ∣ < 1 ∣ x i − y i ∣ − 0.5 , o t h e r w i s e z_i = \begin{cases} 0.5(x_i - y_i)^2, & if |x_i - y_i| < 1\\ |x_i - y_i| - 0.5, &otherwise \end{cases} zi={0.5(xi−yi)2,∣xi−yi∣−0.5,if∣xi−yi∣<1otherwise

交叉熵损失

交叉熵表示互信息量，表达的是预测值与真实值之间的分布关系，交叉熵越小，两者间的概率分布越相近。

交叉熵计算公式： H ( p , q ) = − ∑ k = 1 n ( p k ∗ l o g ( q k ) ) H(p, q) = - \sum_{k=1}^n (p_k * log(q_k)) H(p,q)=−k=1∑n(pk∗log(qk))。其中，$ p_k 是预测值的期望，是预测值的期望，是预测值的期望， q_k $是真实值的期望，通常都是1。

torch.nn中的交叉熵都可以定义weight，也就是说可以通过样本数量控制样本权重。

`nn.NLLLoss`

NLLLoss： negative log likelihood loss，负对数似然损失。

公式为： ℓ ( x , y ) = L = { l 1 , … , l N } ⊤ , l n = − x n y n \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - x_ny_n\quad ℓ(x,y)=L={l1,…,lN}⊤,ln=−xnyn

确实没搞清楚这个公式跟对数有什么关系，看公式就是取每行对应列别的值

复现代码如下：

import torch
# 预测值
predict = torch.Tensor([[0.5796, 0.4403, 0.9087],[-1.5673, -0.3150, 1.6660]])
# 真实值
target = torch.tensor([0, 2])result = 0
for i, j in enumerate(range(target.shape[0])):# 分别取出0.5796和1.6660# 也就是log_soft_out[0][0]和log_soft_out[1][2]result -= predict[i][target[j]]
print(result / target.shape[0])
# tensor(-1.1228)loss = torch.nn.NLLLoss()
print(loss(predict, target))
# tensor(-1.1228)

`nn.CrossEntropyLoss`

即CELoss，交叉熵损失。等价为predict经log_softmax后执行nn.NLLLoss。

公式为：$ CELoss(x, y) = - \sum y_i * log(x_i) $

执行过程为：

对预测值做softmax获取每条信息的分布概率。
对概率分布做对数映射，将乘法改成加法减少计算量。
根据分类取出每行映射后的值，求和或求平均。

import torch
# 预测值
# predict的shape是[2,3],表示两个数据对三类任务的预测值
predict = torch.Tensor([[0.5796, 0.4403, 0.9087],[-1.5673, -0.3150, 1.6660]])
# 真实值
# target的长度对应predict的shape[0],最大值为predict的shape[1] - 1
# 也就是第0行取index=0，第1行取index=2
target = torch.tensor([0, 2])ce_loss = torch.nn.CrossEntropyLoss()
# 这里输入的是原始预测值
print(ce_loss(predict, target))
# tensor(0.6725)soft_max = torch.nn.Softmax(dim=-1)
soft_out = soft_max(predict)
# tensor([[0.3068, 0.2669, 0.4263],
#        [0.0335, 0.1172, 0.8494]])log_soft_out = torch.log(soft_max(predict))
# tensor([[-1.1816, -1.3209, -0.8525],
#         [-3.3966, -2.1443, -0.1633]])nll_loss = torch.nn.NLLLoss()
# 这里输入的是经过log_softmax的值
print(nll_loss(log_soft_out, target))
# tensor(0.6725)

`nn.BCELoss`

二元交叉熵损失，公式为： B C E L o s s ( x , y ) = − ( y ∗ l o g ( x ) + ( 1 − y ) ∗ l o g ( 1 − x ) ) BCELoss(x, y) = - (y * log(x) + (1 - y) * log(1 - x)) BCELoss(x,y)=−(y∗log(x)+(1−y)∗log(1−x))

由公式可以看出，BCELoss相比CELoss而言，似乎考虑到了互信息间的计算。

如此分析，BCELoss在处理二分类问题也就是0-1问题时，就会有一项变为0。那么公式就好像跟CELoss有了些相似。

BCELoss对于输入数据有两个要求：

要求输入的predict和target必须是同样shape的。
要求输入的predict的数值范围应该为0~1

那么针对问题1要求的predict和target是一致的，那么BCELoss去解决多分类问题如何构造target呢？这时候就需要用到one-hot这种数据格式了。

那，针对问题2要求的数值范围我们应该怎么控制呢？上面提到的Softmax不就是个很好的0~1映射嘛。

解决二分类问题

由上面可知，CELoss是预测值通过Softmax + log + NLLLoss计算得来的。

那么在面对二分类的问题时，预测值经Softmax后的每行数据应该是相加等于1的(默认根据最后一维做Softmax)。

也就是说，soft_out[:][0] = 1 - soft_out[:][1]。

那么，设 s o f t _ o u t [ : ] [ 0 ] = x 0 , s o f t _ o u t [ : ] [ 1 ] = x 1 soft\_out[:][0] = x_0, soft\_out[:][1] = x_1 soft_out[:][0]=x0,soft_out[:][1]=x1，则有 x 0 = 1 − x 1 x_0 = 1 - x_1 x0=1−x1

l o g ( x 1 ) = l o g ( 1 − x 0 ) log(x_1) = log(1 - x_0) log(x1)=log(1−x0)

即，在二分类问题时，预测值的每一行经Softmax + log后，变成了 l o g ( x 0 ) log(x_0) log(x0)和 l o g ( 1 − x 0 ) log(1-x_0) log(1−x0)，y肯定是非0即1的。

那么，特征项就变成了 [ x 0 , 1 − x 0 ] [x_0, 1-x_0] [x0,1−x0]。标签要么是[1, 0]，要么是[0, 1]。

带入到BCELoss的公式里,每一行的两个元素：

要么是 [ − l o g ( x 0 ) , − l o g ( 1 − ( 1 − x 0 ) ) [-log(x_0), -log(1 - (1-x_0)) [−log(x0),−log(1−(1−x0))，即 [ − l o g ( x 0 ) , − l o g ( x 0 ) ] [-log(x_0), -log(x_0)] [−log(x0),−log(x0)]。

要么是 [ − l o g ( 1 − x 0 ) , − l o g ( 1 − x 0 ] ) [-log(1-x_0), -log(1-x_0]) [−log(1−x0),−log(1−x0])。

咦，这个BECLoss在二分类问题上，经过了Softmax后，每行的两个元素的值是一样的哇！

那我去求平均值，不就是每行取一个值加起来然后除以行数就行了嘛。

predict = torch.tensor([[0.9346, 0.8287],[0.5189, 0.3842],[0.8615, 0.8318],[0.6799, 0.4911]])
soft_max = torch.nn.Softmax(dim=-1)
soft_out = soft_max(predict)
bce_target = torch.Tensor([[0, 1],[1, 0],[0, 1],[1, 0]])
bce_result = - bce_target * torch.log(soft_out) - (1.0 - bce_target) * torch.log(1.0 - soft_out)
# tensor([[0.7475, 0.7475],
#         [0.6281, 0.6281],
#         [0.7081, 0.7081],
#         [0.6032, 0.6032]])

那么再看，CELoss之前说了，等价为predict经log_softmax后执行nn.NLLLoss。也就是说预测值经过Softmax，然后求个log，在根据每行的真实值所在的索引取出来，做平均。

用同样的预测值跑一下

predict = torch.tensor([[0.9346, 0.8287],[0.5189, 0.3842],[0.8615, 0.8318],[0.6799, 0.4911]])
soft_max = torch.nn.Softmax(dim=-1)
soft_out = soft_max(predict)
log_soft_out = - torch.log(soft_out)
# tensor([[0.6416, 0.7475],
#         [0.6281, 0.7628],
#         [0.6784, 0.7081],
#         [0.6032, 0.7920]])

想获取BCELoss一样的target，我们把ce_target设置成[1, 0, 1, 0]。

突然我们发现，那求的CELoss的话，每行取出来的值，跟bce_result每行对应的值是一样的！！

（其实带入公式我们也能发现，面对二分类问题时，predict经过Softmax后，CELoss和BCELoss就是一样的。这里不方便讲述清楚，大家用笔带入矩阵算一下就很明显了。）

结论：面对二分类问题时，CELoss是Softmax + BCELoss。

我们验证一下：

# 预测值
predict = torch.rand([2, 2])
# 真实值
ce_target = torch.tensor([1, 0])
# 1. CELoss
ce_loss = torch.nn.CrossEntropyLoss()
print(ce_loss(predict, ce_target))# 2.Softmax + BCELoss
soft_max= torch.nn.Softmax(dim=-1)
soft_out = soft_max(predict)bce_target = torch.Tensor([[0, 1],[1, 0]])
bce_loss = torch.nn.BCELoss()
print(bce_loss(soft_out, bce_target))# 3.手动实现个BCELoss
bce_result = - bce_target * torch.log(soft_out) - (1.0 - bce_target) * torch.log(1.0 - soft_out)
print(bce_result.mean())# 4.Softmax + log + NLLLoss
log_soft_out = torch.log(soft_out)
nll_loss = torch.nn.NLLLoss()
print(nll_loss(log_soft_out, ce_target))

wuhu~串联起来了。

注意ce_target一定要跟bce_target设置成语义一样的啊，结合上面nn.CrossEntropyLoss代码里ce_target的注释和bce_target对比理解一下。

解决多分类问题

来看一下BCELoss是怎么解决多分类问题的。要是没法解决多分类问题，BCELoss也不会在目标检测网络里经常被使用。

首先比较一下CELoss和BCELoss在解决多分类问题上有没有差异：

# 预测值
predict = torch.Tensor([[0.5796, 0.4403, 0.9087],[-1.5673, -0.3150, 1.6660]])
# 真实值
ce_target = torch.tensor([2, 0])
# 1. CELoss
ce_loss = torch.nn.CrossEntropyLoss()
print('ce_loss:', ce_loss(predict, ce_target)) # ce_loss: tensor(2.1246)# 2.Softmax + BCELoss
soft_input = torch.nn.Softmax(dim=-1)soft_out = soft_input(predict)bec_target = torch.Tensor([[0, 0, 1],[1, 0, 0]])
bce_loss = torch.nn.BCELoss()
print('bce_loss:', bce_loss(soft_out, bec_target)) # bce_loss: tensor(1.1572)# 3.Softmax + log + NLLLoss
log_soft_out = torch.log(soft_out)
nll_loss = torch.nn.NLLLoss()
print('nll_loss:', nll_loss(log_soft_out, ce_target)) # nll_loss: tensor(2.1246)

可以看出，解决多分类问题时，CELoss和BCELoss的结果不一样了。

那么解决二分类问题和三分类问题时，有如下对比代码：

import torch
# 二分类预测值
predict_2 = torch.rand([3, 2])
# tensor([[0.6718, 0.8155],
#         [0.6771, 0.1240],
#         [0.7621, 0.3166]])
soft_input = torch.nn.Softmax(dim=-1)
# 二分类Softmax结果
soft_out_2 = soft_input(predict_2)
# tensor([[0.4641, 0.5359],
#         [0.6349, 0.3651],
#         [0.6096, 0.3904]])# 三分类预测值
predict_3 = torch.rand([2, 3])
# tensor([[0.0098, 0.5813, 0.9645],
#         [0.4855, 0.5245, 0.4162]])
# 三分类Softmax结果
soft_out_3 = soft_input(predict_3)
# tensor([[0.1863, 0.3299, 0.4839],
#         [0.3364, 0.3498, 0.3139]])

可以看出，在解决二分类问题时，soft_out_2的结果，每行只有两个元素，且两个元素和为1。也就是说，soft_out_2[:][0] + soft_out_2[:][1] = 1

假设target的第一个元素是0，那么应对在BCELoss的公式 B C E L o s s ( x , y ) = − ( y ∗ l o g ( x ) + ( 1 − y ) ∗ l o g ( 1 − x ) ) BCELoss(x, y) = - (y * log(x) + (1 - y) * log(1 - x)) BCELoss(x,y)=−(y∗log(x)+(1−y)∗log(1−x))中，

B C E L o s s ( s o f t _ o u t _ 2 [ 0 ] [ 0 ] , 0 ) = − l o g ( 1 − s o f t _ o u t _ 2 [ 0 ] [ 0 ] ) = − l o g ( s o f t _ o u t _ 2 [ 0 ] [ 1 ] ) BCELoss(soft\_out\_2[0][0], 0) = - log(1 - soft\_out\_2[0][0]) = - log(soft\_out\_2[0][1]) BCELoss(soft_out_2[0][0],0)=−log(1−soft_out_2[0][0])=−log(soft_out_2[0][1])

B C E L o s s ( s o f t _ o u t _ 2 [ 0 ] [ 1 ] , 1 ) = − l o g ( s o f t _ o u t _ 2 [ 0 ] [ 1 ] ) BCELoss(soft\_out\_2[0][1], 1) = - log(soft\_out\_2[0][1]) BCELoss(soft_out_2[0][1],1)=−log(soft_out_2[0][1])

二者是一样的，也就是说，面对二分类问题，BCELoss每一行的结果中每个元素都是一样的，所以做平均值的时候，每行的结果也就是每行每个元素的结果。

但是解决三分类问题时，soft_out_3的结果每行有三个元素，三个元素的和为1。

还是假设target的第一个元素是0，BCELoss每行的每个元素不一样了。那结果也就不一样了。

如此BCELoss相比CELoss在解决多分类问题的优势就表现了出来。CELoss只是根据每行的分类结果去取值，而BCELoss考虑了每行全部结果。

`BCEWithLogitsLoss`

上面我们说了BCELoss对于输入数据有两个要求

要求输入的predict和target必须是同样shape的。
要求输入的predict的数值范围应该为0~1。

predict和target的shape问题，我们通过把target构造成onehot形式解决了。

那怎么把predict的数值范围确定为0~1之间呢？

之前提到的Softmax是一种解决方法，那sigmoid同样是一种解决方法啊。

** Softmax的输出结果和为1，每行各元素有相关关系。而sigmoid的输出结果是相互独立的。**

那么BCEWithLogitsLoss呢，是先对数据取了sigmoid，在做BCELoss。也就是说BCEWithLogitsLoss = sigmoid + BCELoss

import torch
# 预测值
predict = torch.Tensor([[0.5796, 0.4403, 0.9087],[-1.5673, -0.3150, 1.6660]])
# 真实值
bce_target = torch.Tensor([[0, 0, 1],[1, 0, 0]])bce_logits_loss = torch.nn.BCEWithLogitsLoss()
print(bce_logits_loss(predict, bce_target))sigmoid_out = torch.sigmoid(predict)
bce_loss = torch.nn.BCELoss()
print(bce_loss(sigmoid_out, bce_target))

`Focal Loss`

Focal Loss是何凯明大神针对正负样本差距过大提出来的，目前pytorch还没有集成相应函数，后面补一个笔记吧。可以先看这个blog。

https://blog.csdn.net/cp1314971/article/details/105559545/