focal loss小结

论文原地址

https://arxiv.org/pdf/1708.02002.pdf

论文中讲述focal loss思想的部分

As our experiments will show, the large class imbalance encountered during training of dense detectors overwhelms the cross entropy loss. Easily classified negatives comprise the majority of the loss and dominate the gradient. While α balances the importance of positive/negative examples, it does not differentiate between easy/hard examples. Instead, we propose to reshape the loss function to down-weight easy examples and thus focus training on hard negatives.

正如我们的实验将显示的那样，在密集检测器的训练过程中遇到的样本不平衡问题被交叉熵损失所湮没。容易分类的负赝本占据了大部分损失并主导了梯度。虽然 α 平衡了正/负示例的重要性，但它不区分简单/困难示例。因此，我们建议将损失函数重塑，减轻容易分类样本的权重，从而将训练重点放在难分类的负样本上。

More formally, we propose to add a modulating factor (1-pt)^γ to the cross entropy loss, with tunable focusing parameter γ ≥ 0. We define the focal loss as:
FL(pt)=−(1−pt)γlog(pt)FL(p_t) = -(1-p_t)^{\gamma} log(p_t)FL(pt)=−(1−pt)γlog(pt)

上面为定义的FL(pt)损失函数。

The focal loss is visualized for several values of γ ∈ [0, 5] in Figure 1. We note two properties of the focal loss.
When an example is misclassified and pt is small, the modulating factor is near 1 and the loss is unaffected. As pt → 1, the factor goes to 0 and the loss for well-classified examples is down-weighted. (2) The focusing parameter γ smoothly adjusts the rate at which easy examples are down-weighted. When γ = 0, FL is equivalent to CE, and as γ is increased the effect of the modulating factor is likewise increased (we found γ = 2 to work best in our experiments).

Intuitively, the modulating factor reduces the loss contri-bution from easy examples and extends the range in which an example receives low loss. For instance, with γ = 2, an example classified with pt = 0.9 would have 100× lower loss compared with CE and with pt ≈ 0.968 it would have 1000× lower loss. This in turn increases the importance of correcting misclassified examples (whose loss is scaled down by at most 4× for pt ≤ .5 and γ = 2).

在图 1 中，对于 γ ∈ [0, 5] 的几个值，focal loss是可视化的。我们注意到focal loss的两个属性。
当一个样本被错误分类并且 pt 很小时，调制因子接近 1 并且损失不受影响。当 pt → 1 时，因子变为 0，并且分类良好的示例的损失被降低权重。聚焦参数 γ 平滑地调整了简单示例被降权的速率。当 γ = 0 时，FL 等价于 CE，并且随着 γ 的增加，调节因子的影响同样增加（我们发现 γ = 2 在我们的实验中效果最好）。

直观地说，调制因子减少了简单样本的损失贡献，并扩大了样本获得低损失的范围。例如，当 γ = 2 时，分类为 pt = 0.9 的示例与 CE 相比损失低 100 倍，而 pt ≈ 0.968 则损失低 1000 倍。这反过来又增加了纠正错误分类示例的重要性（对于 pt ≤ .5 和 γ = 2，其损失最多减少 4 倍）。

In practice we use an α-balanced variant of the focal loss:
FL(pt)=−αt(1−pt)γlog(pt)FL(p_t) = -\alpha_t (1-p_t)^{\gamma} log(p_t)FL(pt)=−αt(1−pt)γlog(pt)

We adopt this form in our experiments as it yields slightly improved accuracy over the non-α-balanced form. Finally, we note that the implementation of the loss layer combines the sigmoid operation for computing p with the loss computation, resulting in greater numerical stability.

While in our main experimental results we use the focal loss definition above, its precise form is not crucial. In the appendix we consider other instantiations of the focal loss and demonstrate that these can be equally effective.

我们在实验中采用这种形式，因为它比非 α 平衡形式的精度略有提高。最后，我们注意到损失层的实现将计算 p的sigmoid 操作与损失函数计算相结合，从而获得更大的数值稳定性。

虽然在我们的主要实验结果中我们使用了上面的focal loss定义，但其精确形式并不重要。在附录中，我们考虑了focal loss的其他实例，并证明它们同样有效。