文章目录

label Smoothing
- 解决什么问题
- 实现1- torch.nn.functional.cross_entropy
- - 入参中有label_smoothing
- 实现2- huggingface transformers
- - 计算logits的log_softmax
  - 确定有效的样本的数
  - nll_loss
  - smoothed_loss
  - 汇总
- 实现3- 修改labels值

label Smoothing

解决什么问题

在使用cross-entropy 时，会使得模型向预测正负标签差值无限大的方向学习；但过大的logit差值会使得模型缺乏适应性。实际上有些标注数据不一定准确，在训练数据不足的情况下，影响更大。可以用labelsmoothing做缓解处理。

实现1- torch.nn.functional.cross_entropy

pytorch官方文档

import torch
import torch.nn as nn
import torch.nn.functional as F

构造示例数据

logits = torch.randn(3, 5, requires_grad=True) # 3个样本，标签数量为5
labels = torch.randint(5, (3,), dtype=torch.int64)
print(logits)
print(labels)
loss = F.cross_entropy(logits, labels)
print(loss)

打印结果：

tensor([[ 1.2215, -0.9342, -0.1349, -0.5035,  0.4492],[-0.1989,  0.2589,  2.1556, -0.5734,  0.8668],[ 0.1215,  0.5060,  0.5376, -1.7609, -0.6566]], requires_grad=True)
tensor([2, 3, 2])
tensor(2.1185, grad_fn=<NllLossBackward0>)

验证：CEloss的计算等价于 NLL_loss(log_softmax(pred), labels)

print(F.nll_loss(F.log_softmax(logits), labels)) # tensor(2.1185, grad_fn=<NllLossBackward0>)

入参中有label_smoothing

F.cross_entropy(logits, labels, label_smoothing=0.1)

结果：

tensor(2.1038, grad_fn=<AddBackward0>)

实现2- huggingface transformers

阅读huggingface transformers中的相关源码，在trainer中计算loss时，采用loss = self.label_smoother(outputs, labels), 具体分析其实现。

# from huggingface transformers/trainer_pt_utils.py
class LabelSmoother:"""Adds label-smoothing on a pre-computed output from a Transformers model.Args:epsilon (`float`, *optional*, defaults to 0.1):The label smoothing factor.ignore_index (`int`, *optional*, defaults to -100):The index in the labels to ignore when computing the loss."""epsilon: float = 0.1ignore_index: int = -100def __call__(self, model_output, labels):logits = model_output["logits"] if isinstance(model_output, dict) else model_output[0]log_probs = -nn.functional.log_softmax(logits, dim=-1)if labels.dim() == log_probs.dim() - 1:labels = labels.unsqueeze(-1)padding_mask = labels.eq(self.ignore_index)# In case the ignore_index is -100, the gather will fail, so we replace labels by 0. The padding_mask# will ignore them in any case.labels = torch.clamp(labels, min=0)nll_loss = log_probs.gather(dim=-1, index=labels)# works for fp16 input tensor too, by internally upcasting it to fp32smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.float32)nll_loss.masked_fill_(padding_mask, 0.0)smoothed_loss.masked_fill_(padding_mask, 0.0)# Take the mean over the label dimensions, then divide by the number of active elements (i.e. not-padded):num_active_elements = padding_mask.numel() - padding_mask.long().sum()nll_loss = nll_loss.sum() / num_active_elementssmoothed_loss = smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1])return (1 - self.epsilon) * nll_loss + self.epsilon * smoothed_loss

计算logits的log_softmax

log_probs = -F.log_softmax(logits, dim=-1)
if labels.dim() == log_probs.dim() - 1:labels = labels.unsqueeze(-1)
print(log_probs)
print(labels)

结果为：

tensor([[0.6999, 2.8556, 2.0563, 2.4249, 1.4721],[2.8157, 2.3578, 0.4612, 3.1901, 1.7499],[1.5253, 1.1407, 1.1092, 3.4077, 2.3033]], grad_fn=<NegBackward0>)
tensor([[2],[3],[2]])

确定有效的样本的数

padding_mask = labels.eq(-100)
labels = torch.clamp(labels, min=0)
print(padding_mask)
print(labels)

tensor([[False],[False],[False]])
tensor([[2],[3],[2]])

nll_loss

# gather用来按索引取值 https://pytorch.org/docs/stable/generated/torch.gather.html#torch.gather
nll_loss = log_probs.gather(dim=-1, index=labels)  # 取
print(nll_loss)

tensor([[2.0563],[3.1901],[1.1092]], grad_fn=<GatherBackward0>)

smoothed_loss

smoothed_loss=sum(log_probs)样本数∗标签数smoothed\_loss = \cfrac{sum(log\_probs)}{样本数 * 标签数} smoothed_loss=样本数∗标签数sum(log_probs)

# 对某一样本预测所有类别的logit求和
smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.float32)
print(smoothed_loss)

tensor([[ 9.5088],[10.5747],[ 9.4863]], grad_fn=<SumBackward1>)

汇总

# 有几个样本是需要算loss的
num_active_elements = padding_mask.numel() - padding_mask.long().sum()
print(num_active_elements)  # tensor(3)nll_loss = nll_loss.sum() / num_active_elements
print(nll_loss)  # tensor(2.1185, grad_fn=<DivBackward0>)smoothed_loss = smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1])
print(smoothed_loss) # tensor(1.9713, grad_fn=<DivBackward0>)# 可理解为：loss为对"预测的分布与真实分布"及"预测分布与先验分布（均匀分布）"的惩罚。
print((1 - 0.1) * nll_loss + 0.1 * smoothed_loss) # tensor(2.1038, grad_fn=<AddBackward0>)

最终结果和直接使用 cross_entropy 中的label_smoothing一致。

实现3- 修改labels值

ykLS=yk(1−α)+α/Ky_k^{LS} = y_k(1-\alpha) + \alpha/KykLS=yk(1−α)+α/K
K为标签数， alpha为上述实现中的epsilon

假设样本标签为2: [0,0,1,0,0]即将变为[0.02, 0.02, 0.92, 0.02, 0.02]

print(labels)labels_onehot = torch.zeros(3, 5).scatter_(1, labels.unsqueeze(-1), 1)
print(labels_onehot)labels_smoothed = labels_onehot*(1-0.1) + 0.1/labels_onehot.shape[-1]
print(labels_smoothed)print(F.cross_entropy(logits, labels_smoothed))

tensor([2, 3, 2])
tensor([[0., 0., 1., 0., 0.],[0., 0., 0., 1., 0.],[0., 0., 1., 0., 0.]])
tensor([[0.0200, 0.0200, 0.9200, 0.0200, 0.0200],[0.0200, 0.0200, 0.0200, 0.9200, 0.0200],[0.0200, 0.0200, 0.9200, 0.0200, 0.0200]])
tensor(2.1038, grad_fn=<DivBackward1>)

NLP炼丹技巧：标签平滑label smoothing相关推荐

标签平滑 Label smoothing / Temperature Softmax
标签平滑 Label smoothing 逻辑为什么有效 Temperature Softmax 近期在查看一些训练技巧,无意中发现了标签平滑 Label smoothing,非常简单却有效的一个技 ...
标签平滑 label smoothing
文章目录简介什么是label smoothing label smoothing作用 torch实现label smoothing 简介 label smoothing其实是机器学习和深度学习上比 ...
标签平滑Label Smoothing
Lable Smoothing 是分类问题中错误标注的一种解决方法. 对于分类问题,特别是多分类问题,常常把向量转换成one-hot-vector(独热向量) one-hot带来的问题:(对于独热的简 ...
Label Smoothing 标签平滑 (Label smooth regularization, LSR)
Lable Smoothing 是分类问题中错误标注的一种解决方法.是一种正则化方法, 为了降低模型过拟合(overfitting) 出自inception v3,Transformer中就用到了我 ...
[轻笔记] label smoothing(标签平滑)
看google AI最新的开源代码,发现有个技巧--label smoothing,网上查到的公式与代码中的公式不一样,于是做个笔记,并对见到的觉得有问题的关于label smoothing的博客也列 ...
深度学习--TensorFlow（7）拟合（过拟合处理）（数据增强、提前停止训练、dropout、正则化、标签平滑）
目录拟合 1.拟合情况 2.抵抗过拟合方法过拟合处理(防止过拟合): 一.数据增强 1.设置图像生成器 2.载入图片 3.图像转三维数据 4.三维转四维 5.生成图片(用图像生成器) 代码二.提 ...
【AI面试】hard label与soft label，Label Smoothing Loss 和 Smooth L1 Loss
往期文章: AI/CV面试,直达目录汇总 [AI面试]NMS 与 Soft NMS 的辨析 [AI面试]L1 loss.L2 loss和Smooth L1 Loss,L1正则化和L2正则化在一次询问 ...
垃圾分类、EfficientNet模型、数据增强(ImageDataGenerator)、混合训练Mixup、Random Erasing随机擦除、标签平滑正则化、tf.keras.Sequence
日萌社人工智能AI:Keras PyTorch MXNet TensorFlow PaddlePaddle 深度学习实战(不定时更新) 垃圾分类.EfficientNet模型.数据增强(ImageD ...
Label Smoothing介绍及其代码实现
一.标签平滑(Label Smoothing)介绍标签平滑(Label Smoothing)的原理其实很简单,它大部分的用处用一句话总结就是: 修改数据集的标签来增加扰动,避免模型的判断过于自信从而 ...
label smooth标签平滑的理解
今天我们来聊一聊label smooth这个tricks,标签平滑已经成为众所周知的机器学习或者说深度学习的正则化技巧.标签平滑--label smooth regularization作为一种简单的 ...

NLP炼丹技巧：标签平滑label smoothing