1. 文章简介

标题：DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
作者：Victor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF
日期：2019
期刊：arxiv preprint

2. 文章概括

文章给出了对BERT[1]模型进行知识蒸馏的方法，并训练得到了DistilBERT，参数量仅为BERT的40%，且在多个自然语言理解、分类任务上表现接近BERT。

3 文章重点技术

3.1 知识蒸馏

知识蒸馏是一种常用的压缩模型的方法，可以用更轻量级的模型达成几乎和原模型相同的效果。

随着NLP模型的发展（见上图），模型量级越来越大，实际使用时可能会收到时间、存储、硬件等限制。为此文章提出了用知识蒸馏来对BERT模型进行压缩。知识蒸馏中我们将原始的大模型称为教师模型Teacher model，我们要生成的小模型称为学生模型Student model。知识蒸馏分为两个步骤

监督学习：学生模型根据标注数据进行监督学习，在文本中此过程的损失函数即为BERT的
MLM(Masked Language Modeling)损失函数 $L_{MLM}$ 。
Teacher模型学习：学生模型学习教师模型的行为，这里采用损失函数 $L_{ce} = \sum_i t_i \log (s_i)$ ，其中 $t_i$ 和 $s_i$ 分别代表教师模型和学生模型对当前输入的估计概率。另外在计算概率的时候增加蒸馏的温度 $T$ ，即 $t_i = \frac {\exp (z_i^{(t)} / T)}{\sum_j \exp (z_j^{(t)} / T)}$ 用于平滑分布，其中 $z_i^{(t)}$ 和 $z_i^{(s)}$ 分别表示教师模型和学生模型对类别 $i$ 的输出分值。注意到损失函数最小当且仅当教师和学生模型对每个类别的输出概率均相同，即损失函数是为了使学生模型学习到和教师模型相近的输出概率分布。此外温度 $T$ 越大，不同类别之间的差距越小，分布越平均，适当地增大温度可以增强模型对小概率事件的捕捉能力。
最后模型会把上述两个损失结合起来作为最终的损失。文章提出增加一个损失函数 $L_{cos}$ 表示学生模型和教师模型输出的隐藏层向量之间的cosine相似度，实验表明此损失函数效果很好。

3.2 DistilBERT

在上述知识蒸馏的基础上，文章训练得到了学生模型DistilBERT，其具体细节如下

architecture：和BERT基本相同，移除token-type embeddings和pooler层
size：层数较BERT减半
隐藏层：最后一层隐藏层的维度不变
初始化：由于DistilBERT的隐藏层维度不变，从而模型可以选择直接从BERT的输入层/隐藏层初始化。
蒸馏：参考RoBERTa[2]的分析，文章在较大的batch(4K)上进行知识蒸馏
训练目标：MLM，移除NSP(Next Sentence Prediction)目标

4. 文章亮点

文章通过 $L_{ce}, L_{MLM}, L_{cos}$ 三种损失函数的结合训练得到BERT的蒸馏模型DistilBERT，数值实验表明，DistilBERT在GLUE benchmark上达到了BERT效果的97%，且参数量仅为BERT的40%，在下游任务上表现与BERT只差约3个百分点。
DistilBERT的参数量仅为BERT的40%，推理速度比BERT快69%。在实际生活中可供一些受资源限制的场景使用，比如在线推理等。

5. 原文传送门

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

6. References

[1] 论文笔记–BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[2] 论文笔记–RoBERTa: A Robustly Optimized BERT Pretraining Approach

论文笔记--DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter相关推荐

《DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter》（NeurIPS-2019）论文阅读
前言论文地址:https://arxiv.org/abs/1910.01108 代码地址:https://github.com/huggingface/transformers Abstract 就 ...
《DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter》论文笔记
论文来源:NIPS-2019(hugging face发布) 论文链接:https://arxiv.org/abs/1910.01108 ⭐背景介绍: 近年来NLP领域,在大型预训练模型上进行迁移学 ...
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter（2019-10-2）
模型介绍 NLP预训练模型随着近几年的发展,参数量越来越大,受限于算力,在实际落地上线带来了困难,针对最近最为流行的BERT预训练模型,提出了DistilBERT,在保留97%的性能的前提下,模型大小 ...
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter学习
1. 总结论文地址论文写得很简单,但是引用量好高啊
DistilBERT, a distilled version of BERT
1 简介本文根据2020年<DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter>翻译 ...
Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT 翻译
paper: https://arxiv.org/pdf/1910.01108v2.pdf code: https://github.com/huggingface/transformers Time ...
【论文笔记】一种有效攻击BERT等模型的方法
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Enta ...
论文笔记：《DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks》
论文笔记:<DeepGBM: A Deep Learning Framework Distilled by GBDT for Online Prediction Tasks> 摘要 1. ...
论文笔记【A Comprehensive Study of Deep Video Action Recognition】
论文链接:A Comprehensive Study of Deep Video Action Recognition 目录 A Comprehensive Study of Deep Video A ...

论文笔记--DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter