DistilBERT, a distilled version of BERT

2024-06-07 06:00:53

1 简介

本文根据2020年《DistilBERT, a distilled version of BERT: smaller,faster, cheaper and lighter》翻译总结的。

DistilBERT：预训练模型。将BERT模型大小减少40%，却仍保持97%的语言理解能力，以及快了60%。

损失函数采用了三元损失函数，包括distillation loss L_ce、监督训练损失函数（在我们例子中，、采用的masked language modeling loss Lmlm）、cosine embedding loss (Lcos)。

2 Knowledge distillation

Knowledge distillation，可参考[Bucila et al., 2006, Hinton et al., 2015]，是一种压缩技术，一个小型的模型（student），是被训练来重现一个大模型（teacher或者一组模型）的行为。

最后的训练损失函数是distillation loss L_ce和监督训练损失函数（在我们例子中，采用的masked language modeling loss Lmlm）的线性组合。我们发现增加一个cosine embedding loss (Lcos)，有助于对齐学生和老师的隐藏状态向量。

3 DistilBERT: a distilled version of BERT

学生架构：DistilBERT和BERT有相同的架构，token-type embeddings 和the pooler被去掉了，同时层的数量也减少了2倍。

学生初始化：从老师那2层取一层来初始化。

4 实验结果

从下表1可以看到DistilBERT表现分数保持了97%。表2表示了在下游认为中也表现较好。表3标明了参数和预测时间较小。

DistilBERT, a distilled version of BERT相关推荐

《DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter》（NeurIPS-2019）论文阅读
前言论文地址:https://arxiv.org/abs/1910.01108 代码地址:https://github.com/huggingface/transformers Abstract 就 ...
论文笔记--DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
论文笔记--DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 1. 文章简介 2. 文章概括 ...
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter（2019-10-2）
模型介绍 NLP预训练模型随着近几年的发展,参数量越来越大,受限于算力,在实际落地上线带来了困难,针对最近最为流行的BERT预训练模型,提出了DistilBERT,在保留97%的性能的前提下,模型大小 ...
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter学习
1. 总结论文地址论文写得很简单,但是引用量好高啊
《DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter》论文笔记
论文来源:NIPS-2019(hugging face发布) 论文链接:https://arxiv.org/abs/1910.01108 ⭐背景介绍: 近年来NLP领域,在大型预训练模型上进行迁移学 ...
Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT 翻译
paper: https://arxiv.org/pdf/1910.01108v2.pdf code: https://github.com/huggingface/transformers Time ...
使用DistilBERT 蒸馏类 BERT 模型的代码实现
来源:DeepHub IMBA 本文约2700字,建议阅读9分钟本文带你进入Distil细节,并给出完整的代码实现.本文为你详细介绍DistilBERT,并给出完整的代码实现. 机器学习模型已经变得 ...
【BERT蒸馏】DistilBERT、Distil-LSTM、TinyBERT、FastBERT（论文+代码）
文章目录 0. 引言 1. FastBERT: a Self-distilling BERT with Adaptive Inference Time 1.1 摘要 1.2 动机 1.3 贡献(适用于 ...
Bert RoBerta DistilBert ALBert 解读
目录 1 Transformer结构 1.1 self attention的理解 1.2 Multi head理解 1.3 transformer基本单元构成 2 Bert 2.1 bert的输入三部 ...

最新文章

热门文章