蒸馏 (distill

The field of natural language processing is now in the age of large scale pretrained models being the first thing to try for almost any new task. Models like BERT, RoBERTa and ALBERT are so large and have been trained with so much data that they are able to generalize their pre-trained knowledge to understand any downstream tasks that you can use them for. But that’s all they can do — understand. If you wanted to answer a question that wasn’t multiple choice, write a story or an essay or anything that required free form writing, you’d be out of luck.

现在,自然语言处理领域已经进入大规模预训练模型时代,这是几乎可以尝试完成任何新任务的第一件事。 像BERT,RoBERTa和ALBERT这样的模型是如此之大,并且已经通过大量数据进行了训练,以至于它们能够推广其预训练的知识,以理解您可以使用它们的任何下游任务。 但这就是他们所能做的-了解。 如果您想回答一个非多项选择的问题,撰写故事或文章或任何需要自由形式撰写的内容,那么您将很不幸。

Now don’t get me wrong, just because BERT-like models can’t write stories, that doesn’t mean that there aren’t models out there that can. Introducing the Sequence to Sequence (Seq2Seq) model. When we write a story, we write the next word, sentence or even paragraph based on what we’ve written so far. This is exactly what Seq2Seq models are designed to do. They predict the most likely next word based on all the words they’ve seen so far by modeling them as a time series i.e. the order of the previous words matters.

现在不要误会我的意思,仅仅是因为类似BERT的模型不能写故事,这并不意味着就没有模型可以写。 介绍序列到序列(Seq2Seq)模型。 当我们写一个故事时,我们根据到目前为止所写的内容来写下一个单词,句子甚至段落。 这正是Seq2Seq模型的设计目标。 他们通过将它们建模为一个时间序列,即前一个单词的顺序很重要,从而根据到目前为止所看到的所有单词来预测最可能的下一个单词。

Seq2Seq models have been around for a while and there are several variants that are used for text generation tasks like summarization and translating one language to another. The exploration of Seq2Seq models has culminated in the development of models like GPT-2 and GPT-3, which can complete news snippets, stories, essays and even investment strategies — all from a few sentences of context! Forewarning though, not all these generated pieces of text make very much sense when you read them — probability distributions over words can only take you so far.

Seq2Seq模型已经存在了一段时间,并且有多种变体用于文本生成任务,例如汇总和将一种语言翻译成另一种语言。 对Seq2Seq模型的探索最终导致了诸如GPT-2和GPT-3之类的模型的开发,这些模型可以完成新闻摘要,故事,文章甚至投资策略 ,而这些都只需几句话即可! 但是,请注意,并非所有这些生成的文本片段在阅读时都非常有意义-单词的概率分布只能带您走这么远。

A few of the fundamental units used in designing these models are Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM) and Transformers (a combination of an encoder and a decoder that learns a representation for words using numbers) — which also form the backbone of BERT-like models and GPT-2/3.

设计这些模型时使用的一些基本单元是递归神经网络(RNN),长期短期记忆(LSTM)和变压器(编码器和解码器的组合,该编码器和解码器使用数字来学习单词表示)类BERT模型和GPT-2 / 3的骨干。

A natural question now arises — if a Seq2Seq model is used as the backbone of both BERT-like models and GPT, why can the BERT-like models not generate text? It’s because they’re trained in a way that considers both the future and past context. During training, these models are given sentences with a few missing words as input and they’re expected to predict these missing words. To predict the missing word, they need to know what the words before mean as well as those after. In this spirit, there has been work done on trying to get BERT-like models to work for text generation such as Yang et al.’s CT-NMT

现在出现一个自然的问题-如果将Seq2Seq模型用作BERT模型和GPT的主干,为什么BERT模型不能生成文本? 这是因为对他们进行培训时考虑了未来和过去的情况。 在训练过程中,会给这些模型加上一些缺少单词的句子作为输入,并期望它们能够预测这些缺少的单词。 为了预测丢失的单词,他们需要知道单词之前和之后的含义。 本着这种精神,已经进行了一些工作,试图使类似BERT的模型能够用于文本生成,例如Yang等人的CT-NMT。

Another train of thought of BERT-like models for text generation is based on this question: Can the knowledge of future words that these models have from their training help Seq2Seq models formulate more coherent sentences instead of just predicting the next word? This is exactly the problem that the researchers at Microsoft Dynamics 365 AI Research try to answer using Distill-BERT.

基于BERT的文本生成模型的另一思路是基于以下问题:这些模型从其训练中获得的对未来单词的了解是否可以帮助Seq2Seq模型制定更连贯的句子,而不仅仅是预测下一个单词? 这正是Microsoft Dynamics 365 AI Research的研究人员试图使用Distill-BERT回答的问题。

They use knowledge distillation to transfer the knowledge from a teacher BERT model to a student Seq2Seq model, while also maintaining the original Seq2Seq goal of predicting the most likely next word. This way, the student model retains the best of both worlds. A more formal explanation of this technique is shown in the below equations.

他们使用知识提炼将知识从教师BERT模型转移到学生Seq2Seq模型,同时还保持了预测最可能出现的下一个单词的原始Seq2Seq目标。 这样,学生模型就保留了两全其美的优势。 下式显示了对该技术的更正式解释。

The objective used to train the student Seq2Seq model
用于训练学生Seq2Seq模型的目标
BERT part of the objective
BERT目标的一部分

Here (yt) is a list of probabilities predicted by BERT of all words being relevant at position t in the generated text.

在此, (yt)是由BERT预测的所有单词在生成的文本中的位置t处相关的概率的列表。

Original Seq2Seq objective
原始Seq2Seq物镜

Once the student model is trained, the teacher BERT model is no longer needed and only the student model is used to generate the text. This means that at generation time, there are no additional resources required for Distill-BERT. This technique is also teacher agnostic. This means that any BERT-like model like RoBERTa, ALBERT, BERT and more can be used to distil knowledge to the student.

一旦训练了学生模型,就不再需要教师BERT模型,而仅使用学生模型来生成文本。 这意味着在生成时,Distill-BERT不需要其他资源。 这种技术也是老师不可知的。 这意味着任何类似于BERT的模型(例如RoBERTa,ALBERT,BERT等)都可以用于向学生分发知识。

To prove their method works, the researchers distil BERT’s knowledge to train a student transformer and use it for German-to-English translation, English-to-German translation and summarization. The student transformer shows significant improvement over a regular transformer without BERT and is even able to achieve state of the art performance on German-to-English translation.

为了证明他们的方法有效,研究人员利用BERT的知识来训练学生变压器,并将其用于德语到英语的翻译,英语到德语的翻译和总结。 与没有BERT的常规变压器相比,该学生变压器显示出了显着的进步,甚至在德语到英语的翻译中都可以达到最先进的性能。

They also apply this knowledge to a student RNN, showing that the technique is also student agnostic. This RNN is applied for English-to-Vietnamese translation and shows improvement as well.

他们还将此知识应用于学生RNN,表明该技术也与学生无关。 此RNN应用于英语到越南语的翻译,并且也显示出改进。

Here’s a link to the paper if you want to know more about Distill-BERT , a link to the code if you want to try to train your own Seq2Seq model and click here to see more of our publications and other work.

这里有一个链接到文件,如果你想了解更多有关提炼-BERT,一个链接 ,如果你想培养你自己的Seq2Seq型号并单击代码在这里看到更多我们的出版物和其他工作。

  1. Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever, Language models are unsupervised multitask learners. OpenAI Blog 1, no. 8 (2019): 9.

    Radford,Alec,Jeffrey Wu,Rewon Child,David Luan,Dario Amodei和Ilya Sutskever表示, 语言模型是无监督的多任务学习者OpenAI博客 1,否。 8(2019):9。

  2. Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al., Language models are few-shot learners, arXiv preprint arXiv:2005.14165 (2020).

    Brown,Tom B.,Benjamin Mann,Nick Ryder,Melanie Subbiah,Jared Kaplan,Prafulla Dhariwal,Arvind Neelakantan等人,《 语言模型是 鲜为人知的 学习者》arXiv预印本arXiv:2005.14165 (2020)。

  3. Cristian Bucilu, Rich Caruana, and Alexandru Niculescu-Mizil, Model compression (2006), In KDD.

    Cristian Bucilu,Rich Caruana和Alexandru Niculescu-Mizil, 模型压缩 (2006),在KDD中进行。

  4. Jiacheng Yang, Mingxuan Wang, Hao Zhou, Chengqi Zhao, Yong Yu, Weinan Zhang and Lei Li, Towards making the most of bert in neural machine translation (2019), arXiv preprint arXiv:1908.05672

    杨家成,王明轩,周浩,赵成奇,于勇,张渭南,李磊,努力在神经机器翻译中充分利用bert (2019),arXiv预印本arXiv:1908.05672

  5. Chen, Yen-Chun, Zhe Gan, Yu Cheng, Jingzhou Liu, and Jingjing Liu, Distilling Knowledge Learned in BERT for Text Generation, In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7893–7905. 2020.

    Chen,Yen-Chun,Zhe Gan,Yu Cheng,Liuzhou和Liujingjing,摘录BERT学习的知识以生成文本 ,在《计算语言学协会》第58届年会上 ,第7893–7905页。 2020年。

翻译自: https://medium.com/swlh/distill-bert-using-bert-for-smarter-text-generation-aa65ba6facf0

蒸馏 (distill


http://www.taodudu.cc/news/show-2816791.html

相关文章:

  • distill_basic_teacher
  • 蒸馏神经网络(Distill the Knowledge in a Neural Network)
  • Like What You Like: Knowledge Distill via Neuron Selectivity Transfer 论文翻译
  • 【Distill 系列:二】CVPR 2019 Distilling Object Detectors with Fine-grained Feature Imitation
  • 【Distill 系列:三】CVPR2019 Relational Knowledge Distillation
  • distill论文总结(未待续)
  • 《Search to Distill: Pearls are Everywhere but not the Eyes》论文阅读
  • Like What You Like: Knowledge Distill via Neuron Selectivity Transfer论文初读
  • Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distill阅读笔记
  • 论文解读 Search to Distill: Pearls are Everywhere but not the Eyes,神经网络架构搜索+知识蒸馏
  • 【模型压缩】蒸馏神经网络(Distill the Knowledge in a Neural Network)
  • Dreaming to distill(Deep Inversion, data free distill)
  • Mosaicking to Distill Knowledge Distillation from Out-of-Domain Data
  • 【2021知识蒸馏】Show, Attend and Distill:Knowledge Distillation via Attention-based Feature Matching
  • Like What Y ou Like: Knowledge Distill via Neuron Selectivity Transfer(2017)------论文阅读笔记
  • distill_bert和tiny_bert
  • Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion
  • Dreaming to Distill Data-free Knowledge Transfer via DeepInversion
  • Distill文章-A gentle introduction to graph Neural Networks(图神经网络是怎么构造的)
  • Distill
  • 深度学习:蒸馏Distill
  • BEV蒸馏来了!BEVDistill:用于多目3D目标检测的跨模态BEV蒸馏
  • distill介绍及优秀博客记录
  • Win10域客户端强制更改壁纸
  • win10 安装 ad9
  • AD破解后在同一局域网内许可证冲突
  • Andwobble破解
  • 基于AD Event日志识别域用户密码攻击
  • 【AD】破解WindowsServer2008R2 AD域控目录还原模式密
  • Altium designer 10安装破解以及出现缺少mfc71.dll文件的情况处理

蒸馏 (distill_Distill-BERT:使用BERT进行更智能的文本生成相关推荐

  1. 智能营销文本生成项目知识点总结

    最近业余时间弄了一个文本生成的项目,在此将相关知识点总一下总结. 项目说明 本项目中,我们作为输入的原文称之为 source,待生成的目标文本 称之为 target ,用来作为 target 好坏的参 ...

  2. 详解京东商城智能对话系统(生成+检索)

    01 京东AI项目实战课程安排 覆盖了从经典的机器学习.文本处理技术.序列模型.深度学习.预训练模型.知识图谱.图神经网络所有必要的技术. 项目一.京东健康智能分诊项目 第一周:文本处理与特征工程 | ...

  3. 10大NLP精选项目-涉及预训练Bert、知识图谱、智能问答、机器翻译、对话等

    自然语言处理技术近几年发展非常快,像BERT.GPT-3.图神经网络.知识图谱等技术被大量应用于项目实践中. 今年大厂的NLP面试中对项目方面的考察深度也随之提升了很多,经常会被面试官揪着细节一步一步 ...

  4. NLP精选10个实现项目推荐-涉及预训练Bert、知识图谱、智能问答、机器翻译、对话等...

    自然语言处理技术近几年发展非常快,像BERT.GPT-3.图神经网络.知识图谱等技术被大量应用于项目实践中. 今年大厂的NLP面试中对项目方面的考察深度也随之提升了很多,经常会被面试官揪着细节一步一步 ...

  5. 强烈推荐十大NLP主流经典项目:预训练BERT、知识图谱、智能问答、机器翻译、文本自动生成等...

    自然语言处理技术近几年发展非常快,像BERT.GPT-3.图神经网络.知识图谱等技术被大量应用于项目实践中. 今年大厂的NLP面试中对项目方面的考察深度也随之提升了很多,经常会被面试官揪着细节一步一步 ...

  6. 自然语言处理NLP之BERT、BERT是什么、智能问答、阅读理解、分词、词性标注、数据增强、文本分类、BERT的知识表示本质

    自然语言处理NLP之BERT.BERT是什么.智能问答.阅读理解.分词.词性标注.数据增强.文本分类.BERT的知识表示本质 目录

  7. 【BERT】BERT模型压缩技术概览

    由于BERT参数众多,模型庞大,推理速度较慢,在一些实时性要求较高.计算资源受限的场景,其应用会受到限制.因此,讨论如何在不过多的损失BERT性能的条件下,对BERT进行模型压缩,是一个非常有现实意义 ...

  8. 预训练生成模型:结合VAE与BERT/GPT-2提高文本生成效果

    论文标题: Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space 论文作者: Chunyuan Li, Xi ...

  9. bert使用做文本分类_使用BERT进行深度学习的多类文本分类

    bert使用做文本分类 Most of the researchers submit their research papers to academic conference because its ...

最新文章

  1. php把数组转换成对象,php怎么将数组转换成对象
  2. 如何自行找出 SAP Spartacus 查询用户信息的 API Service 类
  3. Acer 4750 安装黑苹果_黑苹果系统安装通用教程图文版
  4. C++ primer 第15章 面向对象程序设计
  5. 浅谈OpenCL之API分类
  6. 使用自定义端口连接SQL Server 2008的方法
  7. Java从入门到精通 第13章 抽象类与接口
  8. appscan 历史版本下载
  9. 航空订票系统C++课程设计
  10. Adobe Originals的可变字体
  11. 【Unity开发小技巧】Unity中文转拼音
  12. java sql连接代码 sqlserver的jar包
  13. 6场圆桌,20+演讲,48小时聚会, 2020全球区块链算力大会圆桌议题首度曝光
  14. Elasticsearch7 mapping和setting简介
  15. 计算机专业毕业了,还要不要参加培训班
  16. 一文帮你理解 Google SRE 体系
  17. VOC2007 2012数据集有多少张图片
  18. springboot整合redis做缓存
  19. 明日之后如何注销一个服务器的账号,明日之后账号怎么注销_明日之后账号注销方法介绍_玩游戏网...
  20. association weak 属性

热门文章

  1. vscode 设置setting文件
  2. PPT中加水印的方法
  3. 中国国内可用API合集
  4. [Easy] CodeForces - 897D Ithea Plays With Chtholly | 贪心博弈
  5. OpenGL立方体纹理贴图
  6. PbootCMS模板主题开发必备标签集合
  7. 红帽linux云计算提供商,神州数码获得红帽云计算及服务供应商认证
  8. 实验记录 | BWA的安装
  9. MySQL-5.7.18绿色版安装和配置
  10. 物联网云平台—物联网背后的掌舵者?