原文：Deconstructing BERT: Distilling 6 Patterns from 100 Million Parameters

关于transformer 和attention的机制，推荐这篇《The Illustrated Transformer【译】》

《bert 结构解析：在1亿个参数中提取6种模式》

The year 2018 marked a turning point for the field of Natural Language Processing, with a series of deep-learning models achieving state-of-the-art results on NLP tasks ranging from question answering to sentiment classification. Most recently, Google’s BERT algorithm has emerged as a sort of “one model to rule them all,” based on its superior performance over a wide variety of tasks.

2018是自然语言处理领域的转折点，深度学习模型从问题回答道情感分类等任务中达到了出色的效果
最近google的bert 算法横空出世，号称“一个模型搞定所有”，在多项任务中获得了不错的性能

BERT builds on two key ideas that have been responsible for many of the recent advances in NLP: (1) the transformer architecture and (2) unsupervised pre-training. The transformer is a sequence model that forgoes the recurrent structure of RNN’s for a fully attention-based approach, as described in the classic Attention Is All You Need. BERT is also pre-trained; its weights are learned in advance through two unsupervised tasks: masked language modeling (predicting a missing word given the left and right context) and next sentence prediction (predicting whether one sentence follows another). Thus BERT doesn’t need to be trained from scratch for each new task; rather, its weights are fine-tuned. For more details about BERT, check out the The Illustrated Bert.

bert 基于两个与nlp领域近期改进相关的idea：
- transformer 结构：序列模型，放弃了RNN结构，替换成基于attention的方法，详见： Attention Is All You Need
- 无监督预训练：权重通过两个无监督任务：masked language modeling （根据上下文预测一个缺失的单词）和下文预测（一个句子是否跟着另一个之后）提前训练。
因此bert不需要每个任务再从头开始训练，相反，在后续过程中不断调整，详见：The Illustrated Bert

BERT is a (multi-headed) beast（bert是个多头怪兽）

Bert is not like traditional attention models that use a flat attention structure over the hidden states of an RNN. Instead, BERT uses multiple layers of attention (12 or 24 depending on the model), and also incorporates multiple attention “heads” in every layer (12 or 16). Since model weights are not shared between layers, a single BERT model effectively has up to 24 x 16 = 384 different attention mechanisms.

bert 不同于传统的attention 在RNN隐层上使用平面attention结构，而是使用了多层attention（12 或24）
每一层attention引入了多个head
每一层不共享模型权重，单bert模型有着384个不同的attention机制

Visualizing BERT（可视化）

Because of BERT’s complexity, it can be difficult to intuit the meaning of its learned weights. Deep-learning models in general are notoriously opaque, and various visualization tools have been developed to help make sense of them. However, I hadn’t found one that could shed light on the attention patterns that BERT was learning. Fortunately, Tensor2Tensor has an excellent tool for visualizing attention in encoder-decoder transformer models, so I modified this to work with BERT’s architecture, using a PyTorch implementation of BERT. The adapted interface is shown below, and you can run it yourself using the notebooks on Github.

The tool visualizes attention as lines connecting the position being updated (left) with the position being attended to (right). Colors identify the corresponding attention head(s), while line thickness reflects the attention score. At the top of the tool, the user can select the model layer, as well as one or more attention heads (by clicking on the color patches at the top, representing the 12 heads).

因为bert的复杂性，很难直观理解学习到的权重的含义
深度学习模型一般都是黑盒，诞生了不同的可视化工具来解释它们
Tensor2Tensor 有一个不错的工具，可以对编解码transformer模型终得attention 做可视化，所以我用bert的pytorch实现来对其进行修改以对bert做可视化（工程能力强啊），可以用此notebook Github 自己跑
工具将关注度attention 可视化为连接线，将左边已更新的位置与右边待更新的位置连接。
颜色区分了不同的attention的head，线条粗细反映了关注度attention的分值
工具上方可以选择不同的层，还有head

What does BERT actually learn?（bert学了啥，这里对应标题的6种模式）

I used the tool to explore the attention patterns of various layers / heads of the pre-trained BERT model (the BERT-Base, uncased version). I experimented with different input values, but for demonstration purposes, I just use the following inputs:

Sentence A: I went to the store.

Sentence B: At the store, I bought fresh strawberries.

BERT uses WordPiece tokenization and inserts special classifier ([CLS]) and separator ([SEP]) tokens, so the actual input sequence is: [CLS] i went to the store . [SEP] at the store , i bought fresh straw ##berries . [SEP]

I found some fairly distinctive and surprisingly intuitive attention patterns. Below I identify six key patterns and for each one I show visualizations for a particular layer / head that exhibited the pattern.

工具用于观察预训练bert模型不同层或head的关注度模式，为展示，选择以下句子输入：
- I went to the store.
- At the store, I bought fresh strawberries.
bert 使用WordPiece 标记，因此实际输入为：
- [CLS] i went to the store . [SEP] at the store , i bought fresh straw ##berries . [SEP]
我发现了相当不同且非常直观的关注度模式，以下我选了6中关键的模式做可视化

Pattern 1: Attention to next word（对下一个词的关注度）

In this pattern, most of the attention at a particular position is directed to the next token in the sequence. Below we see an example of this for layer 2, head 0. (The selected head is indicated by the highlighted square in the color bar at the top.) The figure on the left shows the attention for all tokens, while the one on the right shows the attention for one selected token (“i”). In this example, virtually all of the attention is directed to “went,” the next token in the sequence.

此模式下，大部分词关注度来源于句子的下一个词
左图显示了所有词的关注度来源，有图显示了i词来自went的关注度
此模式下i的关注度几乎都来自went

On the left, we can see that the [SEP] token disrupts the next-token attention pattern, as most of the attention from [SEP] is directed to [CLS]rather than the next token. Thus this pattern appears to operate primarily within each sentence.

This pattern is related to the backward RNN, where state updates are made sequentially from right to left. Pattern 1 appears over multiple layers of the model, in some sense emulating the recurrent updates of an RNN.

左图看出分隔符破坏了这种关注度模式，大部分分隔符SEP的关注度贡献到了定界符CLS，因此这种模式只在一句子中出现
这种模式与反向RNN相关，序列的每个状态更新是从右到左。
模式1在多层都有出现，某种程度上是对RNN的循环更新的一种仿真

Pattern 2: Attention to previous word（对前一个词的关注度）

In this pattern, much of the attention is directed to the previous token in the sentence. For example, most of the attention for “went” is directed to the previous word “i” in the figure below. The pattern is not as distinct as the last one; some attention is also dispersed to other tokens, especially the [SEP] tokens. Like Pattern 1, this is loosely related to a sequential RNN, in this case the forward RNN.

这种模式，大部分关注度都来自句子的前一个词
went 词主要关注的是i
这种模式并不想模式1那般明显，部分关注点还落在了其他词上，例如分隔符。
这种也可以视作对正向RNN的仿真

Pattern 3: Attention to identical/related words（对相同/相关的词的关注度）

In this pattern, attention is paid to identical or related words, including the source word itself. In the example below, most of the attention for the first occurrence of “store” is directed to itself and to the second occurrence of “store”. This pattern is not as distinct as some of the others, with attention dispersed over many different words.

这种模式下，关注度主要来自相同词或相关的词，包括自己
store 的关注度落在自己和下一个store上
这种模式也不明显，很多关注点落在其他不同词上

Pattern 4: Attention to identical/related words in other sentence（对其他句子中相同/相关词的关注度）

In this pattern, attention is paid to identical or related words in the other sentence. For example, most of attention for “store” in the second sentence is directed to “store” in the first sentence. One can imagine this being particularly helpful for the next sentence prediction task (part of BERT’s pre-training), because it helps identify relationships between sentences.

此模式中，关注点落在其他句子的相同或相关词上
store 词的关注度落在第一个句子的store
可以想象这个对下一句预测的任务将很有帮助，因为标注了句子间的关联（bert可以用到很多任务上）

Pattern 5: Attention to other words predictive of word（对其他对本单词预测有用的词的关注度）

In this pattern, attention seems to be directed to other words that are predictive of the source word, excluding the source word itself. In the example below, most of the attention from “straw” is directed to “##berries”, and most of the attention from “##berries” is focused on “straw”.

This pattern isn’t as distinct as some of the others. For example, much of the attention is directed to a delimiter token ([CLS]), which is the defining characteristic of Pattern 6 discussed next.

此模式下，关注点来自于对当前词有预见性的词，排除自身
下图中straw的关注度贡献到了##berries（straw 后面接着的大概率是berries）
此模式并不明显，大部分关注度也落在了定界符

Pattern 6: Attention to delimiter tokens（对定界符的关注度）

In this pattern, most of the attention is directed to the delimiter tokens, either the [CLS] token or the [SEP] tokens. In the example below, most of the attention is directed to the two [SEP] tokens. As discussed in this paper, this pattern serves as a kind of “no-op”: an attention head focuses on the [SEP] tokens when it can’t find anything meaningful in the input sentence to focus on.

此模式下，大部分关注度落在定界符（CLS或SEP）
此模式一般代表“无操作”，在输入中找不到有意义的关注点。

Notes（注意）

It has been said that data visualizations are a bit like Rorschach tests: our interpretations may be colored by our own beliefs and expectations. While some of the patterns above are quite distinct, others are somewhat subjective, so these interpretations should only be taken as preliminary observations.

Also, the above 6 patterns describe the coarse attentional structure of BERT and do not attempt to describe the linguistic patterns that attention may capture. For example, there are many different types of “relatedness” that could manifest in Patterns 3 and 4, e.g., synonymy, coreference, etc. It would be interesting to see if different attention heads specialize in different types of semantic and syntactic relationships.

数据可视化就像罗夏测试，我们的解释可能受我们的期望影响
以上模式是比较特殊的，其他模式都有一定主观性，所以这里的解释只能作为初步观测
以上6种模式粗略的描述了bert中attention结构，但没有描绘出attention可以抓取的语言学特征，或者不同head可以表现出不同的语义语法关系

Try it out!（试一把）

You can check out the visualization tool on Github. Please play with it and share what you find!

Github 可视化

For further reading（引申阅读）

In Part 2, I extend the visualization tool to show how BERT is able to form its distinctive attention patterns. In my most recent article, I explore OpenAI’s new text generator, GPT-2.

本文 Part 2 ，对bert 如何形成不同的attention模式做了分析

转自：https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77

下篇：自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：DECONSTRUCTING BERT, PART 2）

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）相关推荐

【koa系列】koa洋葱模型及其compose原理解析
什么是洋葱模型先来看一个 demo const Koa = require('koa'); const app = new Koa();// 中间件1 app.use((ctx, next) =&g ...
NLP之PTM：自然语言处理领域—预训练大模型时代的各种吊炸天大模型算法概述(Word2Vec→ELMO→Attention→Transfo→GPT系列/BERT系列等)、关系梳理、模型对比之详细攻略
NLP之PTM:自然语言处理领域-预训练大模型时代的各种吊炸天大模型算法概述(Word2Vec→ELMO→Attention→Transformer→GPT系列/BERT系列等).关系梳理.模型对比之 ...
人工智能大模型多场景应用原理解析
前言在上篇文章<人工智能大模型之ChatGPT原理解析>中分享了一些大模型之ChatGPT的核心原理后,收到大量读者的反馈,诸如:在了解了核心原理后想进一步了解未来的发展趋势(比如生成式 ...
BERT |（2）BERT的原理详解
在写这一篇的时候,偶然发现有一篇博客,相比于我之前的一篇写得更详尽,这一篇也参考这篇博客来继续写写自己的笔记总结. 原博客地址:一文读懂BERT(原理篇) 一.什么是Bert? 二,bert的原理从 ...
BERT通俗笔记：从Word2Vec/Transformer逐步理解到BERT
前言我在写上一篇博客<22下半年>时,有读者在文章下面评论道:"july大神,请问BERT的通俗理解还做吗?",我当时给他发了张俊林老师的BERT文章,所以没太在意. ...
【自然语言处理】1. 细讲：Attention模型的机制原理
NLP系列讲解笔记本专题是针对NLP的一些常用知识进行记录,主要由于本人接下来的实验需要用到NLP的一些知识点,但是本人非NLP方向学生,对此不是很熟悉,所以打算做个笔记记录一下自己的学习过程,也是 ...
Transformer:《Attention is all you need》(论文精读/原理解析/模型架构解读/源码解析/相关知识点解析/相关资源提供)
本文解读Transformer较为详细,是一篇两万字的长文,如果想看简短版的,请参考这篇文章目录 1 相关背景 1.1 Transformer 1.2<Attention is all you ...
深度学习进阶篇-预训练模型[3]：XLNet、BERT、GPT,ELMO的区别优缺点，模型框架、一些Trick、Transformer Encoder等原理详细讲解
[深度学习入门到进阶]必看系列,含激活函数.优化策略.损失函数.模型调优.归一化算法.卷积模型.序列模型.预训练模型.对抗神经网络等专栏详细介绍:[深度学习入门到进阶]必看系列,含激活函数.优化策略 ...
【面试必备】奉上最通俗易懂的XGBoost、LightGBM、BERT、XLNet原理解析
一只小狐狸带你解锁炼丹术&NLP 秘籍在非深度学习的机器学习模型中,基于GBDT算法的XGBoost.LightGBM等有着非常优秀的性能,校招算法岗面试中"出镜率"非 ...

自然语言处理模型：bert 结构原理解析——attention+transformer（翻译自：Deconstructing BERT）