The Illustrated Transformer:中英文（看原文，很多翻译是错误的）

在上一篇文章中（previous post），我们研究了注意力机制 - 一种在现代深度学习模型中无处不在的（ubiquitous）方法。注意力是一个有助于提高神经机器翻译（neural machine translation）应用程序性能的概念。在这篇文章中（In this post），我们将介绍The Transformer–一个使用注意力来提高（boost）这些模型训练速度的模型。The Transformers在特定任务中优于（outperforms）Google神经机器翻译模型。然而，最大的好处来自于Transformer如何为并行化（parallelization）做出贡献。事实上，Google Cloud建议使用The Transformer作为参考模型来使用他们的Cloud TPU产品。因此，让我们尝试将模型分开，看看它是如何运作的。

The Transformer在文章中提出了Attention is All You Need。它的TensorFlow实现作为Tensor2Tensor包的一部分提供。哈佛大学的NLP小组创建了一个用PyTorch实现注释论文的指南。在这篇文章中，我们将尝试过度简化一些事情并逐一介绍这些概念，以便在没有深入了解主题的情况下让人们更容易理解。

A High-Level Look

让我们首先将模型看作一个黑盒子。在机器翻译应用程序中，它将使用一种语言的句子，并将其翻译输出到另一种语言中。

Popping open that Optimus Prime goodness, we see an encoding component, a decoding component, and connections between them.

The encoding component is a stack of encoders (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). The decoding component is a stack of decoders of the same number.

The encoders are all identical in structure (yet they do not share weights注：他们的权重不共享). Each one is broken down into two sub-layers:

The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.

The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.

The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

Bringing The Tensors Into The Picture

Now that we’ve seen the major components of the model, let’s start to look at the various vectors/tensors and how they flow between these components to turn the input of a trained model into an output.

As is the case in NLP applications in general, we begin by turning each input word into a vector using an embedding algorithm.

The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set(这是其中的一个超参数) – basically it would be the length of the longest sentence in our training dataset.

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.

Here we begin to see one key property of the Transformer, which is that the word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

Next, we’ll switch up（切换） the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.

Now We’re Encoding!

As we’ve mentioned already, an encoder receives a list of vectors as input. It processes this list by passing these vectors into a ‘self-attention’ layer, then into a feed-forward neural network, then sends out the output upwards to the next encoder.

Self-Attention at a High Level

Don’t be fooled by me throwing around the word “self-attention” like it’s a concept everyone should be familiar with. I had personally never came across the concept until reading the Attention is All You Need paper. Let us distill how it works.

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to （指什么） ? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate （相关联） “it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake（融入） the “understanding” of other relevant words into the one we’re currently processing.

Be sure to check out the Tensor2Tensor notebook where you can load a Transformer model, and examine it using this interactive visualization.

Self-Attention in Detail

Let’s first look at how to calculate self-attention using vectors, then proceed to look at how it’s actually implemented – using matrices.

The first step in calculating self-attention is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a Query vector, a Key vector, and a Value vector. These vectors are created by multiplying the embedding by three matrices that we trained during the training process.

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512. They don’t HAVE to be smaller, this is an architecture choice to make the computation of multiheaded attention (mostly) constant.

What are the “query”, “key”, and “value” vectors?

They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.

The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
计算自我关注度的第二步是计算得分。假设我们正在计算这个例子中第一个单词“思考”的自我关注。我们需要根据这个词对输入句子的每个单词进行评分。当我们在某个位置编码单词时，分数决定了对输入句子的其他部分放置多少焦点。

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
通过将查询向量的点积与我们得分的相应单词的关键向量计算得分。因此，如果我们处理位置＃1中单词的自我关注，则第一个分数将是q1和k1的点积。第二个分数是q1和k2的点积。

The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
第三步和第四步是将得分除以8（论文中使用的关键向量的维数的平方根 - 64.这导致具有更稳定的梯度。这里可能存在其他可能的值，但这是默认），然后通过softmax操作传递结果。 Softmax将分数标准化，因此它们都是正数并且加起来为1。

The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition（直觉） here is to keep intact the values of the word(s) we want to focus on, and drown-out（淹没） irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
第五步是将每个值向量乘以softmax得分（准备将它们相加）。这里的直觉是保持我们想要关注的单词的值不变，并淹没不相关的单词（例如，通过将它们乘以像0.001这样的小数字）。

The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
第六步是总结加权值向量。这会在此位置产生自我关注层的输出（对于第一个单词）。

Matrix Calculation of Self-Attention

The first step is to calculate the Query, Key, and Value matrices. We do that by packing our embeddings into a matrix X, and multiplying it by the weight matrices we’ve trained (WQ, WK, WV).
第一步是计算Query，Key和Value矩阵。我们通过将嵌入包装到矩阵X中，并将其乘以我们训练过的权重矩阵（WQ，WK，WV）来实现。

Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
最后，由于我们正在处理矩阵，我们可以在一个公式中浓缩步骤2到6来计算自我关注层的输出。

The Beast With Many Heads

The paper further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:
本文通过增加一种称为“多头”关注的机制，进一步完善了自我关注层。这以两种方式改善了关注层的性能：

It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
它扩展了模型关注不同位置的能力。是的，在上面的例子中，z1包含了所有其他编码的一点点，但它可能由实际的单词本身支配。如果我们翻译一句“动物没有过马路，因为它太累了”，我们会想知道“它”指的是哪个词，这将是有用的。
It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
它给予attention层多个“表示子空间”。正如我们接下来将看到的，我们不仅有一个，而且还有多组Query / Key / Value权重矩阵（Transformer使用8个注意头，因此我们最终为每个编码器/解码器设置了8个）。这些集合中的每一个都是随机初始化的。然后，在训练之后，每组用于将输入嵌入（或来自较低编码器/解码器的矢量）投影到不同的表示子空间中。

If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices
如果我们进行上面概述的相同的自我关注计算，只有八个不同的时间使用不同的权重矩阵，我们最终得到八个不同的Z矩阵

This leaves us with a bit of a challenge. The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.
这让我们面临一些挑战。前馈层不期望八个矩阵 - 它期望单个矩阵（每个字的向量）。所以我们需要一种方法将这八个压缩成一个矩阵。

How do we do that? We concat the matrices then multiple them by an additional weights matrix WO.
我们怎么做？我们将矩阵连接起来然后通过另外的权重矩阵WO将它们多个。

That’s pretty much all there is to multi-headed self-attention. It’s quite a handful of matrices, I realize. Let me try to put them all in one visual so we can look at them in one place
这就是多头自我关注的全部内容。我意识到这是一小部分矩阵。让我尝试将它们全部放在一个视觉中，这样我们就可以在一个地方看到它们

Now that we have touched upon attention heads, let’s revisit our example from before to see where the different attention heads are focusing as we encode the word “it” in our example sentence:
现在我们已经触及了注意力的头，让我们重新审视我们之前的例子，看看不同的注意力头在哪里聚焦，因为我们在我们的例句中编码“it”这个词：

If we add all the attention heads to the picture, however, things can be harder to interpret:
但是，如果我们将所有注意力添加到图片中，那么事情可能更难理解：

Representing The Order of The Sequence Using Positional Encoding

One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
到目前为止，模型中缺少的一件事就是考虑输入序列中单词顺序的一种方法。

To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
为了解决这个问题，变换器为每个输入嵌入添加了一个向量。这些向量遵循模型学习的特定模式，这有助于确定每个单词的位置，或者序列中不同单词之间的距离。这里的直觉是，将这些值添加到嵌入中，一旦它们被投影到Q / K / V向量中并且在点积注意期间，就在嵌入向量之间提供有意义的距离。

If we assumed the embedding has a dimensionality of 4, the actual positional encodings would look like this:
如果我们假设嵌入的维数为4，那么实际的位置编码将如下所示：

What might this pattern look like?
这种模式可能是什么样的？

In the following figure, each row corresponds the a positional encoding of a vector. So the first row would be the vector we’d add to the embedding of the first word in an input sequence. Each row contains 512 values – each with a value between 1 and -1. We’ve color-coded them so the pattern is visible.
在下图中，每行对应矢量的位置编码。因此第一行将是我们添加到输入序列中嵌入第一个单词的向量。每行包含512个值 - 每个值的值介于1和-1之间。我们对它们进行了颜色编码，使图案可见。

The formula for positional encoding is described in the paper (section 3.5). You can see the code for generating positional encodings in get_timing_signal_1d(). This is not the only possible method for positional encoding. It, however, gives the advantage of being able to scale to unseen lengths of sequences (e.g. if our trained model is asked to translate a sentence longer than any of those in our training set).
位置编码的公式在论文（第3.5节）中描述。您可以在get_timing_signal_1d()中看到用于生成位置编码的代码。这不是位置编码的唯一可能方法。然而，它具有能够扩展到看不见的序列长度的优点（例如，如果要求我们训练的模型翻译句子的时间长于我们训练集中的任何句子）。

The Residuals

One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer-normalization step.
在继续之前我们需要提到的编码器架构中的一个细节是每个编码器中的每个子层（自注意，ffnn）在其周围具有残余连接，然后是层规范化步骤。

If we’re to visualize the vectors and the layer-norm operation associated with self attention, it would look like this:
如果我们要将矢量和与自我关注相关的图层规范操作可视化，它将如下所示：

This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
这也适用于解码器的子层。如果我们想到2个堆叠编码器和解码器的变压器，它看起来像这样：

The Decoder Side

Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.
既然我们已经涵盖了编码器方面的大多数概念，我们基本上都知道解码器的组件是如何工作的。但是让我们来看看它们如何协同工作。

The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
以下步骤重复此过程，直到达到特殊符号，表示变压器解码器已完成其输出。在下一个时间步骤中，每个步骤的输出被馈送到底部解码器，并且解码器像编码器那样冒泡它们的解码结果。就像我们对编码器输入所做的那样，我们在这些解码器输入中嵌入并添加位置编码，以指示每个字的位置。

The self attention layers in the decoder operate in a slightly different way than the one in the encoder:
解码器中的自关注层以与编码器中的自注意层略有不同的方式操作：

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.
在解码器中，仅允许自我关注层关注输出序列中的较早位置。这是通过在自我关注计算中的softmax步骤之前屏蔽未来位置（将它们设置为-inf）来完成的。

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.
“编码器 - 解码器注意”层就像多头自我注意一样，除了它从它下面的层创建其查询矩阵，并从编码器堆栈的输出中获取键和值矩阵。

The Final Linear and Softmax Layer

The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
解码器堆栈输出浮点数向量。我们如何将其变成一个单词？这是最终线性层的工作，其后是Softmax层。

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
线性层是一个简单的完全连接的神经网络，它将堆叠的解码器产生的矢量投影到一个更大，更大的向量中，称为logits向量。

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
让我们假设我们的模型知道10,000个独一无二的英语单词（我们的模型的“输出词汇表”），它是从训练数据集中学到的。这将使logits向量10,000个细胞宽 - 每个细胞对应于一个唯一单词的得分。这就是我们如何解释模型的输出，然后是线性层。

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
然后softmax层将这些分数转换为概率（全部为正，全部加起来为1.0）。选择具有最高概率的单元，并且将与其相关联的单词作为该时间步的输出。

Recap Of Training

Now that we’ve covered the entire forward-pass process through a trained Transformer, it would be useful to glance at the intuition of training the model.
现在我们已经讲述了Transformer的整个前向传播过程，看一下训练模型的直觉是有用的。

During training, an untrained model would go through the exact same forward pass. But since we are training it on a labeled training dataset, we can compare its output with the actual correct output.
在训练期间，未经训练的模型将通过完全相同的前进传球。但是，由于我们在标记的训练数据集上训练它，我们可以将其输出与实际正确的输出进行比较。

To visualize this, let’s assume our output vocabulary only contains six words(“a”, “am”, “i”, “thanks”, “student”, and “” (short for ‘end of sentence’)).
为了想象这一点，让我们假设我们的输出词汇只包含六个单词（“a”，“am”，“i”，“thanks”，“student”和“”（“句末”的缩写））。

Once we define our output vocabulary, we can use a vector of the same width to indicate each word in our vocabulary. This also known as one-hot encoding. So for example, we can indicate the word “am” using the following vector:
一旦我们定义了输出词汇表，我们就可以使用相同宽度的向量来表示词汇表中的每个单词。这也称为单热编码。例如，我们可以使用以下向量指示单词“am”：

Following this recap, let’s discuss the model’s loss function – the metric we are optimizing during the training phase to lead up to a trained and hopefully amazingly accurate model.
在回顾一下之后，让我们讨论一下模型的损失函数 - 我们在训练阶段优化的指标，以引导一个训练有素且令人惊讶的精确模型。

The Loss Function

Say we are training our model. Say it’s our first step in the training phase, and we’re training it on a simple example – translating “merci” into “thanks”.
假设我们正在训练我们的模型。说这是我们在训练阶段的第一步，我们正在训练它的一个简单例子 - 将“merci”翻译成“谢谢”。

What this means, is that we want the output to be a probability distribution indicating the word “thanks”. But since this model is not yet trained, that’s unlikely to happen just yet.
这意味着，我们希望输出是指示“谢谢”一词的概率分布。但由于这种模式还没有接受过训练，所以这种情况不太可能发生。

How do you compare two probability distributions? We simply subtract one from the other. For more details, look at cross-entropy and Kullback–Leibler divergence.
你如何比较两个概率分布？我们简单地从另一个中减去一个。有关更多详细信息，请查看交叉熵和Kullback-Leibler散度。

But note that this is an oversimplified example. More realistically, we’ll use a sentence longer than one word. For example – input: “je suis étudiant” and expected output: “i am a student”. What this really means, is that we want our model to successively output probability distributions where:
但请注意，这是一个过于简单的例子。更现实的是，我们将使用长于一个单词的句子。例如 - 输入：“jesuisétudiant”和预期输出：“我是学生”。这真正意味着，我们希望我们的模型能够连续输出概率分布，其中：

Each probability distribution is represented by a vector of width vocab_size (6 in our toy example, but more realistically a number like 3,000 or 10,000)
每个概率分布由宽度为vocab_size的向量表示（在我们的玩具示例中为6，但更实际地是3,000或10,000的数字）
The first probability distribution has the highest probability at the cell associated with the word “i”
第一概率分布在与单词“i”相关联的单元处具有最高概率
The second probability distribution has the highest probability at the cell associated with the word “am”
第二概率分布在与单词“am”相关联的单元格中具有最高概率
And so on, until the fifth output distribution indicates ‘’ symbol, which also has a cell associated with it from the 10,000 element vocabulary.
依此类推，直到第五个输出分布表示’<句末结束>‘符号，其中还有一个与10,000元素词汇表相关联的单元格。

After training the model for enough time on a large enough dataset, we would hope the produced probability distributions would look like this:
在足够大的数据集上训练模型足够的时间之后，我们希望产生的概率分布看起来像这样：

Now, because the model produces the outputs one at a time, we can assume that the model is selecting the word with the highest probability from that probability distribution and throwing away the rest. That’s one way to do it (called greedy decoding). Another way to do it would be to hold on to, say, the top two words (say, ‘I’ and ‘a’ for example), then in the next step, run the model twice: once assuming the first output position was the word ‘I’, and another time assuming the first output position was the word ‘me’, and whichever version produced less error considering both positions #1 and #2 is kept. We repeat this for positions #2 and #3…etc. This method is called “beam search”, where in our example, beam_size was two (because we compared the results after calculating the beams for positions #1 and #2), and top_beams is also two (since we kept two words). These are both hyperparameters that you can experiment with.
现在，因为模型一次生成一个输出，我们可以假设模型从该概率分布中选择具有最高概率的单词并丢弃其余的单词。这是一种方法（称为贪婪解码）。另一种方法是保持前两个词（例如，‘I’和’a’），然后在下一步中，运行模型两次：一旦假设第一个输出位置是单词’I’，另一次假设第一个输出位置是’me’这个单词，考虑到＃1和＃2的位置，保留的是哪个版本产生的错误较少。我们重复这个位置＃2和＃3 …等。这种方法称为“波束搜索”，在我们的例子中，beam_size是两个（因为我们在计算位置＃1和＃2的波束后比较了结果），top_beams也是两个（因为我们保留了两个词）。这些都是您可以尝试的超参数。

Go Forth And Transform

I hope you’ve found this a useful place to start to break the ice with the major concepts of the Transformer. If you want to go deeper, I’d suggest these next steps:
我希望你已经发现这是一个有用的地方，开始用Transformer的主要概念打破僵局。如果你想更深入，我建议接下来的步骤：

Read the Attention Is All You Need paper, the Transformer blog post (Transformer: A Novel Neural Network Architecture for Language Understanding), and the Tensor2Tensor announcement.
Watch Łukasz Kaiser’s talk walking through the model and its details
Play with the Jupyter Notebook provided as part of the Tensor2Tensor repo
Explore the Tensor2Tensor repo.