变压器耦合和电容耦合

In this post, I plan to explore aspects of cutting edge architectures in NLP like BERT/Transformers. I assume that the readers are familiar with the Transformer architectures. To learn more about these, refer to Jay Alammar’s post here and here . I am also going to explore a couple of BERT variations like BERT-base and RoBERTa-base models, but these techniques can be very easily extended to more recent architectures, thanks to Hugging Face!

在本文中，我计划探索BERT / Transformers等NLP中最先进的体系结构的各个方面。我假设读者熟悉Transformer架构。要了解更多关于这些，指周杰伦Alammar的帖子在这里和这里。我还将探索一些基于BERT和RoBERTa的BERT版本，但是由于Hugging Face，这些技术可以很容易地扩展到最新的体系结构！

As I started diving into the world of Transformers, and eventually into BERT and its siblings, a common theme that I came across was the Hugging Face library (link). It reminds me of scikit-learn, which provides practitioners with easy access to almost every algorithm, and with a consistent interface. The Hugging Face library has accomplished the same kind of consistent and easy-to-use interface, but this time with deep learning based algorithms/architectures in the NLP world. We will dig into the architectures with the help of these interfaces provided by the this library.

当我开始深入研究《变形金刚》的世界，最终进入BERT及其兄弟姐妹的时候，我遇到的一个共同主题是Hugging Face库( 链接 )。它使我想起scikit-learn，它使从业人员可以轻松访问几乎所有算法，并具有一致的界面。 Hugging Face库已经完成了相同类型的一致且易于使用的界面，但是这次使用了NLP世界中基于深度学习的算法/体系结构。我们将在此库提供的这些接口的帮助下深入研究体系结构。

探索性模型分析(EMA) (Exploratory Model Analysis (EMA))

One of the main components of any Machine Learning (ML) projects is Exploratory Data Analysis (EDA). Every ML project I have been part of in the past years has started with me and my team doing EDA, iteratively, which helped understand and formulate a concrete problem statement. Figure 1 below demonstrates the typical ML process with an iterative EDA phase, which aims at answering questions about the data to help make decisions, typically about methods to leverage the data to solve specific business problems (say via modeling).

任何机器学习(ML)项目的主要组成部分之一是探索性数据分析(EDA)。过去几年中，我参与的每个ML项目都是从我和我的团队开始进行EDA迭代开始的，这有助于理解和制定具体的问题陈述。下面的图1展示了具有迭代EDA阶段的典型ML过程，该阶段旨在回答有关数据以帮助做出决策的问题，通常是有关利用数据来解决特定业务问题( 例如通过建模 )的方法。

With the advent of Deep Learning (DL), especially with the option of transfer-learning, the exploration phase now extends beyond looking at the data. It also entails another cycle of exploration for models, let’s call that EMA (Exploratory Model Analysis), which involves understanding the model architectures, their pre-training process, the data and assumptions that went into the pre-training, the architecture’s limitations (e.g. input size, model bias, types of problems they cannot solve), and the extent to which they can be fine-tuned for a downstream task. In other words, analyze where do they lie in the spectrum of re-training, few-shots fine-tuning to zero-shot (as they say in the GPT-3 world).

随着深度学习(DL)的到来，尤其是在转移学习的选择下，探索阶段现在已经超出了查看数据的范围。它还需要对模型进行另一轮探索，我们称其为EMA (探索性模型分析) ，它涉及对模型体系结构，其预训练过程，预训练中涉及的数据和假设，体系结构的局限性( 例如，输入大小，模型偏差，无法解决的问题的类型 )以及可以针对下游任务进行微调的程度。换句话说，分析 它们在再训练的频谱中所处的位置，从少量射击微调到零射击(正如他们在GPT-3世界中所说的) 。

Figure 2: EDA + EMA: A Typical Deep Learning flow with both EDA (Exploratory Data Analysis) and EMA (Exploratory Model Analysis) .. or should we just call it EDMA? :)

In this article, I would like to focus more on EMA on BERT (and the likes), to understand what can it provide beyond fine-tuning for classification or Q&A. As I stated earlier, the Hugging Face library can provide us with the tools necessary to peek into the model, and explore various aspects of the model. More specifically, I would like to use the library to answer the following questions -

在本文中，我想将更多的注意力放在BERT(或类似)上的EMA上，以了解它可以为分类或Q＆A进行微调之外提供什么。如前所述，Hugging Face库可以为我们提供窥探模型并探索模型各个方面所必需的工具。更具体地说，我想使用该库来回答以下问题-

How can I peek into the pre-trained model architecture and attempt to interpret the model results given the weights? If you have heard of attention weights and how they could be used to interpret these models, we will explore how to access them using the Hugging Face library, and also visualize them.

如何权衡预训练的模型体系结构并尝试根据权重解释模型结果？ 如果您听说过注意权重以及如何将它们用于解释这些模型，我们将探索如何使用Hugging Face库访问它们，并对其进行可视化。
How can I access outputs from various layers of BERT like models? What quality of output can I expect from these models, if I had to go completely unsupervised beyond? We will extract word and sentence vectors, and visualize them to analyze similarity. Further, we will examine the impact on quality of these vectors and metrics when we fine-tuning these models on a dataset from a completely different domain. To put it simply, if we did not have domain specific datasets to fine-tune, how do we move forward?!

如何访问类似BERT的各个层的输出？如果我不得不完全不受监督，那么我可以从这些模型期望什么输出质量？我们将提取单词和句子向量，并将其可视化以分析相似性。此外，当我们在来自完全不同域的数据集上微调这些模型时，我们将研究这些向量和度量对质量的影响。简而言之，如果我们没有微调特定领域的数据集，我们将如何前进？

These questions spawn from two pain points, limited availability of labelled data, and interpretability. Most of real world projects I have worked on, unlike Kaggle competitions, do not give us a nice labelled dataset, and the challenge is to justify the cost of creating labelled data. The second challenge, in some those projects, is also the ability to provide explanation of the model behavior, to hedge some flavor of risk.

这些问题来自两个痛点， 标记数据的可用性有限和可解释性 。与Kaggle竞赛不同，我从事的大多数现实世界项目都没有给我们提供一个很好的标记数据集，而挑战在于证明创建标记数据的成本是合理的。在某些项目中，第二个挑战是提供模型行为的解释，对冲某种风险的能力。

Without further ado, let’s dive in! :)

事不宜迟，让我们开始吧！ :)

轻松获得注意力权重：迈向解释的一步 (Easy access to Attention weights: a step towards interpretation)

All the transformer based architectures today are based on attention mechanisms. I found that understanding the basics of how attention works helped me explore how that could be used as a tool for interpretation. I plan to describe the layers at a high level in this post, and focus more on how to extract them using the Transformers library from Hugging Face. If you need to understand the concept of attention in depth, I would suggest you go through Jay Alammar’s blog (link provided earlier) or watch this playlist by Chris McCormick and Nick Ryan here.

如今，所有基于变压器的体系结构都基于注意力机制。我发现了解注意力的基本原理有助于我探索如何将注意力用作解释工具。我打算在本文中高层次地描述这些图层，并将更多的精力放在如何使用Hugging Face的Transformers库中提取它们上。如果您需要深入了解注意力的概念，建议您浏览Jay Alammar的博客( 前面提供的链接 )，或在此处观看Chris McCormick和Nick Ryan的播放列表。

The Hugging Face library provides us with a way access the attention values across all attention heads in all hidden layers. In the BERT base model, we have 12 hidden layers, each with 12 attention heads. Each attention head has an attention weight matrix of size NxN (N is number of tokens from the tokenization process). In other words, we have a total of 144 matrices (12x12), each of size NxN. The final embedding size of each token at every layer input or output is 768 (which comes from 64 dimensional vectors from each attention head i.e. 64x12 = 768). This will be clear as you move to figure 4 below.

Hugging Face库为我们提供了一种访问所有隐藏层中所有关注头的关注值的方法。在BERT基本模型中，我们有12个隐藏层，每个隐藏层都有12个关注头。每个关注头都有一个大小为NxN的关注权重矩阵(N是来自令牌化过程的令牌数)。换句话说，我们共有144个矩阵(12x12)，每个矩阵的大小为NxN。每个令牌在每层输入或输出上的最终嵌入大小为768 (来自每个关注头的64维矢量，即64x12 = 768) 。 当您移至下面的图4时，这将很清楚。

Figure 3 provides the architecture for an encoder layer. Figure 4 below drills into the Attention block from Figure 3, and provides a simplified, high-level flow of one sentence through one attention layer of the BERT base model (ignoring the batch_size for simplicity). These diagrams hopefully provides clarity on what matrix will be returned when you turn the output_attentions flag to true via the library.

图3提供了编码器层的体系结构。下面的图4深入到了图3中的Attention块，并提供了一个简化的高层流程，其中一个句子通过BERT基本模型的一个注意层( 为简单起见，忽略batch_size )。这些图有望使您清楚通过库将output_attentions标志设置为true时将返回什么矩阵。

Figure 3: A simplified diagram of the encoder stack. As you can see, we get a vector for each token after each encoder layer. The next diagram (Fig 4 below) drills into the attention block in one of these encoder blocks

Figure 4: A simplified, high-level flow of one sentence (batch dim ignored for simplicity) through one self attention layer of the BERT base model. The input matrix, Nx768 (N rows, one for each token, each embedded into 768 dimensions) flows through the attention layer (the box in the center). When we set output_attention=True in the BertConfig, it returns the matrix ‘A’ for each for attention head.

Note: I found tools here and here, which enable us to visualize attentions. These tools are either deprecated or do not implement all the latest architectures. Further, instead of leveraging the well maintained APIs like Hugging Face, one of them re-implements the architectures within, which hampers the chance to run things for newer architectures.

注意 ：我在 这里和这里 找到了工具 ，这些 工具 使我们可以直观地看到注意力。 这些工具已弃用或未实现所有最新架构。 此外，它们之一没有重新利用像Hugging Face这样维护良好的API，而是重新实现其中的体系结构，这阻碍了为更新的体系结构运行事物的机会。

Let’s quickly walkthrough the code (the full notebook can be found here). All sthe code here, except fine-tuning, can be run without a GPU

让我们快速浏览一下代码( 完整的笔记本可以在 此处找到 )。 除微调外，此处所有的代码都可以在没有GPU的情况下运行

The code above creates a Bert config with output_attentions=True, and uses this config to initialize the Bert base model. We put the model into eval mode since we just care about doing a forward pass through the architecture for this task. The code then goes onto tokenize and do a forward pass. The shape of the output is based on the config passed as described the documentation here. The first two items are last_hidden_state for the last layers and the pooled_output that can be used for fine tuning. The next input is what we are interested in, which is the attentions. As you can see from the last three statements, we can reach any layer, and any attention head, each of which will give us a NxN matrix we are interested in.

上面的代码创建一个具有output_attentions = True的Bert配置，并使用该配置初始化Bert基本模型。我们将模型置于评估模式，因为我们只在乎为此任务进行架构的前向传递。然后，代码进入令牌化并进行前向传递。输出的形状基于此处的文档中所述的传递的配置。前两个项目是最后一层的last_hidden_state和可用于微调的pooled_output。下一个输入是我们感兴趣的东西，即关注点。从最后三个语句中可以看到，我们可以到达任何层，也可以到达任何关注头，每一个都会给我们一个我们感兴趣的NxN矩阵。

We can quickly plot a heatmap for any of the 144 matrices like below

我们可以为以下144个矩阵中的任何一个快速绘制热图

The code above has two simple functions:

上面的代码具有两个简单的功能：

“get_attentions” navigates to the particular layer and attention head, and grabs the NxN matrix to be visualized

“ get_attentions”导航到特定的图层和关注头，并获取要可视化的NxN矩阵

“plt_attentions” plots the matrix passed as a heat map

“ plt_attentions”绘制了作为热图传递的矩阵

As part of my EDA in the full notebook here, I plotted all the 144 heatmaps as a grid, and skimmed through them to spot some that had a good variation of attention weights. One of them in particular in Figure 3 below shows the relation between the words ‘it’ and ‘animal’ in the sentence “The animal didn’t cross the street because it was too tired”.

作为完整的笔记本电脑我的EDA的一部分在这里，我绘制的所有144个热图作为一个网格，并通过他们向脱脂发现一些不得不注意权重的良好变化。其中一个特别是在下面的图3中，显示了“动物因为疲倦而没有穿过马路”这句话中的“它”和“动物”之间的关系。

Figure 5: The attention heatmap for the sentence “The animal didn’t cross the street because it was too tired” from layer 9 and attention head 10. We can see that the the word “it” has a large weight for “animal”.

As you can see, with some basic understanding of architecture, the transformers library by Hugging Face makes it extremely easy to pull out raw weights from any attention head. Now that we discussed how to pull out raw weights, let’s talk a bit about whether we should use them directly to interpret what the model has learnt. A recent paper “Quantifying Attention Flow in Transformers” discussed exactly this aspect. They state “across layers of the Transformer, information originating from different tokens gets increasingly mixed”

如您所见，通过对体系结构有一些基本了解，Hugging Face的转换器库非常容易地从任何关注的头中提取原始分量。现在，我们讨论了如何提取原始权重，让我们讨论一下是否应该直接使用它们来解释模型学到的知识。最近的一篇论文《量化变压器中的注意力流》恰好讨论了这一方面。他们指出“跨变形金刚的层，来自不同令牌的信息越来越混杂”

This means reading too much into these weights to interpret how the model deconstructs the input text may not be very useful. They go on to device a strategy to help interpret the impact of inputs on outputs. I won’t dive into the full paper here, but in short, they discuss building a Directed Acyclic Graph (DAG) on top of the architecture, which will help track paths and information flow between pairs of inputs and the hidden tokens. They discuss two approaches “attention rollout” and “attention flow”, that could be used to interpret attention weights as the relative relevance of the input tokens. In simple terms, instead of looking at only the raw attentions in a particular layer, you should consider a weighted flow of information all the way from the input embedding to the particular hidden output. If you are interested to know more, you can also refer to this article, that explains the paper with examples.

这意味着过多地阅读这些权重以解释模型如何解构输入文本可能不是很有用。他们继续制定一项策略，以帮助解释输入对输出的影响。我不会在这里深入研究全文，但总而言之，他们讨论了在体系结构之上构建有向无环图(DAG)，这将有助于跟踪输入对和隐藏令牌之间的路径和信息流。他们讨论了“注意力分布”和“注意力流”两种方法，可用于将注意力权重解释为输入令牌的相对相关性。简单来说，您应该考虑从输入嵌入到特定隐藏输出的一路加权信息流，而不是仅关注特定层中的原始注意力。如果您想了解更多信息，还可以参考本文，该文章以示例对本文进行了解释。

In summary, to interpret the effect of specific inputs, instead of looking at only the raw attentions independently in each layer, we should take it a step further by using them to track the contribution all the way from the input embedding to specific outputs.

总而言之，为了解释特定输入的效果，而不是仅在每一层中单独关注原始注意力，我们应该进一步利用它们，以跟踪从输入嵌入到特定输出的所有贡献。

访问单词和句子向量：相似性的路径(以及聚类，分类等) (Access to word and sentence vectors: paths to similarity (and clustering, classification etc.))

As we discussed, it is quite easy to access the attention layers and the corresponding weights. The Hugging Face library also provides us with easy access to outputs from each layer. This allows us to generate word vectors, and potentially sentence vectors.

正如我们所讨论的，访问关注层和相应的权重非常容易。 Hugging Face库还使我们可以轻松访问每一层的输出。这使我们能够生成单词向量以及潜在的句子向量。

Word Vectors

词向量

Figure 6 below shows a few different ways we can extract word level vectors. We could average/sum/concat the last few layers to get a vector. The hypothesis would be that the initial layers (closer to the inputs) would learn low level features (like in a CNN), and the final few layers (closer to the output) would have a much richer representation of the words. We could also just extract the last or the second to last layer as a word/token vector. There is no consensus on what one should use, and it would really depend on the requirement of the downstream task, and it may not even matter so much if the task we are trying to run downstream is simple enough.

下面的图6显示了几种我们可以提取词级向量的方法 。我们可以平均/求和/合并最后几层以获得向量。假设是，初始层(更靠近输入)将学习低级特征( 例如在CNN中 )，而最后几个层(更靠近输出)将具有更丰富的单词表示形式。我们也可以只提取最后一层或倒数第二层作为单词/令牌向量。对于应该使用什么没有达成共识，这实际上取决于下游任务的要求， 如果我们要在下游运行的任务足够简单 ， 则可能甚至没有太大关系。

*Figure 6 Word Vectors:* Ways we can extract vectors for each token. On the left, it shows how we could either average, sum or concatenate over last 4 layers to get one vector for token-1. On the right, it shows how we could access a vector for Token-N from last or second last layer.图6单词向量：我们可以为每个令牌提取向量的方式。在左侧，它显示了如何对最后4层进行平均，求和或连接以获得令牌1的一个向量。右边显示了如何从最后一层或倒数第二层访问令牌N的向量。

Word Similarity

词相似度

Once we have word vectors, we are ready to use them in a downstream task like similarity. We can either directly visualize them using dimensionality reduction techniques like PCA or tSNE. Instead, I generally try to find distances between these vectors first in the higher dimension, and then use techniques like Multi Dimensional Scaling (MDS; ref Link) to visualize the distance matrix. Figure 7 below summarizes this approach (we can do this for sentences as well, as long as we plan to access to sentence vectors directly, more on this later in the post):

一旦有了单词向量，我们就可以在类似相似的下游任务中使用它们。我们可以使用降维技术(例如PCA或tSNE)直接可视化它们。取而代之的是，我通常首先尝试在较高维度上查找这些向量之间的距离，然后使用诸如多维比例缩放(MDS; ref Link )之类的技术来可视化距离矩阵。下面的图7总结了这种方法( 只要我们计划直接访问句子向量，我们也可以对句子执行此操作，更多信息将在后文中介绍 )：

Figure 7: Flow of how to visualize word vectors using a cosine distance + MDS (Multi dimensional Scaling) for 4 words

Implementation: Word Similarity

实现：单词相似度

Let’s start looking through some code that implements the above flow. The full colab notebook can be found here

让我们开始研究实现上述流程的一些代码。 完整的colab笔记本可以在 这里找到

We will follow the same process as we used to initialize and visualize attention weights, except this time we use “output_hidden_states=True” while initializing the model.

我们将遵循与初始化和可视化注意力权重相同的过程，只是这次我们在初始化模型时使用“ output_hidden_states = True” 。

We also switch to encode_plus instead of encode, which will help us add CLS/SEP/PAD tokens, with a lot less effort. We build some additional logic around it to get attention masks that mask the PADs and also the CLS/SEPs. Notice the return values in the below function

我们还将切换到encode_plus而不是encode，这将帮助我们以更少的精力添加CLS / SEP / PAD令牌。我们围绕它构建了一些附加的逻辑，以获取遮盖PAD以及CLS / SEP的注意遮罩。 注意以下函数中的返回值

Let’s define the model and call it on the sample sentences.

让我们定义模型并在例句上调用它。

Let’s define few helper functions:

让我们定义一些辅助函数：

get_vector: Extract vectors as needed (concat/sum across multiple layers as discussed in the diagrams above etc.)

get_vector ：根据需要提取向量(如上图所示，跨越多层的concat / sum等)

plt_dists: Plot distance matrix passed. This calculates an MDS plot over a distance matrix passed.

plt_dists ：传递的距离矩阵图。这将计算通过的距离矩阵的MDS图。

eval_vecs: Tie get_vector and plt_dists together to get word vectors for sentences

eval_vecs ：将get_vector和plt_dists绑在一起以获得句子的词向量

Thats it! We are now ready to visualize these vectors in different configurations. Let’s look at one of them, concatenating the word vectors from the last 4 layers, and visualizing. We pass two sentences

而已！现在，我们准备以不同的配置可视化这些向量。让我们看一下其中的一个，将最后4层的单词向量连接起来并进行可视化。我们通过两个句子

texts = ["Joe took Alexandria out on a date.","What is your date of birth?",]

And then call the function to plot the similarity

然后调用该函数绘制相似度

MODE = 'concat'eval_vecs(hidden_states, tokenized_sents, mode='concat')

Figure 8: As you can see, date_0 and date_1 are different vectors, and are closer to other words in the respective sentence. This allows us to now use them based on the sentence they occur in (i.e. “contextualized”)

Sentence Vectors

句子向量

The encoder based models provide us with a couple of options to get sentence vectors from the architecture. Figure 9 shows us these options:

基于编码器的模型为我们提供了两个选项，可以从体系结构中获取句子向量。图9向我们展示了这些选项：

Sentence vectors could be extracted using the last layer (or averaged over last n layers) for the CLS token. I tried extracting and visualizing sentence similarity based on the CLS token, but it did not give me good results.
可以使用CLS令牌的最后一层(或在最后n层取平均值)提取句子矢量。我尝试基于CLS令牌提取和可视化句子相似度，但是效果不佳。

A quick note from the authors of BERT: This output is usually not a good summary of the semantic content of the input (source).

BERT作者的快速注释：此输出通常不是输入( source ) 的语义内容的很好总结 。

2. Average token vectors across the sentence. The token vectors themselves, as we discussed above, could come from concatenating/averaging over last N-layers or directly from the Nth layer.

2.整个句子的平均标记向量。如上所述，令牌向量本身可以来自对最后N层的串联/平均或直接来自第N层。

Figure 9: Sentence vectors could be extracted using the last layer CLS token directly OR could be averaged over all the tokens in the sentence, which in turn could come from the last layer, the second last layer or averaged over a few layers as we saw in Figure 6.图9：句子向量可以直接使用最后一层CLS令牌提取，或者可以对句子中的所有令牌进行平均，而这又可以来自最后一层，倒数第二层或平均数层在图6中。

Sentence Similarity

句子相似度

Similarity of two sentences is very subjective. Two sentences could be very similar in one context, and could be treated as opposites in other contexts. For example, two sentences could be called similar because they are talking about certain a topic, and could be discussing both positive as well as negative aspects of the topic. Those sentences could be considered similar because they are talking about a common topic, but would be considered opposites, if the focus is polarity of the sentence. Most of these architectures are trained on an independent objectives like Masked Language Model(MLM) and Next Sentence Prediction (NSP), and are trained on large, varied datasets. What we get out-of-the-box in terms of similarity, may or may not be relevant, based on task at hand.

两个句子的相似性非常主观。在一个上下文中，两个句子可能非常相似，而在其他上下文中，这两个句子可以被视为相反。例如，两个句子之所以被称为“相似”是因为它们正在谈论某个主题，并且可能同时讨论该主题的正面和负面方面。那些句子可以被认为是相似的，因为它们在谈论一个共同的话题，但是，如果重点是句子的极性，则可以认为它们是相反的。这些体系结构中的大多数都在诸如屏蔽语言模型(MLM)和下一句预测(NSP)的独立目标上进行了训练，并且在各种大型数据集上进行了训练。根据手头的任务，我们在相似性方面开箱即用，可能相关，也可能不相关。

Irrespective, it is important to look at our options of understanding similarity. Figure 11 below summarizes our options with sentence similarity in a flow diagram below.

无论如何，重要的是要看一下我们了解相似性的选择。下面的图11在下面的流程图中总结了带有句子相似性的选项。

Figure 11: Different ways of calculating Sentence Similarity using BERT-like models

Option #1 and Option #2 above extract and try to create sentence vectors, which can then use the same pipeline we built for word vector similarity.

上面的选项1和选项2提取并尝试创建句子向量，然后可以使用为词向量相似性构建的相同管道。

Option #3 tries to compute similarity between two sentences directly from word vectors, instead of attempting to create a sentence vectors explicitly. This uses a special metric called Word Movers Distance (WMD).

选项＃3尝试直接从单词向量中计算两个句子之间的相似度，而不是尝试显式创建句子向量。这使用一种特殊的度量标准，称为“单词移动距离”(WMD)。

Figure 10: WMD calculation: How the sentence on the left is converted “moved” to the sentence on the right

You can read the original paper for WMD here, but in short, it is based on EMD (Earth Movers Distance) and tries to move the words from one sentence to other using the word vectors. An example directly from the paper is show in Figure 10. There is a nice implementation of this here, and an awesome explanation here

您可以在此处阅读有关WMD的原始文章，但总而言之，它基于EMD(地球移动距离)，并尝试使用单词向量将单词从一个句子移动到另一个句子。图10显示了直接来自本文的示例。这里有一个不错的实现，而这里有很棒的解释。

Implementation: Sentence Similarity

实现：句子相似度

The code to run through sentence similarity options is available as a colab notebook here here

通过句子相似性选项运行的代码可在此处作为colab笔记本获得

Most of the code is very similar to the word embeddings piece we discussed earlier, except calculating the word movers distance. The class below calculates the word movers distance using WMD-relax library. it needs access to an embedding lookup, that yields a vector when a word is passed.

除了计算移词器的距离外，大多数代码与我们之前讨论的嵌入词非常相似。下面的类使用WMD-relax库计算移词器距离。它需要访问嵌入查询，当一个单词被传递时，它会产生一个向量。

Class defined to calculate WMD on vectors extracted from BERT. This needs an embedding lookup dictionary. Internally, it calculates n-bow, which is essentially a distribution of words based on their count in the sentence.

定义为在从BERT提取的向量上计算WMD的类。这需要嵌入查找字典。在内部，它计算n弓，它实际上是根据单词在句子中的数量分布的单词。

转移学习的潜力： (The potential for transfer learning:)

I came across two levels of transfer learning for Transformer based models:

我遇到了基于Transformer模型的传输学习的两个层次：

The default learning that comes baked into pre-trained models. Models like BERT/RoBERTa etc. come pre-trained on large corpora, and give us a starting point
预训练模型中包含的默认学习。像BERT / RoBERTa等模型已经在大型语料库上进行了预训练，并为我们提供了一个起点
Fine-tuning the architecture for task specific learning, which is typically how these architectures are used today (e.g. building classifiers/Q&A systems with a dataset from the domain)
微调用于特定任务学习的体系结构，这通常是当今使用这些体系结构的方式(例如，使用来自领域的数据集构建分类器/ Q＆A系统)

The problems I work on generally do not have domain specific labeled datasets. This led me to explore the third option:

我研究的问题通常没有特定于域的标记数据集。这使我探索了第三个选择：

What if I fine-tune the models, but on a dataset that comes from a different domain? If I make sure that the objective is similar, can I still transfer learn, even though I know there will be a covariate shift?(covariate shift is the change in the distribution of the input variables present in the training and the test/realtime data).

如果我对模型进行了微调，但在来自不同域的数据集上怎么办？ 如果我确定目标是相似的，即使我知道会有协变量转变，我仍可以转移学习吗？ (协变量移位是训练和测试/实时数据中存在的输入变量分布的变化)。

To understand if this method of transfer learning works, let’s do a quick experiment. Let’s pick sentiment/polarity as the objective for our experimentation.

要了解这种迁移学习方法是否有效，让我们做一个快速实验。让我们选择情绪/极性作为我们实验的目标。

We take the pre-trained RoBERTa model, fine-tune the model on the training set from the IMDB 50K movie reviews, and then pick our evaluation dataset as follows:

我们采用预先训练的RoBERTa模型，根据IMDB 50K电影评论中的训练集对模型进行微调，然后选择评估数据集 ，如下所示：

2 sentences from the IMDB 50k movie reviews TEST dataset

来自IMDB 50k电影评论TEST数据集的2句话

Positive: 'This is a good film. This is very funny. Yet after this film there were no good Ernest films!'Negative: 'Hated it with all my being. Worst movie ever. Mentally- scarred. Help me. It was that bad.TRUST ME!!!'

2 sentences from Amazon fine food reviews, which is a dataset from a completely different domain. We will use this only for evaluation

来自亚马逊美食评论的 2个句子，这是来自完全不同领域的数据集。我们仅将其用于评估

Positive: 'This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!'Negative: "This oatmeal is not good. Its mushy, soft, I don't like it. Quaker Oats is the way to go."

We then extract and visualize sentence similarity on both the base pre-trained model (RoBERTa architecture in this case), and on the fine-tuned model on the same architecture

然后，我们在基本的预训练模型(在这种情况下为RoBERTa架构)以及在同一架构上的微调模型上提取和可视化句子相似度

The code to fine-tune BERT is available here.

此处提供了用于微调BERT的代码。

比较： (Comparison:)

Lets do our first comparison by visualizing cosine similarity between sentence vectors obtained by averaging word vectors, that are themselves an averaged version from the last four layers (I know this was a mouthful, but if you need clarification on any of these, feel free to scroll up where we discussed each of them in detail).

让我们通过可视化通过平均单词向量获得的句子向量之间的余弦相似性来进行第一个比较，这些单词向量本身是最后四层的平均值 ( 我知道这是一个详尽的例子，但是如果您需要澄清其中的任何一个，请随意向上滚动我们在其中详细讨论的每个地方 )。

Legend: The yellow circles are positive, while the pink circles are negative

图例：黄色圆圈为正，而粉红色圆圈为负

Figure 11 below shows the the visualization run on the pre-trained model:

下面的图11显示了在预训练模型上运行的可视化效果：

*The yellow circles are positive, while the pink circles are negative黄色圆圈为正，而粉红色圆圈为负*

Figure 12 below shows the same plot has Figure 11 but zoomed in (notice the axes have a different range). This shows that while the distance between the vectors is really small, after zooming in, we see that the model does attempt to place the sentences discussing the subject (food) closer to each other, favoring the subject of discussion more than the polarity of the sentence (see the bottom two circles, one yellow and one pink)

下面的图12显示了与图11相同的图，但是放大了(请注意，轴的范围有所不同 )。这表明尽管向量之间的距离确实很小，但是放大后，我们看到该模型确实试图将讨论主题(食物)的句子彼此放置得更近，从而比讨论极性更有利于讨论主题句子( 见最下面的两个圆圈，一个黄色和一个粉红色 )

Figure 12 Pre-trained model: After zooming in, we see that it does attempt to place the sentences discussing the subject (food) closer to each other, favoring the subject of discussion more than the polarity of the sentence. The yellow circles are positive, while the pink circles are negative

Now let’s look at the same sentences passed through the fine-tuned model in Figure 13 below. Notice how the sentences are much closer based on the polarity. As you can notice, it has moved its focus from the subject of discussion (food/movie) to polarity (positive vs negative): The positive sentence from the movie review data is closer to the positive sentence from the food review dataset, as is the case with the negative examples. This is exactly what we needed ! We wanted to use a dataset from a completely different domain, but with a similar labeling objective, to transfer learn!

现在，让我们来看一下通过图13中的微调模型传递的相同语句。请注意，根据极性，句子之间的距离更近了。正如您所注意到的，它已将重点从讨论主题(食物/电影)转移到极性(正面与负面)：电影评论数据中的肯定句子更接近食物评论数据集中的正面句子，带有负面例子的情况。这正是我们所需要的！ 我们希望使用来自完全不同域但具有相似标记目标的数据集来转移学习！

Figure 13 Fine-tuned model: As you can notice, it has moved its focus from the subject of discussion (food vs movie) to polarity (positive vs negative). The yellow circles are positive, while the pink circles are negative

2. Let’s do our second round of comparison by visualizing the Word Movers Distance (WMD) calculated on word vectors averaged from the last four layers.

2.让我们进行第二轮比较，方法是可视化根据从最后四层平均得到的单词向量计算出的单词移动距离(WMD)。

Figure 14 below shows similar characteristics as the models above. The subject of discussion (food) still seems to be the focus. One main difference here is that Word Movers Distance (WMD) is able to tease out the distances better than the earlier method (applying cosine distance on averaging word vectors across sentences), even on the base model i.e. there was no need to zoom in here. This shows that averaging of word vectors may have unintended consequences for a downstream task.

下面的图14显示了与上述模型相似的特性。讨论的主题(食品)似乎仍然是焦点。此处的主要区别在于，即使在基本模型上，字移动距离(WMD)也能比较早的方法( 将余弦距离应用于跨句子的平均单词向量 )更好地阐明距离，即无需在此处放大。这表明平均单词向量可能会对下游任务产生意想不到的后果。

Figure 14: The distance metric, Word Movers Distance (WMD), shows similar characteristics, but can help differentiate sentences better than averaging methods. The yellow circles are positive, while the pink circles are negative

Figure 15 below shows the same WMD metric on the four sentences using the Fine-tuned model, which again shows that we were able to “shift” the focus of the model towards polarity.

下面的图15使用微调模型在四个句子上显示了相同的WMD度量，这再次表明我们能够将模型的焦点“转移”到极性上。

Figure 15: WMD metric on the four sentences using the Fine-tuned model. The yellow circles are positive, while the pink circles are negative

In summary, the ability to transfer-learn from unrelated domain datasets will open more avenues for projects that struggle with data today. The power of transformers now can enable companies to leverage previously unusable datasets, and not have to spend time creating labelled datasets.

总之，从不相关的域数据集中进行转移学习的能力将为当今处理数据的项目打开更多的途径。 变压器的强大功能现在可以使公司利用以前无法使用的数据集，而不必花费时间来创建标记的数据集。

I hope this article was helpful. There are few other topics like data augmentation in NLP, model interpretation via techniques like LIME (Local interpretable model-agnostic explanations) etc. that interest me and I do plan to explore them in future posts. Until then, thanks for reading!

希望本文对您有所帮助。除了NLP中的数据增强，通过LIME(本地可解释的模型不可知的解释)之类的技术进行模型解释等其他话题之外，我也很感兴趣，我也计划在以后的文章中进行探讨。在此之前，感谢您的阅读！

https://arxiv.org/pdf/1906.05714.pdf

翻译自: https://towardsdatascience.com/beyond-classification-with-transformers-and-hugging-face-d38c75f574fb

变压器耦合和电容耦合

http://www.taodudu.cc/news/show-863500.html

梯度下降法_梯度下降
学习机器学习的项目_辅助项目在机器学习中的重要性
计算机视觉知识基础_我见你：计算机视觉基础知识
配对交易方法_COVID下的自适应配对交易，一种强化学习方法
设计数据密集型应用程序_设计数据密集型应用程序书评
pca 主成分分析_超越普通PCA：非线性主成分分析
全局变量和局部变量命名规则_变量范围和LEGB规则
dask 使用_在Google Cloud上使用Dask进行可扩展的机器学习
计算机视觉课_计算机视觉教程—第4课
用camelot读取表格_如何使用Camelot从PDF提取表格
c盘扩展卷功能只能向右扩展_信用风险管理：功能扩展和选择
使用OpenCV，Keras和Tensorflow构建Covid19掩模检测器
使用Python和OpenCV创建自己的“ CamScanner”
cnn图像进行预测_CNN方法：使用聚合物图像预测其玻璃化转变温度
透过性别看世界_透过树林看森林
gan神经网络_神经联觉：当艺术遇见GAN
rasa聊天机器人_Rasa-X是持续改进聊天机器人的独特方法
python进阶指南_Python特性工程动手指南
人工智能对金融世界的改变_人工智能革命正在改变网络世界
数据科学自动化_数据科学会自动化吗？
数据结构栈和队列_使您的列表更上一层楼：链接列表和队列数据结构
轨迹预测演变（第1/2部分）
人口预测和阻尼-增长模型_使用分类模型预测利率-第3部分
机器学习深度学习 ai_人工智能，机器学习，深度学习-特征和差异
随机模拟_随机模拟可帮助您掌握统计概念
机器学习算法如何应用于控制_将机器学习算法应用于NBA MVP数据
知乎开源机器学习_使用开源数据和机器学习预测海洋温度
:)xception_Xception：认识Xtreme盗梦空间
评估模型如何建立_建立和评估分类ML模型
介绍神经网络_神经网络介绍

变压器耦合和电容耦合_超越变压器和抱抱面的分类相关推荐

奎因莫克拉斯基方法_超越源代码奎因和自我复制
奎因莫克拉斯基方法 The "Beyond the Source Code" series of posts will explore source code beyond its ...
代码可读性_超越了源代码的可读性和理解
代码可读性 The "Beyond the Source Code" series of posts will explore source code beyond its use ...
超越技术分析_超越技术面试
超越技术分析 by Jaime J. Rios 由Jaime J. Rios 超越技术面试 (Transcending the Technical Interview) "Wow. What ...
用什么软件测试mate9的闪存_超越苹果？余承东重拳出击，华为Mate40 Pro确认采用海思自研闪存！...
当年发布华为Mate9的时候,余承东曾经说过:华为拥有独家的技术,可以把Emmc5.1优化成UFS2.1.这个理论说实话引起了很大争议,一度被认为是华为宣传历史上的黑点.4年过去了,事情突然有了反转. ...
halcon区域腐蚀膨胀算子_超越halcon速度的二值图像的腐蚀和膨胀，实现目前最快的半径相关类算法（附核心源码）。...
超越halcon速度的二值图像的腐蚀和膨胀,实现目前最快的半径相关类算法(附核心源码). 发布时间:2019-03-20 12:32, 浏览次数:1259 , 标签: halcon 我在两年前的博客里 ...
均值归一化_超越BN和GN！谷歌提出新的归一化层：FRN
码字不易,欢迎给个赞! 欢迎交流与转载,文章会同步发布在公众号:机器学习算法工程师(Jeemy110) 目前主流的深度学习模型都会采用BN层(Batch Normalization)来加速模型训练以及 ...
pca 主成分分析_超越普通PCA：非线性主成分分析
pca 主成分分析 TL;DR: PCA cannot handle categorical variables because it makes linear assumptions about t ...
python pca主成分_超越“经典” PCA：功能主成分分析（FPCA）应用于使用Python的时间序列...
python pca主成分 FPCA is traditionally implemented with R but the "FDASRSF" package from J. D ...
bert预训练模型解读_超越谷歌BERT！依图预训练语言理解模型入选NeurIPS
机器之心发布机器之心编辑部在本文中,本土独角兽依图科技提出了一个小而美的方案--ConvBERT,通过全新的注意力模块,仅用 1/10 的训练时间和 1/6 的参数就获得了跟 BERT 模型一样的 ...

变压器耦合和电容耦合_超越变压器和抱抱面的分类

探索性模型分析(EMA) (Exploratory Model Analysis (EMA))

轻松获得注意力权重：迈向解释的一步 (Easy access to Attention weights: a step towards interpretation)

访问单词和句子向量：相似性的路径(以及聚类，分类等) (Access to word and sentence vectors: paths to similarity (and clustering, classification etc.))

转移学习的潜力： (The potential for transfer learning:)

比较： (Comparison:)

相关文章：

变压器耦合和电容耦合_超越变压器和抱抱面的分类相关推荐

最新文章

热门文章

变压器耦合和电容耦合_超越变压器和抱抱面的分类

探索性模型分析(EMA) (Exploratory Model Analysis (EMA))

轻松获得注意力权重： 迈向解释的一步 (Easy access to Attention weights: a step towards interpretation)

访问单词和句子向量：相似性的路径(以及聚类，分类等) (Access to word and sentence vectors: paths to similarity (and clustering, classification etc.))

转移学习的潜力： (The potential for transfer learning:)

比较： (Comparison:)

相关文章：

变压器耦合和电容耦合_超越变压器和抱抱面的分类相关推荐

最新文章

热门文章

轻松获得注意力权重：迈向解释的一步 (Easy access to Attention weights: a step towards interpretation)