TENER: Adapting Transformer Encoder for Named Entity Recognition

Abstract（摘要）
1 Introduction（介绍）
2 Related Work（相关工作）
2.1 Neural Architecture for NER（NER 的神经网络架构）
- 2.2 Transformer
- 2.2.1 Transformer Encoder Architecture
- 2.2.2 Position Embedding
3 Proposed Model（目标模型）
- 3.1 Embedding Layer
- 3.2 Encoding Layer with Adapted Transformer
- - 3.2.1 Direction- and Distance-Aware Attention
- 3.2.2 Un-scaled Dot-Product Attention
- 3.3 CRF Layer
Experiment（实验）
- 4.1 Data
- Conclusion
TENER运行流程
- 整体流程
- TENER模型训练流程
- - Transformer编码器网络结构

加粗为论文的贡献
红色为论文篇章结构
蓝色为文章代码结合部分

Abstract（摘要）

Bidirectional long short-term memory networks (BiLSTMs) have been widely used as an encoder for named entity recognition (NER) task. Recently, the fully-connected self-attention architecture (aka Transformer) is broadly adopted in various natural language processing (NLP) tasks owing to its parallelism and advantage in modeling the longrange context. Nevertheless, the performance of the vanilla Transformer in NER is not as good as it is in other NLP tasks. In this paper, we propose TENER, a NER architecture adopting adapted Transformer Encoder to model the character-level features and wordlevel features.
我们要做什么：
双向长短期记忆网络（BiLSTM）已被广泛用作命名实体识别（NER）任务的编码器。最近，全连接的自注意力架构（又名 Transformer）由于其并行性和在远程上下文建模方面的优势而被广泛应用于各种自然语言处理（NLP）任务。然而，NER 中的寻常的 Transformer 的性能不如其他 NLP 任务。在本文中，我们提出了 TENER，一种采用自适应 Transformer 编码器对字符级特征和字级特征进行建模的 NER 架构。
我们做了什么：
By incorporating the directionaware, distance-aware and un-scaled attention, we prove the Transformer-like encoder is just as effective for NER as other NLP tasks.
通过结合方向感知、距离感知和未缩放的注意力，我们证明了类似 Transformer 的编码器对于 NER 与其他 NLP 任务一样有效。
我们的效果：
Experiments on six NER datasets show that TENER achieves superior performance than the prevailing BiLSTM-based models.
对六个 NER 数据集的实验表明，TENER 的性能优于流行的基于 BiLSTM 的模型。

1 Introduction（介绍）

背景问题：
The named entity recognition (NER) is the task of finding the start and end of an entity in a sentence and assigning a class for this entity. NER has been widely studied in the field of natural language processing (NLP) because of its potential assistance in question generation (Zhou et al., 2017), relation extraction (Miwa and Bansal, 2016), and coreference resolution (Fragkou, 2017). Since (Collobert et al., 2011), various neural models have been introduced to avoid hand-crafted features (Huang et al., 2015; Ma and Hovy, 2016; Lample et al., 2016).
命名实体识别 (NER) 是在句子中查找实体的开始和结束并为该实体分配类的任务。 NER 已在自然语言处理 (NLP) 领域得到广泛研究，因为它在问题生成 (Zhou et al., 2017)、关系提取 (Miwa and Bansal, 2016) 和共指解析 (Fragkou, 2017) 方面的潜在帮助。自 (Collobert et al., 2011) 以来，已经引入了各种神经模型来避免手工制作的特征 (Huang et al., 2015; Ma and Hovy, 2016; Lample et al., 2016)。

NER is usually viewed as a sequence labeling task, the neural models usually contain three components: word embedding layer, context encoder layer, and decoder layer (Huang et al., 2015; Ma and Hovy, 2016; Lample et al., 2016; Chiu and Nichols, 2016; Chen et al., 2019; Zhang et al., 2018; Gui et al., 2019b). The difference between various NER models mainly lies in the variance in these components.
NER 通常被视为序列标注任务，神经模型通常包含三个组件：词嵌入层、上下文编码器层和解码器层（Huang et al., 2015; Ma and Hovy, 2016; Lample et al., 2016; Chiu and Nichols, 2016; Chen et al., 2019; Zhang et al., 2018; Gui et al., 2019b)。各种NER模型之间的差异主要在于它们所用组件的差异。

Recurrent Neural Networks (RNNs) are widely employed in NLP tasks due to its sequential characteristic, which is aligned well with language. Specifically, bidirectional long short-term memory networks (BiLSTM) (Hochreiter and Schmidhuber, 1997) is one of the most widely used RNN structures. (Huang et al., 2015) was the first one to apply the BiLSTM and Conditional Random Fields (CRF) (Lafferty et al., 2001) to sequence labeling tasks. Owing to BiLSTM’s high power to learn the contextual representation of words, it has been adopted by the majority of NER models as the encoder (Ma and Hovy, 2016; Lample et al., 2016; Zhang et al., 2018; Gui et al., 2019b).
递归神经网络（RNN）由于其顺序特征而被广泛用于 NLP 任务，这与语言很好地对齐。具体来说，双向长短期记忆网络 (BiLSTM) (Hochreiter and Schmidhuber, 1997) 是使用最广泛的 RNN 结构之一。 (Huang et al., 2015) 是第一个将 BiLSTM 和条件随机场 (CRF) (Lafferty et al., 2001) 应用于序列标记任务的人。由于 BiLSTM 学习单词上下文表示的强大能力，它已被大多数 NER 模型用作编码器（Ma and Hovy, 2016; Lample et al., 2016; Zhang et al., 2018; Gui et al., 2019b)。

前人方法的局限性：
Recently, Transformer (Vaswani et al., 2017) began to prevail in various NLP tasks, like machine translation (Vaswani et al., 2017), language modeling (Radford et al., 2018), and pretraining models (Devlin et al., 2018). The Transformer encoder adopts a fully-connected self-attention structure to model the long-range context, which is the weakness of RNNs. Moreover, Transformer has better parallelism ability than RNNs. However, in the NER task, Transformer encoder has been reported to perform poorly (Guo et al., 2019), our experiments also confirm this result. Therefore, it is intriguing to explore the reason why Transformer does not work well in NER task.
最近，Transformer (Vaswani et al., 2017) 开始在各种 NLP 任务中盛行，例如机器翻译 (Vaswani et al., 2017)、语言建模 (Radford et al., 2018) 和预训练模型 (Devlin et al., 2018)。 Transformer 编码器采用全连接的自注意力结构来建模远程上下文，这也是 RNN 的弱点。此外，Transformer 比 RNN 具有更好的并行能力。然而，在 NER 任务中，据报道 Transformer 编码器表现不佳（Guo et al., 2019），我们的实验也证实了这一结果。因此，探索 Transformer 在 NER 任务中表现不佳的原因是很有趣的。

解决方法：
In this paper, we analyze the properties of Transformer and propose two specific improvements for NER.
在本文中，我们分析了 Transformer 的特性，并针对 NER 提出了两个具体的改进。

The first is that the sinusoidal position embedding used in the vanilla Transformer is aware of distance but unaware of the directionality. In addition, this property will lose when used in the vanilla Transformer. However, both the direction and distance information are important in the NER task. For example in Fig 1, words after “in” are more likely to be a location or time than words before it, and words before “Inc.” are mostly likely to be of the entity type “ORG”. Besides, an entity is a continuous span of words. Therefore, the awareness of distance might help the word better recognizes its neighbor. To endow the Transformer with the ability of direction- and distanceawareness, we adopt the relative positional encoding (Shaw et al., 2018; Huang et al., 2019; Dai et al., 2019). instead of the absolute position encoding. We propose a revised relative positional encoding that uses fewer parameters and performs better.
一是原始 Transformer 中使用的正弦位置嵌入知道距离但不知道方向性。此外，在原版 Transformer 中使用时，该属性会丢失。然而，方向和距离信息在 NER 任务中都很重要。例如在图 1 中，“in”之后的词比它之前的词和“Inc”之前的词更可能是一个位置或时间。很可能是实体类型“ORG”。此外，实体是连续的单词。因此，距离的意识可能有助于这个词更好地识别它的邻居。为了赋予 Transformer 方向和距离感知能力，我们采用了相对位置编码（Shaw et al., 2018; Huang et al., 2019; Dai et al., 2019）。而不是绝对位置编码。我们提出了一种修改后的相对位置编码，它使用更少的参数并且性能更好。

The second is an empirical finding. The attention distribution of the vanilla Transformer is scaled and smooth. But for NER, a sparse attention is suitable since not all words are necessary to be attended. Given a current word, a few contextual words are enough to judge its label. The smooth attention could include some noisy information. Therefore, we abandon the scale factor of dot-production attention and use an unscaled and sharp attention.
二是实证发现。原始 Transformer 的注意力分布是可缩放且平滑的。但对于 NER，稀疏注意力是合适的，因为并非所有单词都需要被关注。给定一个当前单词，几个上下文单词就足以判断它的标签。平滑的注意力可能包括一些嘈杂的信息。因此，我们放弃了dot-production attention的比例因子，使用了unscaled和sharp attention。

实验结果：
With the above improvements, we can greatly boost the performance of Transformer encoder for NER.
通过上述改进，我们可以极大地提高用于 NER 的 Transformer 编码器的性能。

Other than only using Transformer to model the word-level context, we also tried to apply it as a character encoder to model word representation with character-level information. The previous work has proved that character encoder is necessary to capture the character-level features and alleviate the out-of-vocabulary (OOV) problem (Lample et al., 2016; Ma and Hovy, 2016; Chiu and Nichols, 2016; Xin et al., 2018). In NER, CNN is commonly used as the character encoder. However, we argue that CNN is also not perfect for representing character-level information, because the receptive field of CNN is limited, and the kernel size of the CNN character encoder is usually 3, which means it cannot correctly recognize 2-gram or 4-gram patterns. Although we can deliberately design different kernels, CNN still cannot solve patterns with discontinuous characters, such as “un…ily” in “unhappily” and “unnecessarily”. Instead, the Transformer-based character encoder shall not only fully make use of the concurrence power of GPUs, but also have the potentiality to recognize different n-grams and even discontinuous patterns. Therefore, in this paper, we also try to use Transformer as the character encoder, and we compare four kinds of character encoders.
除了仅使用 Transformer 对单词级上下文进行建模外，我们还尝试将其用作字符编码器，以使用字符级信息对单词表示进行建模。先前的工作已经证明，字符编码器对于捕获字符级特征和缓解词汇外（OOV）问题是必要的（Lample 等人，2016；Ma 和 Hovy，2016；Chiu 和 Nichols，2016；Xin等人，2018）。在 NER 中，CNN 通常用作字符编码器。但是，我们认为 CNN 也不能完美地表示字符级信息，因为 CNN 的感受野是有限的，而且 CNN 字符编码器的内核大小通常为 3，这意味着它无法正确识别 2-gram 或 4-gram 模式。尽管我们可以刻意设计不同的内核，但 CNN 仍然无法解决字符不连续的模式，例如“unhappyly”和“unnecessally”中的“un…ily”。相反，基于 Transformer 的字符编码器不仅要充分利用 GPU 的并发能力，还要具备识别不同 n-gram 甚至不连续模式的潜力。因此，在本文中，我们也尝试使用 Transformer 作为字符编码器，并对四种字符编码器进行比较。
N-gram 模式，当前词只与它前面n − 1个词相关。
参数的选取，可以从以下两个角度进行考虑：
(1) 计算复杂度：n越大，计算复杂度越大（指数级增长）。
(2) 模型效果：理论上n越大越好，但n越大，模型效果的提升幅度越小。
因此，实际常选用n = 3。

主要贡献：
In summary, to improve the performance of the Transformer-based model in the NER task, we explicitly utilize the directional relative positional encoding, reduce the number of parameters and sharp the attention distribution. After the adaptation, the performance raises a lot, making our model even performs better than BiLSTM based models. Furthermore, in the six NER datasets, we achieve state-of-the-art performance among models without considering the pre-trained language models or designed features.
总之，为了提高基于 Transformer 的模型在 NER 任务中的性能，我们明确地利用了方向相对位置编码，减少了参数的数量并锐化了注意力分布。适应后，性能提升很多，使得我们的模型甚至比基于 BiLSTM 的模型表现更好。此外，在六个 NER 数据集中，我们在模型中实现了最先进的性能，而不考虑预训练的语言模型或设计的特征。

2 Related Work（相关工作）

2.1 Neural Architecture for NER（NER 的神经网络架构）

分门别类归纳总结已有工作：
Collobert et al. (2011) utilized the Multi-Layer Perceptron (MLP) and CNN to avoid using taskspecific features to tackle different sequence labeling tasks, such as Chunking, Part-of-Speech (POS) and NER. In (Huang et al., 2015), BiLSTM-CRF was introduced to solve sequence labeling questions. Since then, the BiLSTM has been extensively used in the field of NER (Chiu and Nichols, 2016; Dong et al., 2016; Yang et al., 2018; Ma and Hovy, 2016).
(Collobert et al，2011) 利用多层感知器 (MLP) 和 CNN 来避免使用特定于任务的特征来处理不同的序列标记任务，例如分块、词性 (POS) 和 NER。在 (Huang et al., 2015) 中，引入了 BiLSTM-CRF 来解决序列标记问题。从那时起，BiLSTM 已广泛用于 NER 领域（Chiu 和 Nichols，2016；Dong 等人，2016；Yang 等人，2018；Ma 和 Hovy，2016）。

Despite BiLSTM’s great success in the NER task, it has to compute token representations one by one, which massively hinders full exploitation of GPU’s parallelism. Therefore, CNN has been proposed by (Strubell et al., 2017; Gui et al., 2019a) to encode words concurrently. In order to enlarge the receptive field of CNNs, (Strubell et al., 2017) used iterative dilated CNNs (IDCNN).
尽管 BiLSTM 在 NER 任务中取得了巨大成功，但它必须逐个计算令牌表示，这极大地阻碍了充分利用 GPU 的并行性。因此，(Strubell et al., 2017; Gui et al., 2019a) 提出了 CNN 来同时对单词进行编码。为了扩大 CNN 的感受野，(Strubell et al., 2017) 使用了迭代扩张的 CNN (IDCNN)。

Since the word shape information, such as the capitalization and n-gram, is important in recognizing named entities, CNN and BiLSTM have been used to extract character-level information (Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016; Strubell et al., 2017; Chen et al., 2019).
由于单词形状信息（例如大写和 n-gram）对于识别命名实体很重要，因此 CNN 和 BiLSTM 已被用于提取字符级信息（Chiu and Nichols, 2016; Lample et al., 2016; Ma and Hovy, 2016; Strubell et al., 2017; Chen et al., 2019）。

Almost all neural-based NER models used pretrained word embeddings, like Word2vec and Glove (Pennington et al., 2014; Mikolov et al., 2013). And when contextual word embeddings are combined, the performance of NER models will boost a lot (Peters et al., 2017, 2018; Akbik et al., 2018). ELMo introduced by (Peters et al., 2018) used the CNN character encoder and BiLSTM language models to get contextualized word representations. Except for the BiLSTM based pre-trained models, BERT was based on Transformer (Devlin et al., 2018).
几乎所有基于神经的 NER 模型都使用预训练的词嵌入，例如 Word2vec 和 Glove (Pennington et al., 2014; Mikolov et al., 2013)。当结合上下文词嵌入时，NER 模型的性能将大大提高（Peters et al., 2017, 2018; Akbik et al., 2018）。 (Peters et al., 2018) 介绍的 ELMo 使用 CNN 字符编码器和 BiLSTM 语言模型来获得上下文化的词表示。除了基于 BiLSTM 的预训练模型外，BERT 还基于 Transformer（Devlin et al., 2018）。

2.2 Transformer

分门别类归纳总结已有工作：
Transformer was introduced by (Vaswani et al., 2017), which was mainly based on self-attention. It achieved great success in various NLP tasks. Since the self-attention mechanism used in the Transformer is unaware of positions, to avoid this shortage, position embeddings were used (Vaswani et al., 2017; Devlin et al., 2018). Instead of using the sinusoidal position embedding (Vaswani et al., 2017) and learned absolute position embedding, Shaw et al. (2018) argued that the distance between two tokens should be considered when calculating their attention score. Huang et al. (2019) reduced the computation complexity of relative positional encoding from O(l²d) toO(ld), where l is the length of sequences and d is the hidden size. Dai et al. (2019) derived a new form of relative positional encodings, so that the relative relation could be better considered.
Transformer 是由 (Vaswani et al., 2017) 引入的，主要基于 self-attention。它在各种 NLP 任务中取得了巨大的成功。由于 Transformer 中使用的自注意力机制不知道位置，为了避免这种短缺，使用了位置嵌入（Vaswani et al., 2017; Devlin et al., 2018）。（Shaw et al., 2018）没有使用正弦位置嵌入（Vaswani et al., 2017）和学习绝对位置嵌入。（Shaw et al., 2018）认为在计算它们的注意力分数时应该考虑两个标记之间的距离。（Huang et al., 2019）将相对位置编码的计算复杂度从 O(l²d) 降低到 O(ld)，其中 l 是序列的长度，d 是隐藏大小。（Dai et al., 2019）推导出一种新形式的相对位置编码，以便更好地考虑相对关系。

2.2.1 Transformer Encoder Architecture

分门别类归纳总结已有工作：
We first introduce the Transformer encoder proposed in (Vaswani et al., 2017). The Transformer encoder takes in an matrix H ∈ R^l×d, where l the sequence length, d is the input dimension. Then three learnable matrix Wq, Wk, Wv are used to project H into different spaces. Usually, the matrix size of the three matrix are all R^d×dk , where dk is a hyper-parameter. After that, the scaled dotproduct attention can be calculated by the following equations,
我们首先介绍了 (Vaswani et al., 2017) 中提出的 Transformer 编码器。 Transformer 编码器接收一个矩阵 H ∈ R^l×d，其中 l 是序列长度，d 是输入维度。然后使用三个可学习矩阵 Wq、Wk、Wv 将 H 投影到不同的空间中。通常，三个矩阵的矩阵大小都是 R^d×dk ，其中 dk 是一个超参数。之后，可以通过以下等式计算缩放的点积注意力，

where Qt is the query vector of the tth token, j is the token the tth token attends. Kj is the key vector representation of the jth token. The softmax is along the last dimension. Instead of using one group of Wq, Wk, Wv, using several groups will enhance the ability of self-attention. When several groups are used, it is called multi-head selfattention, the calculation can be formulated as follows,
其中Qt是第t个token的查询向量，j是第t个token参与的token。 Kj 是第 j 个令牌的关键向量表示。 softmax 沿着最后一个维度。而不是使用一组Wq、Wk、Wv，使用几组会增强self-attention的能力。当使用多个组时，称为多头self-attention，计算可以表述如下，

where n is the number of heads, the superscript h represents the head index. [head⁽¹⁾; …; head⁽ⁿ⁾]means concatenation in the last dimension. Usually dk × n = d, which means the output of [head⁽¹⁾; …; head⁽ⁿ⁾] will be of size R^l×d. Wo is a learnable parameter, which is of size R^d×d.
其中n是头数，上标h表示头索引。 [head⁽¹⁾； …; head⁽ⁿ⁾] 表示最后一维的连接。通常dk × n = d，表示 [head⁽¹⁾;…; head⁽ⁿ⁾] 的输出大小为 R^l×d。 Wo 是一个可学习的参数，大小为 R^d×d。

The output of the multi-head attention will be further processed by the position-wise feedforward networks, which can be represented as follows,
多头注意力的输出将由位置前馈网络进一步处理，可以表示如下，

where W1, W2, b1, b2 are learnable parameters, and W1 ∈ R^d×dff , W2 ∈ R^dff×d, b1 ∈ R^dff ,b2 ∈ R^d. dff is a hyper-parameter. Other components of the Transformer encoder includes layer normalization and Residual connection, we use them the same as (Vaswani et al., 2017).
其中 W1, W2, b1, b2 是可学习的参数，W1 ∈ R^d×dff , W2 ∈ R^dff×d, b1 ∈ R^dff ,b2 ∈ R^d。 dff 是一个超参数。 Transformer 编码器的其他组件包括层归一化和残差连接，我们使用它们与 (Vaswani et al., 2017) 相同。

2.2.2 Position Embedding

分门别类归纳总结已有工作：
The self-attention is not aware of the positions of different tokens, making it unable to capture the sequential characteristic of languages. In order to solve this problem, (Vaswani et al., 2017) suggested to use position embeddings generated by sinusoids of varying frequency. The tth token’s position embedding can be represented by the following equations

self-attention 不知道不同标记的位置，使其无法捕捉语言的顺序特征。为了解决这个问题，(Vaswani et al., 2017) 建议使用由不同频率的正弦曲线生成的位置嵌入。第 t 个 token 的位置嵌入可以用以下等式表示

where i is in the range of [0, d/2 ], d is the input dimension. This sinusoid based position embedding makes Transformer have an ability to model the position of a token and the distance of each two tokens. For any fixed offset k, PEt+k can be represented by a linear transformation of PEt (Vaswani et al., 2017).
其中 i 在 [0, d/2 ] 的范围内，d 是输入维度。这种基于正弦曲线的位置嵌入使 Transformer 能够对令牌的位置和每两个令牌的距离进行建模。对于任何固定的偏移量 k，PEt+k 可以用 PEt 的线性变换来表示（Vaswani 等人，2017）。

3 Proposed Model（目标模型）

介绍整体架构：
In this paper, we utilize the Transformer encoder to model the long-range and complicated interactions of sentence for NER. The structure of proposed model is shown in Fig 2. We detail each parts in the following sections.
在本文中，我们利用 Transformer 编码器对 NER 句子的远程且复杂的交互进行建模。所提出模型的结构如图 2 所示。我们将在以下部分详细介绍每个部分。

3.1 Embedding Layer

局部：
To alleviate the problems of data sparsity and outof-vocabulary (OOV), most NER models adopted the CNN character encoder (Ma and Hovy, 2016; Ye and Ling, 2018; Chen et al., 2019) to represent words. Compared to BiLSTM based character encoder (Lample et al., 2016; Ghaddar and Langlais, 2018), CNN is more efficient. Since Transformer can also fully exploit the GPU’s parallelism, it is interesting to use Transformer as the character encoder. A potential benefit of Transformer-based character encoder is to extract different n-grams and even uncontinuous character patterns, like “un…ily” in “unhappily” and “uneasily”. For the model’s uniformity, we use the “adapted Transformer” to represent the Transformer introduced in next subsection.
为了缓解数据稀疏和词汇表外 (OOV) 的问题，大多数 NER 模型采用 CNN 字符编码器（Ma 和 Hovy，2016；Ye 和 Ling，2018；Chen 等人，2019）来表示单词。与基于 BiLSTM 的字符编码器（Lample 等人，2016 年；Ghaddar 和 Langlais，2018 年）相比，CNN 的效率更高。由于 Transformer 还可以充分利用 GPU 的并行性，因此使用 Transformer 作为字符编码器很有趣。基于 Transformer 的字符编码器的一个潜在好处是提取不同的 n-gram 甚至不连续的字符模式，例如“unhappyly”和“uneasily”中的“un…ily”。为了模型的统一性，我们使用“adapted Transformer”来表示下一小节介绍的Transformer。

The final word embedding is the concatenation of the character features extracted by the character encoder and the pre-trained word embeddings.
最终的词嵌入是字符编码器提取的字符特征和预训练的词嵌入的连接。

如下列代码所示：

# @cache_results 作为修饰器帮用户将处理后的数据保存到本地文件。
# 第二次运行时，如果设置的 _refresh 是为 False 和_cachefp 不为 None，则会去本地读取数据返回，不需要重新执行 load_data() 函数。
@cache_results(name, _refresh=False)
def load_data():# 替换路径if dataset == 'conll2003':# conll2003的lr不能超过0.002paths = {'test': "./datasets/conll2003/eng.testa",'train': "./datasets/conll2003/eng.train",'dev': "./datasets/conll2003/eng.testb"}data = Conll2003NERPipe(encoding_type=encoding_type).process_from_file(paths)elif dataset == 'en-ontonotes':# 会使用这个文件夹下的train.txt, test.txt, dev.txt等文件paths = './datasets/en-ontonotes/english'data = OntoNotesNERPipe(encoding_type=encoding_type).process_from_file(paths)char_embed = None# char representation 由 char embedding 与对应的 word embedding concat 起来得到# char embedding 是通过将单词的字符当做一个序列，经过 cnn 或者 rnn 以后得到的，然后# 这里设置三种字符嵌入类型，这里同论文 3.1 采用CNN 字符编码器if char_type == 'cnn':char_embed = CNNCharEmbedding(vocab=data.get_vocab('words'), embed_size=30, char_emb_size=30, filter_nums=[30],kernel_sizes=[3], word_dropout=0, dropout=0.3, pool_method='max',include_word_start_end=False, min_char_freq=2)elif char_type in ['adatrans', 'naive']:char_embed = TransformerCharEmbed(vocab=data.get_vocab('words'), embed_size=30, char_emb_size=30,word_dropout=0,dropout=0.3, pool_method='max', activation='relu',min_char_freq=2, requires_grad=True, include_word_start_end=False,char_attn_type=char_type, char_n_head=3, char_dim_ffn=60,char_scale=char_type == 'naive',char_dropout=0.15, char_after_norm=True)elif char_type == 'lstm':char_embed = LSTMCharEmbedding(vocab=data.get_vocab('words'), embed_size=30, char_emb_size=30, word_dropout=0,dropout=0.3, hidden_size=100, pool_method='max', activation='relu',min_char_freq=2, bidirectional=True, requires_grad=True,include_word_start_end=False)word_embed = StaticEmbedding(vocab=data.get_vocab('words'),model_dir_or_name='en-glove-6b-100d',requires_grad=True, lower=True, word_dropout=0, dropout=0.5,only_norm_found_vector=normalize_embed)if char_embed is not None:embed = StackEmbedding([word_embed, char_embed], dropout=0, word_dropout=0.02)else:word_embed.word_drop = 0.02embed = word_embeddata.rename_field('words', 'chars')return data, embed

3.2 Encoding Layer with Adapted Transformer

局部：
Although Transformer encoder has potential advantage in modeling long-range context, it is not working well for NER task. In this paper, we propose an adapted Transformer for NER task with two improvements.
尽管 Transformer 编码器在建模远程上下文方面具有潜在优势，但它不适用于 NER 任务。在本文中，我们提出了一种适用于 NER 任务的 Transformer，并进行了两项改进。

提出了一种适用于 NER 任务的 Transformer，并进行了两项改进。

3.2.1 Direction- and Distance-Aware Attention

Inspired by the success of BiLSTM in NER tasks, we consider what properties the Transformer lacks compared to BiLSTM-based models. One observation is that BiLSTM can discriminatively collect the context information of a token from its left and right sides. But it is not easy for the Transformer to distinguish which side the context information comes from.
受 BiLSTM 在 NER 任务中的成功启发，我们考虑了 Transformer 与基于 BiLSTM 的模型相比缺乏哪些属性。一个观察结果是 BiLSTM 可以有区别地从令牌的左侧和右侧收集令牌的上下文信息。但是对于Transformer来说，要区分上下文信息来自哪一边并不容易。

Although the dot product between two sinusoidal position embeddings is able to reflect their distance, it lacks directionality and this property will be broken by the vanilla Transformer attention. To illustrate this, we first prove two properties of the sinusoidal position embeddings.
虽然两个正弦位置嵌入之间的点积能够反映它们的距离，但它缺乏方向性，并且这个属性会被原始的 Transformer attention 破坏。为了说明这一点，我们首先证明了正弦位置嵌入的两个属性。

Property 1. For an offset k and a position t,PE^Tt+k PEt only depends on k, which means the dot product of two sinusoidal position embeddings can reflect the distance between two tokens.
属性 1. 对于一个偏移量 k 和一个位置 t，PE^Tt+k PEt 只取决于 k，这意味着两个正弦位置嵌入的点积可以反映两个令牌之间的距离。

Proof. Based on the definitions of Eq.(8) and Eq.(9), the position embedding of t-th token is
证明。基于式(8)和式(9)的定义，第t个token的位置嵌入为

where d is the dimension of the position embedding, ci is a constant decided by i, and its value is 1/10000^2i/d.
其中d是位置嵌入的维度，ci是由i决定的常数，其值为1/10000^2i/d。
Therefore,
因此，

where Eq.(11) to Eq.(12) is based on the equationcos(x − y) = sin(x) sin(y) + cos(x) cos(y).
其中 Eq.(11) 到 Eq.(12) 基于方程 cos(x − y) = sin(x) sin(y) + cos(x) cos(y)。

Property 2. For an offset k and a position t, PE^Tt PEt−k = PE^Tt PEt+k, which means the sinusoidal position embeddings is unware of directionality.
属性 2. 对于偏移量 k 和位置 t，PE^Tt PEt−k = PE^Tt PEt+k，这意味着正弦位置嵌入不知道方向性。

Proof. Let j = t − k, according to property 1, we have
证明。令 j = t - k，根据性质 1，我们有

The relation between d, k and PE^Tt PEt+k is displayed in Fig 3. The sinusoidal position embeddings are distance-aware but lacks directionality.
d, k 和 PE^Tt PEt+k 之间的关系如图 3 所示。正弦位置嵌入是距离感知的，但缺乏方向性。

However, the property of distance-awareness also disappears when PEt is projected into the query and key space of self-attention. Since in vanilla Transformer the calculation between PEt and PEt+k is actually PE^Tt Wq^T Wk PEt+k, where Wq, Wk are parameters in Eq.(1). Mathematically, it can be viewed as PE^Tt W PEt+k with only one parameter W. The relation between PE^Tt PEt+k and PE^Tt W PEt+k is depicted in Fig 4.
然而，当 PEt 投影到 self-attention 的 query 和 key 空间时，距离感知的属性也消失了。因为在原始 Transformer 中，PEt 和 PEt+k 之间的计算实际上是 PE^Tt Wq^T Wk PEt+k，其中 Wq、Wk 是等式（1）中的参数。在数学上，它可以被视为只有一个参数 W 的 PE^Tt W PEt+k 。 PE^Tt PEt+k 和 PE^Tt W PEt+k 之间的关系如图 4 所示。

Therefore, to improve the Transformer with direction- and distance-aware characteristic, we calculate the attention scores using the equations below:
因此，为了改进具有方向和距离感知特性的 Transformer，我们使用以下等式计算注意力分数：

where t is index of the target token, j is the index of the context token, Qt, Kj is the query vector and key vector of token t, j respectively,Wq, Wv ∈ Rd×dk . To get Hdk ∈ Rl×dk , we first split H into d/dk partitions in the second dimension, then for each head we use one partition. u ∈ Rdk , v ∈ Rdk are learnable parameters, Rt−j is the relative positional encoding, andRt−j ∈ Rdk , i in Eq.(17) is in the range [0, dk2 ].QT t Kj in Eq.(18) is the attention score between two tokens; QT t Rt−j is the tth token’s bias on certain relative distance; uT Kj is the bias on the jth token; vT Rt−j is the bias term for certain distance and direction.
其中t是目标token的索引，j是上下文token的索引，Qt，Kj分别是token t，j的查询向量和关键向量，Wq，Wv∈Rd×dk。为了得到 Hdk ∈ Rl×dk ，我们首先在第二维将 H 分成 d/dk 个分区，然后对于每个头，我们使用一个分区。 u ∈ Rdk , v ∈ Rdk 是可学习参数，Rt−j 是相对位置编码，Rt−j ∈ Rdk , i 在 Eq.(17) 的范围内 [0, dk2 ].QT t Kj in Eq.( 18) 是两个token之间的注意力分数； QT t Rt−j 是第 t 个 token 在某个相对距离上的偏差； uT Kj 是第 j 个代币的偏差； vT Rt-j 是特定距离和方向的偏置项。

如下列代码所示：相对多头注意力机制，仅显示部分

    def forward(self, x, mask):""":param x: batch_size x max_len x d_model:param mask: batch_size x max_len:return:"""batch_size, max_len, d_model = x.size()pos_embed = self.pos_embed(mask)  # l x head_dimqkv = self.qkv_linear(x)  # batch_size x max_len x d_model3q, k, v = torch.chunk(qkv, chunks=3, dim=-1)q = q.view(batch_size, max_len, self.n_head, -1).transpose(1, 2)k = k.view(batch_size, max_len, self.n_head, -1).transpose(1, 2)v = v.view(batch_size, max_len, self.n_head, -1).transpose(1, 2)  # b x n x l x drw_head_q = q + self.r_r_bias[:, None]# 一种非常方便的多维矩阵运算AC = torch.einsum('bnqd,bnkd->bnqk', [rw_head_q, k])  # b x n x l x d, n是headD_ = torch.einsum('nd,ld->nl', self.r_w_bias, pos_embed)[None, :, None]  # head x 2max_len, 每个head对位置的biasB_ = torch.einsum('bnqd,ld->bnql', q, pos_embed)  # bsz x head  x max_len x 2max_len，每个query对每个shift的偏移E_ = torch.einsum('bnqd,ld->bnql', k, pos_embed)  # bsz x head x max_len x 2max_len, key对relative的biasBD = B_ + D_  # bsz x head x max_len x 2max_len, 要转换为bsz x head x max_len x max_lenBDE = self._shift(BD) + self._transpose_shift(E_)attn = AC + BDEattn = attn / self.scaleattn = attn.masked_fill(mask[:, None, None, :].eq(0), float('-inf'))attn = F.softmax(attn, dim=-1)attn = self.dropout_layer(attn)v = torch.matmul(attn, v).transpose(1, 2).reshape(batch_size, max_len, d_model)  # b x n x l x dreturn v

Based on Eq.(17), we have
基于等式（17），我们有

because sin(−x) = − sin(x), cos(x) = cos(−x). This means for an offset t, the forward and backward relative positional encoding are the same with respect to the cos(cit) terms, but is the opposite with respect to the sin(cit) terms. Therefore, by using Rt−j , the attention score can distinguish different directions and distances.
因为 sin(−x) = − sin(x)，cos(x) = cos(−x)。这意味着对于偏移量 t，前向和后向相对位置编码对于 cos(cit) 项是相同的，但对于 sin(cit) 项是相反的。因此，通过使用 Rt-j ，注意力分数可以区分不同的方向和距离。

The above improvement is based on the work (Shaw et al., 2018; Dai et al., 2019). Since the size of NER datasets is usually small, we avoid direct multiplication of two learnable parameters, because they can be represented by one learnable parameter. Therefore we do not use Wk in Eq.(16). The multi-head version is the same as Eq.(6), but we discard Wo since it is directly multiplied by W1in Eq.(7).
上述改进是基于工作（Shaw et al., 2018; Dai et al., 2019）。由于 NER 数据集的大小通常很小，我们避免直接将两个可学习参数相乘，因为它们可以由一个可学习参数表示。因此，我们在等式（16）中不使用 Wk。多头版本与等式（6）相同，但我们丢弃 Wo，因为它直接乘以等式（7）中的 W1。

3.2.2 Un-scaled Dot-Product Attention

The vanilla Transformer use the scaled dotproduct attention to smooth the output of softmax function. In Eq.(3), the dot product of key and value matrices is divided by the scaling factor $\sqrt{dk}$

.
原始的 Transformer 使用缩放的点积注意力来平滑 softmax 函数的输出。在等式（3）中，键和值矩阵的点积除以缩放因子

\sqrt{dk}

。

We empirically found that models perform better without the scaling factor $\sqrt{dk}$

. We presume this is because without the scaling factor the attention will be sharper. And the sharper attention might be beneficial in the NER task since only few words in the sentence are named entities.
我们凭经验发现，没有比例因子

\sqrt{dk}

的模型表现更好。我们认为这是因为没有缩放因子，注意力会更加清晰。更清晰的注意力可能对 NER 任务有益，因为句子中只有少数单词是命名实体。

3.3 CRF Layer

局部：
In order to take advantage of dependency between different tags, the Conditional Random Field (CRF) was used in all of our models. Given a sequence s = [s1, s2, …, sT], the corresponding golden label sequence is y = [y1, y2, …, yT], and Y(s) represents all valid label sequences. The probability of y is calculated by the following equation
为了利用不同标签之间的依赖关系，我们所有的模型都使用了条件随机场（CRF）。给定一个序列 s = [s1, s2, …, sT]，对应的最佳标签序列是 y = [y1, y2, …, yT]，并且 Y(s) 代表所有有效的标签序列。 y 的概率由以下等式计算

where f (yt−1, yt, s) computes the transition score from yt−1 to yt and the score for yt. The optimization target is to maximize P(y|s). When decoding, the Viterbi Algorithm is used to find the path achieves the maximum probability.
其中 f(yt−1, yt, s) 计算从 yt−1 到 yt 的转换分数以及 yt 的分数。优化目标是最大化 P(y|s)。解码时，使用维特比算法寻找路径达到最大概率。
维特比算法是一种动态规划算法用于寻找最有可能产生观测事件序列的-维特比路径-隐含状态序列，特别是在马尔可夫信息源上下文和隐马尔可夫模型中。

如下列代码所示：

# 条件随机场 tag_vocab：标签的数量，include_start_end_trans：是否考虑各个tag作为开始以及结尾的分数
self.crf = ConditionalRandomField(len(tag_vocab), include_start_end_trans=True, allowed_transitions=trans)

Experiment（实验）

4.1 Data

We evaluate our model in two English NER datasets and four Chinese NER datasets.
(1) CoNLL2003 is one of the most evaluated English NER datasets, which contains four different named entities: PERSON, LOCATION, ORGANIZATION, and MISC (Sang and Meulder, 2003).
(2) OntoNotes 5.0 is an English NER dataset whose corpus comes from different domains, such as telephone conversation, newswire. We exclude the New Testaments portion since there is no named entity in it (Chen et al., 2019; Chiu and Nichols, 2016). This dataset has eleven entity names and seven value types, like CARDINAL, MONEY, LOC.
(3) Weischedel (2011) released OntoNotes 4.0. In this paper, we use the Chinese part. We adopted the same pre-process as (Che et al., 2013).
(4) The corpus of the Chinese NER dataset MSRA came from news domain (Levow, 2006).
(5) Weibo NER was built based on text in Chinese social media Sina Weibo (Peng and Dredze, 2015), and it contained 4 kinds of entities.
(6) Resume NER was annotated by (Zhang and Yang, 2018).
我们在两个英文 NER 数据集和四个中文 NER 数据集中评估我们的模型。
（1）CoNLL2003 是评估最多的英语 NER 数据集之一，它包含四个不同的命名实体：PERSON、LOCATION、ORGANIZATION 和 MISC (Sang and Meulder, 2003)。
（2）OntoNotes 5.0是一个英文NER数据集，其语料来自不同领域，如电话交谈、新闻专线。我们排除了新约部分，因为其中没有命名实体（hen et al., 2019; Chiu and Nichols, 2016）。该数据集有 11 个实体名称和 7 种值类型，如 CARDINAL、MONEY、LOC。
（3）Weischedel (2011) 发布了 OntoNotes 4.0。在本文中，我们使用中文部分。我们采用了与 (Che et al., 2013) 相同的预处理过程。
（4）中文 NER 数据集MSRA的语料来自新闻领域（Levow，2006）。
（5）微博 NER 是基于中国社交媒体新浪微博中的文本构建的（Peng and Dredze, 2015），它包含 4 种实体。
（6）NER 由 (Zhang and Yang, 2018) 重新注释。

Their statistics are listed in Table 1. For all datasets, we replace all digits with “0”, and use the BIOES tag schema. For English, we use the Glove 100d pre-trained embedding (Pennington et al., 2014). For the character encoder, we use 30d randomly initialized character embeddings. More details on models’ hyper-parameters can be found in the supplementary material. For Chinese, we used the character embedding and bigram embedding released by (Zhang and Yang, 2018). All pretrained embeddings are finetuned during training. In order to reduce the impact of randomness, we ran all of our experiments at least three times, and its average F1 score and standard deviation are reported.
它们的统计数据列于表 1。对于所有数据集，我们将所有数字替换为“0”，并使用 BIOES 标记模式。对于英语，我们使用 Glove 100d 预训练嵌入 (Pennington et al., 2014)。对于字符编码器，我们使用 30d 随机初始化的字符嵌入。有关模型超参数的更多详细信息，请参阅补充材料。对于中文，我们使用了 (Zhang and Yang, 2018) 发布的字符嵌入和二元嵌入。所有预训练的嵌入都在训练期间进行微调。为了减少随机性的影响，我们将所有实验至少运行了 3 次，并报告了其平均 F1 分数和标准差。

We used random-search to find the optimal hyper-parameters, hyper-parameters and their ranges are displayed in the supplemental material. We use SGD and 0.9 momentum to optimize the model. We run 100 epochs and each batch has 16 samples. During the optimization, we use the triangle learning rate (Smith, 2017) where the learning rate rises to the pre-set learning rate at the first 1% steps and decreases to 0 in the left 99% steps. The model achieves the highest development performance was used to evaluate the test set. The hyper-parameter search range and other settings can be found in the supplementary material. Codes are available at https://github.com/fastnlp/TENER.
我们使用随机搜索来找到最佳超参数，超参数及其范围显示在补充材料中。我们使用 SGD 和 0.9 动量来优化模型。我们运行 100 个 epoch，每批有 16 个样本。在优化过程中，我们使用三角形学习率 (Smith, 2017)，其中学习率在前 1% 的步骤中上升到预设的学习率，在左侧的 99% 步骤中降低到 0。该模型实现了最高的开发性能，用于评估测试集。超参数搜索范围和其他设置可以在补充材料中找到。代码可在 https://github.com/fastnlp/TENER 获取。

Conclusion

In this paper, we propose TENER, a model adopting Transformer Encoder with specific customizations for the NER task. Transformer Encoder has a powerful ability to capture the long-range context. In order to make the Transformer more suitable to the NER task, we introduce the direction-aware, distance-aware and un-scaled attention. Experiments in two English NER tasks and four Chinese NER tasks show that the performance can be massively increased. Under the same pretrained embeddings and external knowledge, our proposed modification outperforms previous models in the six datasets. Meanwhile, we also found the adapted Transformer is suitable for being used as the English character encoder, because it has the potentiality to extract intricate patterns from characters. Experiments in two English NER datasets show that the adapted Transformer character encoder performs better than BiLSTM and CNN character encoders.
在本文中，我们提出了TENER模型，它采用了针对NER任务的特殊定制的Transformer Encoder。Transformer Encoder具有捕获远程上下文的强大能力。为了使Transformer更适合NER任务，我们引入了方向感知、距离感知和非缩放注意。在2个英语NER任务和4个中文NER任务上的实验表明，该算法能够大幅度提高性能。在相同的预训练嵌入和外部知识的情况下，我们提出的修正在六个数据集中优于之前的模型。同时，我们还发现改进后的Transformer适合用作英文字符编码器，因为它具有从字符中提取复杂模式的潜力。在两个英文NER数据集上的实验表明，改进的Transformer字符编码器的性能优于BiLSTM和CNN字符编码器。

TENER运行流程

整体流程

通过xxxNERPipe()读取数据->通过xxxCharEmbedding进行字符嵌入->通过StackEmbedding进行词嵌入->将二者进行合并

TENER模型训练流程

TENER模型->进入输入全连接层->进入Transformer编码器->进入Dropout层->进入输出全连接层->进行log_softmax归一化->返回loss

Transformer编码器网络结构

num_layers个编码器层（即进行num_layers次迭代）：
更新位置编码，每次使用正弦嵌入或固定最大尺寸的位置嵌入->多头注意力机制/相对多头注意力机制->规范化层->前馈神经网络层->规范化层

相对多头注意力机制：
相对正弦位置嵌入->注意力计算->softmax->进入Dropout层

前馈神经网络层：
输入全连接层->ReLU层->进入Dropout层->输出全连接层->进入Dropout层

TENER: Adapting Transformer Encoder for Named Entity Recognition 笔记相关推荐

TENER: Adapting Transformer Encoder for Named Entity Recognition 论文详解
论文地址 https://arxiv.org/pdf/1911.04474.pdf 算法介绍 NER 是一个根据输入的句子,预测出其标注序列(实体的序列)的过程对于模型来说,一般来说有这么几个组成部 ...
TENER: Adapting Transformer Encoder for Named Entity Recognition
Transformer编码器用于命名实体识别. 用于ner识别的最常见深度学习模型应该就是bilstm了:但是在transformer大热的现在当然也少不了ner这个任务场景.本文就是 ...
TENER: Adapting Transformer Encoder for Name Entity Recognition
TENER: Adapting Transformer Encoder for Name Entity Recognition 来源:arxiv 链接:https://arxiv.org/pdf/19 ...
TENER: Adapting Transformer Encoder for Named Entity Recogni
论文要点这篇文章应该是第一篇使用Transformer取得比较好效果的论文,分析了Transformer的特性,同时提出两种改进: 原本position embedding只体现了距离,但没有方向性 ...
【论文笔记-NER综述】A Survey on Deep Learning for Named Entity Recognition
本笔记理出来综述中的点,并将大体的论文都列出,方便日后调研使用查找,详细可以看论文. 神经网络的解释: The forward pass com- putes a weighted sum of th ...
【论文精读】A Survey on Deep Learning for Named Entity Recognition
A Survey on Deep Learning for Named Entity Recognition 前言 Abstract 1. INTRODUCTION 2. BACKGROUND 2.1 ...
论文阅读笔记-FGN: Fusion Glyph Network for Chinese Named Entity Recognition
论文地址:paper:https://arxiv.org/ftp/arxiv/papers/2001/2001.05272.pdf github地址:github:https://github.com ...
论文阅读：（2020版）A Survey on Deep Learning for Named Entity Recognition 命名实体识别中的深度学习方法
A Survey on Deep Learning for Named Entity Recognition 命名实体识别中的深度学习方法目录 A Survey on Deep Learning f ...
Template-Based Named Entity Recognition Using BART
场景:Few-Shot Learning + Prompt Learning+PLM(BART)+Transfer Learning Abstract 最近人们对研究少量的NER很感兴趣,其中低资源的 ...

TENER: Adapting Transformer Encoder for Named Entity Recognition 笔记