Semantic Sentence Matching with Densely-connected Recurrent and Co-attentive Information

文章目录

Abstract
Introduction
Related Work
Methods
Experiments
- Experimental Results

Abstract

Sentence matching is widely used in various natural language tasks such as natural language inference, paraphrase identification, and question answering. For these tasks, understanding logical and semantic relationship between two sentences is required but it is yet challenging. Although attention mechanism is useful to capture the semantic relationship and to properly align the elements of two sentences, previous methods of attention mechanism simply use a summation operation which does not retain original features enough. Inspired by DenseNet, a densely connected convolutional network, we propose a densely-connected co-attentive recurrent neural network, each layer of which uses concatenated information of attentive features as well as hidden features of all the preceding recurrent layers. It enables preserving the original and the co-attentive feature information from the bottommost word embedding layer to the uppermost recurrent layer. To alleviate the problem of an ever-increasing size of feature vectors due to dense concatenation operations, we also propose to use an autoencoder after dense concatenation. We evaluate our proposed architecture on highly competitive benchmark datasets related to sentence matching. Experimental results show that our architecture, which retains recurrent and attentive features, achieves state-of-the-art performances for most of the tasks.
句子匹配广泛用于各种自然语言任务，例如自然语言推理，释义识别和问答。对于这些任务，需要理解两个句子之间的逻辑和语义关系，但它仍具有挑战性。虽然注意机制对于捕获语义关系和正确对齐两个句子的元素是有用的，但是先前的注意机制方法仅使用不足以保留原始特征的求和操作。受DenseNet（一种密集连接的卷积网络）的启发，我们提出了一种密集连接的共同反复递归神经网络，其每一层都使用注意特征的连接信息以及所有前面复发层的隐藏特征。它可以保留从最底层的字嵌入层到最上面的复现层的原始和共同注意特征信息。为了缓解由于密集级联操作导致的特征向量的不断增加的问题，我们还建议在密集级联之后使用自动编码器。我们在与句子匹配相关的竞争激烈的基准数据集上评估我们提出的架构。实验结果表明，我们的体系结构保留了反复出现的注意力特征，为大多数任务实现了最先进的性能。

Introduction

Semantic sentence matching, a fundamental technology in natural language processing, requires lexical and compositional semantics. In paraphrase identification, sentence matching is utilized to identify whether two sentences have identical meaning or not. In natural language inference also known as recognizing textual entailment, it determines whether a hypothesis sentence can reasonably be inferred from a given premise sentence. In question answering, sentence matching is required to determine the degree of matching 1) between a query and a question for question retrieval, and 2) between a question and an answer for answer selection. However identifying logical and semantic relationship between two sentences is not trivial due to the problem of the semantic gap (Liu et al. 2016).
语义句子匹配是自然语言处理中的基础技术，需要词汇和组合语义。在复述识别中，利用句子匹配来识别两个句子是否具有相同的含义。在自然语言推理（也称为识别文本蕴涵）中，它确定是否可以从给定的前提句子中合理地推断出假设句子。在问题回答中，需要句子匹配来确定查询和问题检索问题之间的匹配程度1），以及2）问题和答案选择答案之间的匹配程度。然而，由于语义鸿沟的问题，识别两个句子之间的逻辑和语义关系并不是微不足道的（Liu et al.2016）。

Recent advances of deep neural network enable to learn textual semantics for sentence matching. Large amount of annotated data such as Quora (Csernai 2017), SNLI (Bowman et al. 2015), and MultiNLI (Williams, Nangia, and Bowman 2017) have contributed significantly to learning semantics as well. In the conventional methods, a matching model can be trained in two different ways (Gong, Luo, and Zhang 2018). The first methods are sentence-encoding-based ones where each sentence is encoded to a fixed-sized vector in a complete isolated manner and the two vectors for the corresponding sentences are used in predicting the degree of matching. The others are joint methods that allow to utilize interactive features like attentive information between the sentences.
深度神经网络的最新进展使得能够学习句子匹配的文本语义。大量的注释数据，如Quora（Csernai 2017），SNLI（Bowman等2015）和MultiNLI（Williams，Nangia，和Bowman 2017）也为学习语义做出了重要贡献。在传统方法中，可以以两种不同的方式训练匹配模型（Gong，Luo和Zhang 2018）。第一种方法是基于句子编码的方法，其中每个句子以完全孤立的方式编码为固定大小的矢量，并且相应句子的两个矢量用于预测匹配程度。其他是联合方法，允许利用诸如句子之间的关注信息之类的交互式特征。

In the former paradigm, because two sentences have no interaction, they can not utilize interactive information during the encoding procedure. In our work, we adopted a joint method which enables capturing interactive information for performance improvements. Furthermore, we employ a substantially deeper recurrent network for sentence matching like deep neural machine translator (NMT) (Wu et al. 2016). Deep recurrent models are more advantageous for learning long sequences and outperform the shallower architectures. However, the attention mechanism is unstable in deeper models with the well-known vanishing gradient problem. Though GNMT (Wu et al. 2016) uses residual connection between recurrent layers to allow better information and gradient flow, there are some limitations. The recurrent hidden or attentive features are not preserved intact through residual connection because the summation operation may impede the information flow in deep networks.
在以前的范例中，因为两个句子没有交互，所以在编码过程中它们不能利用交互信息。在我们的工作中，我们采用了一种联合方法，可以捕获交互式信息以提高性能。此外，我们采用了更深层次的复发网络进行句子匹配，如深度神经机器翻译器（NMT）（Wu et al.2016）。深度递归模型更有利于学习长序列并且优于较浅的架构。然而，在具有众所周知的消失梯度问题的较深模型中，注意机制是不稳定的。尽管GNMT（Wu等人，2016）使用循环层之间的残余连接以允许更好的信息和梯度流，但是存在一些限制。由于求和操作可能阻碍深度网络中的信息流，因此通过剩余连接不能完整地保留重复的隐藏或关注特征。

Inspired by Densenet (Huang et al. 2017), we propose a densely-connected recurrent network where the recurrent hidden features are retained to the uppermost layer. In addition, instead of the conventional summation operation, the concatenation operation is used in combination with the attention mechanism to preserve co-attentive information better. The proposed architecture shown in Figure 1 is called DRCN which is an abbreviation for Densely-connected Recurrent and Co-attentive neural Network. The proposed DRCN can utilize the increased representational power of deeper recurrent networks and attentive information. Furthermore, to alleviate the problem of an ever-increasing feature vector size due to concatenation operations, we adopted an autoencoder and forwarded a fixed length vector to the higher layer recurrent module as shown in the figure. DRCN is, to our best knowledge, the first generalized version of DenseRNN which is expandable to deeper layers with the property of controllable feature sizes by the use of an autoencoder. We evaluate our model on three sentence matching tasks: natural language inference, paraphrase identification and answer sentence selection. Experimental results on five highly competitive benchmark datasets (SNLI, MultiNLI, QUORA, TrecQA and SelQA) show that our model significantly outperforms the current state-of-the-art results on most of the tasks.
受Densenet（Huang et al.2017）的启发，我们提出了一种密集连接的循环网络，其中重复隐藏的特征保留在最上层。另外，代替传统的求和操作，串联操作与注意机制结合使用以更好地保持共同注意信息。图1中所示的建议架构称为DRCN，它是密集连接的复发和共同神经网络的缩写。所提出的DRCN可以利用更深的循环网络和注意信息的增强的表示能力。此外，为了缓解由于级联操作引起的不断增加的特征向量大小的问题，我们采用自动编码器并将固定长度向量转发到更高层的循环模块，如图所示。据我们所知，DRCN是DenseRNN的第一个通用版本，它可以通过使用自动编码器扩展到具有可控特征尺寸属性的更深层。我们在三个句子匹配任务上评估我们的模型：自然语言推理，复述识别和答案句子选择。五个竞争激烈的基准数据集（SNLI，MultiNLI，QUORA，TrecQA和SelQA）的实验结果表明，我们的模型在大多数任务中明显优于当前最先进的结果。

Related Work

Earlier approaches of sentence matching mainly relied on conventional methods such as syntactic features, transformations or relation extraction (Romano et al. 2006; Wang, Smith, and Mitamura 2007). These are restrictive in that they work only on very specific tasks.
早期的句子匹配方法主要依赖于传统方法，如句法特征，变换或关系提取（Romano等，2006; Wang，Smith和Mitamura，2007）。这些是限制性的，因为它们仅适用于非常具体的任务。

The developments of large-scale annotated datasets (Bowman et al. 2015; Williams, Nangia, and Bowman 2017) and deep learning algorithms have led a big progress on matching natural language sentences. Furthermore, the wellestablished attention mechanisms endowed richer information for sentence matching by providing alignment and dependency relationship between two sentences. The release of the large-scale datasets also has encouraged the developments of the learning-centered approaches to semantic representation. The first type of these approaches is sentence-encoding-based methods (Conneau et al. 2017; Choi, Yoo, and goo Lee 2017; Nie and Bansal 2017; Shen et al. 2018) where sentences are encoded into their own sentence representation without any cross- interaction. Then, a classifier such as a neural network is applied to decide the relationship based on these independent sentence representations. These sentence-encoding-based methods are simple to extract sentence representation and are able to be used for transfer learning to other natural language tasks (Conneau et al. 2017). On the other hand, the joint methods, which make up for the lack of interaction in the former methods, use cross-features as an attention mechanism to express the word- or phrase-level alignments for performance improvements (Wang, Hamza, and Florian 2017; Chen et al. 2017b; Gong, Luo, and Zhang 2018; Yang et al. 2016).
大规模注释数据集（Bowman等人2015; Williams，Nangia和Bowman 2017）和深度学习算法的发展在匹配自然语言句子方面取得了重大进展。此外，通过提供两个句子之间的对齐和依赖关系，完善的注意机制为句子匹配赋予了更丰富的信息。大规模数据集的发布也鼓励了以学习为中心的语义表示方法的发展。这些方法的第一种类型是基于句子编码的方法（Conneau等人2017; Choi，Yoo和goo Lee 2017; Nie和Bansal 2017; Shen等人2018），其中句子被编码成他们自己的句子表示而没有任何交叉互动。然后，应用诸如神经网络的分类器来基于这些独立的句子表示来确定关系。这些基于句子编码的方法很容易提取句子表示，并且能够用于转移学习到其他自然语言任务（Conneau等人2017）。另一方面，弥补前一种方法缺乏相互作用的联合方法使用交叉特征作为注意机制来表达单词或短语级别的对齐以提高性能（Wang，Hamza和Florian） 2017; Chen等人2017b; Gong，Luo和Zhang 2018; Yang等人2016）。

Recently, the architectural developments using deeper layers have led more progress in performance. The residual connection is widely and commonly used to increase the depth of a network stably (He et al. 2016; Wu et al. 2016). More recently, Huang et al. (Huang et al. 2017) enable the features to be connected from lower to upper layers using the concatenation operation without any loss of information on lower-layer features.
最近，使用更深层次的架构开发在性能方面取得了更多进展。残余连接广泛且通常用于稳定地增加网络的深度（He等人2016; Wu等人2016）。最近，Huang等人。（Huang et al.2017）使用连接操作使特征从下层连接到上层，而不会丢失有关下层特征的信息。

External resources are also used for sentence matching. Chen et al. (Chen et al. 2017a; Chen et al. 2017b) used syntactic parse trees or lexical databases like WordNet to measure the semantic relationship among the words and Pavlick et al. (Pavlick et al. 2015) added interpretable semantics to the paraphrase database.
外部资源也用于句子匹配。陈等人（Chen et al.2017a; Chen et al.117 2017b）使用句法分析树或WordNet等词汇数据库来测量单词和Pavlick等人之间的语义关系。（Pavlick等人，2015）在释义数据库中添加了可解释的语义。

Unlike these, in this paper, we do not use any such external resources. Our work belongs to the joint approaches which uses densely- connected recurrent and co-attentive information to enhance representation power for semantic sentence matching.
与此不同，在本文中，我们不使用任何此类外部资源。我们的工作属于联合方法，它使用密集连接的循环和共同注意信息来增强语义句子匹配的表示能力。

Methods

In this section, we describe our sentence matching architecture DRCN which is composed of the following three components: (1) word representation layer, (2) attentively connected RNN and (3) interaction and prediction layer. We denote two input sentences as $P = {p_1, p_2, ..., p_I}$ and $Q = {q_1, q_2, ..., q_J}$ where $p_i/q_j$ is the $i^{th}/j^{th}$ word of the sentence $P / Q$ and $I / J$ is the word length of $P / Q$ . The overall architecture of the proposed DRCN is shown in Fig. 1.
在本节中，我们描述了句子匹配架构DRCN，它由以下三个部分组成：（1）字表示层，（2）注意连接的RNN和（3）交互和预测层。我们将两个输入句子表示为 $P = {p_1, p_2, ..., p_I}$ 和 $Q = {q_1, q_2, ..., q_J}$ 其中 $p_i/q_j$ 是句子 $P / Q$ 的第 $i$ 个/第 $j$ 个字， $I / J$ 是 $P / Q$ 的字长。所提出的DRCN的总体架构如图1所示。

Word Representation Layer
To construct the word representation layer, we concatenate word embedding, character representation and the exact matched flag which was used in (Gong, Luo, and Zhang 2018).
为了构造单词表示层，我们连接了单词嵌入，字符表示和在（Gong，Luo和Zhang 2018）中使用的精确匹配标志。

In word embedding, each word is represented as a ddimensional vector by using a pre-trained word embedding method such as GloVe (Pennington, Socher, and Manning 2014) or Word2vec (Mikolov et al. 2013). In our model, a word embedding vector can be updated or fixed during training. The strategy whether to make the pre-trained word embedding be trainable or not is heavily task-dependent. Trainable word embeddings capture the characteristics of the training data well but can result in overfitting. On the other hand, fixed (non- trainable) word embeddings lack flexibility on task-specific data, while it can be robust for overfitting, especially for less frequent words. We use both the trainable embedding $epitre^{tr}_{pi}$ and the fixed (non-trainable) embedding $epifixe^{fix}_{pi}$ to let them play complementary roles in enhancing the performance of our model. This technique of mixing trainable and non-trainable word embeddings is simple but yet effective.
在单词嵌入中，通过使用预训练的单词嵌入方法（例如GloVe（Pennington，Socher和Manning 2014）或Word2vec（Mikolov等人2013））将每个单词表示为d维向量。在我们的模型中，可以在训练期间更新或修复单词嵌入向量。是否使预训练的单词嵌入是可训练的策略是严重依赖于任务的。可训练的单词嵌入很好地捕获了训练数据的特征，但可能导致过度拟合。另一方面，固定（不可训练）字嵌入在任务特定数据上缺乏灵活性，而对于过度拟合可能是稳健的，特别是对于不太频繁的单词。我们使用可训练嵌入 $epitre^{tr}_{pi}$ 和固定（不可训练）嵌入 $epifixe^{fix}_{pi}$ 来让它们在增强模型性能方面发挥互补作用。这种混合可训练和不可训练的单词嵌入的技术简单但有效。

The character representation $c_{pi}$ is calculated by feeding randomly initialized character embeddings into a convolutional neural network with the max-pooling operation. The character embeddings and convolutional weights are jointly learned during training.
通过将随机初始化的字符嵌入馈送到具有最大池操作的卷积神经网络来计算字符表示 $c_{pi}$ 。在训练期间共同学习角色嵌入和卷积权重。

Like (Gong, Luo, and Zhang 2018), the exact match flag $f_{p_i}$ is activated if the same word is found in the other sentence.
如（Gong，Luo和Zhang 2018），如果在另一个句子中找到相同的单词，则激活完全匹配标志 $f_{p_i}$ 。

Our final word representational feature $piwp^w_i$ for the word $p_i$ is composed of four components as follows:
对于单词 $p_i$ ，我们的最终单词表示特征 $piwp^w_i$ 由以下四个组成部分组成：
$(1)epitr=Etr(pi)epifix=Efix(pi)cpi=Char−Conv(pi)piw=[epitr;epifix;cpi;fpi]e^{tr}_{p_i}=E^{tr}(p_i) \\ e^{fix}_{p_i}=E^{fix}(pi) \\ c_{p_i}=Char-Conv(p_i) \\ p^w_i=[e^{tr}_{p_i}; e^{fix}_{p_i};c_{p_i};f_{p_i}] \tag 1 \\$

Here, $E^{tr}$ and $E^{fix}$ are the trainable and non-trainable (fixed) word embeddings respectively. Char-Conv is the characterlevel convolutional operation and $[\cdot; \cdot]$ is the concatenation operator. For each word in both sentences, the same above procedure is used to extract word features.
这里， $E ^ {tr}$ 和 $E ^ {fix}$ 分别是可训练和不可训练（固定）的单词嵌入。 Char-Conv是特征级卷积运算和 $[\cdot; \cdot]$ 是连接运算符。对于两个句子中的每个单词，使用相同的上述过程来提取单词特征。

Densely connected Recurrent Networks
The ordinal stacked RNNs (Recurrent Neural Networks) are composed of multiple RNN layers on top of each other, with the output sequence of previous layer forming the input sequence for the next. More concretely, let $H_l$ be the $l^{th}$ RNN layer in a stacked RNN. Note that in our implementation, we employ the bidirectional LSTM (BiLSTM) as a base block of $H_l$ . At the time step $t$ , an ordinal stacked RNN is expressed as follows:
$(2)htl=Hl(xtl,ht−1l)xtl=htl−1h^l_t = H_l(x^l_t, h^l_{t-1}) \\ x^l_t =h^{l-1}_t \tag 2$
序数堆叠的RNN（递归神经网络）由多个相互叠加的RNN层组成，前一层的输出序列形成下一层的输入序列。更具体地，让 $H_l$ 成为堆叠RNN中的第 $l$ 个RNN层。注意，在我们的实现中，我们使用双向LSTM（BiLSTM）作为 $H_l$ 的基本块。在时间步骤 $t$ ，序数堆叠的RNN表示如下：

While this architecture enables us to build up higher level representation, deeper networks have difficulties in training due to the exploding or vanishing gradient problem.
虽然这种架构使我们能够建立更高级别的表示，但由于爆炸或消失的梯度问题，更深的网络在训练中存在困难。

To encourage gradient to flow in the backward pass, residual connection (He et al. 2016) is introduced which bypasses the non-linear transformations with an identity mapping. Incorporating this into (2), it becomes
为了鼓励梯度在向后传递中流动，引入了残余连接（He等人2016），其通过身份映射绕过非线性变换。将其纳入（2），就变成了
$(3)htl=Hl(xtl,ht−1l)xtl=htl−1+xtl−1h^l_t =H_l(x^l_t, h^l_{t-1}) \\ x^l_t=h^{l-1}_t+x^{l-1}_t \\ \tag 3$
However, the summation operation in the residual connection may impede the information flow in the network (Huang et al. 2017). Motivated by Densenet (Huang et al. 2017), we employ direct connections using the concatenation operation from any layer to all the subsequent layers so that the features of previous layers are not to be modified but to be retained as they are as depicted in Figure 1. The densely connected recurrent neural networks can be described as
$(4)htl=Hl(xtl,ht−1l)xtl=[htl−1;xtl−1]h^l_t=H_l(x^l_t, h^l_{t-1}) \\ x^l_t=[h^{l-1}_t; x^{l-1}_t]\\ \tag 4$
但是，剩余连接中的求和操作可能会阻碍网络中的信息流（Huang et al.2017）。在Densenet（Huang et al.2017）的推动下，我们使用从任何层到所有后续层的连接操作的直接连接，这样前面层的特征不会被修改，而是保留它们如图所示。密集连接的递归神经网络可以描述为

The concatenation operation enables the hidden features to be preserved until they reach to the uppermost layer and all the previous features work for prediction as collective knowledge (Huang et al. 2017).
连接操作使得隐藏的特征能够被保留，直到它们到达最上层并且所有先前的特征用于预测作为集体知识（Huang等人2017）。

Densely-connected Co-attentive networks
Attention mechanism, which has largely succeeded in many domains (Wu et al. 2016; Vaswani et al. 2017), is a technique to learn effectively where a context vector is matched conditioned on a specific sequence.
注意机制在许多领域取得了很大成功（Wu等人，2016; Vaswani等人，2017），是一种有效学习的技术，其中上下文向量以特定序列为条件进行匹配。

Given two sentences, a context vector is calculated based on an attention mechanism focusing on the relevant part of the two sentences at each RNN layer. The calculated attentive information represents soft-alignment between two sentences. In this work, we also use an attention mechanism. We incorporate co-attentive information into densely connected recurrent features using the concatenation operation, so as not to lose any information (Fig. 1). This concatenated recurrent and co-attentive features which are obtained by densely connecting the features from the undermost to the uppermost layers, enrich the collective knowledge for lexical and compositional semantics.
给定两个句子，基于注意机制计算上下文向量，该注意机制关注于每个RNN层处的两个句子的相关部分。计算出的注意信息表示两个句子之间的软对齐。在这项工作中，我们还使用了一种注意机制。我们使用串联操作将共同注意信息整合到密集连接的循环特征中，以免丢失任何信息（图1）。通过密集地连接从最下层到最上层的特征而获得的这种连续的重复和共同注意特征丰富了词汇和组合语义的集体知识。

The attentive information $a_{p_i}$ of the $i^{th}$ word $pi∈Pp_i \in P$ against the sentence $Q$ is calculated as a weighted sum of h qj ’s which are weighted by the softmax weights as follows :
第 $i$ 个单词 $pi∈Pp_i \in P$ 对句子 $Q$ 的注意信息被计算为 $h_{q_j}$ 的加权和，其由softmax权重加权如下：
$(5)api=∑j=1Jai,jhqjai,j=exp(ei,j)∑k=1Jexp(ei,k)a_{p_i}=\sum^J_{j=1}a_{i,j}h_{q_j} \\ a_{i,j}=\frac{exp(e_{i,j})}{\sum^J_{k=1}exp(e_i, k)} \tag 5$
Similar to the densely connected RNN hidden features, we concatenate the attentive context vector $a_{p_i}$ with triggered vector $h_{p_i}$ so as to retain attentive information as an input to the next layer:
$(6)htl=Hl(xtl,ht−1l)xtl=[htl−1;atl−1;xtl−1]h^l_t=H_l(x^l_t, h^l_{t-1}) \\ x^l_t=[h^{l-1}_t;a^{l-1}_t; x^{l-1}_t] \\ \tag 6$
与密集连接的RNN隐藏特征类似，我们将注意的上下文向量 $a_{p_i}$ 与触发向量 $h_{p_i}$ 连接起来，以便将注意信息保留为下一层的输入：

Bottleneck component
Our network uses all layers’ outputs as a community of semantic knowledge. However, this network is a structure with increasing input features as layers get deeper, and has a large number of parameters especially in the fully-connected layer. To address this issue, we employ an autoencoder as a bottleneck component. Autoencoder is a compression technique that reduces the number of features while retaining the original information, which can be used as a distilled semantic knowledge in our model. Furthermore, this component increased the test performance by working as a regularizer in our experiments.
我们的网络使用所有层的输出作为语义知识社区。然而，该网络是随着层变深而具有增加的输入特征的结构，并且具有大量参数，尤其是在完全连接的层中。为了解决这个问题，我们使用自动编码器作为瓶颈组件。 Autoencoder是一种压缩技术，可以在保留原始信息的同时减少功能的数量，这可以在我们的模型中用作蒸馏语义知识。此外，该组件通过在我们的实验中作为正则化器来提高测试性能。

Interaction and Prediction Layer
To extract a proper representation for each sentence, we apply the step-wise max-pooling operation over densely connected recurrent and co-attentive features (pooling in Fig. 1). More specifically, if the output of the final RNN layer is a 100d vector for a sentence with 30 words, a 30 × 100 matrix is obtained which is max-pooled column-wise such that the size of the resultant vector $p$ or $q$ is 100. Then, we aggregate these representations $p$ and $q$ for the two sentences $P$ and $Q$ in various ways in the interaction layer and the final feature vector $v$ for semantic sentence matching is obtained as follows:
为了提取每个句子的正确表示，我们将逐步最大池操作应用于密集连接的循环和共同注意特征（图1中的池）。更具体地，如果最终RNN层的输出是具有30个字的句子的100d向量，则获得30×100矩阵，其被逐列最大化，使得得到的向量 $p$ 或 $q$ 的大小为100。然后，我们在交互层中以各种方式聚合这两个句子 $P$ 和 $Q$ 的表示 $p$ 和 $q$ ，并且如下获得用于语义句子匹配的最终特征向量 $v$ ：
$v = [p; q; p + q; p - q; ∣ p - q ∣]$ (7)

Here, the operations +, − and j · j are performed elementwise to infer the relationship between two sentences. The element-wise subtraction $p - q$ is an asymmetric operator for one-way type tasks such as natural language inference or answer sentence selection. Finally, based on previously aggregated features v, we use two fully-connected layers with ReLU activation followed by one fully-connected output layer. Then, the softmax function is applied to obtain a probability distribution of each class. The model is trained end-to-end by minimizing the multi-class cross entropy loss and the reconstruction loss of autoencoders.
这里，以元素方式执行操作+， - 和j·j以推断两个句子之间的关系。逐元素减法p-q是单向类型任务的不对称算子，例如自然语言推理或答案句子选择。最后，基于先前聚合的特征v，我们使用两个完全连接的层，其中ReLU激活，然后是一个完全连接的输出层。然后，应用softmax函数以获得每个类的概率分布。通过最小化多级交叉熵损失和自动编码器的重建损失来端到端地训练模型。

Experiments

We evaluate our matching model on five popular and wellstudied benchmark datasets for three challenging sentence matching tasks: (i) SNLI and MultiNLI for natural language inference; (ii) Quora Question Pair for paraphrase identification; and (iii) TrecQA and SelQA for answer sentence selection in question answering. Additional details about the above datasets can be found in the supplementary materials.
我们在五个受欢迎且经过充分研究的基准数据集上评估我们的匹配模型，用于三个具有挑战性的句子匹配任务：（i）用于自然语言推理的SNLI和MultiNLI; （ii）Quora问题对用于释义识别; （iii）TrecQA和SelQA用于回答问题的答案选择。有关上述数据集的其他详细信息，请参阅补充材料。

Implementation Details
We initialized word embedding with 300d GloVe vectors pre-trained from the 840B Common Crawl corpus (Pennington, Socher, and Manning 2014), while the word embeddings for the out-of-vocabulary words were initialized randomly. We also randomly initialized character embedding with a 16d vector and extracted 32d character representation with a convolutional network. For the densely- connected recurrent layers, we stacked 5 layers each of which have 100 hidden units. We set 1000 hidden units with respect to the fullyconnected layers. The dropout was applied after the word and character embedding layers with a keep rate of 0.5. It was also applied before the fully-connected layers with a keep rate of 0.8. For the bottleneck component, we set 200 hidden units as encoded features of the autoencoder with a dropout rate of 0.2. The batch normalization was applied on the fully-connected layers, only for the one-way type datasets. The RMSProp optimizer with an initial learning rate of 0.001 was applied. The learning rate was decreased by a factor of 0.85 when the dev accuracy does not improve. All weights except embedding matrices are constrained by L2 regularization with a regularization constant $λ = 10^{−6}$ . The sequence lengths of the sentence are all different for each dataset: 35 for SNLI, 55 for MultiNLI, 25 for Quora question pair and 50 for TrecQA. The learning parameters were selected based on the best performance on the dev set. We employed 8 different randomly initialized sets of parameters with the same model for our ensemble approach.
我们使用从840B Common Crawl语料库（Pennington，Socher和Manning 2014）预训练的300d GloVe向量初始化了单词嵌入，而词汇外单词的嵌入单词则随机初始化。我们还用16d向量随机初始化字符嵌入，并用卷积网络提取32d字符表示。对于密集连接的复发层，我们堆叠了5层，每层有100个隐藏单元。我们相对于完全连接的层设置了1000个隐藏单元。在单词和字符嵌入图层之后应用了丢失，保持率为0.5。它也在完全连接的层之前应用，保持率为0.8。对于瓶颈组件，我们将200个隐藏单元设置为自动编码器的编码特征，丢失率为0.2。批量标准化应用于完全连接的图层，仅适用于单向类型数据集。应用了初始学习率为0.001的RMSProp优化器。当开发精度没有提高时，学习率降低了0.85倍。除嵌入矩阵之外的所有权重都受到L2正则化的约束，正则化常数 $λ= 10^{-6}$ 。对于每个数据集，句子的序列长度都是不同的：SNLI为35，MultiNLI为55，Quora问题对为25，TrecQA为50。基于开发集的最佳性能选择学习参数。我们使用8个不同的随机初始化参数集，使用相同的模型进行集合方法。

Experimental Results

SNLI and MultiNLI

We evaluated our model on the natural language inference task over SNLI and MultiNLI datasets. Table 2 shows the results on SNLI dataset of our model with other published models. Among them, ESIM+ELMo and LMTransformer are the current state-of-the-art models. However, they use additional contextualized word representations from language models as an externel knowledge. The proposed DRCN obtains an accuracy of 88.9% which is a competitive score although we do not use any external knowledge like ESIM+ELMo and LM- Transformer. The ensemble model achieves an accuracy of 90.1%, which sets the new state-ofthe-art performance. Our ensemble model with 53m parameters (6.7m×8) outperforms the LM-Transformer whose the number of parameters is 85m. Furthermore, in case of the encoding-based method, we obtain the best performance of 86.5% without the co-attention and exact match flag.
我们在SNLI和MultiNLI数据集上的自然语言推理任务上评估了我们的模型。表2显示了我们的模型与其他已发表模型的SNLI数据集的结果。其中，ESIM + ELMo和LMTransformer是目前最先进的模型。但是，他们使用语言模型中的附加语境化词表示作为外部知识。虽然我们不使用任何外部知识，如ESIM + ELMo和LM-Transformer，但提议的DRCN获得的准确度为88.9％，这是一个竞争分数。整体模型达到了90.1％的准确度，从而创造了新的最先进的性能。我们的53m参数（6.7m×8）的集合模型优于LM-Transformer，其参数数量为85m。此外，在基于编码的方法的情况下，我们获得86.5％的最佳性能而没有共同关注和精确匹配标志。

Table 3 shows the results on MATCHED and MISMATCHED problems of MultiNLI dataset. Our plain DRCN has a competitive performance without any contextualized knowledge. And, by combining DRCN with the ELMo, one of the contextualized embeddings from language models, our model outperforms the LM-Transformer which has 85m parameters with fewer parameters of 61m. From this point of view, the combination of our model with a contextualized knowledge is a good option to enhance the performance.
表3显示了MultiNLI数据集的MATCHED和MISMATCHED问题的结果。我们的普通DRCN具有竞争性的表现，没有任何背景知识。并且，通过将DRCN与ELMo（语言模型中的一个语境化嵌入）相结合，我们的模型优于LM-Transformer，其具有85m参数，参数较少，为61m。从这个角度来看，我们的模型与情境化知识的结合是提高绩效的一个很好的选择。

Quora Question Pair
Table 4 shows our results on the Quora question pair dataset. BiMPM using the multiperspective matching technique between two sentences reports baseline performance of a L.D.C. network and basic multi-perspective models (Wang, Hamza, and Florian 2017). We obtained accuracies of 90.15% and 91.30% in single and ensemble methods, respectively, surpassing the previous state-of-the-art model of DIIN.
表4显示了我们在Quora问题对数据集上的结果。使用两个句子之间的多个匹配技术的BiMPM报告了L.D.C.的基线性能。网络和基本的多视角模型（Wang，Hamza和Florian 2017）。我们分别在单一和集合方法中获得了90.15％和91.30％的准确度，超过了之前最先进的DIIN模型。

TrecQA and SelQA
Table 5 shows the performance of different models on TrecQA and SelQA datasets for answer sentence selection task that aims to select a set of candidate answer sentences given a question. Most competitive models (Shen, Yang, and Deng 2017; Bian et al. 2017; Wang, Hamza, and Florian 2017; Shen et al. 2017) also use attention methods for words alignment between question and candidate answer sentences. However, the proposed DRCN using collective attentions over multiple layers, achieves the new state-ofthe-art performance, exceeding the current state-of-the-art performance significantly on both datasets.
表5显示了针对答案句子选择任务的TrecQA和SelQA数据集的不同模型的性能，其旨在选择给出问题的一组候选答案句子。大多数竞争模型（Shen，Yang和Deng 2017; Bian等人2017; Wang，Hamza和Florian 2017; Shen等人2017）也使用注意方法来解决问题和候选答案句之间的词汇对齐。然而，所提出的DRCN使用多层的集体关注，实现了新的最先进的性能，在两个数据集上显着超过了当前最先进的性能。