
简介 Introduce

In parallel, the concept of “attention” has gained popularity recently in training neural networks, allowing models to learn alignments between different modalities, e.g., between image objects and agent actions in the dynamic control problem (Mnih et al., 2014), between speech frames and text in the speech recognition task, or between visual features of a picture and its text description in the image caption generation task (Xu et al., 2015). In the context of NMT, Bahdanau et al. (2015) has successfully applied such attentional mechanism to jointly translate and align words.


In this work, we design, with simplicity and fectiveness in mind, two novel types of attentionbased models: a global approach in which all source words are attended and a local one whereby only a subset of source words are considered at a time.


  • 全局注意力:所有词都被关注,和原始注意力模型类似,但更简单。
  • 局部注意力:部分词被关注,比较新颖,作者认为这是一种***软硬注意力***的结合。

注意力机制模型 Attention-based Models

While these models differ in how the context vector ct is derived, they share the same subsequent steps.


全局注意力机制 Global Attention


局部注意力 Local Attention

Our local attention mechanism selectively focuses on a small window of context and is differentiable. This approach has an advantage of avoiding the expensive computation incurred in the soft attention and at the same time, is easier to train than the hard attention approach. In concrete details, the model first generates an aligned position
pt for each target word at time t. The context vector ct is then derived as a weighted average over the set of source hidden states within the window [pt−D, pt+D]; D is empirically selected.Unlike the global approach, the local alignment vector at is now fixed-dimensional R=2D+1.



  • 朴素假设:译文中的第t个词和原文中的第t个词对齐(显然这个假设更不合理)
  • 预测对齐:构建模型去预测对齐位置(方法见下,感觉也不太靠谱诶)


对齐覆盖问题 Input-feeding Approach

In our proposed global and local approaches, the attentional decisions are made independently, which is suboptimal. Whereas, in standard MT, a coverage set is often maintained during the translation process to keep track of which source words have been translated. Likewise, in attentional NMTs, alignment decisions should be made jointly taking into account past alignment information. To address that, we propose an input-feeding approach in which attentional vectors ˜ht are concatenated with inputs at the next time steps as illustrated in Figure 4.11 The effects of having such connections are two-fold: (a) we hope to make the model fully aware of previous alignment choices and (b) we create a very deep network spanning both horizontally and vertically.

  • 作者这里提到了在传统机器翻译中,会维护一个***覆盖集***用来告诉模型:原文中哪些词已经被翻译过了(像我这种后生晚辈,肯定是从来没听说过这个东西的)。
  • 因此希望在做注意力对齐的时候,注意力模型也能知道哪些词已经被对齐过了。所以提出了一种专门的输入方法,即在解码器计算下一时刻的隐状态时,将上一时刻的隐状态和上一时刻所对齐的输入向量同时输入。



分析 Analysis



In this paper, we propose two simple and effective attentional mechanisms for neural machine translation: the global approach which always looks at all source positions and the local one that only attends to a subset of source positions at a time.


