Neural Machine Translation by Jointly Learning to Align and Translate论文及代码助解

我们现在来实现Neural Machine Translation by Jointly Learning to Align and Translate中的模型。

一、常规的编码器-解码器模型

作为提醒，下图是常规的编码器-解码器模型：

二、先前的模型

在先前的模型中，我们的体系结构是通过在每个time-step显式地将context向量( 即z )传递给解码器(对应于上图中粉色的方块)。接着①context向量(z)②embbedded input word（即d(yt) ）(图中棕色的方块代表的是embbed操作)和③hidden state（即St）传递到一个linear layer（图中紫色的方块）中，来进行prediction。

三、先前模型的缺点

即使我们减少了一些compression，但是我们的context向量仍然需要包含有关源句子的所有信息。也就是说，在先前的模型中，context向量z中并没有包含有关原句子的所有信息。

四、如何改进先前的模型？

在此笔记本中实现的模型通过允许解码器在每个解码步骤中查看整个源句子（通过其隐藏状态）来避免这种压缩！它是如何做到的？It uses attention.

attention机制首先通过计算attention vector( 即a )（即源句子的长度）来进行。
attention vector具有以下属性：每个元素在0到1之间，并且整个向量之和为1。然后，我们计算源句子隐藏状态H的加权和，以获得加权的source向量w。

我们在解码时的每个time-step都计算一个新的加权源向量，将其用作解码器RNN以及lienar layer的输入以进行prediction。在本教程中，我们将说明如何执行所有这些操作。

五、准备数据

1.首先需要的库

'''
refer: https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb
'''
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Fimport spacy
import numpy as np
import random
import math
import time

from torchtext.data import Field,BucketIterator
from torchtext.datasets import Multi30k

2.设置可重复使用的随机种子

SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

3.加载德语和英语 spaCy 模型

这里需要注意的是：de是de_core_news_sm的简写，en是en_core_web_sm的简写。例如，在执行：

python3 -m spacy download de

之后，可能会下载一个文件，文件名为de_core_news_sm-2.3.0.tar.gz。把该文件解压之后，进入到文件夹然后执行:

sudo python3 setup.py install

如下图：

然后执行 pip3 list | grep de命令：

pip3 list | grep de

可以看到下图：

我们可以看到库名：de-core-news-sm。
但是，如果我们直接在python3中引入，就会报错：

这是因为import的时候，库名书写的格式不对，改为import de_core_news_sm就行，也即是说，横杠变为下划线。

这就不报错了。
对en库的处理是一样的，这里不再赘述。
上面的步骤完成之后，我们就可以引入这两个库了。并使用这两个库来生成数据。

#! python3 -m spacy download de
#! python3 -m spacy download en
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

4.创建标记器（tokenizers）

def tokenize_de(text):# TOkenizes German text from a string into a list of stringsreturn [tok.text for tok in spacy_de.tokenizer(text)]
def tokenize_en(text):# Tokenizes English text from a string into a list of stringsreturn [tok.text for tok in spacy_en.tokenizer(text)]

字段和以前相同

SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token='<eos>',lower = True)
TRG = Field(tokenize = tokenize_en,init_token='<sos>',eos_token='<eos>',lower=True)

/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py:150: UserWarning: Field class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn(’{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.’.format(self.class.name), UserWarning)

5.加载数据

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))

/usr/local/lib/python3.6/dist-packages/torchtext/data/example.py:78: UserWarning: Example class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn(‘Example class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.’, UserWarning)

不妨打印并查看一些信息

print(type(train_data))
print(len(train_data))
print(train_data.shape)
print(len(valid_data))
print(len(test_data))
print(test_data[0])
print(type(test_data[0]))
print(test_data[0])

6.Build the vocabulary

SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)print(len(SRC.vocab))
print(len(TRG.vocab))
print(SRC.vocab[1000])
print(type(SRC.vocab))
print(type(SRC.vocab[3]))

7854
5893
0
<class ‘torchtext.vocab.Vocab’>
<class ‘int’>

7.Define the device

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda

8.Create the iterators

BATCH_SIZE = 128
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), batch_size = BATCH_SIZE,device = device)

/usr/local/lib/python3.6/dist-packages/torchtext/data/iterator.py:48: UserWarning: BucketIterator class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn(’{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.’.format(self.class.name), UserWarning)

print(train_iterator)
print(len(train_iterator))
for i in train_iterator:print(i)

<torchtext.data.iterator.BucketIterator object at 0x7f86b9812cf8>
227
/usr/local/lib/python3.6/dist-packages/torchtext/data/batch.py:23: UserWarning: Batch class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn(’{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.’.format(self.class.name), UserWarning)

[torchtext.data.batch.Batch of size 128 from MULTI30K]
[.src]:[torch.cuda.LongTensor of size 23x128 (GPU 0)]
[.trg]:[torch.cuda.LongTensor of size 21x128 (GPU 0)]
…
…

六、Building the Seq2Seq Model

1. Encoder

首先，我们将构建编码器。与以前的模型相似，我们仅使用单层GRU，但是现在使用双向RNN。我们知道，对于双向RNN，在每个层中都有两个RNN。前向RNN从左到右遍历embedded sentence（下图中以绿色显示），而backward RNN从右到左遍历embedded sentence（深青色）。
我们在代码中需要做的所有事情都是将bidirectional = True,即设置为“真”，然后像以前一样将embedded sentence传递给RNN。

我们现在知道了：

和以前一样，我呢仅仅把经过embedded处理之后的input输入到RNN中。从而PyTorch可以用它来初始化forward和backward的初始的hidden states(分别对应 $h0→h_{0}^{\rightarrow}$ 和 $h0←h_{0}^{\leftarrow}$ )，当然是把他们初始化为全0的张量（tensor）。我们也会得到两个context vectors，一个来自fowward RNN，在sentence的最后一个word处理了之后，这个context vector对应于 $z→z^{\rightarrow}$ = $hT→h_{T}^{\rightarrow}$ 。另一个来自backward RNN，在sentence的第一个word处理了之后，这个context vector就是 $z←z^{\leftarrow}$ = $hT←h_{T}^{\leftarrow}$ 。

RNN会返回outputs和hidden。对于RNN的参数和返回值，可以参考：https://blog.csdn.net/jiuweideqixu/article/details/109492863

outputs的size是[src len, batch size, hid_dim * num_directions]。当是双向的GRU的时候，num_directions等于2。其中first hid_dim元素是forward RNN的top layer的hidden states。相对应的，last hid_dim是来自于backward RNN的top layer的hidden states。因此，我们可以把第3个维度当做是前向和后向hidden states拼接在一起的结果。eg: $h1=[h1→;hT←]h_{1}=[h_{1}^{\rightarrow};h_{T}^{\leftarrow}]$ , $h2=[h2→;hT−1←]h_{2}=[h_{2}^{\rightarrow};h_{T-1}^{\leftarrow}]$ 。因此，我们可以把encoder的所有hidden states表示为 $H = \{h_{1},h_{2},...,h_{T}\}$ 。

hidden的size是 [n_layers * num_directions, batch size, hid dim]，在最终的time-step之后(例如在sentence的最后一个word处理了之后)，[-2,:,]可以表示前向RNN的top layer的hidden state。[-1,:,:]可以表示后向RNN的top layer的hidden state（在最终的time-step之后，也是sentence的第一个word处理了之后）。

由于Decoder中也使用了RNN，但是这个GRU是单向的。所以，它只需要一个context vector，即Z，来当做它的initial hidden state就行了， $S_{0}$ ，但是从encoder中却传递来了两个hidden state，一个是前向和一个是后向的(分别是 $z→z^{\rightarrow}$ = $hT→h_{T}^{\rightarrow}$ 和 $z←z^{\leftarrow}$ = $hT←h_{T}^{\leftarrow}$ )。我们的解决办法是把这两个hidden state合并为一个。然后把他们输入到一个linear layer(做线性运算)， $g$ , 并且随后应用 $t a n h$ 激活函数。
$tanh(g(h_{T}^{\rightarrow},h_{T}^{\leftarrow}))$ ，所以我们就得到了 $z$ ,这个 $z$ 就是是encoder返回的用于输入到Decoder中的initial hidden state。

Note: 这点实际上和论文中的不太一样的，在论文中的做法是仅仅把前向RNN的hidden state传入到了linear layer， $g$ ,从而得到context vector/decoder的initial hidden state。

因为我们想要看到整个source sentence，所以我们返回outputs，对于source sentence中的每个token对应的前向和后向hidden state都在outputs中保存了起来。我们也会返回hidden，这对应于decoder中的initial hidden state。

所以，Encoder对应的代码如下：z

class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):super().__init__()self.embedding = nn.Embedding(input_dim, emb_dim) # input_dim表示的是不同单词的数目self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)self.dropout = nn.Dropout(dropout)def forward(self, src): print('【Encoder中打印的内容】------------------------------------------')'''src = [src_len, batch_size] # src_len表示的是实际单词的数目'''src = src.transpose(0, 1) # src = [batch_size, src_len]print('src.shape:',src.shape) # 现在输出的是经过transpose之后的。# 经过dropout处理之后，的size是[batch_size,src_len,emb_dim]然后进行维度的转换# embedded = [src_len, batch_size, emb_dim]embedded = self.dropout(self.embedding(src)).transpose(0, 1)print(type(embedded))print('embbedded.shape:',embedded.shape) # embbedded.shape: torch.Size([3, 2, 256])# enc_output = [src_len, batch_size, hid_dim * num_directions],如果是双向的，那么num_directions是2，否则是1# enc_hidden = [n_layers * num_directions, batch_size, hid_dim]enc_output, enc_hidden = self.rnn(embedded) # if h_0 is not give, it will be set 0 acquiescentlyprint('enc_output.shape:',enc_output.shape)#enc_output.shape: torch.Size([3, 2, 1024])print('enc_hidden.shape:',enc_hidden.shape)#enc_hidden.shape: torch.Size([2, 2, 512])print('enc_hidden[0].shape: ',enc_hidden[0].shape)print('enc_hidden[1].shape: ',enc_hidden[1].shape)cc = torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1) # 这样以来，3维的结构就变为2维了。print('torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1).shape: ',cc.shape) # torch.Size([2, 1024])#print(enc_hidden[2].shape)# enc_hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]# enc_output are always from the last layer# enc_hidden [-2, :, : ] is the last of the forwards RNN# enc_hidden [-1, :, : ] is the last of the backwards RNN# initial decoder hidden is final hidden state of the forwards and backwards# encoder RNNs fed through a linear layer# s = [batch_size, dec_hid_dim]s = torch.tanh(self.fc(cc)) # 相当于是[2 × 1024] * [enc_hid_dim * 2 × dec_hid_dim]，所以最终s的shape是[2,512]print('s.shape:',s.shape) # s.shape: torch.Size([2, 512])print('【Encoder中forward函数执行完毕】！------------------------')return enc_output, s

Encoder中进行测试

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5enc = Encoder(INPUT_DIM,ENC_EMB_DIM,ENC_HID_DIM,DEC_HID_DIM,ENC_DROPOUT)
enc.forward(torch.LongTensor([[0,2],[3,5],[8,9]]))

【Encoder中打印的内容】------------------------------------------
src.shape: torch.Size([2, 3])
<class ‘torch.Tensor’>
embbedded.shape: torch.Size([3, 2, 256])
enc_output.shape: torch.Size([3, 2, 1024])
enc_hidden.shape: torch.Size([2, 2, 512])
enc_hidden[0].shape: torch.Size([2, 512])
enc_hidden[1].shape: torch.Size([2, 512])
torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1).shape: torch.Size([2, 1024])
s.shape: torch.Size([2, 512])
【Encoder中forward函数执行完毕】！------------------------
(tensor([[[ 0.3615, 0.2286, 0.3451, …, -0.2103, -0.1843, 0.0026],
[-0.2185, 0.2574, -0.3948, …, -0.5293, -0.3931, -0.1950]],

[[ 0.2123, 0.1713, 0.2521, …, -0.1816, -0.0419, 0.0432],
[-0.5082, -0.0661, 0.1635, …, 0.1175, 0.0220, -0.0448]],

[[ 0.1594, 0.0015, 0.4782, …, -0.1525, 0.0692, 0.1618],
[-0.2713, -0.2825, -0.4277, …, 0.1342, 0.2105, 0.3363]]],
grad_fn=),
tensor([[-0.0471, -0.1929, 0.0166, …, -0.2596, -0.0213, -0.1967],
[-0.1615, 0.0246, -0.2823, …, 0.0091, 0.2008, -0.0206]],
grad_fn=))

2.Attention

为了便于理解，首先下面是Attention对应的图示。
$S_{t-1}=S_{0}=Z$ 。绿色的块代表前向和后向RNN的所有的hidden states。粉红色的块完成Attention的计算。

在上图中，右上角粉红色的方框对应的正是我们的Attention模型，从箭头可以看出，它接收的输入有两个，分别是outupt和hidden。中间一行的4个正方形方框代表的正是前面encoder模块的GRU结构。

可以看出，Attention会接收encoder返回的hidden state，不妨表示为 $S_{t-1}$ ,还会接收encoder函数返回的outputs,可以表示为 $H$ 。Attention layer会输出一个attention vector，即 $a_{t}$ ,从上图的右上角也可以看出， $a_{t}$ 中的每个元素都在0和1之间并且这些元素的和为1。所以为了更加准确地预测next word， $a_{t}$ 可以为我们指明source sentence中的哪个word应该得到更多的关注。

首先，我们可以使用刚刚提到的 $S_{t-1}$ 和 $H$ 来计算 $e n e r g y$ 。如果对GRU模型有足够的了解，我们现在只看上图，那么上图中标明的 $z$ 和 $h_{4}$ 都是tensor，而且这两个tensor是相同的。而在Attention模型中，相当于是把output和hidden进行concat操作。因此，虽然 $S_{t-1}$ (对应上图的 $z$ )和 $h_{t-1}$ (对应上图的 $h_{4}$ )的维度是相同的，但是 $S_{t-1}$ 和encoder函数返回的output的维度是不同的，所以在Attention函数中就对 $S_{t-1}$ 进行了repeat操作(见代码)。

$E_{t} = tanh(attn(repeat(S_{t-1}),H))$

这个可以用来解释encoder中的output中保存的每个hidden state与 $S_{t-1}$ 之间match的程度。

结合代码，现在 $E_{t}$ 的shape是[batch_size, src_len, dec_hid_dim]，接下来对 $E_{t}$ 进行一次现行变换：让 $E_{t}$ 乘以一个[dec_hid_dim, 1]的矩阵 $v$ 。

$at⋀\mathop{a_{t}}\limits^{\bigwedge}$ = $VE_{t}$

我们可以将 v 视为所有编码器隐藏状态中energy加权和的权重。这些权重告诉我们我们应当关注source sequence中每个token的程度。v是经过随机的初始化的。但是可以通过反向传播从剩余的model中进行学习并更新自身的值。所以v是取决于时间的。并且同一个v可被用于decoding的每个time-step。我们通过一个没有bias的linear layer来实现v。

最终，为了确保attention vector中所有的元素在0和1之间并且这些元素的和为1，我们需要使用softmax函数。

$at=softmax(at⋀)a_{t}=softmax(\mathop{a_{t}}\limits^{\bigwedge})$

从而， $a_{t}$ 给了我们整个source sentence的注意力。

Attention对应的代码如下。

class Attention(nn.Module):def __init__(self, enc_hid_dim, dec_hid_dim):super().__init__()self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim, bias=False) # 现行变换self.v = nn.Linear(dec_hid_dim, 1, bias = False)def forward(self, s, enc_output):print('【Attention中forward打印的信息】------------------------------')# s = [batch_size, dec_hid_dim]# enc_output = [src_len, batch_size, enc_hid_dim * 2]batch_size = enc_output.shape[1]src_len = enc_output.shape[0]# repeat decoder hidden state src_len timess = s.unsqueeze(1).repeat(1, src_len, 1) # s = [batch_size, src_len, dec_hid_dim]enc_output = enc_output.transpose(0, 1) # enc_output = [batch_size, src_len, enc_hid_dim * 2]print('s.shape:',s.shape)print('enc_output.shape:',enc_output.shape)# cc = [batch_size, src_len, enc_hid_dim + enc_hid_dim * 2]或者cc = [batch_size, src_len, 3 * dec_hid_dim]cc = torch.cat((s,enc_output),dim=2)print('torch.cat((s,enc_output),dim=2).shape:',cc.shape)# energy = [batch_size, src_len, dec_hid_dim]energy = torch.tanh(self.attn(cc)) # 这个Attention处理其实是一种线性变换print('energy.shape:',energy.shape)# attention = [batch_size, src_len]attention = self.v(energy).squeeze(2) # 经过线性变换，然后进过squeeze处理之后，把维度为1的第三维消去了。print('attention.shape:',attention.shape)atten_output = F.softmax(attention, dim=1)print('atten_output.shape:',atten_output.shape)print('【Attention中forward方法执行完毕】！-------------------------------')return F.softmax(attention, dim=1) # [batch_size, src_len]

同样，对Attention进行小小的测试

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5enc = Encoder(INPUT_DIM,ENC_EMB_DIM,ENC_HID_DIM,DEC_HID_DIM,ENC_DROPOUT)
print('--------------Encoder--------------------------')
enc_output,s = enc.forward(torch.LongTensor([[0,2],[3,5],[8,9]]))
attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
print('-------------------Attention------------------')
attn.forward(s,enc_output)

--------------Encoder--------------------------
【Encoder中打印的内容】------------------------------------------
src.shape: torch.Size([2, 3])
<class ‘torch.Tensor’>
embbedded.shape: torch.Size([3, 2, 256])
enc_output.shape: torch.Size([3, 2, 1024])
enc_hidden.shape: torch.Size([2, 2, 512])
enc_hidden[0].shape: torch.Size([2, 512])
enc_hidden[1].shape: torch.Size([2, 512])
torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1).shape: torch.Size([2, 1024])
s.shape: torch.Size([2, 512])
【Encoder中forward函数执行完毕】！------------------------
-------------------Attention------------------
【Attention中forward打印的信息】------------------------------
s.shape: torch.Size([2, 3, 512])
enc_output.shape: torch.Size([2, 3, 1024])
torch.cat((s,enc_output),dim=2).shape: torch.Size([2, 3, 1536])
energy.shape: torch.Size([2, 3, 512])
attention.shape: torch.Size([2, 3])
atten_output.shape: torch.Size([2, 3])
【Attention中forward方法执行完毕】！-------------------------------
tensor([[0.3180, 0.3446, 0.3374],
[0.3199, 0.3413, 0.3388]], grad_fn=)

3.Decoder

decoder中包含attention layer，attention layer会使用encoder中的所有hidden states, $H$ 并返回attention vector， $a_{t}$ 。

我们接下来使用attention vector来创建weighted source vector， $W_{t}$ ,这是encoder中hidden states， $H$ ,的加权总和，可以通过下面的公式计算。

$W_{t}=a_{t}H$

embbedded input word( 即 $d(y_{t})$ )、 $W_{t}$ 和 $S_{t-1}$ 都会被传入到decoder当中的RNN中，decoder的RNN对应于如下图中蓝色的块。

$S_{t}=DecoderGRU(d(y_{t}),W_{t},S_{t-1})$

我们接下来把 $d(y_{t})$ 、 $W_{t}$ 和 $S_{t}$ 传入到linear layer，f中，从而对target sentence中的next word(即 $yt+1⋀\mathop{y_{t+1}}\limits^{\bigwedge}$ )进行prediction。

$yt+1⋀\mathop{y_{t+1}}\limits^{\bigwedge}$ = $f(d(y_{t}),W_{t},S_{t})$

下图展示了decoding第1个word的过程：

绿色的块代表前向/反向 encoder RNN的 $H$ ,粉红色的块代表了context vector，即 $S_{t-1}$ ， $z=ht=tanh(g(ht→,ht←))z=h_{t}=tanh(g(h_{t}^{\rightarrow},h_{t}^{\leftarrow}))$ 。蓝色的块代表了decoder的RNN模块，它会生成 $S_{t}$ ，紫色的块代表了linear layer, $f$ ,它会生成 $yt+1⋀\mathop{y_{t+1}}\limits^{\bigwedge}$ ，橙色的块代表了通过 $H$ 和 $a_{t}$ 计算的权重和，返回的是 $W_{t}$

Decoder函数对应的代码如下

class Decoder(nn.Module):def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):super().__init__()self.output_dim = output_dimself.attention = attentionself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)self.dropout = nn.Dropout(dropout)def forward(self, dec_input, s, enc_output):print('【Decoder中forward函数打印的信息】---------------------------')# dec_input = [batch_size]# s = [batch_size, dec_hid_dim]# enc_output = [src_len, batch_size, enc_hid_dim * 2]dec_input = dec_input.unsqueeze(1) # dec_input = [batch_size, 1]# [batch_size,1,emb_dim],   transpose之后： [1,batch_size,emb_dim]embedded = self.dropout(self.embedding(dec_input)).transpose(0, 1) # embedded = [1, batch_size, emb_dim]print('embeded.shape: ',embedded.shape) # embeded.shape:  torch.Size([1, 128, 256])# a = [batch_size, 1, src_len]a = self.attention(s, enc_output).unsqueeze(1)print('a.shape after attention(s,enc_output).unsqueeze(1): ',a.shape)# torch.Size([128, 1, 31])# enc_output = [batch_size, src_len, enc_hid_dim * 2]enc_output = enc_output.transpose(0, 1)print('enc_output.shape after enc_output.transpose(0,1): ',enc_output.shape)# c = [1, batch_size, enc_hid_dim * 2]# 实际上是最后两个维度kanchu进行相乘c = torch.bmm(a, enc_output).transpose(0, 1)# torch.Size([128, 1, 31]) *  torch.Size([128, 31, 1024])# 所以，c的shape是[128,1,1024],transpose之后是[1,128,1024]print('enc_output.shape: ',enc_output.shape) #torch.Size([128, 31, 1024])print('c.shape after torch.bmm(a,enc_output).transpose(0,1): ',c.shape) # torch.Size([1, 128, 1024])# rnn_input = [1, batch_size, (enc_hid_dim * 2) + emb_dim]rnn_input = torch.cat((embedded, c), dim = 2)print('rnn_input.shape: ',rnn_input.shape)# dec_output = [src_len(=1), batch_size, dec_hid_dim]# dec_hidden = [n_layers * num_directions, batch_size, dec_hid_dim]dec_output, dec_hidden = self.rnn(rnn_input, s.unsqueeze(0)) # 由于Decoder中dec_input的size都是[batch_size,1],所以dec_output和dec_hidden的shape是相同的。print('------------after rnn--------------')print('dec_output.shape: ',dec_output.shape)print('dec_hidden.shape: ',dec_hidden.shape)# embedded = [batch_size, emb_dim]# dec_output = [batch_size, dec_hid_dim]# c = [batch_size, enc_hid_dim * 2]embedded = embedded.squeeze(0)dec_output = dec_output.squeeze(0)c = c.squeeze(0)# pred = [batch_size, output_dim]pred = self.fc_out(torch.cat((dec_output, c, embedded), dim = 1))print('pred.shape: ',pred.shape)print('【Decoder中forward函数执行完毕】！-------------------------')return pred, dec_hidden.squeeze(0)

对Decoder进行小小的测试

由于Decoder是集成在了Seq2Seq类中了。所以，为了使用真正的数据进行测试，等到后面介绍完Seq2Seq之后再进行测试。

4.Seq2Seq

在这个model中，我们不必令encoder RNN和decoder RNN具有相同的hidden dimensions。

Seq2Seq中到底需要做什么事情呢。我们知道，Encoder会返回final hidden stete(来自于前向和后向的encoder RNN)来作为输入到decoder中的initial hidden state，Encoder还会返回every hidden state(保存在outputs中的前向和后向hidden states)。通过下面几个步骤，我们能够确保他们被Encoder返回之后，能够被Decoder进行良好地处理。

outputs tensor可以用来保存所有的predictions， $Y⋀\mathop{Y}\limits^{\bigwedge}$
source sequence, $X$ ，被传入到encoder中来得到 $z$ 和 $H$
输入到decoder中的initial hidden state就是 $s_{0}=z$
我们使用一批 tokens 来作为first input, $y_{1}$
我们接下来在一个loop中进行decoder处理：

把①input token, $y_{t}$ ，②previous hidden state, $S_{t-1}$ 和③ $W_{t}$ (由 $H$ 经过Attention处理之后生成)输入得到decoder中
decoder会返回prediction, $yt+1⋀\mathop{y_{t+1}}\limits^{\bigwedge}$ 和一个hidden state， $s_{t}$
我们会就决定时候设置next input来进行下一次循环的执行。

代码如下，由于笔记本的显卡只有2G内存，所以在执行完一个循环之后就报内存耗尽错误，所以这里手动修改代码为只进行一次循环。

class Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super().__init__()self.encoder = encoderself.decoder = decoderself.device = devicedef forward(self, src, trg, teacher_forcing_ratio = 0.5):print('【Seq2Seq中forward函数打印的信息】-----------------------')# src = [src_len, batch_size]  ,eg: src.shape:  torch.Size([31, 128])# trg = [trg_len, batch_size]  ,eg: trg.shape:  torch.Size([37, 128])# teacher_forcing_ratio is probability to use teacher forcingbatch_size = src.shape[1] # 128print('batch_size: ',batch_size)trg_len = trg.shape[0]print('trg_len: ',trg_len)trg_vocab_size = self.decoder.output_dim # 5893print('trg_vocab_size: ',trg_vocab_size)# tensor to store decoder outputsoutputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)print('outputs.shape: ',outputs.shape)# enc_output is all hidden states of the input sequence, back and forwards# s is the final forward and backward hidden states, passed through a linear layerenc_output, s = self.encoder(src) # 调用Encoder的forward函数# first input to the decoder is the <sos> tokensdec_input = trg[0,:]print('dec_input.shape: ',dec_input.shape)for t in range(1, trg_len):# insert dec_input token embedding, previous hidden state and all encoder hidden states# receive output tensor (predictions) and new hidden state#dec_output, s = self.decoder(dec_input, s, enc_output)#print('dec_output.shape: ',dec_output.shape)#print('s.shape: ',s.shape)self.decoder(dec_input, s, enc_output) # 可以看出，decoder处理的时候是一个一个进行处理的break;# place predictions in a tensor holding predictions for each tokenoutputs[t] = dec_output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = dec_output.argmax(1) # if teacher forcing, use actual next token as next input# if not, use predicted tokendec_input = trg[t] if teacher_force else top1print('【Seq2Seq中forwar函数执行完毕】---------------------------')return outputs

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
print('INPUT_DIM:',INPUT_DIM)
print('OUTPUT_DIM:',OUTPUT_DIM)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)model = Seq2Seq(enc, dec, device).to(device)
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

INPUT_DIM: 7854
OUTPUT_DIM: 5893

for i,batch in enumerate(train_iterator): # 当enumerate多次执行的时候，会继续向后获取元素print(i,batch)src = batch.src # 每个batch会对一个一个src和trg。其中，每个src中会有m个单词？和src相对应的trg中会存在n个单词，其中m和n不一定相等trg = batch.trgprint('---------src----------\n',src)print('src.shape: ',src.shape)print('----------trg---------\n',trg)print('trg.shape: ',trg.shape)pred = model(src,trg)break;

0
[torchtext.data.batch.Batch of size 128 from MULTI30K]
[.src]:[torch.cuda.LongTensor of size 31x128 (GPU 0)]
[.trg]:[torch.cuda.LongTensor of size 37x128 (GPU 0)]
---------src----------
tensor([[ 2, 2, 2, …, 2, 2, 2],
[ 8, 73, 5, …, 3455, 5, 18],
[ 16, 52, 13, …, 65, 1010, 30],
…,
[ 1, 1, 1, …, 1, 1, 1],
[ 1, 1, 1, …, 1, 1, 1],
[ 1, 1, 1, …, 1, 1, 1]], device=‘cuda:0’)
src.shape: torch.Size([31, 128])
----------trg---------
tensor([[ 2, 2, 2, …, 2, 2, 2],
[ 4, 19, 4, …, 63, 4, 16],
[ 14, 17, 9, …, 17, 1623, 30],
…,
[ 1, 1, 1, …, 1, 1, 1],
[ 1, 1, 1, …, 1, 1, 1],
[ 1, 1, 1, …, 1, 1, 1]], device=‘cuda:0’)
trg.shape: torch.Size([37, 128])
【Seq2Seq中forward函数打印的信息】-----------------------
batch_size: 128
trg_len: 37
trg_vocab_size: 5893
outputs.shape: torch.Size([37, 128, 5893])
【Encoder中打印的内容】------------------------------------------
src.shape: torch.Size([128, 31])
<class ‘torch.Tensor’>
embbedded.shape: torch.Size([31, 128, 256])
enc_output.shape: torch.Size([31, 128, 1024])
enc_hidden.shape: torch.Size([2, 128, 512])
enc_hidden[0].shape: torch.Size([128, 512])
enc_hidden[1].shape: torch.Size([128, 512])
torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1).shape: torch.Size([128, 1024])
s.shape: torch.Size([128, 512])
【Encoder中forward函数执行完毕】！------------------------
dec_input.shape: torch.Size([128])
【Decoder中forward函数打印的信息】---------------------------
embeded.shape: torch.Size([1, 128, 256])
【Attention中forward打印的信息】------------------------------
s.shape: torch.Size([128, 31, 512])
enc_output.shape: torch.Size([128, 31, 1024])
torch.cat((s,enc_output),dim=2).shape: torch.Size([128, 31, 1536])
energy.shape: torch.Size([128, 31, 512])
attention.shape: torch.Size([128, 31])
atten_output.shape: torch.Size([128, 31])
【Attention中forward方法执行完毕】！-------------------------------
a.shape after attention(s,enc_output).unsqueeze(1): torch.Size([128, 1, 31])
enc_output.shape after enc_output.transpose(0,1): torch.Size([128, 31, 1024])
enc_output.shape: torch.Size([128, 31, 1024])
c.shape after torch.bmm(a,enc_output).transpose(0,1): torch.Size([1, 128, 1024])
rnn_input.shape: torch.Size([1, 128, 1280])
------------after rnn--------------
dec_output.shape: torch.Size([1, 128, 512])
dec_hidden.shape: torch.Size([1, 128, 512])
pred.shape: torch.Size([128, 5893])
【Decoder中forward函数执行完毕】！-------------------------
【Seq2Seq中forwar函数执行完毕】---------------------------

由于本机器显卡内存不足，所以不再进行后面的train和evaluate的过程。

5.train和evaluate

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)model = Seq2Seq(enc, dec, device).to(device)
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)def train(model, iterator, optimizer, criterion):model.train()    epoch_loss = 0for i, batch in enumerate(iterator):src = batch.srctrg = batch.trg # trg = [trg_len, batch_size]# pred = [trg_len, batch_size, pred_dim]pred = model(src, trg)pred_dim = pred.shape[-1]# trg = [(trg len - 1) * batch size]# pred = [(trg len - 1) * batch size, pred_dim]trg = trg[1:].view(-1)pred = pred[1:].view(-1, pred_dim)loss = criterion(pred, trg)optimizer.zero_grad()loss.backward()optimizer.step()epoch_loss += loss.item()return epoch_loss / len(iterator)def evaluate(model, iterator, criterion):model.eval()epoch_loss = 0with torch.no_grad():for i, batch in enumerate(iterator):src = batch.srctrg = batch.trg # trg = [trg_len, batch_size]# output = [trg_len, batch_size, output_dim]output = model(src, trg, 0) # turn off teacher forcingoutput_dim = output.shape[-1]# trg = [(trg_len - 1) * batch_size]# output = [(trg_len - 1) * batch_size, output_dim]output = output[1:].view(-1, output_dim)trg = trg[1:].view(-1)loss = criterion(output, trg)epoch_loss += loss.item()return epoch_loss / len(iterator)def epoch_time(start_time, end_time):elapsed_time = end_time - start_timeelapsed_mins = int(elapsed_time / 60)elapsed_secs = int(elapsed_time - (elapsed_mins * 60))return elapsed_mins, elapsed_secsbest_valid_loss = float('inf')evaluatefor epoch in range(10):start_time = time.time()train_loss = train(model, train_iterator, optimizer, criterion)valid_loss = evaluate(model, valid_iterator, criterion)end_time = time.time()epoch_mins, epoch_secs = epoch_time(start_time, end_time)if valid_loss < best_valid_loss:best_valid_loss = valid_losstorch.save(model.state_dict(), 'tut3-model.pt')print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')model.load_state_dict(torch.load('tut3-model.pt'))
test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')