我们现在来实现Neural Machine Translation by Jointly Learning to Align and Translate中的模型。

一、常规的编码器-解码器模型

    作为提醒,下图是常规的编码器-解码器模型:

二、先前的模型

    在先前的模型中,我们的体系结构是通过在每个time-step显式地将context向量( 即z )传递给解码器(对应于上图中粉色的方块)。接着①context向量(z)②embbedded input word( 即d(yt) )(图中棕色的方块代表的是embbed操作)和③hidden state(即St)传递到一个linear layer(图中紫色的方块)中,来进行prediction。

三、先前模型的缺点

    即使我们减少了一些compression,但是我们的context向量仍然需要包含有关源句子的所有信息。也就是说,在先前的模型中,context向量z中并没有包含有关原句子的所有信息。

四、如何改进先前的模型?

    在此笔记本中实现的模型通过允许解码器在每个解码步骤中查看整个源句子(通过其隐藏状态)来避免这种压缩!它是如何做到的?It uses attention.

    attention机制首先通过计算attention vector( 即a )(即源句子的长度)来进行。
attention vector具有以下属性:每个元素在0到1之间,并且整个向量之和为1。然后,我们计算源句子隐藏状态H的加权和,以获得加权的source向量w。

    我们在解码时的每个time-step都计算一个新的加权源向量,将其用作解码器RNN以及lienar layer的输入以进行prediction。在本教程中,我们将说明如何执行所有这些操作。

五、准备数据

1.首先需要的库
'''
refer: https://github.com/bentrevett/pytorch-seq2seq/blob/master/3%20-%20Neural%20Machine%20Translation%20by%20Jointly%20Learning%20to%20Align%20and%20Translate.ipynb
'''
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Fimport spacy
import numpy as np
import random
import math
import time
from torchtext.data import Field,BucketIterator
from torchtext.datasets import Multi30k
2.设置可重复使用的随机种子
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
3.加载德语和英语 spaCy 模型

这里需要注意的是:de是de_core_news_sm的简写,en是en_core_web_sm的简写。例如,在执行:

python3 -m spacy download de

    之后,可能会下载一个文件,文件名为de_core_news_sm-2.3.0.tar.gz。把该文件解压之后,进入到文件夹然后执行:

sudo python3 setup.py install

如下图:

然后执行 pip3 list | grep de命令:

pip3 list | grep de

可以看到下图:

我们可以看到库名:de-core-news-sm。
但是,如果我们直接在python3中引入,就会报错:

    这是因为import的时候,库名书写的格式不对,改为import de_core_news_sm就行,也即是说,横杠变为下划线。

这就不报错了。
对en库的处理是一样的,这里不再赘述。
上面的步骤完成之后,我们就可以引入这两个库了。并使用这两个库来生成数据。

#! python3 -m spacy download de
#! python3 -m spacy download en
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')
4.创建标记器(tokenizers)
def tokenize_de(text):# TOkenizes German text from a string into a list of stringsreturn [tok.text for tok in spacy_de.tokenizer(text)]
def tokenize_en(text):# Tokenizes English text from a string into a list of stringsreturn [tok.text for tok in spacy_en.tokenizer(text)]
字段和以前相同
SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token='<eos>',lower = True)
TRG = Field(tokenize = tokenize_en,init_token='<sos>',eos_token='<eos>',lower=True)

/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py:150: UserWarning: Field class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn(’{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.’.format(self.class.name), UserWarning)

5.加载数据
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))

/usr/local/lib/python3.6/dist-packages/torchtext/data/example.py:78: UserWarning: Example class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn(‘Example class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.’, UserWarning)

不妨打印并查看一些信息
print(type(train_data))
print(len(train_data))
print(train_data.shape)
print(len(valid_data))
print(len(test_data))
print(test_data[0])
print(type(test_data[0]))
print(test_data[0])

<class ‘torchtext.datasets.translation.Multi30k’>
29000
<generator object Dataset.getattr at 0x7f86b9858f68>
1014
1000
<torchtext.data.example.Example object at 0x7f86b9bfdb70>
<class ‘torchtext.data.example.Example’>
<torchtext.data.example.Example object at 0x7f86b9bfdb70>

6.Build the vocabulary
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)print(len(SRC.vocab))
print(len(TRG.vocab))
print(SRC.vocab[1000])
print(type(SRC.vocab))
print(type(SRC.vocab[3]))

7854
5893
0
<class ‘torchtext.vocab.Vocab’>
<class ‘int’>

7.Define the device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda

8.Create the iterators
BATCH_SIZE = 128
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), batch_size = BATCH_SIZE,device = device)

/usr/local/lib/python3.6/dist-packages/torchtext/data/iterator.py:48: UserWarning: BucketIterator class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn(’{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.’.format(self.class.name), UserWarning)

print(train_iterator)
print(len(train_iterator))
for i in train_iterator:print(i)

<torchtext.data.iterator.BucketIterator object at 0x7f86b9812cf8>
227
/usr/local/lib/python3.6/dist-packages/torchtext/data/batch.py:23: UserWarning: Batch class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.
warnings.warn(’{} class will be retired soon and moved to torchtext.legacy. Please see the most recent release notes for further information.’.format(self.class.name), UserWarning)

[torchtext.data.batch.Batch of size 128 from MULTI30K]
[.src]:[torch.cuda.LongTensor of size 23x128 (GPU 0)]
[.trg]:[torch.cuda.LongTensor of size 21x128 (GPU 0)]

六、Building the Seq2Seq Model

1. Encoder

    首先,我们将构建编码器。与以前的模型相似,我们仅使用单层GRU,但是现在使用双向RNN。我们知道,对于双向RNN,在每个层中都有两个RNN。前向RNN从左到右遍历embedded sentence(下图中以绿色显示),而backward RNN从右到左遍历embedded sentence(深青色)。
    我们在代码中需要做的所有事情都是将bidirectional = True,即设置为“真”,然后像以前一样将embedded sentence传递给RNN。

我们现在知道了:


    和以前一样,我呢仅仅把经过embedded处理之后的input输入到RNN中。从而PyTorch可以用它来初始化forward和backward的初始的hidden states(分别对应h0→h_{0}^{\rightarrow}h0h0←h_{0}^{\leftarrow}h0),当然是把他们初始化为全0的张量(tensor)。我们也会得到两个context vectors,一个来自fowward RNN,在sentence的最后一个word处理了之后,这个context vector对应于z→z^{\rightarrow}z=hT→h_{T}^{\rightarrow}hT。另一个来自backward RNN,在sentence的第一个word处理了之后,这个context vector就是z←z^{\leftarrow}z=hT←h_{T}^{\leftarrow}hT

RNN会返回outputs和hidden。对于RNN的参数和返回值,可以参考:https://blog.csdn.net/jiuweideqixu/article/details/109492863

    outputs的size是[src len, batch size, hid_dim * num_directions]。当是双向的GRU的时候,num_directions等于2。其中first hid_dim元素是forward RNN的top layer的hidden states。相对应的,last hid_dim是来自于backward RNN的top layer的hidden states。因此,我们可以把第3个维度当做是前向和后向hidden states拼接在一起的结果。eg: h1=[h1→;hT←]h_{1}=[h_{1}^{\rightarrow};h_{T}^{\leftarrow}]h1=[h1;hT],h2=[h2→;hT−1←]h_{2}=[h_{2}^{\rightarrow};h_{T-1}^{\leftarrow}]h2=[h2;hT1]。因此,我们可以把encoder的所有hidden states表示为 H={h1,h2,...,hT}H = \{h_{1},h_{2},...,h_{T}\}H={h1,h2,...,hT}

    hidden的size是 [n_layers * num_directions, batch size, hid dim],在最终的time-step之后(例如在sentence的最后一个word处理了之后),[-2,:,]可以表示前向RNN的top layer的hidden state。[-1,:,:]可以表示后向RNN的top layer的hidden state(在最终的time-step之后,也是sentence的第一个word处理了之后)。

    由于Decoder中也使用了RNN,但是这个GRU是单向的。所以,它只需要一个context vector,即Z,来当做它的initial hidden state就行了,S0S_{0}S0,但是从encoder中却传递来了两个hidden state,一个是前向和一个是后向的(分别是z→z^{\rightarrow}z=hT→h_{T}^{\rightarrow}hTz←z^{\leftarrow}z=hT←h_{T}^{\leftarrow}hT)。我们的解决办法是把这两个hidden state合并为一个。然后把他们输入到一个linear layer(做线性运算),ggg, 并且随后应用tanhtanhtanh激活函数。
z=tanh(g(hT→,hT←))z = tanh(g(h_{T}^{\rightarrow},h_{T}^{\leftarrow}))z=tanh(g(hT,hT)),所以我们就得到了zzz,这个zzz就是是encoder返回的用于输入到Decoder中的initial hidden state。

    Note: 这点实际上和论文中的不太一样的,在论文中的做法是仅仅把前向RNN的hidden state传入到了linear layer,ggg,从而得到context vector/decoder的initial hidden state。

    因为我们想要看到整个source sentence,所以我们返回outputs,对于source sentence中的每个token对应的前向和后向hidden state都在outputs中保存了起来。我们也会返回hidden,这对应于decoder中的initial hidden state。

    所以,Encoder对应的代码如下:z

class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):super().__init__()self.embedding = nn.Embedding(input_dim, emb_dim) # input_dim表示的是不同单词的数目self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)self.dropout = nn.Dropout(dropout)def forward(self, src): print('【Encoder中打印的内容】------------------------------------------')'''src = [src_len, batch_size] # src_len表示的是实际单词的数目'''src = src.transpose(0, 1) # src = [batch_size, src_len]print('src.shape:',src.shape) # 现在输出的是经过transpose之后的。# 经过dropout处理之后,的size是[batch_size,src_len,emb_dim]然后进行维度的转换# embedded = [src_len, batch_size, emb_dim]embedded = self.dropout(self.embedding(src)).transpose(0, 1)print(type(embedded))print('embbedded.shape:',embedded.shape) # embbedded.shape: torch.Size([3, 2, 256])# enc_output = [src_len, batch_size, hid_dim * num_directions],如果是双向的,那么num_directions是2,否则是1# enc_hidden = [n_layers * num_directions, batch_size, hid_dim]enc_output, enc_hidden = self.rnn(embedded) # if h_0 is not give, it will be set 0 acquiescentlyprint('enc_output.shape:',enc_output.shape)#enc_output.shape: torch.Size([3, 2, 1024])print('enc_hidden.shape:',enc_hidden.shape)#enc_hidden.shape: torch.Size([2, 2, 512])print('enc_hidden[0].shape: ',enc_hidden[0].shape)print('enc_hidden[1].shape: ',enc_hidden[1].shape)cc = torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1) # 这样以来,3维的结构就变为2维了。print('torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1).shape: ',cc.shape) # torch.Size([2, 1024])#print(enc_hidden[2].shape)# enc_hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]# enc_output are always from the last layer# enc_hidden [-2, :, : ] is the last of the forwards RNN# enc_hidden [-1, :, : ] is the last of the backwards RNN# initial decoder hidden is final hidden state of the forwards and backwards# encoder RNNs fed through a linear layer# s = [batch_size, dec_hid_dim]s = torch.tanh(self.fc(cc)) # 相当于是[2 × 1024] * [enc_hid_dim * 2 × dec_hid_dim],所以最终s的shape是[2,512]print('s.shape:',s.shape) # s.shape: torch.Size([2, 512])print('【Encoder中forward函数执行完毕】!------------------------')return enc_output, s
Encoder中进行测试
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5enc = Encoder(INPUT_DIM,ENC_EMB_DIM,ENC_HID_DIM,DEC_HID_DIM,ENC_DROPOUT)
enc.forward(torch.LongTensor([[0,2],[3,5],[8,9]]))

【Encoder中打印的内容】------------------------------------------
src.shape: torch.Size([2, 3])
<class ‘torch.Tensor’>
embbedded.shape: torch.Size([3, 2, 256])
enc_output.shape: torch.Size([3, 2, 1024])
enc_hidden.shape: torch.Size([2, 2, 512])
enc_hidden[0].shape: torch.Size([2, 512])
enc_hidden[1].shape: torch.Size([2, 512])
torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1).shape: torch.Size([2, 1024])
s.shape: torch.Size([2, 512])
【Encoder中forward函数执行完毕】!------------------------
(tensor([[[ 0.3615, 0.2286, 0.3451, …, -0.2103, -0.1843, 0.0026],
[-0.2185, 0.2574, -0.3948, …, -0.5293, -0.3931, -0.1950]],

[[ 0.2123, 0.1713, 0.2521, …, -0.1816, -0.0419, 0.0432],
[-0.5082, -0.0661, 0.1635, …, 0.1175, 0.0220, -0.0448]],

[[ 0.1594, 0.0015, 0.4782, …, -0.1525, 0.0692, 0.1618],
[-0.2713, -0.2825, -0.4277, …, 0.1342, 0.2105, 0.3363]]],
grad_fn=),
tensor([[-0.0471, -0.1929, 0.0166, …, -0.2596, -0.0213, -0.1967],
[-0.1615, 0.0246, -0.2823, …, 0.0091, 0.2008, -0.0206]],
grad_fn=))

2.Attention

为了便于理解,首先下面是Attention对应的图示。
St−1=S0=ZS_{t-1}=S_{0}=ZSt1=S0=Z。绿色的块代表前向和后向RNN的所有的hidden states。粉红色的块完成Attention的计算。

    在上图中,右上角粉红色的方框对应的正是我们的Attention模型,从箭头可以看出,它接收的输入有两个,分别是outupt和hidden。中间一行的4个正方形方框代表的正是前面encoder模块的GRU结构。

    可以看出,Attention会接收encoder返回的hidden state,不妨表示为St−1S_{t-1}St1,还会接收encoder函数返回的outputs,可以表示为HHH。Attention layer会输出一个attention vector,即ata_{t}at,从上图的右上角也可以看出,ata_{t}at中的每个元素都在0和1之间并且这些元素的和为1。所以为了更加准确地预测next word,ata_{t}at可以为我们指明source sentence中的哪个word应该得到更多的关注。

    首先,我们可以使用刚刚提到的St−1S_{t-1}St1HHH来计算energyenergyenergy。如果对GRU模型有足够的了解,我们现在只看上图,那么上图中标明的zzzh4h_{4}h4都是tensor,而且这两个tensor是相同的。而在Attention模型中,相当于是把output和hidden进行concat操作。因此,虽然St−1S_{t-1}St1(对应上图的zzz)和ht−1h_{t-1}ht1(对应上图的h4h_{4}h4)的维度是相同的,但是St−1S_{t-1}St1和encoder函数返回的output的维度是不同的,所以在Attention函数中就对St−1S_{t-1}St1进行了repeat操作(见代码)。

Et=tanh(attn(repeat(St−1),H))E_{t} = tanh(attn(repeat(S_{t-1}),H))Et=tanh(attn(repeat(St1),H))

    这个可以用来解释encoder中的output中保存的每个hidden state与St−1S_{t-1}St1之间match的程度。

    结合代码,现在EtE_{t}Et的shape是[batch_size, src_len, dec_hid_dim],接下来对EtE_{t}Et进行一次现行变换:让EtE_{t}Et乘以一个[dec_hid_dim, 1]的矩阵vvv

at⋀\mathop{a_{t}}\limits^{\bigwedge}at=VEtVE_{t}VEt

    我们可以将 v 视为所有编码器隐藏状态中energy加权和的权重。这些权重告诉我们我们应当关注source sequence中每个token的程度。v是经过随机的初始化的。但是可以通过反向传播从剩余的model中进行学习并更新自身的值。所以v是取决于时间的。并且同一个v可被用于decoding的每个time-step。我们通过一个没有bias的linear layer来实现v。

    最终,为了确保attention vector中所有的元素在0和1之间并且这些元素的和为1,我们需要使用softmax函数。

at=softmax(at⋀)a_{t}=softmax(\mathop{a_{t}}\limits^{\bigwedge})at=softmax(at)

    从而,ata_{t}at给了我们整个source sentence的注意力。

    Attention对应的代码如下。

class Attention(nn.Module):def __init__(self, enc_hid_dim, dec_hid_dim):super().__init__()self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim, bias=False) # 现行变换self.v = nn.Linear(dec_hid_dim, 1, bias = False)def forward(self, s, enc_output):print('【Attention中forward打印的信息】------------------------------')# s = [batch_size, dec_hid_dim]# enc_output = [src_len, batch_size, enc_hid_dim * 2]batch_size = enc_output.shape[1]src_len = enc_output.shape[0]# repeat decoder hidden state src_len timess = s.unsqueeze(1).repeat(1, src_len, 1) # s = [batch_size, src_len, dec_hid_dim]enc_output = enc_output.transpose(0, 1) # enc_output = [batch_size, src_len, enc_hid_dim * 2]print('s.shape:',s.shape)print('enc_output.shape:',enc_output.shape)# cc = [batch_size, src_len, enc_hid_dim + enc_hid_dim * 2]或者cc = [batch_size, src_len, 3 * dec_hid_dim]cc = torch.cat((s,enc_output),dim=2)print('torch.cat((s,enc_output),dim=2).shape:',cc.shape)# energy = [batch_size, src_len, dec_hid_dim]energy = torch.tanh(self.attn(cc)) # 这个Attention处理其实是一种线性变换print('energy.shape:',energy.shape)# attention = [batch_size, src_len]attention = self.v(energy).squeeze(2) # 经过线性变换,然后进过squeeze处理之后,把维度为1的第三维消去了。print('attention.shape:',attention.shape)atten_output = F.softmax(attention, dim=1)print('atten_output.shape:',atten_output.shape)print('【Attention中forward方法执行完毕】!-------------------------------')return F.softmax(attention, dim=1) # [batch_size, src_len]
同样,对Attention进行小小的测试
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5enc = Encoder(INPUT_DIM,ENC_EMB_DIM,ENC_HID_DIM,DEC_HID_DIM,ENC_DROPOUT)
print('--------------Encoder--------------------------')
enc_output,s = enc.forward(torch.LongTensor([[0,2],[3,5],[8,9]]))
attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
print('-------------------Attention------------------')
attn.forward(s,enc_output)

--------------Encoder--------------------------
【Encoder中打印的内容】------------------------------------------
src.shape: torch.Size([2, 3])
<class ‘torch.Tensor’>
embbedded.shape: torch.Size([3, 2, 256])
enc_output.shape: torch.Size([3, 2, 1024])
enc_hidden.shape: torch.Size([2, 2, 512])
enc_hidden[0].shape: torch.Size([2, 512])
enc_hidden[1].shape: torch.Size([2, 512])
torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1).shape: torch.Size([2, 1024])
s.shape: torch.Size([2, 512])
【Encoder中forward函数执行完毕】!------------------------
-------------------Attention------------------
【Attention中forward打印的信息】------------------------------
s.shape: torch.Size([2, 3, 512])
enc_output.shape: torch.Size([2, 3, 1024])
torch.cat((s,enc_output),dim=2).shape: torch.Size([2, 3, 1536])
energy.shape: torch.Size([2, 3, 512])
attention.shape: torch.Size([2, 3])
atten_output.shape: torch.Size([2, 3])
【Attention中forward方法执行完毕】!-------------------------------
tensor([[0.3180, 0.3446, 0.3374],
[0.3199, 0.3413, 0.3388]], grad_fn=)

3.Decoder

    decoder中包含attention layer,attention layer会使用encoder中的所有hidden states,HHH并返回attention vector,ata_{t}at

    我们接下来使用attention vector来创建weighted source vector,WtW_{t}Wt,这是encoder中hidden states,HHH,的加权总和,可以通过下面的公式计算。

Wt=atHW_{t}=a_{t}HWt=atH

    embbedded input word( 即d(yt)d(y_{t})d(yt) )、WtW_{t}WtSt−1S_{t-1}St1都会被传入到decoder当中的RNN中,decoder的RNN对应于如下图中蓝色的块。

St=DecoderGRU(d(yt),Wt,St−1)S_{t}=DecoderGRU(d(y_{t}),W_{t},S_{t-1})St=DecoderGRU(d(yt),Wt,St1)

    我们接下来把d(yt)d(y_{t})d(yt)WtW_{t}WtStS_{t}St传入到linear layer,f中,从而对target sentence中的next word(即yt+1⋀\mathop{y_{t+1}}\limits^{\bigwedge}yt+1)进行prediction。

yt+1⋀\mathop{y_{t+1}}\limits^{\bigwedge}yt+1=f(d(yt),Wt,St)f(d(y_{t}),W_{t},S_{t})f(d(yt),Wt,St)

    下图展示了decoding第1个word的过程:

    绿色的块代表前向/反向 encoder RNN的HHH,粉红色的块代表了context vector,即St−1S_{t-1}St1z=ht=tanh(g(ht→,ht←))z=h_{t}=tanh(g(h_{t}^{\rightarrow},h_{t}^{\leftarrow}))z=ht=tanh(g(ht,ht))。蓝色的块代表了decoder的RNN模块,它会生成StS_{t}St,紫色的块代表了linear layer,fff,它会生成yt+1⋀\mathop{y_{t+1}}\limits^{\bigwedge}yt+1,橙色的块代表了通过HHHata_{t}at计算的权重和,返回的是WtW_{t}Wt

    Decoder函数对应的代码如下

class Decoder(nn.Module):def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):super().__init__()self.output_dim = output_dimself.attention = attentionself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)self.dropout = nn.Dropout(dropout)def forward(self, dec_input, s, enc_output):print('【Decoder中forward函数打印的信息】---------------------------')# dec_input = [batch_size]# s = [batch_size, dec_hid_dim]# enc_output = [src_len, batch_size, enc_hid_dim * 2]dec_input = dec_input.unsqueeze(1) # dec_input = [batch_size, 1]# [batch_size,1,emb_dim],   transpose之后: [1,batch_size,emb_dim]embedded = self.dropout(self.embedding(dec_input)).transpose(0, 1) # embedded = [1, batch_size, emb_dim]print('embeded.shape: ',embedded.shape) # embeded.shape:  torch.Size([1, 128, 256])# a = [batch_size, 1, src_len]a = self.attention(s, enc_output).unsqueeze(1)print('a.shape after attention(s,enc_output).unsqueeze(1): ',a.shape)# torch.Size([128, 1, 31])# enc_output = [batch_size, src_len, enc_hid_dim * 2]enc_output = enc_output.transpose(0, 1)print('enc_output.shape after enc_output.transpose(0,1): ',enc_output.shape)# c = [1, batch_size, enc_hid_dim * 2]# 实际上是最后两个维度kanchu进行相乘c = torch.bmm(a, enc_output).transpose(0, 1)# torch.Size([128, 1, 31]) *  torch.Size([128, 31, 1024])# 所以,c的shape是[128,1,1024],transpose之后是[1,128,1024]print('enc_output.shape: ',enc_output.shape) #torch.Size([128, 31, 1024])print('c.shape after torch.bmm(a,enc_output).transpose(0,1): ',c.shape) # torch.Size([1, 128, 1024])# rnn_input = [1, batch_size, (enc_hid_dim * 2) + emb_dim]rnn_input = torch.cat((embedded, c), dim = 2)print('rnn_input.shape: ',rnn_input.shape)# dec_output = [src_len(=1), batch_size, dec_hid_dim]# dec_hidden = [n_layers * num_directions, batch_size, dec_hid_dim]dec_output, dec_hidden = self.rnn(rnn_input, s.unsqueeze(0)) # 由于Decoder中dec_input的size都是[batch_size,1],所以dec_output和dec_hidden的shape是相同的。print('------------after rnn--------------')print('dec_output.shape: ',dec_output.shape)print('dec_hidden.shape: ',dec_hidden.shape)# embedded = [batch_size, emb_dim]# dec_output = [batch_size, dec_hid_dim]# c = [batch_size, enc_hid_dim * 2]embedded = embedded.squeeze(0)dec_output = dec_output.squeeze(0)c = c.squeeze(0)# pred = [batch_size, output_dim]pred = self.fc_out(torch.cat((dec_output, c, embedded), dim = 1))print('pred.shape: ',pred.shape)print('【Decoder中forward函数执行完毕】!-------------------------')return pred, dec_hidden.squeeze(0)
对Decoder进行小小的测试

    由于Decoder是集成在了Seq2Seq类中了。所以,为了使用真正的数据进行测试,等到后面介绍完Seq2Seq之后再进行测试。

4.Seq2Seq

    在这个model中,我们不必令encoder RNN和decoder RNN具有相同的hidden dimensions。

    Seq2Seq中到底需要做什么事情呢。我们知道,Encoder会返回final hidden stete(来自于前向和后向的encoder RNN)来作为输入到decoder中的initial hidden state,Encoder还会返回every hidden state(保存在outputs中的前向和后向hidden states)。通过下面几个步骤,我们能够确保他们被Encoder返回之后,能够被Decoder进行良好地处理。

  • outputs tensor可以用来保存所有的predictions,Y⋀\mathop{Y}\limits^{\bigwedge}Y
  • source sequence,XXX,被传入到encoder中来得到zzzHHH
  • 输入到decoder中的initial hidden state就是s0=zs_{0}=zs0=z
  • 我们使用一批 tokens 来作为first input,y1y_{1}y1
  • 我们接下来在一个loop中进行decoder处理:
  1. 把①input token,yty_{t}yt,②previous hidden state,St−1S_{t-1}St1和③WtW_{t}Wt(由HHH经过Attention处理之后生成)输入得到decoder中
  2. decoder会返回prediction,yt+1⋀\mathop{y_{t+1}}\limits^{\bigwedge}yt+1和一个hidden state,sts_{t}st
  3. 我们会就决定时候设置next input来进行下一次循环的执行。

    代码如下,由于笔记本的显卡只有2G内存,所以在执行完一个循环之后就报内存耗尽错误,所以这里手动修改代码为只进行一次循环。

class Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super().__init__()self.encoder = encoderself.decoder = decoderself.device = devicedef forward(self, src, trg, teacher_forcing_ratio = 0.5):print('【Seq2Seq中forward函数打印的信息】-----------------------')# src = [src_len, batch_size]  ,eg: src.shape:  torch.Size([31, 128])# trg = [trg_len, batch_size]  ,eg: trg.shape:  torch.Size([37, 128])# teacher_forcing_ratio is probability to use teacher forcingbatch_size = src.shape[1] # 128print('batch_size: ',batch_size)trg_len = trg.shape[0]print('trg_len: ',trg_len)trg_vocab_size = self.decoder.output_dim # 5893print('trg_vocab_size: ',trg_vocab_size)# tensor to store decoder outputsoutputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)print('outputs.shape: ',outputs.shape)# enc_output is all hidden states of the input sequence, back and forwards# s is the final forward and backward hidden states, passed through a linear layerenc_output, s = self.encoder(src) # 调用Encoder的forward函数# first input to the decoder is the <sos> tokensdec_input = trg[0,:]print('dec_input.shape: ',dec_input.shape)for t in range(1, trg_len):# insert dec_input token embedding, previous hidden state and all encoder hidden states# receive output tensor (predictions) and new hidden state#dec_output, s = self.decoder(dec_input, s, enc_output)#print('dec_output.shape: ',dec_output.shape)#print('s.shape: ',s.shape)self.decoder(dec_input, s, enc_output) # 可以看出,decoder处理的时候是一个一个进行处理的break;# place predictions in a tensor holding predictions for each tokenoutputs[t] = dec_output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = dec_output.argmax(1) # if teacher forcing, use actual next token as next input# if not, use predicted tokendec_input = trg[t] if teacher_force else top1print('【Seq2Seq中forwar函数执行完毕】---------------------------')return outputs
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
print('INPUT_DIM:',INPUT_DIM)
print('OUTPUT_DIM:',OUTPUT_DIM)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)model = Seq2Seq(enc, dec, device).to(device)
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)

INPUT_DIM: 7854
OUTPUT_DIM: 5893

for i,batch in enumerate(train_iterator): # 当enumerate多次执行的时候,会继续向后获取元素print(i,batch)src = batch.src # 每个batch会对一个一个src和trg。其中,每个src中会有m个单词?和src相对应的trg中会存在n个单词,其中m和n不一定相等trg = batch.trgprint('---------src----------\n',src)print('src.shape: ',src.shape)print('----------trg---------\n',trg)print('trg.shape: ',trg.shape)pred = model(src,trg)break;

0
[torchtext.data.batch.Batch of size 128 from MULTI30K]
[.src]:[torch.cuda.LongTensor of size 31x128 (GPU 0)]
[.trg]:[torch.cuda.LongTensor of size 37x128 (GPU 0)]
---------src----------
tensor([[ 2, 2, 2, …, 2, 2, 2],
[ 8, 73, 5, …, 3455, 5, 18],
[ 16, 52, 13, …, 65, 1010, 30],
…,
[ 1, 1, 1, …, 1, 1, 1],
[ 1, 1, 1, …, 1, 1, 1],
[ 1, 1, 1, …, 1, 1, 1]], device=‘cuda:0’)
src.shape: torch.Size([31, 128])
----------trg---------
tensor([[ 2, 2, 2, …, 2, 2, 2],
[ 4, 19, 4, …, 63, 4, 16],
[ 14, 17, 9, …, 17, 1623, 30],
…,
[ 1, 1, 1, …, 1, 1, 1],
[ 1, 1, 1, …, 1, 1, 1],
[ 1, 1, 1, …, 1, 1, 1]], device=‘cuda:0’)
trg.shape: torch.Size([37, 128])
【Seq2Seq中forward函数打印的信息】-----------------------
batch_size: 128
trg_len: 37
trg_vocab_size: 5893
outputs.shape: torch.Size([37, 128, 5893])
【Encoder中打印的内容】------------------------------------------
src.shape: torch.Size([128, 31])
<class ‘torch.Tensor’>
embbedded.shape: torch.Size([31, 128, 256])
enc_output.shape: torch.Size([31, 128, 1024])
enc_hidden.shape: torch.Size([2, 128, 512])
enc_hidden[0].shape: torch.Size([128, 512])
enc_hidden[1].shape: torch.Size([128, 512])
torch.cat((enc_hidden[-2,:,:],enc_hidden[-1,:,:]),dim=1).shape: torch.Size([128, 1024])
s.shape: torch.Size([128, 512])
【Encoder中forward函数执行完毕】!------------------------
dec_input.shape: torch.Size([128])
【Decoder中forward函数打印的信息】---------------------------
embeded.shape: torch.Size([1, 128, 256])
【Attention中forward打印的信息】------------------------------
s.shape: torch.Size([128, 31, 512])
enc_output.shape: torch.Size([128, 31, 1024])
torch.cat((s,enc_output),dim=2).shape: torch.Size([128, 31, 1536])
energy.shape: torch.Size([128, 31, 512])
attention.shape: torch.Size([128, 31])
atten_output.shape: torch.Size([128, 31])
【Attention中forward方法执行完毕】!-------------------------------
a.shape after attention(s,enc_output).unsqueeze(1): torch.Size([128, 1, 31])
enc_output.shape after enc_output.transpose(0,1): torch.Size([128, 31, 1024])
enc_output.shape: torch.Size([128, 31, 1024])
c.shape after torch.bmm(a,enc_output).transpose(0,1): torch.Size([1, 128, 1024])
rnn_input.shape: torch.Size([1, 128, 1280])
------------after rnn--------------
dec_output.shape: torch.Size([1, 128, 512])
dec_hidden.shape: torch.Size([1, 128, 512])
pred.shape: torch.Size([128, 5893])
【Decoder中forward函数执行完毕】!-------------------------
【Seq2Seq中forwar函数执行完毕】---------------------------

    由于本机器显卡内存不足,所以不再进行后面的train和evaluate的过程。

5.train和evaluate
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)model = Seq2Seq(enc, dec, device).to(device)
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)def train(model, iterator, optimizer, criterion):model.train()    epoch_loss = 0for i, batch in enumerate(iterator):src = batch.srctrg = batch.trg # trg = [trg_len, batch_size]# pred = [trg_len, batch_size, pred_dim]pred = model(src, trg)pred_dim = pred.shape[-1]# trg = [(trg len - 1) * batch size]# pred = [(trg len - 1) * batch size, pred_dim]trg = trg[1:].view(-1)pred = pred[1:].view(-1, pred_dim)loss = criterion(pred, trg)optimizer.zero_grad()loss.backward()optimizer.step()epoch_loss += loss.item()return epoch_loss / len(iterator)def evaluate(model, iterator, criterion):model.eval()epoch_loss = 0with torch.no_grad():for i, batch in enumerate(iterator):src = batch.srctrg = batch.trg # trg = [trg_len, batch_size]# output = [trg_len, batch_size, output_dim]output = model(src, trg, 0) # turn off teacher forcingoutput_dim = output.shape[-1]# trg = [(trg_len - 1) * batch_size]# output = [(trg_len - 1) * batch_size, output_dim]output = output[1:].view(-1, output_dim)trg = trg[1:].view(-1)loss = criterion(output, trg)epoch_loss += loss.item()return epoch_loss / len(iterator)def epoch_time(start_time, end_time):elapsed_time = end_time - start_timeelapsed_mins = int(elapsed_time / 60)elapsed_secs = int(elapsed_time - (elapsed_mins * 60))return elapsed_mins, elapsed_secsbest_valid_loss = float('inf')evaluatefor epoch in range(10):start_time = time.time()train_loss = train(model, train_iterator, optimizer, criterion)valid_loss = evaluate(model, valid_iterator, criterion)end_time = time.time()epoch_mins, epoch_secs = epoch_time(start_time, end_time)if valid_loss < best_valid_loss:best_valid_loss = valid_losstorch.save(model.state_dict(), 'tut3-model.pt')print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')model.load_state_dict(torch.load('tut3-model.pt'))
test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

    

Neural Machine Translation by Jointly Learning to Align and Translate论文及代码助解相关推荐

  1. [论文阅读]Neural Machine Translation By Jointly Learning to Align and Translate

    文章目录 前言 摘要 一.神经机器翻译 1.机器翻译 2.基于RNN的Encoder-Decoder架构 二.文章贡献 三.模型架构 1.译码器:整体概述 2.编码器:用于注释序列的双向RNN 四.实 ...

  2. nlp论文-《Neural Machine Translation by Jointly Learning to Align and Translate》-基于联合学习对齐和翻译的神经机器翻译(一)

    <Neural Machine Translation by Jointly Learning to Align and Translate>--基于联合学习对齐和翻译的神经机器翻译 作者 ...

  3. neural machine translation by jointly learning to align and translate

    1.论文出处 Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio, "neural machine translation by jointly l ...

  4. 【论文泛读】4. 机器翻译:Neural Machine Translation by Jointly Learning to Align and Translate

    更新进度:■■■■■■■■■■■■■■■■■■■■■■■|100% 理论上一周更一个经典论文 刚刚开始学习,写的不好,有错误麻烦大家留言给我啦 这位博主的笔记短小精炼,爱了爱了:点击跳转 目录 准备 ...

  5. 《Neural Machine Translation by Jointly Learning to Align and Translate》阅读笔记

    个人总结 本文最大贡献是提出了注意力机制,相比于之前 NMT(Neural Machine Translation) 把整个句子压缩成一个固定向量表示的方法,对不同位置的目标单词计算每个输入的不同权重 ...

  6. nlp论文-《Neural Machine Translation by Jointly Learning to Align and Translate》-基于联合学习对齐和翻译的神经机器翻译(二)

    1.论文整体框架 1.1 摘要 神经机器翻译的任务定义: 传统神经机器翻译所用的编码器-解码器模型的缺陷: 本文提出一种能够自动搜索原句中与预测目标词相关的神经机器翻译模型: 所提出的模型的效果: 1 ...

  7. 论文阅读:《Neural Machine Translation by Jointly Learning to Align and Translate》

    https://blog.csdn.net/u011239443/article/details/80521026 论文地址:http://pdfs.semanticscholar.org/071b/ ...

  8. Netural Machine Translation By Joinly Learning To Align And Translate

    参考论文:Netural Machine Translation By Joinly Learning To Align And Translate 这篇论文应该是attention系列论文的鼻祖论文 ...

  9. 《Reducing Word Omission Errors in Neural Machine Translation:A Contrastive Learning Approach》论文阅读笔记

    Reducing Word Omission Errors in Neural Machine Translation:A Contrastive Learning Approach 基本信息 研究目 ...

  10. 《Effective Approaches to Attention-based Neural Machine Translation》—— 基于注意力机制的有效神经机器翻译方法

    目录 <Effective Approaches to Attention-based Neural Machine Translation> 一.论文结构总览 二.论文背景知识 2.1 ...

最新文章

  1. 为什么要低温保存_渔之歌科普课堂:冷冻食品为什么要规定零下18摄氏度冷藏?...
  2. 亲加通讯云郝飞:探讨直播低延迟低流量的粉丝连麦技术
  3. 物联网卡linux,Server Develop (六) Linux epoll总结
  4. 使用git如何批量对文件进行rm操作
  5. StringBuilder内存碎片对性能的影响
  6. 【2019牛客暑期多校训练营(第二场) - H】Second Large Rectangle(单调栈,全1子矩阵变形)
  7. java 对象序列化 数组_序列化-将任何对象转换为j中的字节数组
  8. 在字符串中找出第一个只出现一次的字符
  9. Android实现百度地图定位服务
  10. 以前我劝你们努力,今天我劝你们放弃
  11. UC桌面 测试版本发布
  12. The following paths are ignored by one of your .gitignore
  13. 使用友盟+实现第三方登录(QQ、微信、微博)
  14. [批处理大放送] Visual Studio 之 VC++ 工程清理和备份
  15. OUC2021秋-数值分析-期末(回忆版)
  16. dwf是什么格式文件
  17. 数据库例题(创建数据库SPJ包含S、P、J和SPJ表)
  18. 10万行代码电商项目
  19. C++之enum与switch
  20. Win10 笔记本 解决屏幕忽明忽暗,自动降低亮度问题

热门文章

  1. Ubuntu移动硬盘下载
  2. 1469: 数星星(结构体专题)
  3. 多线程设计模式-主仆模式
  4. 2.1.1.15使用WIFI网卡1_准备工作及配置内核
  5. 网络类型---P2P,MA
  6. Python数据结构之栈(LIFO)
  7. 实战小项目——基于STM32的蓝牙小车
  8. echarts报表javascript插件简介
  9. 孤荷凌寒自学python第二十一天初识python的类
  10. 解决:unable to find valid certification path to requested target