添加了Packed padded sequence和mask机制的Seq2Seq(Attention）模型

文章目录

Introduction
数据预处理
搭建模型
- Encoder
- Attention
- Decoder
- Seq2Seq
训练Seq2Seq模型
推断
BLEU
完整代码

Introduction

在这个笔记本中，我们将相对于上一个模型添加一些改进，添加Packed padded sequence和mask机制。Packed padded sequence用于告诉RNN跳过编码器中的填充tokens。mask显示地强制模型忽略某些值，比如忽略对填充元素的关注。这两种技术在NLP中都很常用。

我们也会观察如何使用我们的模型进行推理，给它一个句子，观察它翻译成什么，观察它在翻译每个单词时到底注意了什么地方。

最后，我们将使用RELU度量来衡量我们的翻译质量。

数据预处理

首先，我们将像之前一样导入所有的模块，并添加用于观察注意力的matplotlib模块。

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Ffrom torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIteratorimport matplotlib.pyplot as plt
import matplotlib.ticker as tickerimport spacy
import numpy as npimport random
import math
import time

接下来，我们将设置可再现性的随机种子。

SEED = 1234random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

与之前一样，我们将导入spaCy并定义德语和英语的标记器。

spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

使用标记器进行分词。

def tokenize_de(text):"""Tokenizes German text from a string into a list of strings"""return [tok.text for tok in spacy_de.tokenizer(text)]def tokenize_en(text):"""Tokenizes English text from a string into a list of strings"""return [tok.text for tok in spacy_en.tokenizer(text)]

当使用packed padded sequences时，我们需要告诉Pytorch实际的（未填充时的）序列有多长。幸运的是，TorchText的Field对象允许我们使用include_length参数，这样将使我们的batch.src为元组形式。其中元组的第一个元素与之前相同，是一批数字化的源语句作为张量，第二个元素是batch中每个源语句的未填充时的长度。

SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token = '<eos>',lower = True,include_lengths = True)TRG = Field(tokenize = tokenize_en,init_token = '<sos>',eos_token = '<eos>',lower = True)

然后我们加载数据

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))

并且创建词典

SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

接下来，我们处理迭代器。
Pytorch中的RNN之pack_padded_sequence()和pad_packed_sequence()
关于packed padded sequences的一个奇怪之处是，batch中的所有元素都需要按照它们的非填充长度降序排序，即batch中的第一句话需要是最长的。我们使用迭代器的两个参数来处理这个问题，sort_within_batch告诉迭代器需要对批处理的内容进行排序，sort_key函数告诉迭代器如何对batch中的元素进行排序。这里，我们将按源句子的长度排序。

BATCH_SIZE = 128device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),batch_size = BATCH_SIZE,sort_within_batch = True, # 是否需要对批处理的内容进行排序sort_key = lambda x : len(x.src),  # 根据源语句的长度device = device)

搭建模型

Encoder

接下来，我们定义编码器。

这里相比之前的更改都表现在forward方法内。它现在接收源句子的长度(src_len)及源句子本身(src)两个参数。

在将源语句（在迭代器中自动填充pad）进行词嵌入之后，我们可以使用pack_padded_sequence在其上加上句子的长度。然后，packed_embedded将是我们打包填充的序列（packed padded sequence）。然后，这可以作为正常的反馈给我们的RNN，它将返回packed_outputs（一个包含序列中所有隐藏状态的打包向量）和hidden（只是序列中最终的隐藏状态）。hidden是一个标准张量，没有以任何方式填充，唯一的区别是输入是一个填充序列，这个张量来自序列中最后一个没有填充的元素。

然后我们使用pad_packed_sequence解压packed_outputs，它返回outputs和每个输出的长度（长度我们不需要）。

outputs的第一个维度是填充序列长度。然而由于使用填充序列，当填充token是输入时张量的值将都是零。

Encoder

class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):super().__init__()self.embedding = nn.Embedding(input_dim, emb_dim)self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)self.dropout = nn.Dropout(dropout)def forward(self, src, src_len):# src = [src len, batch size]# src_len = [batch size]embedded = self.dropout(self.embedding(src))# embedded = [src len, batch size, emb dim]packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len)packed_outputs, hidden = self.rnn(packed_embedded)# packed_outputs is a packed sequence containing all hidden states# hidden is now from the final non-padded element in the batchoutputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs)# outputs is now a non-packed sequence, all hidden states obtained#  when the input is a pad token are all zeros# outputs = [src len, batch size, hid dim * num directions]# hidden = [n layers * num directions, batch size, hid dim]# hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]# outputs are always from the last layer# hidden [-2, :, : ] is the last of the forwards RNN# hidden [-1, :, : ] is the last of the backwards RNN# initial decoder hidden is final hidden state of the forwards and backwards#  encoder RNNs fed through a linear layerhidden = torch.tanh(self.fc(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)))# outputs = [src len, batch size, enc hid dim * 2]# hidden = [batch size, dec hid dim]return outputs, hidden

Attention

注意力模块是我们计算源句子的注意力值的地方。

以前，我们允许这个模块“注意”填充的源语句中的tokens。然而，使用mask，我们可以强制只关注非填充元素。

forward方法现在接收一个掩码输入（mask input）。这是一个[batch_size, src_len] 张量，当源语句token不是padding token时为1，当它是padding token时为0。例如，如果源语句时：[“hello”, “how”, “are”, “you”, “?”, , ]，那么掩码就是 [1, 1, 1, 1, 1, 0, 0]。

我们在注意力被计算之后，但在它被softmax函数归一化之前应用掩码。它是使用masked_fill实现的。它将张量的每个元素中第一个参数（mask==0）为真的，用第二个参数（-1e10）替换。换句话说，它将处理未标准化的注意力值，并将填充元素上的注意力值更改为-1e10。由于这些数字与其它值相比是非常小的，当通过softmax层时，它们将变成0，确保没有注意力放在源语句中的padding token。

Attention

class Attention(nn.Module):def __init__(self, enc_hid_dim, dec_hid_dim):super(Attention, self).__init__()self.atten = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)self.v = nn.Linear(dec_hid_dim, 1, bias=False)def forward(self, hidden, enc_outputs, mask):# hidden = [batch_size, dec_hid_dim]# enc_outputs = [src_len, batch_size, enc_hid_dim * 2]batch_size = enc_outputs.shape[1]src_len = enc_outputs.shape[0]# repeat decoder hidden state src_len timeshidden = hidden.unsqueeze(1).repeat(1, src_len, 1)# hidden = [batch_size, src_len, dec_hid_dim]# enc_outputs = [batch_size, src_len, enc_hid_dim * 2]energy = torch.tanh(self.atten(torch.cat((hidden, enc_outputs), dim=2)))# energy = [batch_size, src_len, dec_hid_dim]attention = self.v(energy).squeeze(2)# attention = [batch_size, src_len]attention = attention.masked_fill(mask == 0, -1e10)return F.softmax(attention, dim=1)

Decoder

解码器相比之前只需要做一些小改动。它只需要接收源句子的mask，并将其传递给Attention模块。当我们想要在推理过程中查看注意力的值时，我们也返回注意力张量。

Decoder

class Decoder(nn.Module):def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):super(Decoder, self).__init__()self.output_dim = output_dimself.attention = attentionself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)self.dropout = dropoutdef forward(self, input, hidden, enc_outputs, mask):# input = [batch_size]# hidden = [batch_size, dec_hid_dim]# enc_outputs = [src_len, batch_size, enc_hid_dim * 2]# mask = [batch_size, src_len]input = input.unsqueeze(0)# input = [1, batch_size]embedded = self.dropout(self.embedding(input))# embedded = [1, batch_size, emb_dim]a = self.attention(hidden, enc_outputs, mask)# a = [batch_size, src_len]a = a.unsqueeze(1)# a = [batch_size, 1, src_len]enc_outputs = enc_outputs.permute(1, 0, 2)# enc_outputs = [batch_size, src_len, enc_hid_dim * 2]weighted = torch.bmm(a, enc_outputs)# weighted = [batch_size, 1, enc_hid_dim * 2]weighted = weighted.permute(1, 0, 2)# weighted = [1, batch_size, enc_hid_dim * 2]rnn_input = torch.cat((weighted, embedded), dim=2)# rnn_input = [1, batch_size, enc_hid_dim * 2 + emb_dim ]output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))# output = [seq len, batch size, dec hid dim * n directions]# hidden = [n layers * n directions, batch size, dec hid dim]# seq len, n layers and n directions will always be 1 in this decoder, therefore:# output = [1, batch_size, dec_hid_dim]# hidden = [1, batch_size, dec_hid_dim]# this also means that output == hiddenassert (output == hidden).all()embedded = embedded.squeeze(0)# embedded = [batch_size, emb_dim]output = output.squeeze(0)# output = [batch_size, dec_hid_dim]weighted = weighted.squeeze(0)# weighted = [batch_size, enc_hid_dim * 2]prediction = self.fc_out(torch.cat((output, weighted, embedded), dim=1))# prediction = [batch_size, output_dim]return prediction, hidden.squeeze(0), a.squeeze(1)

Seq2Seq

全面的Seq2Seq模型还需要对packed padded sequence、mask和interence进行一些更改。

我们要告诉它pad token的索引是什么，并将源语句长度作为输入传递给forward方法。

我们使用pad token索引来创建masks，通过创建mask张量，只要源语句的元素不是pad标记，那么这元素就用1来标识，这都是在create_mask中完成的。

传递给编码器使用填充序列所需要的序列长度。

每个时间步的注意力存储在attention中。

Seq2Seq

class Seq2Seq(nn.Module):def __init__(self, encoder, decoder, src_pad_idx, device):super(Seq2Seq, self).__init__()self.encoder = encoderself.decoder = decoderself.src_pad_idx = src_pad_idxself.device = devicedef create_mask(self, src):mask = (src != self.src_pad_idx).permute(1, 0)return maskdef forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):# src = [src_len, batch_size]# src_len = [batch_size]# trg = [trg_len, batch_size]# teacher_forcing_ratio is probability to use teacher forcing# e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the timebatch_size = src.shape[1]trg_len = trg.shape[0]trg_vocab_size = self.decoder.output_dim# tensor to store decoder outputsoutputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)# encoder_outputs is all hidden states of the input sequence, back and forwards# hidden is the final forward and backward hidden states, passed through a linear layerencoder_outputs, hidden = self.encoder(src, src_len)# first input to the decoder is the <sos> tokensinput = trg[0, :]mask = self.create_mask(src)# mask = [batch_size, src_len]for t in range(1, trg_len):# insert input token embedding, previous hidden state, all encoder hidden states#  and mask# receive output tensor (predictions) and new hidden stateoutput, hidden, _ = self.decoder(input, hidden, encoder_outputs, mask)# place predictions in a tensor holding predictions for each tokenoutputs[t] = output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = output.argmax(1)# if teacher forcing, use actual next token as next input# if not, use predicted tokeninput = trg[t] if teacher_force else top1return outputs

训练Seq2Seq模型

接下来，初始化模型并将其放置在GPU上。


INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)model = Seq2Seq(enc, dec, SRC_PAD_IDX, device).to(device)

然后，对模型参数进行初始化。

def init_weights(m):for name, param in m.named_parameters():if 'weight' in name:nn.init.normal_(param.data, mean=0, std=0.01)else:nn.init.constant_(param.data, 0)model.apply(init_weights)

我们可以知道模型结构如下：

Seq2Seq((encoder): Encoder((embedding): Embedding(7853, 256)(rnn): GRU(256, 512, bidirectional=True)(fc): Linear(in_features=1024, out_features=512, bias=True)(dropout): Dropout(p=0.5, inplace=False))(decoder): Decoder((attention): Attention((attn): Linear(in_features=1536, out_features=512, bias=True)(v): Linear(in_features=512, out_features=1, bias=False))(embedding): Embedding(5893, 256)(rnn): GRU(1280, 512)(fc_out): Linear(in_features=1792, out_features=5893, bias=True)(dropout): Dropout(p=0.5, inplace=False))
)

我们将打印模型中可训练参数的数量，注意到改进后的模型的参数数量与没有改进时模型参数的数量相同。

def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)print(f'The model has {count_parameters(model):,} trainable parameters')

该模型有20,518,405个可训练参数。

然后定义optimizer和criterion。

criterion的ignore_index需要来自目标语句序列的pad标记的索引，而不是源语句的索引。

optimizer = optim.Adam(model.parameters())

TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

接下来，我们将定义我们的训练模式和评估模式的循环。

我们在源字段batch中使用include_length = True。batch.src现在是一个元组，其中第一个元素是表示句子的数字化张量，第二个元素是batch内每个句子的长度。

我们的模型还为每个解码时间步返回batch源句子上的注意力向量。我们不会在训练/评估中使用这些，但我们将在稍后进行推断。

Train

def train(model, iterator, optimizer, criterion, clip):model.train()epoch_loss = 0for i, batch in enumerate(iterator):src, src_len = batch.srctrg = batch.trgoptimizer.zero_grad()output = model(src, src_len, trg)# trg = [trg_len, batch_size]# output = [trg_len, batch_size, output_dim]output_dim = output.shape[-1]output = output[1:].view(-1, output_dim)# output = [(trg_len - 1) * batch_size, output_dim]trg = trg[1:].view(-1)# trg = [(trg_len - 1) * batch_size]loss = criterion(output, trg)loss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), clip)optimizer.step()epoch_loss += loss.item()return epoch_loss / len(iterator)

Evaluate

def evaluate(model, iterator, criterion):model.eval()epoch_loss = 0with torch.no_grad():for i, batch in enumerate(iterator):src, src_len = batch.srctrg = batch.trgoutput = model(src, src_len, trg, 0)  # turn off teacher forcing# trg = [trg len, batch size]# output = [trg len, batch size, output dim]output_dim = output.shape[-1]output = output[1:].view(-1, output_dim)trg = trg[1:].view(-1)# trg = [(trg len - 1) * batch size]# output = [(trg len - 1) * batch size, output dim]loss = criterion(output, trg)epoch_loss += loss.item()return epoch_loss / len(iterator)

然后，我们将定义一个有用的函数来计时epoch所花费的时间。

def epoch_time(start_time, end_time):elapsed_time = end_time - start_timeelapsed_mins = int(elapsed_time / 60)elapsed_secs = int(elapsed_time - (elapsed_mins * 60))return elapsed_mins, elapsed_secs

接下来是训练我们的模型。请注意该模型它是如何花费几乎一半的时间相比起在之前没有添加改进的模型。

N_EPOCHS = 10
CLIP = 1best_valid_loss = float('inf')for epoch in range(N_EPOCHS):start_time = time.time()train_loss = train(model, train_iterator, optimizer, criterion, CLIP)valid_loss = evaluate(model, valid_iterator, criterion)end_time = time.time()epoch_mins, epoch_secs = epoch_time(start_time, end_time)if valid_loss < best_valid_loss:best_valid_loss = valid_losstorch.save(model.state_dict(), 'tut4-model.pt')print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

打印结果：

Epoch: 01 | Time: 0m 32sTrain Loss: 5.062 | Train PPL: 157.888Val. Loss: 4.809 |  Val. PPL: 122.606
Epoch: 02 | Time: 0m 32sTrain Loss: 4.084 | Train PPL:  59.374Val. Loss: 4.108 |  Val. PPL:  60.819
Epoch: 03 | Time: 0m 32sTrain Loss: 3.293 | Train PPL:  26.919Val. Loss: 3.541 |  Val. PPL:  34.504
Epoch: 04 | Time: 0m 33sTrain Loss: 2.808 | Train PPL:  16.583Val. Loss: 3.320 |  Val. PPL:  27.670
Epoch: 05 | Time: 0m 33sTrain Loss: 2.436 | Train PPL:  11.427Val. Loss: 3.242 |  Val. PPL:  25.575
Epoch: 06 | Time: 0m 34sTrain Loss: 2.159 | Train PPL:   8.659Val. Loss: 3.273 |  Val. PPL:  26.389
Epoch: 07 | Time: 0m 32sTrain Loss: 1.937 | Train PPL:   6.937Val. Loss: 3.172 |  Val. PPL:  23.856
Epoch: 08 | Time: 0m 31sTrain Loss: 1.732 | Train PPL:   5.651Val. Loss: 3.231 |  Val. PPL:  25.297
Epoch: 09 | Time: 0m 31sTrain Loss: 1.601 | Train PPL:   4.960Val. Loss: 3.294 |  Val. PPL:  26.957
Epoch: 10 | Time: 0m 31sTrain Loss: 1.491 | Train PPL:   4.441Val. Loss: 3.278 |  Val. PPL:  26.535

最后，我们从最佳验证损失中加载参数，并在测试集中得到我们的结果。

我们得到改进的测试困惑几乎是两倍的速度!

model.load_state_dict(torch.load('tut4-model.pt'))test_loss = evaluate(model, test_iterator, criterion)print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

打印结果：

| Test Loss: 3.154 | Test PPL:  23.441 |

推断

现在我们可以使用我们训练过的模型来生成翻译。

注意:与论文中的示例相比，这些翻译的效果很差，因为它们使用了1000的隐藏尺寸，并且训练了4天!它们被精心挑选出来，是为了展示在足够大的模型上应该是什么样子。

我们的translate_sentence将做以下工作:

确保我们的模型处于评估模式，这应该始终是为了推理
如果源语句没有被标记(是字符串)，则标记化源语句
将源语句数字化
把它转换成张量，然后加上batch维数
得到源语句的长度并转换成张量
将源语句输入编码器
创建源语句的mask
创建一个列表来保存输出语句，初始化时使用《sos》token
创建一个张量来保存注意力值
当我们还没有达到最大长度时
- 得到输入张量，它应该是《sos》或最后的预测token
- 将输入、所有编码器输出、隐藏状态和掩码输入解码器
- 保存注意力值
- 获得预测的下一个token
- 添加预测到当前输出句子预测
- 如果预测是《sos》令牌，则中断循环
将输出语句从索引转换为tokens
返回输出语句(删除了《sos》token)和序列上的注意力值

def translate_sentence(sentence, src_field, trg_field, model, device, max_len = 50):model.eval()if isinstance(sentence, str):nlp = spacy.load('de')tokens = [token.text.lower() for token in nlp(sentence)]else:tokens = [token.lower() for token in sentence]tokens = [src_field.init_token] + tokens + [src_field.eos_token]src_indexes = [src_field.vocab.stoi[token] for token in tokens]src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)src_len = torch.LongTensor([len(src_indexes)]).to(device)with torch.no_grad():encoder_outputs, hidden = model.encoder(src_tensor, src_len)mask = model.create_mask(src_tensor)trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]attentions = torch.zeros(max_len, 1, len(src_indexes)).to(device)for i in range(max_len):trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)with torch.no_grad():output, hidden, attention = model.decoder(trg_tensor, hidden, encoder_outputs, mask)attentions[i] = attentionpred_token = output.argmax(1).item()trg_indexes.append(pred_token)if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:breaktrg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]return trg_tokens[1:], attentions[:len(trg_tokens)-1]

接下来，我们将创建一个函数，为生成的每个目标token显示在源语句上模型的注意力。

def display_attention(sentence, translation, attention):fig = plt.figure(figsize=(10,10))ax = fig.add_subplot(111)attention = attention.squeeze(1).cpu().detach().numpy()cax = ax.matshow(attention, cmap='bone')ax.tick_params(labelsize=15)ax.set_xticklabels(['']+['<sos>']+[t.lower() for t in sentence]+['<eos>'], rotation=45)ax.set_yticklabels(['']+translation)ax.xaxis.set_major_locator(ticker.MultipleLocator(1))ax.yaxis.set_major_locator(ticker.MultipleLocator(1))plt.show()plt.close()

现在，我们将从数据集中获取一些翻译，看看我们的模型做得如何。注意，这里我们将挑选一些示例，以便查看一些有趣的内容，但您可以随意更改example_idx值，以查看不同的示例。

首先，我们将从数据集中获得源和目标。

example_idx = 12src = vars(train_data.examples[example_idx])['src']
trg = vars(train_data.examples[example_idx])['trg']print(f'src = {src}')
print(f'trg = {trg}')

打印输出：

src = ['ein', 'schwarzer', 'hund', 'und', 'ein', 'gefleckter', 'hund', 'kämpfen', '.']
trg = ['a', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'are', 'fighting']

然后，我们将使用translate_sentence函数来获得预期的翻译和注意力值。通过将源句子放在x轴上，并将预测的翻译放在y轴上，我们可以用图形来显示这一点。两个词之间的正方形越高亮，模型在翻译目标词时对源词的关注程度越高。

下面是一个模型试图翻译的例子，它得到了正确的翻译，除了变更的are fighting 到 fighting。

translation, attention = translate_sentence(src, SRC, TRG, model, device)print(f'predicted trg = {translation}')

打印输出：

predicted trg = ['a', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'fighting', '.', '<eos>']

可视化注意力：

display_attention(src, translation, attention)

模型可以简单地记住来自训练集的翻译。所以我们也应该看看验证和测试集的翻译。

从验证集开始，让我们看一个示例。

example_idx = 14src = vars(valid_data.examples[example_idx])['src']
trg = vars(valid_data.examples[example_idx])['trg']print(f'src = {src}')
print(f'trg = {trg}')

打印输出：

src = ['eine', 'frau', 'spielt', 'ein', 'lied', 'auf', 'ihrer', 'geige', '.']
trg = ['a', 'female', 'playing', 'a', 'song', 'on', 'her', 'violin', '.']

然后让我们生成我们的翻译并查看注意力。

在这里，我们可以看到除了将female替换为woman之外，翻译是相同的。

translation, attention = translate_sentence(src, SRC, TRG, model, device)print(f'predicted trg = {translation}')display_attention(src, translation, attention)

打印输出：

predicted trg = ['a', 'woman', 'playing', 'a', 'song', 'on', 'her', 'violin', '.', '<eos>']

最后，让我们从测试集中获得一个示例。

example_idx = 18src = vars(test_data.examples[example_idx])['src']
trg = vars(test_data.examples[example_idx])['trg']print(f'src = {src}')
print(f'trg = {trg}')

打印输出：

src = ['die', 'person', 'im', 'gestreiften', 'shirt', 'klettert', 'auf', 'einen', 'berg', '.']
trg = ['the', 'person', 'in', 'the', 'striped', 'shirt', 'is', 'mountain', 'climbing', '.']

同样，它产生了与目标句稍有不同的翻译，是源句更字面化的版本。它把mountain climbing变成了climbing on a mountain。

translation, attention = translate_sentence(src, SRC, TRG, model, device)print(f'predicted trg = {translation}')display_attention(src, translation, attention)

打印输出：

predicted trg = ['the', 'person', 'in', 'the', 'striped', 'shirt', 'is', 'climbing', 'on', 'a', 'mountain', '.', '<eos>']

BLEU

以前我们只关心模型的损失/困惑。然而，有专门为衡量翻译质量而设计的指标——最受欢迎的是BLEU。BLEU根据n-gram分析预测序列和实际目标序列的重叠部分，并没有过多地讨论细节。它会给我们每个序列一个0到1之间的数字，其中1表示完全重叠，即完全转换，尽管通常显示在0到100之间。BLEU是为每个源序列的多个候选翻译而设计的，但是在这个数据集中，每个源只有一个候选翻译。

我们定义了一个calculate_bleu函数，它在提供的TorchText数据集上计算BLEU得分。这个函数为每个源句子创建一个实际翻译和预测翻译的语料库，然后计算BLEU评分。

from torchtext.data.metrics import bleu_scoredef calculate_bleu(data, src_field, trg_field, model, device, max_len = 50):trgs = []pred_trgs = []for datum in data:src = vars(datum)['src']trg = vars(datum)['trg']pred_trg, _ = translate_sentence(src, src_field, trg_field, model, device, max_len)#cut off <eos> tokenpred_trg = pred_trg[:-1]pred_trgs.append(pred_trg)trgs.append([trg])return bleu_score(pred_trgs, trgs)

我们得到的BELU值大约是29。如果我们将其与注意力模型试图复现的论文进行比较，BLEU得分为26.75。这与我们的分数相似，但是他们使用的是完全不同的数据集，他们的模型尺寸要大得多——1000个隐藏维度需要4天的训练!-所以我们也不能与之比较。

这个数字是无法解释的，我们真的不能说太多。BLEU评分最有用的部分是，它可以用来比较同一数据集上的不同模型，其中BLEU评分较高的模型是“更好”。

bleu_score = calculate_bleu(test_data, SRC, TRG, model, device)print(f'BLEU score = {bleu_score*100:.2f}')

打印输出：

BLEU score = 29.04

在下一篇教程中，我们将不再使用递归神经网络，而是开始研究构建序列到序列模型的其他方法。具体来说，在下一课程中我们将使用卷积序列到序列的模型。

完整代码

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Ffrom torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIteratorimport matplotlib.pyplot as plt
import matplotlib.ticker as tickerimport spacy
import numpy as npimport random
import math
import timeSEED = 1234random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True# 数据预处理spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')def tokenize_de(text):"""Tokenizes German text from a string into a list of strings"""return [tok.text for tok in spacy_de.tokenizer(text)]def tokenize_en(text):"""Tokenizes English text from a string into a list of strings"""return [tok.text for tok in spacy_en.tokenizer(text)]SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token = '<eos>',lower = True,include_lengths = True)TRG = Field(tokenize = tokenize_en,init_token = '<sos>',eos_token = '<eos>',lower = True)train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)BATCH_SIZE = 128device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),batch_size = BATCH_SIZE,sort_within_batch = True, # 是否需要对批处理的内容进行排序sort_key = lambda x : len(x.src),  # 根据源语句的长度device = device)# 搭建模型
class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):super().__init__()self.embedding = nn.Embedding(input_dim, emb_dim)self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)self.dropout = nn.Dropout(dropout)def forward(self, src, src_len):# src = [src len, batch size]# src_len = [batch size]embedded = self.dropout(self.embedding(src))# embedded = [src len, batch size, emb dim]packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len)packed_outputs, hidden = self.rnn(packed_embedded)# packed_outputs is a packed sequence containing all hidden states# hidden is now from the final non-padded element in the batchoutputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs)# outputs is now a non-packed sequence, all hidden states obtained#  when the input is a pad token are all zeros# outputs = [src len, batch size, hid dim * num directions]# hidden = [n layers * num directions, batch size, hid dim]# hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]# outputs are always from the last layer# hidden [-2, :, : ] is the last of the forwards RNN# hidden [-1, :, : ] is the last of the backwards RNN# initial decoder hidden is final hidden state of the forwards and backwards#  encoder RNNs fed through a linear layerhidden = torch.tanh(self.fc(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1)))# outputs = [src len, batch size, enc hid dim * 2]# hidden = [batch size, dec hid dim]return outputs, hiddenclass Attention(nn.Module):def __init__(self, enc_hid_dim, dec_hid_dim):super(Attention, self).__init__()self.atten = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)self.v = nn.Linear(dec_hid_dim, 1, bias=False)def forward(self, hidden, enc_outputs, mask):# hidden = [batch_size, dec_hid_dim]# enc_outputs = [src_len, batch_size, enc_hid_dim * 2]batch_size = enc_outputs.shape[1]src_len = enc_outputs.shape[0]# repeat decoder hidden state src_len timeshidden = hidden.unsqueeze(1).repeat(1, src_len, 1)# hidden = [batch_size, src_len, dec_hid_dim]# enc_outputs = [batch_size, src_len, enc_hid_dim * 2]energy = torch.tanh(self.atten(torch.cat((hidden, enc_outputs), dim=2)))# energy = [batch_size, src_len, dec_hid_dim]attention = self.v(energy).squeeze(2)# attention = [batch_size, src_len]attention = attention.masked_fill(mask == 0, -1e10)return F.softmax(attention, dim=1)class Decoder(nn.Module):def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):super(Decoder, self).__init__()self.output_dim = output_dimself.attention = attentionself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)self.dropout = dropoutdef forward(self, input, hidden, enc_outputs, mask):# input = [batch_size]# hidden = [batch_size, dec_hid_dim]# enc_outputs = [src_len, batch_size, enc_hid_dim * 2]# mask = [batch_size, src_len]input = input.unsqueeze(0)# input = [1, batch_size]embedded = self.dropout(self.embedding(input))# embedded = [1, batch_size, emb_dim]a = self.attention(hidden, enc_outputs, mask)# a = [batch_size, src_len]a = a.unsqueeze(1)# a = [batch_size, 1, src_len]enc_outputs = enc_outputs.permute(1, 0, 2)# enc_outputs = [batch_size, src_len, enc_hid_dim * 2]weighted = torch.bmm(a, enc_outputs)# weighted = [batch_size, 1, enc_hid_dim * 2]weighted = weighted.permute(1, 0, 2)# weighted = [1, batch_size, enc_hid_dim * 2]rnn_input = torch.cat((weighted, embedded), dim=2)# rnn_input = [1, batch_size, enc_hid_dim * 2 + emb_dim ]output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))# output = [seq len, batch size, dec hid dim * n directions]# hidden = [n layers * n directions, batch size, dec hid dim]# seq len, n layers and n directions will always be 1 in this decoder, therefore:# output = [1, batch_size, dec_hid_dim]# hidden = [1, batch_size, dec_hid_dim]# this also means that output == hiddenassert (output == hidden).all()embedded = embedded.squeeze(0)# embedded = [batch_size, emb_dim]output = output.squeeze(0)# output = [batch_size, dec_hid_dim]weighted = weighted.squeeze(0)# weighted = [batch_size, enc_hid_dim * 2]prediction = self.fc_out(torch.cat((output, weighted, embedded), dim=1))# prediction = [batch_size, output_dim]return prediction, hidden.squeeze(0), a.squeeze(1)class Seq2Seq(nn.Module):def __init__(self, encoder, decoder, src_pad_idx, device):super(Seq2Seq, self).__init__()self.encoder = encoderself.decoder = decoderself.src_pad_idx = src_pad_idxself.device = devicedef create_mask(self, src):mask = (src != self.src_pad_idx).permute(1, 0)return maskdef forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):# src = [src_len, batch_size]# src_len = [batch_size]# trg = [trg_len, batch_size]# teacher_forcing_ratio is probability to use teacher forcing# e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the timebatch_size = src.shape[1]trg_len = trg.shape[0]trg_vocab_size = self.decoder.output_dim# tensor to store decoder outputsoutputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)# encoder_outputs is all hidden states of the input sequence, back and forwards# hidden is the final forward and backward hidden states, passed through a linear layerencoder_outputs, hidden = self.encoder(src, src_len)# first input to the decoder is the <sos> tokensinput = trg[0, :]mask = self.create_mask(src)# mask = [batch_size, src_len]for t in range(1, trg_len):# insert input token embedding, previous hidden state, all encoder hidden states#  and mask# receive output tensor (predictions) and new hidden stateoutput, hidden, _ = self.decoder(input, hidden, encoder_outputs, mask)# place predictions in a tensor holding predictions for each tokenoutputs[t] = output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = output.argmax(1)# if teacher forcing, use actual next token as next input# if not, use predicted tokeninput = trg[t] if teacher_force else top1return outputs# train
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)model = Seq2Seq(enc, dec, SRC_PAD_IDX, device).to(device)#  initialize the model parameters
def init_weights(m):for name, param in m.named_parameters():if 'weight' in name:nn.init.normal_(param.data, mean=0, std=0.01)else:nn.init.constant_(param.data, 0)model.apply(init_weights)def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)print(f'The model has {count_parameters(model):,} trainable parameters')# define our optimizer and criterion
optimizer = optim.Adam(model.parameters())TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)def train(model, iterator, optimizer, criterion, clip):model.train()epoch_loss = 0for i, batch in enumerate(iterator):src, src_len = batch.srctrg = batch.trgoptimizer.zero_grad()output = model(src, src_len, trg)# trg = [trg_len, batch_size]# output = [trg_len, batch_size, output_dim]output_dim = output.shape[-1]output = output[1:].view(-1, output_dim)# output = [(trg_len - 1) * batch_size, output_dim]trg = trg[1:].view(-1)# trg = [(trg_len - 1) * batch_size]loss = criterion(output, trg)loss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), clip)optimizer.step()epoch_loss += loss.item()return epoch_loss / len(iterator)def evaluate(model, iterator, criterion):model.eval()epoch_loss = 0with torch.no_grad():for i, batch in enumerate(iterator):src, src_len = batch.srctrg = batch.trgoutput = model(src, src_len, trg, 0)  # turn off teacher forcing# trg = [trg len, batch size]# output = [trg len, batch size, output dim]output_dim = output.shape[-1]output = output[1:].view(-1, output_dim)trg = trg[1:].view(-1)# trg = [(trg len - 1) * batch size]# output = [(trg len - 1) * batch size, output dim]loss = criterion(output, trg)epoch_loss += loss.item()return epoch_loss / len(iterator)def epoch_time(start_time, end_time):elapsed_time = end_time - start_timeelapsed_mins = int(elapsed_time / 60)elapsed_secs = int(elapsed_time - (elapsed_mins * 60))return elapsed_mins, elapsed_secsN_EPOCHS = 10
CLIP = 1best_valid_loss = float('inf')for epoch in range(N_EPOCHS):start_time = time.time()train_loss = train(model, train_iterator, optimizer, criterion, CLIP)valid_loss = evaluate(model, valid_iterator, criterion)end_time = time.time()epoch_mins, epoch_secs = epoch_time(start_time, end_time)if valid_loss < best_valid_loss:best_valid_loss = valid_losstorch.save(model.state_dict(), 'tut4-model.pt')print(f'Epoch: {epoch + 1:02} | Time: {epoch_mins}m {epoch_secs}s')print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')#  load the parameters from our best validation loss and get our results on the test set
model.load_state_dict(torch.load('tut4-model.pt'))
test_loss = evaluate(model, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')def translate_sentence(sentence, src_field, trg_field, model, device, max_len = 50):model.eval()if isinstance(sentence, str):nlp = spacy.load('de_core_news_sm')tokens = [token.text.lower() for token in nlp(sentence)]else:tokens = [token.lower() for token in sentence]tokens = [src_field.init_token] + tokens + [src_field.eos_token]src_indexes = [src_field.vocab.stoi[token] for token in tokens]src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)src_len = torch.LongTensor([len(src_indexes)]).to(device)with torch.no_grad():encoder_outputs, hidden = model.encoder(src_tensor, src_len)mask = model.create_mask(src_tensor)trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]attentions = torch.zeros(max_len, 1, len(src_indexes)).to(device)for i in range(max_len):trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)with torch.no_grad():output, hidden, attention = model.decoder(trg_tensor, hidden, encoder_outputs, mask)attentions[i] = attentionpred_token = output.argmax(1).item()trg_indexes.append(pred_token)if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:breaktrg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]return trg_tokens[1:], attentions[:len(trg_tokens) - 1]def display_attention(sentence, translation, attention):fig = plt.figure(figsize=(10, 10))ax = fig.add_subplot(111)attention = attention.squeeze(1).cpu().detach().numpy()cax = ax.matshow(attention, cmap='bone')ax.tick_params(labelsize=15)ax.set_xticklabels([''] + ['<sos>'] + [t.lower() for t in sentence] + ['<eos>'],rotation=45)ax.set_yticklabels([''] + translation)ax.xaxis.set_major_locator(ticker.MultipleLocator(1))ax.yaxis.set_major_locator(ticker.MultipleLocator(1))plt.show()plt.close()# get a source and target from our dataset
example_idx = 12src = vars(train_data.examples[example_idx])['src']
trg = vars(train_data.examples[example_idx])['trg']print(f'src = {src}')
print(f'trg = {trg}')
# src = ['ein', 'schwarzer', 'hund', 'und', 'ein', 'gefleckter', 'hund', 'kämpfen', '.']
# trg = ['a', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'are', 'fighting']translation, attention = translate_sentence(src, SRC, TRG, model, device)print(f'predicted trg = {translation}')
# predicted trg = ['a', 'black', 'dog', 'and', 'a', 'spotted', 'dog', 'fighting', '.', '<eos>']display_attention(src, translation, attention)# validation set
example_idx = 14src = vars(valid_data.examples[example_idx])['src']
trg = vars(valid_data.examples[example_idx])['trg']print(f'src = {src}')
print(f'trg = {trg}')
# src = ['eine', 'frau', 'spielt', 'ein', 'lied', 'auf', 'ihrer', 'geige', '.']
# trg = ['a', 'female', 'playing', 'a', 'song', 'on', 'her', 'violin', '.']translation, attention = translate_sentence(src, SRC, TRG, model, device)print(f'predicted trg = {translation}')
# predicted trg = ['a', 'woman', 'playing', 'a', 'song', 'on', 'her', 'violin', '.', '<eos>']display_attention(src, translation, attention)# test set.
example_idx = 18src = vars(test_data.examples[example_idx])['src']
trg = vars(test_data.examples[example_idx])['trg']print(f'src = {src}')
print(f'trg = {trg}')
# src = ['die', 'person', 'im', 'gestreiften', 'shirt', 'klettert', 'auf', 'einen', 'berg', '.']
# trg = ['the', 'person', 'in', 'the', 'striped', 'shirt', 'is', 'mountain', 'climbing', '.']translation, attention = translate_sentence(src, SRC, TRG, model, device)print(f'predicted trg = {translation}')
# predicted trg = ['the', 'person', 'in', 'the', 'striped', 'shirt', 'is', 'climbing', 'on', 'a', 'mountain', '.', '<eos>']display_attention(src, translation, attention)from torchtext.data.metrics import bleu_scoredef calculate_bleu(data, src_field, trg_field, model, device, max_len=50):trgs = []pred_trgs = []for datum in data:src = vars(datum)['src']trg = vars(datum)['trg']pred_trg, _ = translate_sentence(src, src_field, trg_field, model, device, max_len)# cut off <eos> tokenpred_trg = pred_trg[:-1]pred_trgs.append(pred_trg)trgs.append([trg])return bleu_score(pred_trgs, trgs)bleu_score = calculate_bleu(test_data, SRC, TRG, model, device)print(f'BLEU score = {bleu_score*100:.2f}')
# BLEU score = 29.04