基于seq2seq做一个机器翻译

我们将使用PyTorch和TorchText构建一个机器学习模型，从一个序列到另一个序列。将德语到英语翻译成英语

该模型是《Sequence to Sequence Learning with Neural Networks》这篇论文的Pytorch实现

使用Encoder生成上下文向量

使用Decoder预测目标语言句子

步骤：

1)准备数据

2)创建Seq2Seq模型

3)训练模型

4)验证模型

准备数据

import torch
import torch.nn as nn
import torch.optim as optimfrom torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIteratorimport spacy
import random
import math
import time#1、preparing data
#设置一个随机种子
SEED = 1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True  #使用cuda保证每次结果一样。将这个 flag 置为True的话，每次返回的卷积算法将是确定的，即默认算法。如果配合上设置 Torch 的随机种子为固定值的话，应该可以保证每次运行网络的时候相同输入的输出是固定的#创建tokenizer
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')#把tokenizer从一串字符转成一个list，同时做一个reverse取反
#在原论文中，作者发现颠倒源语言的输入的顺序可以取得不错的翻译效果，例如，一句话为“good morning!”，颠倒顺序分词后变为"!", “morning”, 和"good"。#将德语进行分词并颠倒顺序
def tokenize_de(text):return [tok.text for tok in spacy_de.tokenizer(text)][::-1]
#将英语进行分词，不颠倒顺序
def tokenize_en(text):return [tok.text for tok in spacy_en.tokenizer(text)]#我们创建SRC和TRG两个Field对象，tokenize为我们刚才定义的分词器函数，在每句话的开头加入字符SOS，结尾加入字符EOS，将所有单词转换为小写。#TorchText的Field定义数据应该如何被处理
#SRC即source，是德语
#TRG即target，是英语
#sos是start of sequence, eos是end of sequence
#lower=True是将所有单词转换为小写
SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token = '<eos>',lower = True)TRG = Field(tokenize = tokenize_en,init_token = '<sos>',eos_token = '<eos>',lower = True)#使用torchtext自带的Multi30k数据集，这是一个包含约30000个平行的英语、德语和法语句子的数据集，每个句子包含约12个单词。
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))#查看一下加载完的数据集
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")#看一下生成的第一个训练样本，可以看到源语言的顺序已经颠倒了
print(vars(train_data.examples[0]))#构建词表
#所谓构建词表，即需要给每个单词编码，也就是用数字表示每个单词，这样才能传入模型。
#可以使用dataset类中的build_vocab()方法传入用于构建词表的数据集。
#注意，源语言和目标语言的词表是不同的，而且词表应该只从训练集构建，而不是验证/测试集，这可以防止“信息泄漏”到模型中。
SRC.build_vocab(train_data, min_freq = 2) #设置最小词频为2，当一个单词在数据集中出现次数小于2时会被转换为<unk>字符。
TRG.build_vocab(train_data, min_freq = 2)#查看一下生成的词表大小
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")#指定GPU还是CPU进行训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')BATCH_SIZE = 128#创建迭代器
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),batch_size = BATCH_SIZE,device = device)#查看一下生成的batch
batch = next(iter(train_iterator))
print(batch)
#输出:
# [torchtext.data.batch.Batch of size 128 from MULTI30K]
#   [.src]:[torch.cuda.LongTensor of size 23x128 (GPU 0)]
#   [.trg]:[torch.cuda.LongTensor of size 21x128 (GPU 0)]

建立Seq2Seq模型

我们将分别创建编码器（Encoder）、解码器（Eecoder）和seq2seq模型。

encoder

原论文使用了一个4层的单向LSTM，出于训练时间的考虑，我们将其缩减到了2层。结构如图所示

参数说明：

input_dim：输入编码器的one-hot向量的维度，等于源语言词汇表的大小。
emb_dim：embedding层的维度
hid_dim：隐藏层h和c的维度
n_layers：LSTM网络的层数
dropout：如果非零的话，将会在LSTM的输出上加个dropout，最后一层除外。

在forward函数中，我们传入源语言src，经过embedding层将其转换为密集向量，然后应用dropout，然后将这些词嵌入传递到LSTM。读者可能注意到，我们没有将初始隐藏状态h_0和单元格状态c_0传递给LSTM。这是因为如果没有向LSTM传递隐藏/单元格状态，它将自动创建一个全0的张量作为初始状态。

decoder

decoder网络同样是一个2层的LSTM（原论文中为4层），结构如图所示：

Decoder网络的参数和初始化类似于Encoder，不同的地方在于：

output_dim：输入解码器的one-hot向量维度，等于目标语言词汇表的大小。
添加了Linear层，用于预测最终输出。

在forward函数中，我们接受目标语言trg作为输入数据，由于目标语言每次是输入一个词（源语言每次输入一句话），因此用unsqueeze()方法给为其添加一个句子长度为1的维度（即将一维变为二维，以便能够作为embedding层的输入）。

然后，与编码器类似，我们通过一个embedding层并应用dropout。然后，将这些
嵌入与Encoder层生成的隐藏状态h_n和单元格状态c_n一起传递到LSTM。注意：在Encoder中，我们使用了一个全0的张量作为初始隐藏状态h_0和单元格状态c_0，在Decoder中，我们使用的是Encoder生成的h_n和c_n作为初始的隐藏状态和单元格状态，这就相当于我们在翻译时使用了源语言的上下文信息。

Seq2Seq网络

Seq2Seq网络将Encoder和Decoder网络组合在，实现以下功能：

使用源语言句子作为输入
使用Encoder生成上下文向量
使用Decoder预测目标语言句子

参数说明：

device ：把张量放到GPU上。新版的Pytorch使用to方法可以容易地将对象移动到不同的设备（代替以前的cpu()或cuda()方法）。
outputs：存储Decoder所有输出的张量
teacher_forcing_ratio：该参数的作用是，当使用teacher force时，decoder网络的下一个input是目标语言的下一个字符，当不使用时，网络的下一个input是其预测出的那个字符。

在该网络中，编码器和解码器的层数（n_layers）和隐藏层维度（hid_dim）是相等。但是，在其他的Seq2seq模型中不一定总是需要相同的层数或相同的隐藏维度大小。例如，编码器有2层，解码器只有1层，这就需要进行相应的处理，如对编码器输出的两个上下文向量求平均值，或者只使用最后一层的上下文向量作为解码器的输入等。

import torch
import torch.nn as nn
import torch.optim as optimfrom torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIteratorimport spacy
import random
import math
import time#1、preparing data
#设置一个随机种子
SEED = 1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True  #使用cuda保证每次结果一样。将这个 flag 置为True的话，每次返回的卷积算法将是确定的，即默认算法。如果配合上设置 Torch 的随机种子为固定值的话，应该可以保证每次运行网络的时候相同输入的输出是固定的#创建tokenizer
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')#把tokenizer从一串字符转成一个list，同时做一个reverse取反
#在原论文中，作者发现颠倒源语言的输入的顺序可以取得不错的翻译效果，例如，一句话为“good morning!”，颠倒顺序分词后变为"!", “morning”, 和"good"。#将德语进行分词并颠倒顺序
def tokenize_de(text):return [tok.text for tok in spacy_de.tokenizer(text)][::-1]
#将英语进行分词，不颠倒顺序
def tokenize_en(text):return [tok.text for tok in spacy_en.tokenizer(text)]#我们创建SRC和TRG两个Field对象，tokenize为我们刚才定义的分词器函数，在每句话的开头加入字符SOS，结尾加入字符EOS，将所有单词转换为小写。#TorchText的Field定义数据应该如何被处理
#SRC即source，是德语
#TRG即target，是英语
#sos是start of sequence, eos是end of sequence
#lower=True是将所有单词转换为小写
SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token = '<eos>',lower = True)TRG = Field(tokenize = tokenize_en,init_token = '<sos>',eos_token = '<eos>',lower = True)#使用torchtext自带的Multi30k数据集，这是一个包含约30000个平行的英语、德语和法语句子的数据集，每个句子包含约12个单词。
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))#查看一下加载完的数据集
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")#看一下生成的第一个训练样本，可以看到源语言的顺序已经颠倒了
print(vars(train_data.examples[0]))#构建词表
#所谓构建词表，即需要给每个单词编码，也就是用数字表示每个单词，这样才能传入模型。
#可以使用dataset类中的build_vocab()方法传入用于构建词表的数据集。
#注意，源语言和目标语言的词表是不同的，而且词表应该只从训练集构建，而不是验证/测试集，这可以防止“信息泄漏”到模型中。
SRC.build_vocab(train_data, min_freq = 2) #设置最小词频为2，当一个单词在数据集中出现次数小于2时会被转换为<unk>字符。
TRG.build_vocab(train_data, min_freq = 2)#查看一下生成的词表大小
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")#指定GPU还是CPU进行训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')BATCH_SIZE = 128#创建迭代器
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),batch_size = BATCH_SIZE,device = device)#查看一下生成的batch
batch = next(iter(train_iterator))
print(batch)
#输出:
# [torchtext.data.batch.Batch of size 128 from MULTI30K]
#   [.src]:[torch.cuda.LongTensor of size 23x128 (GPU 0)]
#   [.trg]:[torch.cuda.LongTensor of size 21x128 (GPU 0)]#2、创建Seq2Seq模型#我们将分别创建编码器（Encoder）、解码器（Eecoder）和seq2seq模型。
#原论文使用了一个4层的单向LSTM，出于训练时间的考虑，我们将其缩减到了2层。结构如图所示class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.input_dim = input_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.n_layers = n_layersself.dropout = dropoutself.embedding = nn.Embedding(input_dim, emb_dim)#encoder部分self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.dropout = nn.Dropout(dropout)def forward(self, src):#src:(sent_len, batch_size)embedded = self.dropout(self.embedding(src))#embedded:(sent_len, batch_size, emb_dim)outputs, (hidden, cell) = self.rnn(embedded)#outputs:(sent_len, batch_size, hid_dim)#hidden:(n_layers, batch_size, hid_dim)#cell:(n_layers, batch_size, hid_dim)return hidden, cellclass Decoder(nn.Module):def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.output_dim = output_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.n_layers = n_layersself.dropout = dropoutself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.out = nn.Linear(hid_dim, output_dim)self.dropout = nn.Dropout(dropout)def forward(self, input, hidden, cell):# input: (batch_size) -> input: (1, batch_size)input = input.unsqueeze(0)# embedded: (1, batch_size, emb_dim)embedded = self.dropout(self.embedding(input))# hidden: (n_layers, batch size, hid_dim)# cell: (n_layers, batch size, hid_dim)# output(1, batch_size, hid_dim)output, (hidden, cell) = self.rnn(embedded, (hidden, cell))# prediction: (batch_size, output_dim)prediction = self.out(output.squeeze(0))return prediction, hidden, cellclass Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super().__init__()self.encoder = encoderself.decoder = decoderself.device = deviceassert encoder.hid_dim == decoder.hid_dim, \"Hidden dimensions of encoder and decoder must be equal!"assert encoder.n_layers == decoder.n_layers, \"Encoder and decoder must have equal number of layers!"def forward(self, src, trg, teacher_forcing_ratio=0.5):# src = [src sent len, batch size]# trg = [trg sent len, batch size]# teacher_forcing_ratio is probability to use teacher forcing# e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the timebatch_size = trg.shape[1]max_len = trg.shape[0]trg_vocab_size = self.decoder.output_dim# tensor to store decoder outputs#创建outputs张量存储Decoder的输出outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)# last hidden state of the encoder is used as the initial hidden state of the decoderhidden, cell = self.encoder(src)#输入到Decoder网络的第一个字符是<sos>（句子开始标记）input = trg[0, :]for t in range(1, max_len):# insert input token embedding, previous hidden and previous cell states# receive output tensor (predictions) and new hidden and cell statesoutput, hidden, cell = self.decoder(input, hidden, cell)# place predictions in a tensor holding predictions for each tokenoutputs[t] = output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = output.argmax(1)# if teacher forcing, use actual next token as next input# if not, use predicted tokeninput = trg[t] if teacher_force else top1return outputs

训练模型

初始化模型的参数。在原论文中，作者将所有参数初始化为-0.08和+0.08之间的均匀分布。我们通过创建一个函数来初始化模型中的参数权重。当使用apply方法时，模型中的每个模块和子模块都会调用init_weights函数。

在定义训练函数中

model.train() : 让model变为训练模式，启用 batch normalization（本模型未使用）和 Dropout。
clip_grad_norm: 进行梯度裁剪，防止梯度爆炸。clip：梯度阈值
view函数: 减少output和trg的维度以便进行loss计算。由于trg每句话的开头都是标记符sos，为了提高准确度，output和trg的第一列将不参与计算损失。

在定义测试函数中

model.eval()：开启测试模式，关闭batch normalization（本模型未使用）和 dropout。
torch.no_grad()：关闭autograd 引擎（不会进行反向传播计算），这样的好处是减少内存的使用并且加速计算。
teacher_forcing_ratio = 0：在测试阶段须要关闭teacher forcing，保证模型使用预测的结果作为下一步的输入。

import torch
import torch.nn as nn
import torch.optim as optimfrom torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIteratorimport spacy
import random
import math
import time#1、preparing data
#设置一个随机种子
SEED = 1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True  #使用cuda保证每次结果一样。将这个 flag 置为True的话，每次返回的卷积算法将是确定的，即默认算法。如果配合上设置 Torch 的随机种子为固定值的话，应该可以保证每次运行网络的时候相同输入的输出是固定的#创建tokenizer
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')#把tokenizer从一串字符转成一个list，同时做一个reverse取反
#在原论文中，作者发现颠倒源语言的输入的顺序可以取得不错的翻译效果，例如，一句话为“good morning!”，颠倒顺序分词后变为"!", “morning”, 和"good"。#将德语进行分词并颠倒顺序
def tokenize_de(text):return [tok.text for tok in spacy_de.tokenizer(text)][::-1]
#将英语进行分词，不颠倒顺序
def tokenize_en(text):return [tok.text for tok in spacy_en.tokenizer(text)]#我们创建SRC和TRG两个Field对象，tokenize为我们刚才定义的分词器函数，在每句话的开头加入字符SOS，结尾加入字符EOS，将所有单词转换为小写。#TorchText的Field定义数据应该如何被处理
#SRC即source，是德语
#TRG即target，是英语
#sos是start of sequence, eos是end of sequence
#lower=True是将所有单词转换为小写
SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token = '<eos>',lower = True)TRG = Field(tokenize = tokenize_en,init_token = '<sos>',eos_token = '<eos>',lower = True)#使用torchtext自带的Multi30k数据集，这是一个包含约30000个平行的英语、德语和法语句子的数据集，每个句子包含约12个单词。
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))#查看一下加载完的数据集
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")#看一下生成的第一个训练样本，可以看到源语言的顺序已经颠倒了
print(vars(train_data.examples[0]))#构建词表
#所谓构建词表，即需要给每个单词编码，也就是用数字表示每个单词，这样才能传入模型。
#可以使用dataset类中的build_vocab()方法传入用于构建词表的数据集。
#注意，源语言和目标语言的词表是不同的，而且词表应该只从训练集构建，而不是验证/测试集，这可以防止“信息泄漏”到模型中。
SRC.build_vocab(train_data, min_freq = 2) #设置最小词频为2，当一个单词在数据集中出现次数小于2时会被转换为<unk>字符。
TRG.build_vocab(train_data, min_freq = 2)#查看一下生成的词表大小
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")#指定GPU还是CPU进行训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')BATCH_SIZE = 128#创建迭代器
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),batch_size = BATCH_SIZE,device = device)#查看一下生成的batch
batch = next(iter(train_iterator))
print(batch)
#输出:
# [torchtext.data.batch.Batch of size 128 from MULTI30K]
#   [.src]:[torch.cuda.LongTensor of size 23x128 (GPU 0)]
#   [.trg]:[torch.cuda.LongTensor of size 21x128 (GPU 0)]#2、创建Seq2Seq模型#我们将分别创建编码器（Encoder）、解码器（Eecoder）和seq2seq模型。
#原论文使用了一个4层的单向LSTM，出于训练时间的考虑，我们将其缩减到了2层。结构如图所示class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.input_dim = input_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.n_layers = n_layersself.dropout = dropoutself.embedding = nn.Embedding(input_dim, emb_dim)#encoder部分self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.dropout = nn.Dropout(dropout)def forward(self, src):#src:(sent_len, batch_size)embedded = self.dropout(self.embedding(src))#embedded:(sent_len, batch_size, emb_dim)outputs, (hidden, cell) = self.rnn(embedded)#outputs:(sent_len, batch_size, hid_dim)#hidden:(n_layers, batch_size, hid_dim)#cell:(n_layers, batch_size, hid_dim)return hidden, cellclass Decoder(nn.Module):def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.output_dim = output_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.n_layers = n_layersself.dropout = dropoutself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.out = nn.Linear(hid_dim, output_dim)self.dropout = nn.Dropout(dropout)def forward(self, input, hidden, cell):# input: (batch_size) -> input: (1, batch_size)input = input.unsqueeze(0)# embedded: (1, batch_size, emb_dim)embedded = self.dropout(self.embedding(input))# hidden: (n_layers, batch size, hid_dim)# cell: (n_layers, batch size, hid_dim)# output(1, batch_size, hid_dim)output, (hidden, cell) = self.rnn(embedded, (hidden, cell))# prediction: (batch_size, output_dim)prediction = self.out(output.squeeze(0))return prediction, hidden, cellclass Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super().__init__()self.encoder = encoderself.decoder = decoderself.device = deviceassert encoder.hid_dim == decoder.hid_dim, \"Hidden dimensions of encoder and decoder must be equal!"assert encoder.n_layers == decoder.n_layers, \"Encoder and decoder must have equal number of layers!"def forward(self, src, trg, teacher_forcing_ratio=0.5):# src = [src sent len, batch size]# trg = [trg sent len, batch size]# teacher_forcing_ratio is probability to use teacher forcing# e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the timebatch_size = trg.shape[1]max_len = trg.shape[0]trg_vocab_size = self.decoder.output_dim# tensor to store decoder outputs#创建outputs张量存储Decoder的输出outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)# last hidden state of the encoder is used as the initial hidden state of the decoderhidden, cell = self.encoder(src)#输入到Decoder网络的第一个字符是<sos>（句子开始标记）input = trg[0, :]for t in range(1, max_len):# insert input token embedding, previous hidden and previous cell states# receive output tensor (predictions) and new hidden and cell statesoutput, hidden, cell = self.decoder(input, hidden, cell)# place predictions in a tensor holding predictions for each tokenoutputs[t] = output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = output.argmax(1)# if teacher forcing, use actual next token as next input# if not, use predicted tokeninput = trg[t] if teacher_force else top1return outputs#3、训练模型#定义模型参数
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5#编码器和解码器的嵌入层维度（emb_dim）和dropout可以不同，但是层数和隐藏层维度必须相同。
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)  #7855 256 512 2 0.5
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT) #5893 256 512 2 0.5model = Seq2Seq(enc, dec, device).to(device)#初始化模型参数
#在原论文中，作者将所有参数初始化为-0.08和+0.08之间的均匀分布。我们通过创建一个函数来初始化模型中的参数权重。当使用apply方法时，模型中的每个模块和子模块都会调用init_weights函数。
def init_weights(m):for name, param in m.named_parameters():nn.init.uniform_(param.data, -0.08, 0.08)model.apply(init_weights)#看一下模型中可训练参数的总数量
def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)print(f'The model has {count_parameters(model):,} trainable parameters')#使用Adam作为优化器
optimizer = optim.Adam(model.parameters())#使用交叉熵损失作为损失函数
#使用交叉熵损失作为损失函数，由于Pytorch在计算交叉熵损失时在一个batch内求平均，因此需要忽略target为的值（在数据处理阶段，一个batch里的所有句子都padding到了相同的长度，不足的用补齐），否则将影响梯度的计算
PAD_IDX = TRG.vocab.stoi['<pad>']criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)#定义训练函数
def train(model, iterator, optimizer, criterion, clip):   #criterion是损失函数model.train()epoch_loss = 0for i, batch in enumerate(iterator):#这里的src和trg都是tensor的形式了src = batch.srctrg = batch.trgoptimizer.zero_grad()output = model(src, trg)# trg = [trg sent len, batch size]# output = [trg sent len, batch size, output dim]output = output[1:].view(-1, output.shape[-1])trg = trg[1:].view(-1)# trg = [(trg sent len - 1) * batch size]# output = [(trg sent len - 1) * batch size, output dim]loss = criterion(output, trg)loss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), clip)optimizer.step()epoch_loss += loss.item()return epoch_loss / len(iterator)#定义验证函数，即val的
#评估阶段和训练阶段的区别是不需要更新任何参数
def evaluate(model, iterator, criterion):model.eval()epoch_loss = 0with torch.no_grad():for i, batch in enumerate(iterator):src = batch.srctrg = batch.trgoutput = model(src, trg, 0)  # turn off teacher forcing# trg = [trg sent len, batch size]# output = [trg sent len, batch size, output dim]output = output[1:].view(-1, output.shape[-1])trg = trg[1:].view(-1)# trg = [(trg sent len - 1) * batch size]# output = [(trg sent len - 1) * batch size, output dim]loss = criterion(output, trg)epoch_loss += loss.item()return epoch_loss / len(iterator)def epoch_time(start_time, end_time):elapsed_time = end_time - start_timeelapsed_mins = int(elapsed_time / 60)elapsed_secs = int(elapsed_time - (elapsed_mins * 60))return elapsed_mins, elapsed_secs#训练模型
N_EPOCHS = 10
CLIP = 1best_valid_loss = float('inf')for epoch in range(N_EPOCHS):start_time = time.time()train_loss = train(model, train_iterator, optimizer, criterion, CLIP)valid_loss = evaluate(model, valid_iterator, criterion)end_time = time.time()epoch_mins, epoch_secs = epoch_time(start_time, end_time)if valid_loss < best_valid_loss:best_valid_loss = valid_losstorch.save(model.state_dict(), 'tut1-model.pt')   #保存最佳验证损失的epoch参数作为模型的最终参数print(f'Epoch: {epoch + 1:02} | Time: {epoch_mins}m {epoch_secs}s')print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')#math.exp()：使用一个batch内的平均损失计算困惑度

验证模型

import torch
import torch.nn as nn
import torch.optim as optimfrom torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIteratorimport spacy
import random
import math
import time#1、preparing data
#设置一个随机种子
SEED = 1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True  #使用cuda保证每次结果一样。将这个 flag 置为True的话，每次返回的卷积算法将是确定的，即默认算法。如果配合上设置 Torch 的随机种子为固定值的话，应该可以保证每次运行网络的时候相同输入的输出是固定的#创建tokenizer
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')#把tokenizer从一串字符转成一个list，同时做一个reverse取反
#在原论文中，作者发现颠倒源语言的输入的顺序可以取得不错的翻译效果，例如，一句话为“good morning!”，颠倒顺序分词后变为"!", “morning”, 和"good"。#将德语进行分词并颠倒顺序
def tokenize_de(text):return [tok.text for tok in spacy_de.tokenizer(text)][::-1]
#将英语进行分词，不颠倒顺序
def tokenize_en(text):return [tok.text for tok in spacy_en.tokenizer(text)]#我们创建SRC和TRG两个Field对象，tokenize为我们刚才定义的分词器函数，在每句话的开头加入字符SOS，结尾加入字符EOS，将所有单词转换为小写。#TorchText的Field定义数据应该如何被处理
#SRC即source，是德语
#TRG即target，是英语
#sos是start of sequence, eos是end of sequence
#lower=True是将所有单词转换为小写
SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token = '<eos>',lower = True)TRG = Field(tokenize = tokenize_en,init_token = '<sos>',eos_token = '<eos>',lower = True)#使用torchtext自带的Multi30k数据集，这是一个包含约30000个平行的英语、德语和法语句子的数据集，每个句子包含约12个单词。
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))#查看一下加载完的数据集
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")#看一下生成的第一个训练样本，可以看到源语言的顺序已经颠倒了
print(vars(train_data.examples[0]))#构建词表
#所谓构建词表，即需要给每个单词编码，也就是用数字表示每个单词，这样才能传入模型。
#可以使用dataset类中的build_vocab()方法传入用于构建词表的数据集。
#注意，源语言和目标语言的词表是不同的，而且词表应该只从训练集构建，而不是验证/测试集，这可以防止“信息泄漏”到模型中。
SRC.build_vocab(train_data, min_freq = 2) #设置最小词频为2，当一个单词在数据集中出现次数小于2时会被转换为<unk>字符。
TRG.build_vocab(train_data, min_freq = 2)#查看一下生成的词表大小
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")#指定GPU还是CPU进行训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')BATCH_SIZE = 128#创建迭代器
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),batch_size = BATCH_SIZE,device = device)#查看一下生成的batch
batch = next(iter(train_iterator))
print(batch)
#输出:
# [torchtext.data.batch.Batch of size 128 from MULTI30K]
#   [.src]:[torch.cuda.LongTensor of size 23x128 (GPU 0)]
#   [.trg]:[torch.cuda.LongTensor of size 21x128 (GPU 0)]#2、创建Seq2Seq模型#我们将分别创建编码器（Encoder）、解码器（Eecoder）和seq2seq模型。
#原论文使用了一个4层的单向LSTM，出于训练时间的考虑，我们将其缩减到了2层。结构如图所示class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.input_dim = input_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.n_layers = n_layersself.dropout = dropoutself.embedding = nn.Embedding(input_dim, emb_dim)#encoder部分self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.dropout = nn.Dropout(dropout)def forward(self, src):#src:(sent_len, batch_size)embedded = self.dropout(self.embedding(src))#embedded:(sent_len, batch_size, emb_dim)outputs, (hidden, cell) = self.rnn(embedded)#outputs:(sent_len, batch_size, hid_dim)#hidden:(n_layers, batch_size, hid_dim)#cell:(n_layers, batch_size, hid_dim)return hidden, cellclass Decoder(nn.Module):def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.output_dim = output_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.n_layers = n_layersself.dropout = dropoutself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.out = nn.Linear(hid_dim, output_dim)self.dropout = nn.Dropout(dropout)def forward(self, input, hidden, cell):# input: (batch_size) -> input: (1, batch_size)input = input.unsqueeze(0)# embedded: (1, batch_size, emb_dim)embedded = self.dropout(self.embedding(input))# hidden: (n_layers, batch size, hid_dim)# cell: (n_layers, batch size, hid_dim)# output(1, batch_size, hid_dim)output, (hidden, cell) = self.rnn(embedded, (hidden, cell))# prediction: (batch_size, output_dim)prediction = self.out(output.squeeze(0))return prediction, hidden, cellclass Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super().__init__()self.encoder = encoderself.decoder = decoderself.device = deviceassert encoder.hid_dim == decoder.hid_dim, \"Hidden dimensions of encoder and decoder must be equal!"assert encoder.n_layers == decoder.n_layers, \"Encoder and decoder must have equal number of layers!"def forward(self, src, trg, teacher_forcing_ratio=0.5):# src = [src sent len, batch size]# trg = [trg sent len, batch size]# teacher_forcing_ratio is probability to use teacher forcing# e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the timebatch_size = trg.shape[1]max_len = trg.shape[0]trg_vocab_size = self.decoder.output_dim# tensor to store decoder outputs#创建outputs张量存储Decoder的输出outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)# last hidden state of the encoder is used as the initial hidden state of the decoderhidden, cell = self.encoder(src)#输入到Decoder网络的第一个字符是<sos>（句子开始标记）input = trg[0, :]for t in range(1, max_len):# insert input token embedding, previous hidden and previous cell states# receive output tensor (predictions) and new hidden and cell statesoutput, hidden, cell = self.decoder(input, hidden, cell)# place predictions in a tensor holding predictions for each tokenoutputs[t] = output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = output.argmax(1)# if teacher forcing, use actual next token as next input# if not, use predicted tokeninput = trg[t] if teacher_force else top1return outputs#3、训练模型#定义模型参数
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5#编码器和解码器的嵌入层维度（emb_dim）和dropout可以不同，但是层数和隐藏层维度必须相同。
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)  #7855 256 512 2 0.5
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT) #5893 256 512 2 0.5model = Seq2Seq(enc, dec, device).to(device)#初始化模型参数
#在原论文中，作者将所有参数初始化为-0.08和+0.08之间的均匀分布。我们通过创建一个函数来初始化模型中的参数权重。当使用apply方法时，模型中的每个模块和子模块都会调用init_weights函数。
def init_weights(m):for name, param in m.named_parameters():nn.init.uniform_(param.data, -0.08, 0.08)model.apply(init_weights)#看一下模型中可训练参数的总数量
def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)print(f'The model has {count_parameters(model):,} trainable parameters')#使用Adam作为优化器
optimizer = optim.Adam(model.parameters())#使用交叉熵损失作为损失函数
#使用交叉熵损失作为损失函数，由于Pytorch在计算交叉熵损失时在一个batch内求平均，因此需要忽略target为的值（在数据处理阶段，一个batch里的所有句子都padding到了相同的长度，不足的用补齐），否则将影响梯度的计算
PAD_IDX = TRG.vocab.stoi['<pad>']criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)#定义训练函数
def train(model, iterator, optimizer, criterion, clip):   #criterion是损失函数model.train()epoch_loss = 0for i, batch in enumerate(iterator):#这里的src和trg都是tensor的形式了src = batch.srctrg = batch.trgoptimizer.zero_grad()output = model(src, trg)# trg = [trg sent len, batch size]# output = [trg sent len, batch size, output dim]output = output[1:].view(-1, output.shape[-1])trg = trg[1:].view(-1)# trg = [(trg sent len - 1) * batch size]# output = [(trg sent len - 1) * batch size, output dim]loss = criterion(output, trg)loss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), clip)optimizer.step()epoch_loss += loss.item()return epoch_loss / len(iterator)#定义验证函数，即val的
#评估阶段和训练阶段的区别是不需要更新任何参数
def evaluate(model, iterator, criterion):model.eval()epoch_loss = 0with torch.no_grad():for i, batch in enumerate(iterator):src = batch.srctrg = batch.trgoutput = model(src, trg, 0)  # turn off teacher forcing# trg = [trg sent len, batch size]# output = [trg sent len, batch size, output dim]output = output[1:].view(-1, output.shape[-1])trg = trg[1:].view(-1)# trg = [(trg sent len - 1) * batch size]# output = [(trg sent len - 1) * batch size, output dim]loss = criterion(output, trg)epoch_loss += loss.item()return epoch_loss / len(iterator)def epoch_time(start_time, end_time):elapsed_time = end_time - start_timeelapsed_mins = int(elapsed_time / 60)elapsed_secs = int(elapsed_time - (elapsed_mins * 60))return elapsed_mins, elapsed_secs#训练模型
N_EPOCHS = 10
CLIP = 1best_valid_loss = float('inf')for epoch in range(N_EPOCHS):start_time = time.time()train_loss = train(model, train_iterator, optimizer, criterion, CLIP)valid_loss = evaluate(model, valid_iterator, criterion)end_time = time.time()epoch_mins, epoch_secs = epoch_time(start_time, end_time)if valid_loss < best_valid_loss:best_valid_loss = valid_losstorch.save(model.state_dict(), 'tut1-model.pt')   #保存最佳验证损失的epoch参数作为模型的最终参数print(f'Epoch: {epoch + 1:02} | Time: {epoch_mins}m {epoch_secs}s')print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')#math.exp()：使用一个batch内的平均损失计算困惑度#4、验证模型
model.load_state_dict(torch.load('tut1-model.pt'))test_loss = evaluate(model, test_iterator, criterion)print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

最终

import torch
import torch.nn as nn
import torch.optim as optimfrom torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIteratorimport spacy
import random
import math
import time#1、preparing data
#设置一个随机种子
SEED = 1234
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True  #使用cuda保证每次结果一样。将这个 flag 置为True的话，每次返回的卷积算法将是确定的，即默认算法。如果配合上设置 Torch 的随机种子为固定值的话，应该可以保证每次运行网络的时候相同输入的输出是固定的#创建tokenizer
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')#把tokenizer从一串字符转成一个list，同时做一个reverse取反
#在原论文中，作者发现颠倒源语言的输入的顺序可以取得不错的翻译效果，例如，一句话为“good morning!”，颠倒顺序分词后变为"!", “morning”, 和"good"。#将德语进行分词并颠倒顺序
def tokenize_de(text):return [tok.text for tok in spacy_de.tokenizer(text)][::-1]
#将英语进行分词，不颠倒顺序
def tokenize_en(text):return [tok.text for tok in spacy_en.tokenizer(text)]#我们创建SRC和TRG两个Field对象，tokenize为我们刚才定义的分词器函数，在每句话的开头加入字符SOS，结尾加入字符EOS，将所有单词转换为小写。#TorchText的Field定义数据应该如何被处理
#SRC即source，是德语
#TRG即target，是英语
#sos是start of sequence, eos是end of sequence
#lower=True是将所有单词转换为小写
SRC = Field(tokenize = tokenize_de,init_token = '<sos>',eos_token = '<eos>',lower = True)TRG = Field(tokenize = tokenize_en,init_token = '<sos>',eos_token = '<eos>',lower = True)#使用torchtext自带的Multi30k数据集，这是一个包含约30000个平行的英语、德语和法语句子的数据集，每个句子包含约12个单词。
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),fields = (SRC, TRG))#查看一下加载完的数据集
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")#看一下生成的第一个训练样本，可以看到源语言的顺序已经颠倒了
print(vars(train_data.examples[0]))#构建词表
#所谓构建词表，即需要给每个单词编码，也就是用数字表示每个单词，这样才能传入模型。
#可以使用dataset类中的build_vocab()方法传入用于构建词表的数据集。
#注意，源语言和目标语言的词表是不同的，而且词表应该只从训练集构建，而不是验证/测试集，这可以防止“信息泄漏”到模型中。
SRC.build_vocab(train_data, min_freq = 2) #设置最小词频为2，当一个单词在数据集中出现次数小于2时会被转换为<unk>字符。
TRG.build_vocab(train_data, min_freq = 2)#查看一下生成的词表大小
print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")#指定GPU还是CPU进行训练
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')BATCH_SIZE = 128#创建迭代器
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data),batch_size = BATCH_SIZE,device = device)#查看一下生成的batch
batch = next(iter(train_iterator))
print(batch)
#输出:
# [torchtext.data.batch.Batch of size 128 from MULTI30K]
#   [.src]:[torch.cuda.LongTensor of size 23x128 (GPU 0)]
#   [.trg]:[torch.cuda.LongTensor of size 21x128 (GPU 0)]#2、创建Seq2Seq模型#我们将分别创建编码器（Encoder）、解码器（Eecoder）和seq2seq模型。
#原论文使用了一个4层的单向LSTM，出于训练时间的考虑，我们将其缩减到了2层。结构如图所示class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.input_dim = input_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.n_layers = n_layersself.dropout = dropoutself.embedding = nn.Embedding(input_dim, emb_dim)#encoder部分self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.dropout = nn.Dropout(dropout)def forward(self, src):#src:(sent_len, batch_size)embedded = self.dropout(self.embedding(src))#embedded:(sent_len, batch_size, emb_dim)outputs, (hidden, cell) = self.rnn(embedded)#outputs:(sent_len, batch_size, hid_dim)#hidden:(n_layers, batch_size, hid_dim)#cell:(n_layers, batch_size, hid_dim)return hidden, cellclass Decoder(nn.Module):def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.output_dim = output_dimself.emb_dim = emb_dimself.hid_dim = hid_dimself.n_layers = n_layersself.dropout = dropoutself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.out = nn.Linear(hid_dim, output_dim)self.dropout = nn.Dropout(dropout)def forward(self, input, hidden, cell):# input: (batch_size) -> input: (1, batch_size)input = input.unsqueeze(0)# embedded: (1, batch_size, emb_dim)embedded = self.dropout(self.embedding(input))# hidden: (n_layers, batch size, hid_dim)# cell: (n_layers, batch size, hid_dim)# output(1, batch_size, hid_dim)output, (hidden, cell) = self.rnn(embedded, (hidden, cell))# prediction: (batch_size, output_dim)prediction = self.out(output.squeeze(0))return prediction, hidden, cellclass Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super().__init__()self.encoder = encoderself.decoder = decoderself.device = deviceassert encoder.hid_dim == decoder.hid_dim, \"Hidden dimensions of encoder and decoder must be equal!"assert encoder.n_layers == decoder.n_layers, \"Encoder and decoder must have equal number of layers!"def forward(self, src, trg, teacher_forcing_ratio=0.5):# src = [src sent len, batch size]# trg = [trg sent len, batch size]# teacher_forcing_ratio is probability to use teacher forcing# e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the timebatch_size = trg.shape[1]max_len = trg.shape[0]trg_vocab_size = self.decoder.output_dim# tensor to store decoder outputs#创建outputs张量存储Decoder的输出outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)# last hidden state of the encoder is used as the initial hidden state of the decoderhidden, cell = self.encoder(src)#输入到Decoder网络的第一个字符是<sos>（句子开始标记）input = trg[0, :]for t in range(1, max_len):# insert input token embedding, previous hidden and previous cell states# receive output tensor (predictions) and new hidden and cell statesoutput, hidden, cell = self.decoder(input, hidden, cell)# place predictions in a tensor holding predictions for each tokenoutputs[t] = output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = output.argmax(1)# if teacher forcing, use actual next token as next input# if not, use predicted tokeninput = trg[t] if teacher_force else top1return outputs#3、训练模型#定义模型参数
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5#编码器和解码器的嵌入层维度（emb_dim）和dropout可以不同，但是层数和隐藏层维度必须相同。
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)  #7855 256 512 2 0.5
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT) #5893 256 512 2 0.5model = Seq2Seq(enc, dec, device).to(device)#初始化模型参数
#在原论文中，作者将所有参数初始化为-0.08和+0.08之间的均匀分布。我们通过创建一个函数来初始化模型中的参数权重。当使用apply方法时，模型中的每个模块和子模块都会调用init_weights函数。
def init_weights(m):for name, param in m.named_parameters():nn.init.uniform_(param.data, -0.08, 0.08)model.apply(init_weights)#看一下模型中可训练参数的总数量
def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)print(f'The model has {count_parameters(model):,} trainable parameters')#使用Adam作为优化器
optimizer = optim.Adam(model.parameters())#使用交叉熵损失作为损失函数
#使用交叉熵损失作为损失函数，由于Pytorch在计算交叉熵损失时在一个batch内求平均，因此需要忽略target为的值（在数据处理阶段，一个batch里的所有句子都padding到了相同的长度，不足的用补齐），否则将影响梯度的计算
PAD_IDX = TRG.vocab.stoi['<pad>']criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)#定义训练函数
def train(model, iterator, optimizer, criterion, clip):   #criterion是损失函数model.train()epoch_loss = 0for i, batch in enumerate(iterator):#这里的src和trg都是tensor的形式了src = batch.srctrg = batch.trgoptimizer.zero_grad()output = model(src, trg)# trg = [trg sent len, batch size]# output = [trg sent len, batch size, output dim]output = output[1:].view(-1, output.shape[-1])trg = trg[1:].view(-1)# trg = [(trg sent len - 1) * batch size]# output = [(trg sent len - 1) * batch size, output dim]loss = criterion(output, trg)loss.backward()torch.nn.utils.clip_grad_norm_(model.parameters(), clip)optimizer.step()epoch_loss += loss.item()return epoch_loss / len(iterator)#定义验证函数，即val的
#评估阶段和训练阶段的区别是不需要更新任何参数
def evaluate(model, iterator, criterion):model.eval()epoch_loss = 0with torch.no_grad():for i, batch in enumerate(iterator):src = batch.srctrg = batch.trgoutput = model(src, trg, 0)  # turn off teacher forcing# trg = [trg sent len, batch size]# output = [trg sent len, batch size, output dim]output = output[1:].view(-1, output.shape[-1])trg = trg[1:].view(-1)# trg = [(trg sent len - 1) * batch size]# output = [(trg sent len - 1) * batch size, output dim]loss = criterion(output, trg)epoch_loss += loss.item()return epoch_loss / len(iterator)def epoch_time(start_time, end_time):elapsed_time = end_time - start_timeelapsed_mins = int(elapsed_time / 60)elapsed_secs = int(elapsed_time - (elapsed_mins * 60))return elapsed_mins, elapsed_secs#训练模型
N_EPOCHS = 10
CLIP = 1best_valid_loss = float('inf')for epoch in range(N_EPOCHS):start_time = time.time()train_loss = train(model, train_iterator, optimizer, criterion, CLIP)valid_loss = evaluate(model, valid_iterator, criterion)end_time = time.time()epoch_mins, epoch_secs = epoch_time(start_time, end_time)if valid_loss < best_valid_loss:best_valid_loss = valid_losstorch.save(model.state_dict(), 'tut1-model.pt')   #保存最佳验证损失的epoch参数作为模型的最终参数print(f'Epoch: {epoch + 1:02} | Time: {epoch_mins}m {epoch_secs}s')print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')#math.exp()：使用一个batch内的平均损失计算困惑度#4、验证模型
model.load_state_dict(torch.load('tut1-model.pt'))test_loss = evaluate(model, test_iterator, criterion)print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

参考：

https://blog.csdn.net/weixin_43632501/article/details/98731800

Seq2Seq实战——机器翻译相关推荐

基于seq2seq的机器翻译系统
目录前言 1.模型结构 1.1 encoder 1.2 decoder 2.数据处理 2.1 数据集 2.2 字典构建 2.3 特殊符号 3.参数加载 3.1 库的导入 3.2 词典导入 4.准备训 ...
PyTorch: 序列到序列模型(Seq2Seq)实现机器翻译实战
版权声明:博客文章都是作者辛苦整理的,转载请注明出处,谢谢!http://blog.csdn.net/m0_37306360/article/details/79318644 简介在这个项目中,我们 ...
[转载] PyTorch: 序列到序列模型(Seq2Seq)实现机器翻译实战
参考链接: Python机器学习中的seq2seq模型简介在这个项目中,我们将使用PyTorch框架实现一个神经网络,这个网络实现法文翻译成英文.这个项目是Sean Robertson写的稍微复杂 ...
【一起入门NLP】中科院自然语言处理作业四：RNN+Attention实现Seq2Seq中英文机器翻译（Pytorch）【代码+报告】
这里是国科大自然语言处理的第四次作业,同样也是从小白的视角解读程序和代码,现在我们开始吧(今天也是花里胡哨的一天呢
一图看懂TensorFlow2.0系列（十一）如何用TensorFlow2.0实现seq2seq的机器翻译？
【笔记3-7】CS224N课程笔记 - 神经网络机器翻译seq2seq注意力机制
CS224N(七)Neural Machine Translation, Seq2seq and Attention seq2seq神经网络机器翻译历史方法 seq2seq基础 seq2seq - ...
4个可以写进简历的京东 NLP 项目实战
01 京东AI项目实战课程安排覆盖了从经典的机器学习.文本处理技术.序列模型.深度学习.预训练模型.知识图谱.图神经网络所有必要的技术. 项目一.京东健康智能分诊项目第一周:文本处理与特征工程 | ...
第十二课.Seq2Seq与Attention
目录 Seq2Seq,机器翻译Encoder-Decoder 机器翻译数据集与数据预处理 Encoder-Decoder模型损失函数训练与测试结合Luong Attention Luong At ...
机器翻译之Facebook的CNN与Google的Attention
传统的seq2seq facebook的cnn 结构特点 position embedding 卷积的引入 GLU控制信息的流动 attention google的attention 结构特点 K ...

Seq2Seq实战——机器翻译