pytorch做seq2seq注意力模型的翻译

以下是对pytorch 1.0版本的seq2seq+注意力模型做法语--英语翻译的理解（这个代码在pytorch0.4上也可以正常跑）：
1 #-*- coding: utf-8 -*-
2 """
3 Translation with a Sequence to Sequence Network and Attention4 *************************************************************5 **Author**: `Sean Robertson <https://github.com/spro/practical-pytorch>`_6
7 In this project we will be teaching a neural network to translate from8 French to English.9
10 ::11
12 [KEY: > input, = target, < output]13
14 > il est en train de peindre un tableau .15 = he is painting a picture .16 < he is painting a picture .17
18 > pourquoi ne pas essayer ce vin delicieux ?19 = why not try that delicious wine ?20 < why not try that delicious wine ?21
22 > elle n est pas poete mais romanciere .23 = she is not a poet but a novelist .24 < she not not a poet but a novelist .25
26 > vous etes trop maigre .27 = you re too skinny .28 < you re all alone .29
30 ... to varying degrees of success.31
32 This is made possible by the simple but powerful idea of the `sequence33 to sequence network <http://arxiv.org/abs/1409.3215>`__, in which two34 recurrent neural networks work together to transform one sequence to35 another. An encoder network condenses an input sequence into a vector,36 and a decoder network unfolds that vector into a new sequence.37
38 .. figure:: /_static/img/seq-seq-images/seq2seq.png39 :alt:40
41 To improve upon this model we'll use an `attention42 mechanism <https://arxiv.org/abs/1409.0473>`__, which lets the decoder43 learn to focus over a specific range of the input sequence.44
45 **Recommended Reading:**46
47 I assume you have at least installed PyTorch, know Python, and48 understand Tensors:49
50 -  https://pytorch.org/ For installation instructions51 -  :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general52 -  :doc:`/beginner/pytorch_with_examples` for a wide and deep overview53 -  :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user54
55
56 It would also be useful to know about Sequence to Sequence networks and57 how they work:58
59 -  `Learning Phrase Representations using RNN Encoder-Decoder for60 Statistical Machine Translation <http://arxiv.org/abs/1406.1078>`__61 -  `Sequence to Sequence Learning with Neural62 Networks <http://arxiv.org/abs/1409.3215>`__63 -  `Neural Machine Translation by Jointly Learning to Align and64 Translate <https://arxiv.org/abs/1409.0473>`__65 -  `A Neural Conversational Model <http://arxiv.org/abs/1506.05869>`__66
67 You will also find the previous tutorials on68 :doc:`/intermediate/char_rnn_classification_tutorial`69 and :doc:`/intermediate/char_rnn_generation_tutorial`70 helpful as those concepts are very similar to the Encoder and Decoder71 models, respectively.72
73 And for more, read the papers that introduced these topics:74
75 -  `Learning Phrase Representations using RNN Encoder-Decoder for76 Statistical Machine Translation <http://arxiv.org/abs/1406.1078>`__77 -  `Sequence to Sequence Learning with Neural78 Networks <http://arxiv.org/abs/1409.3215>`__79 -  `Neural Machine Translation by Jointly Learning to Align and80 Translate <https://arxiv.org/abs/1409.0473>`__81 -  `A Neural Conversational Model <http://arxiv.org/abs/1506.05869>`__82
83
84 **Requirements**85 """
86 from __future__ importunicode_literals, print_function, division87 from io importopen88 importunicodedata89 importstring90 importre91 importrandom92
93 importtorch94 importtorch.nn as nn95 from torch importoptim96 importtorch.nn.functional as F97
98 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")99
100 ######################################################################
101 #Loading data files
102 #==================
103 #104 #The data for this project is a set of many thousands of English to
105 #French translation pairs.
106 #107 #`This question on Open Data Stack
108 #Exchange <http://opendata.stackexchange.com/questions/3888/dataset-of-sentences-translated-into-many-languages>`__
109 #pointed me to the open translation site http://tatoeba.org/ which has
110 #downloads available at http://tatoeba.org/eng/downloads - and better
111 #yet, someone did the extra work of splitting language pairs into
112 #individual text files here: http://www.manythings.org/anki/
113 #114 #The English to French pairs are too big to include in the repo, so
115 #download to ``data/eng-fra.txt`` before continuing. The file is a tab
116 #separated list of translation pairs:
117 #118 #::
119 #120 #I am cold.    J'ai froid.
121 #122 #.. Note::
123 #Download the data from
124 #`here <https://download.pytorch.org/tutorial/data.zip>`_
125 #and extract it to the current directory.
126
127 ######################################################################
128 #Similar to the character encoding used in the character-level RNN
129 #tutorials, we will be representing each word in a language as a one-hot
130 #vector, or giant vector of zeros except for a single one (at the index
131 #of the word). Compared to the dozens of characters that might exist in a
132 #language, there are many many more words, so the encoding vector is much
133 #larger. We will however cheat a bit and trim the data to only use a few
134 #thousand words per language.
135 #136 #.. figure:: /_static/img/seq-seq-images/word-encoding.png
137 #:alt:
138 #139 #140
141
142 ######################################################################
143 #We'll need a unique index per word to use as the inputs and targets of
144 #the networks later. To keep track of all this we will use a helper class
145 #called ``Lang`` which has word → index (``word2index``) and index → word
146 #(``index2word``) dictionaries, as well as a count of each word
147 #``word2count`` to use to later replace rare words.
148 #149
150 SOS_token =0151 EOS_token = 1
152
153
154 #每个单词需要对应唯一的索引作为稍后的网络输入和目标.为了追踪这些索引
155 #则使用一个帮助类 Lang ，类中有 词 → 索引 (word2index) 和 索引 → 词
156 #(index2word) 的字典, 以及每个词word2count 用来替换稀疏词汇.
157
158
159 #此处创建的Lang 对象来表示源/目标语言，它包含三部分：word2index、
160 #index2word 和word2count，分别表示单词到id、id 到单词和单词的词频。
161 #word2count的作用是用于过滤一些低频词（把它变成unknown）
162
163 classLang:164     def __init__(self, name):165         self.name =name166         self.word2index ={}167         self.word2count ={}168         self.index2word = {0: "SOS", 1: "EOS"}169         self.n_words = 2  #Count SOS and EOS
170
171     defaddSentence(self, sentence):172         for word in sentence.split(' '):173             self.addWord(word)  #用于添加单词
174
175     defaddWord(self, word):176         if word not in self.word2index:  #是不是新的词
177             #如果不在word2index里，则需要新的定义字典
178             self.word2index[word] =self.n_words179             self.word2count[word] = 1
180             self.index2word[self.n_words] =word181             self.n_words += 1  #相当于每次index+1
182         else:183             self.word2count[word] += 1  #计算每次词的个数
184
185
186 ######################################################################
187 #The files are all in Unicode, to simplify we will turn Unicode
188 #characters to ASCII, make everything lowercase, and trim most
189 #punctuation.
190 #191
192 #Turn a Unicode string to plain ASCII, thanks to
193 #http://stackoverflow.com/a/518232/2809427
194
195 #此处是为了将Unicode字符串转换为纯ASCII
196 #原文件是Unicode编码
197 defunicodeToAscii(s):198     return ''.join(199         c for c in unicodedata.normalize('NFD', s)200         if unicodedata.category(c) != 'Mn'
201 )202
203
204 #Lowercase, trim, and remove non-letter characters
205
206 #小写,修剪和删除非字母字符
207 defnormalizeString(s):208     s =unicodeToAscii(s.lower().strip())209     s = re.sub(r"([.!?])", r"\1", s)210     s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)211     returns212
213
214 ######################################################################
215 #To read the data file we will split the file into lines, and then split
216 #lines into pairs. The files are all English → Other Language, so if we
217 #want to translate from Other Language → English I added the ``reverse``
218 #flag to reverse the pairs.
219 #220
221
222 #要读取数据文件,我们将把文件分成行,然后将行成对分开. 这些文件
223 #都是英文→其他语言,所以如果我们想从其他语言翻译→英文,我们添加了
224 #翻转标志 reverse来翻转词语对.
225 def readLangs(lang1, lang2, reverse=False):226     print("Reading lines...")227
228     #Read the file and split into lines
229     #读取文件并按行分开
230     lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8'). \231         read().strip().split('\n')232
233     #Split every line into pairs and normalize
234     #将每一行分成两列并进行标准化
235     pairs = [[normalizeString(s) for s in l.split('\t')] for l inlines]236
237     #Reverse pairs, make Lang instances
238     #翻转对,Lang实例化
239     ifreverse:240         pairs = [list(reversed(p)) for p inpairs]241         input_lang =Lang(lang2)242         output_lang =Lang(lang1)243     else:244         input_lang =Lang(lang1)245         output_lang =Lang(lang2)246
247     returninput_lang, output_lang, pairs248
249
250 ######################################################################
251 #Since there are a *lot* of example sentences and we want to train
252 #something quickly, we'll trim the data set to only relatively short and
253 #simple sentences. Here the maximum length is 10 words (that includes
254 #ending punctuation) and we're filtering to sentences that translate to
255 #the form "I am" or "He is" etc. (accounting for apostrophes replaced
256 #earlier).
257 #258
259 #由于例句较多,为了方便快速训练,则会将数据集裁剪为相对简短的句子.
260 #这里的单词的最大长度是10词(包括结束标点符号),
261 #保留”I am” 和”He is” 开头的数据
262
263 MAX_LENGTH = 10
264
265 eng_prefixes =(266     "i am", "i m",267     "he is", "he s",268     "she is", "she s",269     "you are", "you re",270     "we are", "we re",271     "they are", "they re"
272 )273
274
275 deffilterPair(p):276     return len(p[0].split(' ')) < MAX_LENGTH and\277            len(p[1].split(' ')) < MAX_LENGTH and\278            p[1].startswith(eng_prefixes)279     #是否满足长度
280
281
282 deffilterPairs(pairs):283     return [pair for pair in pairs iffilterPair(pair)]284
285
286 ######################################################################
287 #The full process for preparing the data is:
288 #289 #-  Read text file and split into lines, split lines into pairs
290 #-  Normalize text, filter by length and content
291 #-  Make word lists from sentences in pairs
292 #293
294 def prepareData(lang1, lang2, reverse=False):295     input_lang, output_lang, pairs =readLangs(lang1, lang2, reverse)296     #读入数据lang1,lang2,并翻转
297     print("Read %s sentence pairs" %len(pairs))298     #一共读入了多少对
299     pairs =filterPairs(pairs)300     #符合条件的配对有多少对
301     print("Trimmed to %s sentence pairs" %len(pairs))302     print("Counting words...")303     for pair inpairs:304 input_lang.addSentence(pair[0])305         output_lang.addSentence(pair[1])306     print("Counted words:")307     print(input_lang.name, input_lang.n_words)308     print(output_lang.name, output_lang.n_words)309     returninput_lang, output_lang, pairs310
311
312 #对数据进行预处理
313 input_lang, output_lang, pairs = prepareData('eng', 'fra', True)314 print(random.choice(pairs))  #随机展示一对
315
316
317 ######################################################################
318 #The Seq2Seq Model
319 #=================
320 #321 #A Recurrent Neural Network, or RNN, is a network that operates on a
322 #sequence and uses its own output as input for subsequent steps.
323 #324 #A `Sequence to Sequence network <http://arxiv.org/abs/1409.3215>`__, or
325 #seq2seq network, or `Encoder Decoder
326 #network <https://arxiv.org/pdf/1406.1078v3.pdf>`__, is a model
327 #consisting of two RNNs called the encoder and decoder. The encoder reads
328 #an input sequence and outputs a single vector, and the decoder reads
329 #that vector to produce an output sequence.
330 #331 #.. figure:: /_static/img/seq-seq-images/seq2seq.png
332 #:alt:
333 #334 #Unlike sequence prediction with a single RNN, where every input
335 #corresponds to an output, the seq2seq model frees us from sequence
336 #length and order, which makes it ideal for translation between two
337 #languages.
338 #339 #Consider the sentence "Je ne suis pas le chat noir" → "I am not the
340 #black cat". Most of the words in the input sentence have a direct
341 #translation in the output sentence, but are in slightly different
342 #orders, e.g. "chat noir" and "black cat". Because of the "ne/pas"
343 #construction there is also one more word in the input sentence. It would
344 #be difficult to produce a correct translation directly from the sequence
345 #of input words.
346 #347 #With a seq2seq model the encoder creates a single vector which, in the
348 #ideal case, encodes the "meaning" of the input sequence into a single
349 #vector — a single point in some N dimensional space of sentences.
350 #351
352
353 ######################################################################
354 #The Encoder
355 #-----------
356 #357 #The encoder of a seq2seq network is a RNN that outputs some value for
358 #every word from the input sentence. For every input word the encoder
359 #outputs a vector and a hidden state, and uses the hidden state for the
360 #next input word.
361 #362 #.. figure:: /_static/img/seq-seq-images/encoder-network.png
363 #:alt:
364 #365 #366
367 classEncoderRNN(nn.Module):368     def __init__(self, input_size, hidden_size):369         super(EncoderRNN, self).__init__()370         self.hidden_size =hidden_size371         #定义隐藏层
372         self.embedding =nn.Embedding(input_size, hidden_size)373         #word embedding的定义可以这么理解，例如nn.Embedding(2, 4)
374         #2表示有2个词，4表示4维度，其实也就是一个2x4的矩阵，
375         #如果有100个词，每个词10维，就可以写为nn.Embedding(100, 10)
376         #注意这里的词向量的建立只是初始的词向量，并没有经过任何修改优化
377         #需要建立神经网络通过learning的办法修改word embedding里面的参数
378         #使得word embedding每一个词向量能够表示每一个不同的词。
379         self.gru = nn.GRU(hidden_size, hidden_size)  #用到了上面提到的GRU模型
380
381     defforward(self, input, hidden):382         embedded = self.embedding(input).view(1, 1, -1)  #-1是指自适应，view相当于reshape函数
383         output =embedded384         output, hidden =self.gru(output, hidden)385         returnoutput, hidden386
387     def initHidden(self):  #初始化
388         return torch.zeros(1, 1, self.hidden_size, device=device)389
390
391 ######################################################################
392 #The Decoder
393 #-----------
394 #395 #The decoder is another RNN that takes the encoder output vector(s) and
396 #outputs a sequence of words to create the translation.
397 #398
399
400 ######################################################################
401 #Simple Decoder
402 #^^^^^^^^^^^^^^
403 #404 #In the simplest seq2seq decoder we use only last output of the encoder.
405 #This last output is sometimes called the *context vector* as it encodes
406 #context from the entire sequence. This context vector is used as the
407 #initial hidden state of the decoder.
408 #409 #At every step of decoding, the decoder is given an input token and
410 #hidden state. The initial input token is the start-of-string ``<SOS>``
411 #token, and the first hidden state is the context vector (the encoder's
412 #last hidden state).
413 #414 #.. figure:: /_static/img/seq-seq-images/decoder-network.png
415 #:alt:
416 #417 #418
419 classDecoderRNN(nn.Module):420     #DecoderRNN与encoderRNN结构类似，结合图片即可搞清逻辑
421     def __init__(self, hidden_size, output_size):422         super(DecoderRNN, self).__init__()423         self.hidden_size =hidden_size424
425         self.embedding =nn.Embedding(output_size, hidden_size)426         self.gru =nn.GRU(hidden_size, hidden_size)427         self.out =nn.Linear(hidden_size, output_size)428         self.softmax = nn.LogSoftmax(dim=1)429
430     defforward(self, input, hidden):431         output = self.embedding(input).view(1, 1, -1)  #-1是指自适应，view相当于reshape函数
432         output =F.relu(output)433         output, hidden = self.gru(output, hidden)  #此处使用gru神经网络
434         #对上述结果使用softmax,就是图片中左边倒数第二个
435         output =self.softmax(self.out(output[0]))436         returnoutput, hidden437
438     definitHidden(self):439         return torch.zeros(1, 1, self.hidden_size, device=device)440
441
442 ######################################################################
443 #I encourage you to train and observe the results of this model, but to
444 #save space we'll be going straight for the gold and introducing the
445 #Attention Mechanism.
446 #447
448
449 ######################################################################
450 #Attention Decoder
451 #^^^^^^^^^^^^^^^^^
452 #453 #If only the context vector is passed betweeen the encoder and decoder,
454 #that single vector carries the burden of encoding the entire sentence.
455 #456 #Attention allows the decoder network to "focus" on a different part of
457 #the encoder's outputs for every step of the decoder's own outputs. First
458 #we calculate a set of *attention weights*. These will be multiplied by
459 #the encoder output vectors to create a weighted combination. The result
460 #(called ``attn_applied`` in the code) should contain information about
461 #that specific part of the input sequence, and thus help the decoder
462 #choose the right output words.
463 #464 #.. figure:: https://i.imgur.com/1152PYf.png
465 #:alt:
466 #467 #Calculating the attention weights is done with another feed-forward
468 #layer ``attn``, using the decoder's input and hidden state as inputs.
469 #Because there are sentences of all sizes in the training data, to
470 #actually create and train this layer we have to choose a maximum
471 #sentence length (input length, for encoder outputs) that it can apply
472 #to. Sentences of the maximum length will use all the attention weights,
473 #while shorter sentences will only use the first few.
474 #475 #.. figure:: /_static/img/seq-seq-images/attention-decoder-network.png
476 #:alt:
477 #478 #479
480 classAttnDecoderRNN(nn.Module):481     def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):482         super(AttnDecoderRNN, self).__init__()483         self.hidden_size =hidden_size484         self.output_size =output_size485         self.dropout_p =dropout_p486         self.max_length =max_length487
488         self.embedding =nn.Embedding(self.output_size, self.hidden_size)489         self.attn = nn.Linear(self.hidden_size * 2, self.max_length)490         self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)491         self.dropout =nn.Dropout(self.dropout_p)492         self.gru =nn.GRU(self.hidden_size, self.hidden_size)493         self.out =nn.Linear(self.hidden_size, self.output_size)494
495     defforward(self, input, hidden, encoder_outputs):496         #对于输入的input内容进行embedding和dropout操作
497         #dropout是指随机丢弃一些神经元
498         embedded = self.embedding(input).view(1, 1, -1)499         embedded =self.dropout(embedded)500
501         #此处相当于学出来了attention的权重
502         #需要注意的是torch的concatenate函数是torch.cat，是在已有的维度上拼接，
503         #而stack是建立一个新的维度，然后再在该纬度上进行拼接。
504         attn_weights =F.softmax(505             self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)506
507         #将attention权重作用在encoder_outputs上
508         #对存储在两个批batch1和batch2内的矩阵进行批矩阵乘操作。
509         #batch1和 batch2都为包含相同数量矩阵的3维张量。
510         #如果batch1是形为b×n×m的张量，batch1是形为b×m×p的张量，
511         #则out和mat的形状都是n×p
512         attn_applied =torch.bmm(attn_weights.unsqueeze(0),513 encoder_outputs.unsqueeze(0))514         #拼接操作，将embedded和attn_Applied拼接起来
515         output = torch.cat((embedded[0], attn_applied[0]), 1)516         #返回一个新的张量，对输入的制定位置插入维度 1
517         output =self.attn_combine(output).unsqueeze(0)518
519         output =F.relu(output)520         output, hidden =self.gru(output, hidden)521
522         output = F.log_softmax(self.out(output[0]), dim=1)523         returnoutput, hidden, attn_weights524
525     definitHidden(self):526         return torch.zeros(1, 1, self.hidden_size, device=device)527
528
529 ######################################################################
530 #.. note:: There are other forms of attention that work around the length
531 #limitation by using a relative position approach. Read about "local
532 #attention" in `Effective Approaches to Attention-based Neural Machine
533 #Translation <https://arxiv.org/abs/1508.04025>`__.
534 #535 #Training
536 #========
537 #538 #Preparing Training Data
539 #-----------------------
540 #541 #To train, for each pair we will need an input tensor (indexes of the
542 #words in the input sentence) and target tensor (indexes of the words in
543 #the target sentence). While creating these vectors we will append the
544 #EOS token to both sequences.
545 #546
547 defindexesFromSentence(lang, sentence):548     return [lang.word2index[word] for word in sentence.split(' ')]549
550
551 deftensorFromSentence(lang, sentence):552     #获得词的索引
553     indexes =indexesFromSentence(lang, sentence)554     #将EOS标记添加到两个序列中
555 indexes.append(EOS_token)556     return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)557
558
559 deftensorsFromPair(pair):560     #每一对为需要输入的张量（输入句子中的词的索引）和目标张量
561     #（目标语句中的词的索引）
562     input_tensor =tensorFromSentence(input_lang, pair[0])563     target_tensor = tensorFromSentence(output_lang, pair[1])564     return(input_tensor, target_tensor)565
566
567 ######################################################################
568 #Training the Model
569 #------------------
570 #571 #To train we run the input sentence through the encoder, and keep track
572 #of every output and the latest hidden state. Then the decoder is given
573 #the ``<SOS>`` token as its first input, and the last hidden state of the
574 #encoder as its first hidden state.
575 #576 #"Teacher forcing" is the concept of using the real target outputs as
577 #each next input, instead of using the decoder's guess as the next input.
578 #Using teacher forcing causes it to converge faster but `when the trained
579 #network is exploited, it may exhibit
580 #instability <http://minds.jacobs-university.de/sites/default/files/uploads/papers/ESNTutorialRev.pdf>`__.
581 #582 #You can observe outputs of teacher-forced networks that read with
583 #coherent grammar but wander far from the correct translation -
584 #intuitively it has learned to represent the output grammar and can "pick
585 #up" the meaning once the teacher tells it the first few words, but it
586 #has not properly learned how to create the sentence from the translation
587 #in the first place.
588 #589 #Because of the freedom PyTorch's autograd gives us, we can randomly
590 #choose to use teacher forcing or not with a simple if statement. Turn
591 #``teacher_forcing_ratio`` up to use more of it.
592 #593
594 teacher_forcing_ratio = 0.5
595
596
597 #teacher forcing即指使用教师强迫其能够更快的收敛
598 #不过当训练好的网络被利用时，容易表现出不稳定性
599 #teacher_forcing_ratio即指教师训练比率
600 #用于训练的函数
601
602
603 deftrain(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion,604           max_length=MAX_LENGTH):605     #encoder即指EncoderRNN(input_lang.n_words, hidden_size)
606     #attn_decoder即指 AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1)
607     #hidden=256
608     encoder_hidden =encoder.initHidden()609
610     #encoder_optimizer 即指optim.SGD(encoder.parameters(), lr=learning_rate)
611     #decoder_optimizer 即指optim.SGD(decoder.parameters(), lr=learning_rate)
612     #nn.Parameter()是Variable的一种，常被用于模块参数(module parameter)。
613     #Parameters 是 Variable 的子类。Paramenters和Modules一起使用的时候会有一些特殊的属性，
614     #即：当Paramenters赋值给Module的属性的时候，他会自动的被加到 Module的 参数列表中
615     #(即：会出现在 parameters() 迭代器中)。将Varibale赋值给Module属性则不会有这样的影响。
616     #这样做的原因是：我们有时候会需要缓存一些临时的状态(state), 比如：模型中RNN的最后一个隐状态。
617     #如果没有Parameter这个类的话，那么这些临时变量也会注册成为模型变量。
618 encoder_optimizer.zero_grad()619 decoder_optimizer.zero_grad()620
621     #得到长度
622     input_length =input_tensor.size(0)623     target_length =target_tensor.size(0)624
625     #初始化outour值
626     encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)627
628     loss =0629
630     #以下循环是学习过程
631     for ei inrange(input_length):632         encoder_output, encoder_hidden =encoder(input_tensor[ei], encoder_hidden)633         encoder_outputs[ei] = encoder_output[0, 0]  #这里为什么取 0,0
634
635     #定义decoder的Input值
636     decoder_input = torch.tensor([[SOS_token]], device=device)637
638     decoder_hidden =encoder_hidden639
640     use_teacher_forcing = True if random.random() < teacher_forcing_ratio elseFalse641
642     ifuse_teacher_forcing:643         #Teacher forcing: Feed the target as the next input
644         #教师强制: 将目标作为下一个输入
645         #你观察教师强迫网络的输出,这些网络是用连贯的语法阅读的,但却远离了正确的翻译 -
646         #直观地来看它已经学会了代表输出语法,并且一旦老师告诉它前几个单词,就可以"拾取"它的意思,
647         #但它没有适当地学会如何从翻译中创建句子.
648         for di inrange(target_length):649             #通过decoder得到输出值
650             decoder_output, decoder_hidden, decoder_attention =decoder(651 decoder_input, decoder_hidden, encoder_outputs)652             #定义损失函数并计算
653             loss +=criterion(decoder_output, target_tensor[di])654             decoder_input = target_tensor[di]  #Teacher forcing
655
656     else:657         #Without teacher forcing: use its own predictions as the next input
658         #没有教师强迫: 使用自己的预测作为下一个输入
659         for di inrange(target_length):660             #通过decoder得到输出值
661             decoder_output, decoder_hidden, decoder_attention =decoder(662 decoder_input, decoder_hidden, encoder_outputs)663
664             #topk：第k个最小元素,返回第k个最小元素
665             #返回前k个最大元素，注意是前k个，largest=False，返回前k个最小元素
666             #此函数的功能是求取1-D 或N-D Tensor的最低维度的前k个最大的值，返回值为两个Tuple
667             #其中values是前k个最大值的Tuple，indices是对应的下标，默认返回结果是从大到小排序的。
668             topv, topi = decoder_output.topk(1)669             decoder_input = topi.squeeze().detach()  #detach from history as input
670
671             loss +=criterion(decoder_output, target_tensor[di])672             if decoder_input.item() ==EOS_token:673                 break
674     #反向传播
675 loss.backward()676
677     #更新参数
678 encoder_optimizer.step()679 decoder_optimizer.step()680
681     return loss.item() /target_length682
683
684 ######################################################################
685 #This is a helper function to print time elapsed and estimated time
686 #remaining given the current time and progress %.
687 #688
689 importtime690 importmath691
692
693 #根据当前时间和进度百分比,这是一个帮助功能,用于打印经过的时间和估计的剩余时间.
694
695 defasMinutes(s):696     m = math.floor(s / 60)697     s -= m * 60
698     return '%dm %ds' %(m, s)699
700
701 deftimeSince(since, percent):702     now =time.time()703     s = now -since704     es = s /(percent)705     rs = es -s706     return '%s (- %s)' %(asMinutes(s), asMinutes(rs))707
708
709 ######################################################################
710 #The whole training process looks like this:
711 #712 #-  Start a timer
713 #-  Initialize optimizers and criterion
714 #-  Create set of training pairs
715 #-  Start empty losses array for plotting
716 #717 #Then we call ``train`` many times and occasionally print the progress (%
718 #of examples, time so far, estimated time) and average loss.
719 #720
721 def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):722     start =time.time()723     plot_losses =[]724     print_loss_total = 0  #Reset every print_every
725     plot_loss_total = 0  #Reset every plot_every
726
727     encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)728     decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)729
730     #获取训练的一对样本
731     training_pairs =[tensorsFromPair(random.choice(pairs))732                       for i inrange(n_iters)]733     #定义出的损失函数
734     criterion =nn.NLLLoss()735
736     for iter in range(1, n_iters + 1):737         training_pair = training_pairs[iter - 1]738         input_tensor =training_pair[0]739         target_tensor = training_pair[1]740
741         #训练的过程并用于当损失函数
742         loss =train(input_tensor, target_tensor, encoder,743 decoder, encoder_optimizer, decoder_optimizer, criterion)744         print_loss_total +=loss745         plot_loss_total +=loss746
747         if iter % print_every ==0:748             print_loss_avg = print_loss_total /print_every749             print_loss_total =0750             #打印进度(样本的百分比,到目前为止的时间,估计的时间)和平均损失.
751             print('%s (%d %d%%) %.4f' % (timeSince(start, iter /n_iters),752                                          iter, iter / n_iters * 100, print_loss_avg))753
754         if iter % plot_every ==0:755             plot_loss_avg = plot_loss_total /plot_every756 plot_losses.append(plot_loss_avg)757             plot_loss_total =0758     #绘制图像
759 showPlot(plot_losses)760
761
762 ######################################################################
763 #Plotting results
764 #----------------
765 #766 #Plotting is done with matplotlib, using the array of loss values
767 #``plot_losses`` saved while training.
768 #769
770 importmatplotlib.pyplot as plt771
772 plt.switch_backend('agg')773 importmatplotlib.ticker as ticker774 importnumpy as np775
776
777 #使用matplotlib进行绘图，使用训练时保存的损失值plot_losses数组.
778 defshowPlot(points):779 plt.figure()780     fig, ax =plt.subplots()781     #this locator puts ticks at regular intervals
782     #这个定位器会定期发出提示信息
783     loc = ticker.MultipleLocator(base=0.2)784 ax.yaxis.set_major_locator(loc)785 plt.plot(points)786
787
788 ######################################################################
789 #Evaluation
790 #==========
791 #792 #Evaluation is mostly the same as training, but there are no targets so
793 #we simply feed the decoder's predictions back to itself for each step.
794 #Every time it predicts a word we add it to the output string, and if it
795 #predicts the EOS token we stop there. We also store the decoder's
796 #attention outputs for display later.
797 #798
799 def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):800 with torch.no_grad():801         #从sentence中得到对应的变量
802         input_tensor =tensorFromSentence(input_lang, sentence)803         #长度
804         input_length =input_tensor.size()[0]805
806         #encoder即指EncoderRNN(input_lang.n_words, hidden_size)
807         #attn_decoder即指 AttnDecoderRNN(hidden_size,
808         #output_lang.n_words, dropout_p=0.1)
809         #hidden=256
810         encoder_hidden =encoder.initHidden()811
812         #初始化outputs值
813         encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)814
815         #以下是学习过程
816         for ei inrange(input_length):817             encoder_output, encoder_hidden =encoder(input_tensor[ei],818 encoder_hidden)819             encoder_outputs[ei] +=encoder_output[0, 0]820
821         #定义好decoder部分的input值
822         decoder_input = torch.tensor([[SOS_token]], device=device)  #SOS
823
824         #设置好隐藏层
825         decoder_hidden =encoder_hidden826
827         decoded_words =[]828         decoder_attentions =torch.zeros(max_length, max_length)829
830         for di inrange(max_length):831             #得到结果
832             decoder_output, decoder_hidden, decoder_attention =decoder(decoder_input, decoder_hidden, encoder_outputs)833
834             #attention部分的数据
835             decoder_attentions[di] =decoder_attention.data836             #选择output中的第一个值
837             topv, topi = decoder_output.data.topk(1)838             if topi.item() ==EOS_token:839                 decoded_words.append('<EOS>')840                 break
841             else:842                 decoded_words.append(output_lang.index2word[topi.item()])  #将output_lang添加到decoded
843
844             decoder_input =topi.squeeze().detach()845
846         return decoded_words, decoder_attentions[:di + 1]847
848
849 ######################################################################
850 #We can evaluate random sentences from the training set and print out the
851 #input, target, and output to make some subjective quality judgements:
852 #853
854 #从训练集中评估随机的句子并打印出输入,目标和输出以作出一些主观质量判断
855 def evaluateRandomly(encoder, decoder, n=10):856     for i inrange(n):857         pair =random.choice(pairs)858         print('>', pair[0])859         print('=', pair[1])860         output_words, attentions =evaluate(encoder, decoder, pair[0])861         output_sentence = ' '.join(output_words)862         print('<', output_sentence)863         print('')864
865
866 ######################################################################
867 #Training and Evaluating
868 #=======================
869 #870 #With all these helper functions in place (it looks like extra work, but
871 #it makes it easier to run multiple experiments) we can actually
872 #initialize a network and start training.
873 #874 #Remember that the input sentences were heavily filtered. For this small
875 #dataset we can use relatively small networks of 256 hidden nodes and a
876 #single GRU layer. After about 40 minutes on a MacBook CPU we'll get some
877 #reasonable results.
878 #879 #.. Note::
880 #If you run this notebook you can train, interrupt the kernel,
881 #evaluate, and continue training later. Comment out the lines where the
882 #encoder and decoder are initialized and run ``trainIters`` again.
883 #884
885 hidden_size = 256
886 #编码部分
887 encoder1 =EncoderRNN(input_lang.n_words, hidden_size).to(device)888 #加入了attention机制的解码部分
889 attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)890 #训练部分
891 trainIters(encoder1, attn_decoder1, 75000, print_every=5000)892
893 ######################################################################
894 #随机生成一组结果
895 evaluateRandomly(encoder1, attn_decoder1)896
897 ######################################################################
898 #Visualizing Attention
899 #---------------------
900 #901 #A useful property of the attention mechanism is its highly interpretable
902 #outputs. Because it is used to weight specific encoder outputs of the
903 #input sequence, we can imagine looking where the network is focused most
904 #at each time step.
905 #906 #You could simply run ``plt.matshow(attentions)`` to see attention output
907 #displayed as a matrix, with the columns being input steps and rows being
908 #output steps:
909 #910
911 output_words, attentions = evaluate(encoder1, attn_decoder1, "je suis trop froid .")912 plt.matshow(attentions.numpy())913
914
915 ######################################################################
916 #For a better viewing experience we will do the extra work of adding axes
917 #and labels:
918
919 defshowAttention(input_sentence, output_words, attentions):920     #Set up figure with colorbar
921     fig =plt.figure()922     ax = fig.add_subplot(111)923     cax = ax.matshow(attentions.numpy(), cmap='bone')924 fig.colorbar(cax)925
926     #Set up axes
927     ax.set_xticklabels([''] + input_sentence.split(' ') +
928                        ['<EOS>'], rotation=90)929     ax.set_yticklabels([''] +output_words)930
931     #Show label at every tick
932     ax.xaxis.set_major_locator(ticker.MultipleLocator(1))933     ax.yaxis.set_major_locator(ticker.MultipleLocator(1))934
935 plt.show()936
937
938 defevaluateAndShowAttention(input_sentence):939     output_words, attentions =evaluate(940 encoder1, attn_decoder1, input_sentence)941     print('input =', input_sentence)942     print('output =', ' '.join(output_words))943 showAttention(input_sentence, output_words, attentions)944
945
946 evaluateAndShowAttention("elle a cinq ans de moins que moi .")947 evaluateAndShowAttention("elle est trop petit .")948 evaluateAndShowAttention("je ne crains pas de mourir .")949 evaluateAndShowAttention("c est un jeune directeur plein de talent .")950
951 ######################################################################
952 #Exercises
953 #=========
954 #955 #-  Try with a different dataset
956 #957 #-  Another language pair
958 #-  Human → Machine (e.g. IOT commands)
959 #-  Chat → Response
960 #-  Question → Answer
961 #962 #-  Replace the embeddings with pre-trained word embeddings such as word2vec or
963 #GloVe
964 #-  Try with more layers, more hidden units, and more sentences. Compare
965 #the training time and results.
966 #-  If you use a translation file where pairs have two of the same phrase
967 #(``I am test \t I am test``), you can use this as an autoencoder. Try
968 #this:
969 #970 #-  Train as an autoencoder
971 #-  Save only the Encoder network
972 #-  Train a new Decoder for translation from there
973 #
转载于:https://www.cnblogs.com/www-caiyin-com/p/10123346.html
pytorch做seq2seq注意力模型的翻译相关推荐

人工智能之注意力模型
朋友们,如需转载请标明出处:http://blog.csdn.net/jiangjunshow 注意力模型通过对教程中前面一些文章的学习,我们知道可以用上面的神经网络来实现机器翻译.假设要将一段法语 ...
[翻译Pytorch教程]NLP从零开始：使用序列到序列网络和注意力机制进行翻译
翻译自官网手册:NLP From Scratch: Translation with a Sequence to Sequence Network and Attention Author: Sean ...
GAT: 图注意力模型介绍及PyTorch代码分析
文章目录 GAT: 图注意力模型介绍及代码分析原理图注意力层(Graph Attentional Layer) 情境一:节点和它的一个邻居情境二:节点和它的多个邻节点聚合(Aggregatio ...
基于PyTorch实现Seq2Seq + Attention的英汉Neural Machine Translation
NMT(Neural Machine Translation)基于神经网络的机器翻译模型效果越来越好,还记得大学时代Google翻译效果还是差强人意,近些年来使用NMT后已基本能满足非特殊需求了.目前 ...
Tensorflow 自动文摘: 基于Seq2Seq+Attention模型的Textsum模型
Github下载完整代码 https://github.com/rockingdingo/deepnlp/tree/master/deepnlp/textsum 简介这篇文章中我们将基于Tensor ...
【NLP】Attention Model（注意力模型）学习总结
最近一直在研究深度语义匹配算法,搭建了个模型,跑起来效果并不是很理想,在分析原因的过程中,发现注意力模型在解决这个问题上还是很有帮助的,所以花了两天研究了一下. 此文大部分参考深度学习中的注意力机制( ...
王小草【深度学习】笔记第七弹--RNN与应用案例：注意力模型与机器翻译
标签(空格分隔): 王小草深度学习笔记 1. 注意力模型 1.2 注意力模型概述注意力模型(attention model)是一种用于做图像描述的模型.在笔记6中讲过RNN去做图像描述,但是精准度可 ...
关于《注意力模型--Attention注意力机制》的学习
关于<注意力模型--Attention注意力机制>的学习此文大部分参考深度学习中的注意力机制(2017版) 张俊林的博客,不过添加了一些个人的思考与理解过程.在github上找到一份基于 ...
Seq2Seq Attention模型详解
目录一.从传统Seq2Seq说起二.在Seq2Seq中引入Attention 三.引入Attention后,与传统的Seq2Seq的不同之处四.Seq2Seq的损失计算和解码过程 Seq2seq ...
pytorch做seq2seq注意力模型的翻译

pytorch做seq2seq注意力模型的翻译相关推荐

最新文章

热门文章