循环神经网络教程第二部分-用python，numpy，theano实现一个RNN

作者：徐志强
链接：https://zhuanlan.zhihu.com/p/22289383
来源：知乎
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

在本部分，我们将会从头开始用python实现一个完整的循环神经网络，然后hui用Theano（一个可以在GPU上进行运算的库）对实现进行优化。我会跳过一些对理解循环神经网络不是很重要的代码片段，但完整的代码可以在这里找到。

语言模型

这里的目标是用RNN构建一个语言模型，下面会解释一下什么是语言模型。假设有一个包含 $m$ 个词的句子，语言模型可以如下预测出这个句子出现的概率（在给定数据集中）：
$P(w_{1},...,w_{m})=\prod_{i=1}^{m} P(w_{i}|w_{1},...,w_{i-1})$
也就是说，一个句子出现的概率是句子中每一个词在它前面的词给定的情况下出现的概率的乘积。所以，“他去买了一些巧克力”这句话的概率是给定“他去买了一些”时后面是“巧克力”的概率乘以给定“他去买”时后面是“一些”的概率，等等等。

为什么这是有用的？为什么要对一句话赋予一个概率值？

首先，这样一个模型可以作为一个打分机制。例如，机器翻译系统通常会针对一个输入句子生成多个候选，你可以用语言模型选择最可能的句子。直观上看，最可能的句子也更可能在语法上正确。相似的打分机制也出现在语音识别系统中。

但是，求解语言模型问题也会产生一个很有用的副产品。因为我们能预测一个词在其前面所有词都确定时出现的概率，那么我们就可以生成新的文本。这是一个生成模型。给定一个词的序列，我们可以从预测到的概率中采样出下一个词，重复这个过程知道我们有一个完整的句子。 Andrej Karparthy有一个非常好的文章讲述了语言模型可以用来做什么，他的模型是在单个字符而不是整个词上训练得到，可以生成从莎士比亚诗句到Linux代码的任何东西。

注意到在上面的公式中，每一个词的概率是在给定所有它前面词的条件下得到的。在实际中，许多模型由于计算或内存限制很难表示这样的长期依赖，它们通常受限于只能查找之前的几个词。RNN理论上可以捕捉这样的长期依赖，但在实际中会比较复杂，我们会在后续的文章中再探索。

训练数据和预处理

为了训练语言模型，我们需要可以从中学习的文本。幸运的是，在训练语言模型时不需要任何标签，只需要原始的文本即可，我从谷歌提供的BigQuery数据集中下载了15000条稍长的reddit网站评论。由我们的模型生成的文本听起来回想reddit评论（希望这样），但是这里和大多数机器学习项目一样，我们首先需要做一些预处理，把我们的数据变成正确的格式。
1. 分词
我们拥有原始的文本，但我们想根据每一个词来做预测，这意味着必须把我们的评论分成句子，再由句子分成词。我们可以只用空格来分割每一条评论，但这样做没办法正确处理标点符号。句子"He left!"应该是3个词："He", "left", "!"。我们将使用NLTK的word_tokenize和sent_tokenize方法，它们可以解决我们大部分的困难。
2. 去除非频繁词
我们的文本中的大多数词只出现一两次，把这些不频繁的词去除是个好主意。词表太大会使我们的模型训练缓慢（原因我们稍后讨论），并且因为这些词没有大量的上下文样例，我们很难学会如何正确使用它们，这和人类的学习方式十分相似，要真正理解怎么正确的使用一个词，你必须已经在一些不同的语境中看到过它。

在代码中，我把词表限制为vocabulary_size个最常出现的词（这里的设置是8000，可以随意更改），并且把所有不在词表中的词替换成UNKNOWN_TOKEN。例如，如果在词表中没有"nonlinearities"这个词，句子"nonlinearities are important in neural networks"变成"UNKNOWN_TOKEN are important in neural networks"。UNKNOWN_TOKEN也是词表的一部分，我们也会想其他词一样对它做预测。在生成新文本时，我们可以再把UNKNOWN_TOKEN替换掉，比如从不在词表的词中随机采样一个，或者我们就不断地生成句子直到句子中不包含未知词。
3. 准备特殊的起始和终止词
我们想要知道哪些词倾向于作为一句话的开头和结尾，故而我在每一句话的开头插入一个特殊的词SENTENCE_START，结尾插入一个特殊的词SENTENCE_END。这样做为让我们不禁想问：如果第一个词是SENTENCE_START，下一个词可能是什么（句子中真正的第一个词）？
4. 构建训练数据矩阵
RNN的输入是向量，而不是字符串。因此我们在词和它的索引之间建立一个映射，index_to_word和wor_to_index。例如，词"friendly"索引可能是2001。训练样本 $x$ 看起来可能是 $[0, 179, 341, 416]$ ，这里的0对应于SENTENCE_START，相对应的标签 $y$ 就是 $[179, 341, 416, 1]$ 。注意到我们的目标是预测下一个词，所以y只是向量x右移一个位置，并且最后一个元素是词SENTENCE_END。换句话说，词179的正确预测是词341，也就是它的下一个词。

vocabulary_size = 8000
unknown_token = "UNKNOWN_TOKEN"
sentence_start_token = "SENTENCE_START"
sentence_end_token = "SENTENCE_END"# Read the data and append SENTENCE_START and SENTENCE_END tokens
print "Reading CSV file..."
with open('data/reddit-comments-2015-08.csv', 'rb') as f:reader = csv.reader(f, skipinitialspace=True)reader.next()# Split full comments into sentencessentences = itertools.chain(*[nltk.sent_tokenize(x[0].decode('utf-8').lower()) for x in reader])# Append SENTENCE_START and SENTENCE_ENDsentences = ["%s %s %s" % (sentence_start_token, x, sentence_end_token) for x in sentences]
print "Parsed %d sentences." % (len(sentences))# Tokenize the sentences into words
tokenized_sentences = [nltk.word_tokenize(sent) for sent in sentences]# Count the word frequencies
word_freq = nltk.FreqDist(itertools.chain(*tokenized_sentences))
print "Found %d unique words tokens." % len(word_freq.items())# Get the most common words and build index_to_word and word_to_index vectors
vocab = word_freq.most_common(vocabulary_size-1)
index_to_word = [x[0] for x in vocab]
index_to_word.append(unknown_token)
word_to_index = dict([(w,i) for i,w in enumerate(index_to_word)])print "Using vocabulary size %d." % vocabulary_size
print "The least frequent word in our vocabulary is '%s' and appeared %d times." % (vocab[-1][0], vocab[-1][1])# Replace all words not in our vocabulary with the unknown token
for i, sent in enumerate(tokenized_sentences):tokenized_sentences[i] = [w if w in word_to_index else unknown_token for w in sent]print "\nExample sentence: '%s'" % sentences[0]
print "\nExample sentence after Pre-processing: '%s'" % tokenized_sentences[0]# Create the training data
X_train = np.asarray([[word_to_index[w] for w in sent[:-1]] for sent in tokenized_sentences])
y_train = np.asarray([[word_to_index[w] for w in sent[1:]] for sent in tokenized_sentences])

下面是我们文本中的一个实际训练样本：

x:
SENTENCE_START what are n't you understanding about this ? !
[0, 51, 27, 16, 10, 856, 53, 25, 34, 69]
y:
what are n't you understanding about this ? ! SENTENCE_END
[51, 27, 16, 10, 856, 53, 25, 34, 69, 1]

构建RNN

针对RNN的一个大致的介绍，请参考教程的第一部分。
让我们具体看一下针对我们的语言模型的RNN到底是什么。输入 $x$ 是一个词的序列（像上面例子中那样），每一个 $x_{t}$ 是一个单独的词。但是值得额外注意的是：考虑到矩阵乘积的工作原理，我们不能简单用词的索引（例如36）作为输入，而应该把每个词表示成大小为vocabulary_size大小的向量。例如，索引为36的词应该表示成除了位置36处为1，其他位置都是0的one-hot向量。因此，每一个 $x_{t}$ 都是一个向量， $x$ 是一个矩阵，矩阵中每一行代表一个词。我们会在构建神经网络的代码中而不是预处理代码中进行上述的变换。网络的输出 $o$ 也有相似的格式，每一个 $o_{t}$ 是一个包含vocabulary_size个元素的向量，每一个元素代表相应的词是句子中下一个词的概率。

下面给出了教程中第一部分包含的RNN的公式：
$s_{t}=tanh(Ux_{t}+Ws_{t-1})$
o_{t}=softmax(Vs_{t})
我发现通常把矩阵和向量的维度写下来是很有用的。假设我们选择的词表大小 $C=8000$ ，隐藏层大小 $H=100$ 。你可以把隐藏层视为网络的记忆单元，隐藏层变大可以学会更加复杂的模式，但也会引起额外的计算量。如下有：
$x_{t} \epsilon R^{8000}$
$o_{t} \epsilon R^{8000}$
$s_{t} \epsilon R^{100}$
$U\epsilon R^{100\times 8000}$
$V\epsilon R^{8000\times 100}$
$W\epsilon R^{100\times 100}$
上面的信息很重要。记住这里 $U, V, W$ 是网络的参数，需要从数据中学习它们。因此，我们总共需要学习 $2HC+H^{2}$ 个参数。在 $C=8000, H=100$ 的情况下，就是1610000个参数。这里的维度也表明了我们模型的瓶颈。注意到因为 $x_{t}$ 是one-hot编码之后的向量，把它乘以 $U$ 等价于从U中选择一列，所以我们不需要进行完整的乘积。我们的网络中最大规模的矩阵乘积是 $Vs_{t}$ ，这也是为什么需要让我们的词表尽可能小。
有了这些，下面让我们开始具体的实现。

初始化

我们先声明一个RNN类来进行参数初始化。因为后面要实现一个Theano版本，我把这个类命名为RNNNumpy。初始化 $U, V, W$ 有一点棘手，我们不能把它们都初始化为0，这样会在网络的所有层中引起计算的对称性。，我们必须随机初始化它们。因为合适的初始化似乎会影响我们的结果，在这方面已经有很多的研究。事实证明最好的初始化方法依赖于具体的激活函数(我们例子中是tanh)，一个比较推荐的方法是从区间 $[-\frac{1}{\sqrt{n} } , \frac{1}{\sqrt{n} }]$ 中随机初始化权重，这里 $n$ 是来自网络中前一层的进入连接数。这看起来有些过于复杂，但是不要太担心，只要把参数初始化为小的随机数，通产就能很好的工作。

class RNNNumpy:def __init__(self, word_dim, hidden_dim=100, bptt_truncate=4):# Assign instance variablesself.word_dim = word_dimself.hidden_dim = hidden_dimself.bptt_truncate = bptt_truncate# Randomly initialize the network parametersself.U = np.random.uniform(-np.sqrt(1./word_dim), np.sqrt(1./word_dim), (hidden_dim, word_dim))self.V = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (word_dim, hidden_dim))self.W = np.random.uniform(-np.sqrt(1./hidden_dim), np.sqrt(1./hidden_dim), (hidden_dim, hidden_dim))

上面，word_dim是词表的大小，hidden_dim是隐藏层大小，现在先不用担心bptt_truncate，我们会在后面进行解释。

前向传播

接下来，我们实现有上述公式导出的前向传播操作：

def forward_propagation(self, x):# The total number of time stepsT = len(x)# During forward propagation we save all hidden states in s because need them later.# We add one additional element for the initial hidden, which we set to 0s = np.zeros((T + 1, self.hidden_dim))s[-1] = np.zeros(self.hidden_dim)# The outputs at each time step. Again, we save them for later.o = np.zeros((T, self.word_dim))# For each time step...for t in np.arange(T):# Note that we are indxing U by x[t]. This is the same as multiplying U with a one-hot vector.s[t] = np.tanh(self.U[:,x[t]] + self.W.dot(s[t-1]))o[t] = softmax(self.V.dot(s[t]))return [o, s]RNNNumpy.forward_propagation = forward_propagation

我们不仅返回得到的输出值，也返回隐层状态值。后面我们会使用它们计算梯度，在这里返回它们可以避免重复计算。每一个 $o_{t}$ 是词表中所有词的概率构成的向量，但有时候，比如在对模型进行求解时，我们需要的是下一个出现概率最高的词。我们把这个函数命名为predict：

def predict(self, x):# Perform forward propagation and return index of the highest scoreo, s = self.forward_propagation(x)return np.argmax(o, axis=1)RNNNumpy.predict = predict

让我们尝试一下刚刚实现的方法，观察一下样本的输出：

np.random.seed(10)
model = RNNNumpy(vocabulary_size)
o, s = model.forward_propagation(X_train[10])
print o.shape
print o(45, 8000)
[[ 0.00012408  0.0001244   0.00012603 ...,  0.00012515  0.000124880.00012508][ 0.00012536  0.00012582  0.00012436 ...,  0.00012482  0.000124560.00012451][ 0.00012387  0.0001252   0.00012474 ...,  0.00012559  0.000125880.00012551]...,[ 0.00012414  0.00012455  0.0001252  ...,  0.00012487  0.000124940.0001263 ][ 0.0001252   0.00012393  0.00012509 ...,  0.00012407  0.000125780.00012502][ 0.00012472  0.0001253   0.00012487 ...,  0.00012463  0.000125360.00012665]]

针对词表中的每一个词，我们的模型对下一个词的出现概率进行了8000次预测，注意到我们把 $U, V, W$ 初始化为随机值，现在这些预测的概率值也是完全随机的。下面给出了针对每一个词的最高概率预测值的索引：

predictions = model.predict(X_train[10])
print predictions.shape
print predictions(45,)
[1284 5221 7653 7430 1013 3562 7366 4860 2212 6601 7299 4556 2481 238 2539
221 6548 261 1780 2005 1810 5376 4146 477 7051 4832 4991 897 3485 217291 2007 6006 760 4864 2182 6569 2800 2752 6821 4437 7021 7875 6912 3575]

计算损失值

为了训练我们的网络，我们需要一种方式来度量它产生的错误。我们把这个度量函数称之为损失函数 $L$ ，我们的目标是寻找使训练数据上的损失函数最小化的参数 $U, V, W$ 。常用的一个损失函数是交叉熵损失。如果我们有 $N$ 个训练样本（文本中的词）和 $C$ 个类别（词表的大小），那么针对预测值 $o$ 和真实标签 $y$ 的损失如下：
$L(y, o)=-\frac{1}{N}\sum_{n\epsilon N}{y_{n}logo_{n}}$
这个公式看起来有一点复杂，但它所做的是将所有的训练样本求和并根据训练样本和我们的预测值的偏离程度添加到损失值中。 $y$ （正确的词）和 $o$ （预测结果）的偏差越大，损失也越大。我们实现函数calculate_loss如下：

def calculate_total_loss(self, x, y):L = 0# For each sentence...for i in np.arange(len(y)):o, s = self.forward_propagation(x[i])# We only care about our prediction of the "correct" wordscorrect_word_predictions = o[np.arange(len(y[i])), y[i]]# Add to the loss based on how off we wereL += -1 * np.sum(np.log(correct_word_predictions))return Ldef calculate_loss(self, x, y):# Divide the total loss by the number of training examplesN = np.sum((len(y_i) for y_i in y))return self.calculate_total_loss(x,y)/NRNNNumpy.calculate_total_loss = calculate_total_loss
RNNNumpy.calculate_loss = calculate_loss

让我们后退一步并考虑一下随机预测值的损失是什么，这会给我们一个基准并确保我们的实现是正确的。我们的词表中有 $C$ 个词，所以每一个词的预测概率时 $1/C$ ，得到的损失值 $L=-\frac{1}{N}Nlog\frac{1}{C}=log(C)$ ；

# Limit to 1000 examples to save time
print "Expected Loss for random predictions: %f" % np.log(vocabulary_size)
print "Actual loss: %f" % model.calculate_loss(X_train[:1000], y_train[:1000])Expected Loss for random predictions: 8.987197
Actual loss: 8.987440

非常接近！记住在全部数据集上计算损失值是非常昂贵的操作，如果数据量很大的话，可能要花费几个小时。

使用SGD和BPTT训练RNN

我们要寻找的是使得训练数据集上损失最小化的参数 $U, V, W$ ，最常用的方法是SGD，随机梯度下降。SGD背后的思想很简单，我们对所有训练样本进行迭代，在每一次迭代时，把参数向减小误差的方法微调，这些方向是由损失函数的梯度 $\frac{\partial L}{\partial U} , \frac{\partial L}{\partial V}, \frac{\partial L}{\partial W}$ 给出。SGD也需要一个学习率，它定义了在每一次迭代中可以跨越的步长。SGD不仅对于神经网络，对于许多其他机器学习算法来说，都是最常用的优化方法。因此有很多研究关于如何用批量化，并行化和自适应学习率来优化SGD。尽管基本思想很简单，但用很高效的方式实现SGD是很复杂的。如果你想更多的了解SGD，这里是一个很好的开始。由于SGD的通用性，网上有大量相关的资料，在这里我就不重复了。我将实现一个即使没有优化背景也可以理解的一个简单版本的SGD。

但是我们怎么计算上面提到的那些梯度呢？在传统的神经网络中，我们可以通过反向传播算法来计算。在RNN中，我们使用这个算法的一个修改版本，称为随时间的反向传播（BPTT）。因为网络中的参数在所有时刻是共享的，每一个输出的梯度值不仅依赖于当前时刻，也依赖于所有前面时刻的计算结果。如果你了解微积分，这实质上就是在使用链式法则。教程的下一部分就是全都是关于BPTT的内容，所以这里我不会进行详细的推导。关于反向传播的详细介绍可以参考这里和这篇文章，现在你可把BPTT视为黑盒子，它把训练样本 $(x, y)$ 作为输入，并返回梯度值 $\frac{\partial L}{\partial U} , \frac{\partial L}{\partial V}, \frac{\partial L}{\partial W}$ 。

def bptt(self, x, y):T = len(y)# Perform forward propagationo, s = self.forward_propagation(x)# We accumulate the gradients in these variablesdLdU = np.zeros(self.U.shape)dLdV = np.zeros(self.V.shape)dLdW = np.zeros(self.W.shape)delta_o = odelta_o[np.arange(len(y)), y] -= 1.# For each output backwards...for t in np.arange(T)[::-1]:dLdV += np.outer(delta_o[t], s[t].T)# Initial delta calculationdelta_t = self.V.T.dot(delta_o[t]) * (1 - (s[t] ** 2))# Backpropagation through time (for at most self.bptt_truncate steps)for bptt_step in np.arange(max(0, t-self.bptt_truncate), t+1)[::-1]:# print "Backpropagation step t=%d bptt step=%d " % (t, bptt_step)dLdW += np.outer(delta_t, s[bptt_step-1])              dLdU[:,x[bptt_step]] += delta_t# Update delta for next stepdelta_t = self.W.T.dot(delta_t) * (1 - s[bptt_step-1] ** 2)return [dLdU, dLdV, dLdW]RNNNumpy.bptt = bptt

检查梯度

当你实现反向传播算法算法时，同时实现梯度检查是一个好主意，这是确认你的实现是正确的一种方式。梯度检查的思想是一个参数的梯度值等于在相应点处的斜率值，我们可以通过稍微改变参数并除以改变值来实现：
$\frac{\partial{L}}{\partial{\theta }} \approx \lim_{h\rightarrow 0}{\frac{J(\theta +h)-J(\theta -h)}{2h} }$

然后我们把使用反向传播计算得到的梯度值和用上面方法估算得到的梯度值进行比较，如果没有大的差别，那么梯度值就是正确的。上面的近似方法需要对每一个参数计算总体损失值，所以梯度检查的代价是很高的（在上面的例子中，我们有超过100w个参数），最好在词表规模较小的模型上进行梯度检查。

def gradient_check(self, x, y, h=0.001, error_threshold=0.01):# Calculate the gradients using backpropagation. We want to checker if these are correct.bptt_gradients = self.bptt(x, y)# List of all parameters we want to check.model_parameters = ['U', 'V', 'W']# Gradient check for each parameterfor pidx, pname in enumerate(model_parameters):# Get the actual parameter value from the mode, e.g. model.Wparameter = operator.attrgetter(pname)(self)print "Performing gradient check for parameter %s with size %d." % (pname, np.prod(parameter.shape))# Iterate over each element of the parameter matrix, e.g. (0,0), (0,1), ...it = np.nditer(parameter, flags=['multi_index'], op_flags=['readwrite'])while not it.finished:ix = it.multi_index# Save the original value so we can reset it lateroriginal_value = parameter[ix]# Estimate the gradient using (f(x+h) - f(x-h))/(2*h)parameter[ix] = original_value + hgradplus = self.calculate_total_loss([x],[y])parameter[ix] = original_value - hgradminus = self.calculate_total_loss([x],[y])estimated_gradient = (gradplus - gradminus)/(2*h)# Reset parameter to original valueparameter[ix] = original_value# The gradient for this parameter calculated using backpropagationbackprop_gradient = bptt_gradients[pidx][ix]# calculate The relative error: (|x - y|/(|x| + |y|))relative_error = np.abs(backprop_gradient - estimated_gradient)/(np.abs(backprop_gradient) + np.abs(estimated_gradient))# If the error is to large fail the gradient checkif relative_error &gt; error_threshold:print "Gradient Check ERROR: parameter=%s ix=%s" % (pname, ix)print "+h Loss: %f" % gradplusprint "-h Loss: %f" % gradminusprint "Estimated_gradient: %f" % estimated_gradientprint "Backpropagation gradient: %f" % backprop_gradientprint "Relative Error: %f" % relative_errorreturnit.iternext()print "Gradient check for parameter %s passed." % (pname)RNNNumpy.gradient_check = gradient_check# To avoid performing millions of expensive calculations we use a smaller vocabulary size for checking.
grad_check_vocab_size = 100
np.random.seed(10)
model = RNNNumpy(grad_check_vocab_size, 10, bptt_truncate=1000)
model.gradient_check([0,1,2,3], [1,2,3,4])

SGD实现

现在我们能计算参数的梯度值，接下来就可以实现SGD了。我喜欢把实现过程分为两步：1. 用一个函数sgd_step计算梯度值并按批进行更新；2. 用一个外层循环对训练集进行迭代并调整学习率。

# Performs one step of SGD.
def numpy_sdg_step(self, x, y, learning_rate):# Calculate the gradientsdLdU, dLdV, dLdW = self.bptt(x, y)# Change parameters according to gradients and learning rateself.U -= learning_rate * dLdUself.V -= learning_rate * dLdVself.W -= learning_rate * dLdWRNNNumpy.sgd_step = numpy_sdg_step# Outer SGD Loop
# - model: The RNN model instance
# - X_train: The training data set
# - y_train: The training data labels
# - learning_rate: Initial learning rate for SGD
# - nepoch: Number of times to iterate through the complete dataset
# - evaluate_loss_after: Evaluate the loss after this many epochs
def train_with_sgd(model, X_train, y_train, learning_rate=0.005, nepoch=100, evaluate_loss_after=5):# We keep track of the losses so we can plot them laterlosses = []num_examples_seen = 0for epoch in range(nepoch):# Optionally evaluate the lossif (epoch % evaluate_loss_after == 0):loss = model.calculate_loss(X_train, y_train)losses.append((num_examples_seen, loss))time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')print "%s: Loss after num_examples_seen=%d epoch=%d: %f" % (time, num_examples_seen, epoch, loss)# Adjust the learning rate if loss increasesif (len(losses) &gt; 1 and losses[-1][1] &gt; losses[-2][1]):learning_rate = learning_rate * 0.5 print "Setting learning rate to %f" % learning_ratesys.stdout.flush()# For each training example...for i in range(len(y_train)):# One SGD stepmodel.sgd_step(X_train[i], y_train[i], learning_rate)num_examples_seen += 1

搞定了！让我们通过试验了解下需要多久来训练我们的网络。

np.random.seed(10)
model = RNNNumpy(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)

在我的笔记本上，SGD进行一步需要花费大约350ms。我们的训练集有80000个样本，进行一轮（迭代完整个数据集）需要几个小时，多轮下来需要花费几天甚至几周。相比于许多公司和研究者，我们使用的只是一个小数据集。现在怎么办？

幸运的是，有很多方法可以加以加速我们的代码。我们可以继续使用相同的模型并让我们的代码跑得更快，或者我们可以修改我们的模型来减少计算复杂度。研究者们已经发现很多方式降低模型的计算复杂度，例如使用层次softmax或者添加映射层来避免大量矩阵乘法（这里或这里）。但是我想保持模型的简单，所以选择第一条路：用GPU来加速我们的实现。在做之前，我们先用小数据集来运行SGD，并检查一下损失是否真的下降了：

np.random.seed(10)
# Train on a small subset of the data to see what happens
model = RNNNumpy(vocabulary_size)
losses = train_with_sgd(model, X_train[:100], y_train[:100], nepoch=10, evaluate_loss_after=1)2015-09-30 10:08:19: Loss after num_examples_seen=0 epoch=0: 8.987425
2015-09-30 10:08:35: Loss after num_examples_seen=100 epoch=1: 8.976270
2015-09-30 10:08:50: Loss after num_examples_seen=200 epoch=2: 8.960212
2015-09-30 10:09:06: Loss after num_examples_seen=300 epoch=3: 8.930430
2015-09-30 10:09:22: Loss after num_examples_seen=400 epoch=4: 8.862264
2015-09-30 10:09:38: Loss after num_examples_seen=500 epoch=5: 6.913570
2015-09-30 10:09:53: Loss after num_examples_seen=600 epoch=6: 6.302493
2015-09-30 10:10:07: Loss after num_examples_seen=700 epoch=7: 6.014995
2015-09-30 10:10:24: Loss after num_examples_seen=800 epoch=8: 5.833877
2015-09-30 10:10:39: Loss after num_examples_seen=900 epoch=9: 5.710718

不错，看起来像我们想要的那样，我们的实现结果至少做了一些有用的东西，减少了损失值。

用Theano和GPU训练我们的网络

我之前写过一个关于Theano的教程，因为我们在这里的代码逻辑是完全相同的，所以我不会再过一篇之前优化后的代码。我定义了一个RNNClass类，并把其中用numpy进行的计算替换成Theano来进行，和后序的文章一样，代码在Github上可以找到。

np.random.seed(10)
model = RNNTheano(vocabulary_size)
%timeit model.sgd_step(X_train[10], y_train[10], 0.005)

这一次，一步SGD在我的Mac（没有GPU）上需要70ms，在有GPU的Amazon EC2实例上需要23ms。这相对于我们之前的实现有15倍的提高，意味着可以用几小时或几天完成模型的训练。我们仍然可以进行大量的优化，但现在来说已经足够好了。

为了帮助你免于花费几天的时间来训练模型，我已经预训练了一个隐藏层维度为50，词表大小为8000的Theano模型。我用大约20个小时完成了50轮的训练，损失值仍然在下降，训练更长的时间可能会得到一个更好的模型，但是我已经把时间用完了并且想尽快发布这篇文章，你可以尝试训练更长的时间。你可以在Github上的data/trained-model-theano.npz文件中找到模型的参数，并可以用load_model_parameters_theano方法加载它们：

from utils import load_model_parameters_theano, save_model_parameters_theanomodel = RNNTheano(vocabulary_size, hidden_dim=50)
# losses = train_with_sgd(model, X_train, y_train, nepoch=50)
# save_model_parameters_theano('./data/trained-model-theano.npz', model)
load_model_parameters_theano('./data/trained-model-theano.npz', model)

生成文本

现在我们已经拥有了模型，我们可以让它生成新的文本了。让我们实现一个帮助函数来生成新的句子：

def generate_sentence(model):# We start the sentence with the start tokennew_sentence = [word_to_index[sentence_start_token]]# Repeat until we get an end tokenwhile not new_sentence[-1] == word_to_index[sentence_end_token]:next_word_probs = model.forward_propagation(new_sentence)sampled_word = word_to_index[unknown_token]# We don't want to sample unknown wordswhile sampled_word == word_to_index[unknown_token]:samples = np.random.multinomial(1, next_word_probs[-1])sampled_word = np.argmax(samples)new_sentence.append(sampled_word)sentence_str = [index_to_word[x] for x in new_sentence[1:-1]]return sentence_strnum_sentences = 10
senten_min_length = 7for i in range(num_sentences):sent = []# We want long sentences, not sentences with one or two wordswhile len(sent) &lt; senten_min_length:sent = generate_sentence(model)print " ".join(sent)

下面是一些挑选出来的句子（首字母大写了）：

Anyway, to the city scene you’re an idiot teenager.
What ? ! ! ! ! ignore!
Screw fitness, you’re saying: https
Thanks for the advice to keep my thoughts around girls.
Yep, please disappear with the terrible generation.
从上面生成的句子中可以发现一些有意思的事情，模型成功地学习到了语法，它能够正确地放置逗号并用标点符号作为句子的结尾，有时候它能魔方网络上的口语，比如多个惊叹号或者表情符。

然而，大量的生成的句子都讲不通或者有语法错误。一个原因是我们没有用足够场的时间训练网络。这可能是对的，但很有可能不是主要原因。我们的简易版RNN不能生成有意义的文本是因为它无法学习到相隔几步的词之间的依赖。这也是为什么RNN刚提出时没有流行起来的原因。它们在理论上很优美，但实际中效果并不好，我们没法立即明白这是为什么。

幸运的是，RNN训练中的困难之处现在很容易理解了。在教程的下一部分中，我会详细阐述BPTT算法并解释什么是梯度消失问题。这会激发我们去探索更复杂的RNN模型，比如LSTM，它在当前很多NLP任务中都得到了最好的效果。这篇教程中的内容同样适用于LSTM和其他RNN模型。所以，如果普通RNN的效果比你想象中的差，不要感到失望。

是时候结束了。请在评论中留下问题或者反馈，同时不要忘了签出代码。

PS：这第二篇教程还真是不短，花费了我不少时间，对于一些翻译不到位的地方，欢迎大家指正。