前言

本文参考了tensorflow github里面的实现的lstm的教程代码6_lstm.ipynb。因为这代码即实现了lstm，也实操了tf的内容，可以说是一箭双雕。

源码地址：https://github.com/Salon-sai/learning-tensorflow/tree/master/lesson4

小情绪

鄙人原本想试试NLP的，由于最近一直忙于做项目（急需换电脑，也准备做鸭，做男优来谋点财），而且最近心事重重，心虚不宁，满腹心事，茶饭不思，蹙额颦眉，寝不安席，辗转反侧，双眉紧皱，耿耿于怀。因此迟迟未能写完整本文。

lstm理论知识

在简书中有一篇很好的文章，大家可以参考一下当中图和公式： [译] 理解 LSTM 网络。 LSTM的论文：https://arxiv.org/pdf/1402.1128v1.pdf

其实LSTM就是忘记以前的文字内容并记忆当前输入的内容。而LSTM并不是完整的RNN，他仅仅对RNN的隐含层进行改进。而LSTM对隐含层进行精密的设计，设计出forget, input ,output, state这些阀门。

传统的RNN

在这个间隔不断增大时，RNN 会丧失学习到连接如此远的信息的能力（个人认为跟vanishing gradient有关，因为在很深的神经网络里面，梯度会逐级递减，所以考前的cell就不能学到后面内容，就只能根据附近的信息学习）。而LSTM没有这个问题。

LSTM隐含层

（鄙人没有对公式进行证明，所以在此猜测一下。）LSTM是分为state和h_t两个作为下一个元组的输入内容。

实战代码

config.py


# config.py
# -*-coding:utf-8-*-#
import stringclass ModelConfig(object):def __init__(self):self.num_unrollings = 10 # 每条数据的字符串长度self.batch_size = 64 # 每一批数据的个数self.vocabulary_size = len(string.ascii_lowercase) + 1 # 定义出现字符串的个数(一共有26个英文字母和一个空格)self.summary_frequency = 100 # 生成样本的频率self.num_steps = 7001 # 训练步数self.num_nodes = 64 # 隐含层个数config = ModelConfig()

如上，config.py用来保存一些变量。

handle_data.py


# -*-coding:utf-8-*-#
import tensorflow as tf
import string
import zipfile
import numpy as npfirst_letter = ord(string.ascii_lowercase[0])class LoadData(object):def __init__(self, valid_size=1000):self.text = self._read_data()self.valid_text = self.text[:valid_size]self.train_text = self.text[valid_size:]def _read_data(self, filename='text8.zip'):with zipfile.ZipFile(filename) as f:# 获取当中的一个文件name = f.namelist()[0]print('file name : %s ' % name)data = tf.compat.as_str(f.read(name))return datadef char2id(char):# 将字母转换成idif char in string.ascii_lowercase:return ord(char) - first_letter + 1elif char == ' ':return 0else:print("Unexpencted character: %s " % char)return 0def id2char(dictid):# 将id转换成字母if dictid > 0:return chr(dictid + first_letter - 1)else:return ' 'def characters(probabilities):# 根据传入的概率向量得到相应的词return [id2char(c) for c in np.argmax(probabilities, 1)]def batches2string(batches):# 用于测试得到的batches是否符合原来的字符组合s = [''] * batches[0].shape[0]for b in batches:s = [''.join(x) for x in zip(s, characters(b))]return s

这里要提醒一下我拿的数据是text8.zip大家可以去下载来用。LoadData就是将压缩包里面的文本拿出来.然后再划分成train_text和valid_text两个。这里还有一些char2id和id2char方法，这些都为了后面使用的。

BatchGenerator.py


# -*-coding:utf-8-*-#
import numpy as np
from handleData import char2id
from config import configclass BatchGenerator(object):def __init__(self, text, batch_size, num_unrollings):self._text = textself._text_size = len(text)self._batch_size = batch_sizeself._num_unrollings = num_unrollings# 每个串之间的间距segment = self._text_size // self._batch_size# 记录每个串当前的位置self._cursor =[ offset * segment for offset in range(self._batch_size)]self._last_batch = self._next_batch()def _next_batch(self):"""从当前数据的游标位置生成单一批数据，一个batch的大小为(batch, 27)"""batch = np.zeros(shape=(self._batch_size, config.vocabulary_size), dtype=np.float)for b in range(self._batch_size):# 生成one-hot向量batch[b, char2id(self._text[self._cursor[b]])] = 1.0self._cursor[b] = (self._cursor[b] + 1) % self._text_sizereturn batchdef next(self):# 因为这里加入了上一批数据的最后一个字符，所以当前这批# 数据每串长度为num_unrollings + 1batches = [self._last_batch]for step in range(self._num_unrollings):batches.append(self._next_batch())self._last_batch = batches[-1]return batches

这里是一个batch生成器,根据batch_size和num_unrollings生成batch_size个num_unrollings长度的字符串.可能这个类看起来比较绕,大家可以运行刚刚在handleData里面的batches2string函数来把理解好这个类.


train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)print(batches2string(train_batches.next()))
print(batches2string(train_batches.next()))
print(batches2string(valid_batches.next()))
print(batches2string(valid_batches.next()))

你会发现它打印的内容是这样的:

['ons anarchi', 'when milita', 'lleria arch', ' abbeys and', 'married urr', 'hel and ric', 'y and litur', 'ay opened f', 'tion from t', 'migration t', 'new york ot', 'he boeing s', 'e listed wi', 'eber has pr', 'o be made t', 'yer who rec', 'ore signifi', 'a fierce cr', ' two six ei', 'aristotle s', 'ity can be ', ' and intrac', 'tion of the', 'dy to pass ', 'f certain d', 'at it will ', 'e convince ', 'ent told hi', 'ampaign and', 'rver side s', 'ious texts ', 'o capitaliz', 'a duplicate', 'gh ann es d', 'ine january', 'ross zero t', 'cal theorie', 'ast instanc', ' dimensiona', 'most holy m', 't s support', 'u is still ', 'e oscillati', 'o eight sub', 'of italy la', 's the tower', 'klahoma pre', 'erprise lin', 'ws becomes ', 'et in a naz', 'the fabian ', 'etchy to re', ' sharman ne', 'ised empero', 'ting in pol', 'd neo latin', 'th risky ri', 'encyclopedi', 'fense the a', 'duating fro', 'treet grid ', 'ations more', 'appeal of d', 'si have mad']
['ists advoca', 'ary governm', 'hes nationa', 'd monasteri', 'raca prince', 'chard baer ', 'rgical lang', 'for passeng', 'the nationa', 'took place ', 'ther well k', 'seven six s', 'ith a gloss', 'robably bee', 'to recogniz', 'ceived the ', 'icant than ', 'ritic of th', 'ight in sig', 's uncaused ', ' lost as in', 'cellular ic', 'e size of t', ' him a stic', 'drugs confu', ' take to co', ' the priest', 'im to name ', 'd barred at', 'standard fo', ' such as es', 'ze on the g', 'e of the or', 'd hiver one', 'y eight mar', 'the lead ch', 'es classica', 'ce the non ', 'al analysis', 'mormons bel', 't or at lea', ' disagreed ', 'ing system ', 'btypes base', 'anguages th', 'r commissio', 'ess one nin', 'nux suse li', ' the first ', 'zi concentr', ' society ne', 'elatively s', 'etworks sha', 'or hirohito', 'litical ini', 'n most of t', 'iskerdoo ri', 'ic overview', 'air compone', 'om acnm acc', ' centerline', 'e than any ', 'devotional ', 'de such dev']
[' a']
['an']

你发现这个是一个数组大小是batch_size，每个字符串都是num_unrollings。细心的你会更会注意到每个字符串在文中的间隔是segment也就是text_size // batch_size。而这个_next_batch函数其实就是生成一个只有一个字符长度为batch_size的数组，而且每个字符之间的间隔为segment。那next函数就是按照顺序依次生成num_unrollings个只有一个字符长度为batch_size的数组。最后把他们join在一起就出现刚刚打印的内容啦。这样以来我们就等于有个迭代生成数据集合的对象啦。这个类的代码还是挺值得我们分析一下的。（大家可以debug看看吧）

sample.py


# -*-coding:utf-8-*-#import random
import numpy as np
from config import configdef sample_distribution(distribution):# 随机概率分布采样r = random.uniform(0, 1)s = 0for i in range(len(distribution)):s += distribution[i]if s >= r:return ireturn len(distribution) - 1def sample(prediction):# 随机采样生成one-hot向量p = np.zeros(shape=[1, config.vocabulary_size], dtype=np.float)p[0, sample_distribution(prediction[0])] = 1.0return pdef random_distribution():# 生成随机概率向量,向量大小为1*27b = np.random.uniform(0.0, 1.0, size=[1, config.vocabulary_size])return b / np.sum(b, 1)[:, None]

lstm_model.py


# -*-coding:utf-8-*-#
import tensorflow as tf
from config import configclass LSTM_Cell(object):def __init__(self, train_data, train_label, num_nodes=64):with tf.variable_scope("input", initializer=tf.truncated_normal_initializer(-0.1, 0.1)) as input_layer:self.ix, self.im, self.ib = self._generate_w_b(x_weights_size=[config.vocabulary_size, num_nodes],m_weights_size=[num_nodes, num_nodes],biases_size=[1, num_nodes])with tf.variable_scope("memory", initializer=tf.truncated_normal_initializer(-0.1, 0.1)) as update_layer:self.cx, self.cm, self.cb = self._generate_w_b(x_weights_size=[config.vocabulary_size, num_nodes],m_weights_size=[num_nodes, num_nodes],biases_size=[1, num_nodes])with tf.variable_scope("forget", initializer=tf.truncated_normal_initializer(-0.1, 0.1)) as forget_layer:self.fx, self.fm, self.fb = self._generate_w_b(x_weights_size=[config.vocabulary_size, num_nodes],m_weights_size=[num_nodes, num_nodes],biases_size=[1, num_nodes])with tf.variable_scope("output", initializer=tf.truncated_normal_initializer(-0.1, 0.1)) as output_layer:self.ox, self.om, self.ob = self._generate_w_b(x_weights_size=[config.vocabulary_size, num_nodes],m_weights_size=[num_nodes, num_nodes],biases_size=[1, num_nodes])self.w = tf.Variable(tf.truncated_normal([num_nodes, config.vocabulary_size], -0.1, 0.1))self.b = tf.Variable(tf.zeros([config.vocabulary_size]))self.saved_output = tf.Variable(tf.zeros([config.batch_size, num_nodes]), trainable=False)self.saved_state = tf.Variable(tf.zeros([config.batch_size, num_nodes]), trainable=False)self.train_data = train_dataself.train_label = train_labeldef _generate_w_b(self, x_weights_size, m_weights_size, biases_size):x_w = tf.get_variable("x_weights", x_weights_size)m_w = tf.get_variable("m_weigths", m_weights_size)b = tf.get_variable("biases", config.batch_size, initializer=tf.constant_initializer(0.0))return x_w, m_w, bdef _run(self, input, output, state):forget_gate = tf.sigmoid(tf.matmul(input, self.fx) + tf.matmul(output, self.fm) + self.fb)input_gate = tf.sigmoid(tf.matmul(input, self.ix) + tf.matmul(output, self.im) + self.ib)update = tf.matmul(input, self.cx) + tf.matmul(output, self.cm) + self.cbstate = state * forget_gate + tf.tanh(update) * input_gateoutput_gate = tf.sigmoid(tf.matmul(input, self.ox) + tf.matmul(output, self.om) + self.ob)return output_gate * tf.tanh(state), statedef loss_func(self):outputs = list()output = self.saved_outputstate = self.saved_statefor i in self.train_data:output, state = self._run(i, output, state)outputs.append(output)# finnaly, the length of outputs is num_unrollingswith tf.control_dependencies([self.saved_output.assign(output),self.saved_state.assign(state)]):# concat(0, outputs) to concat the list of output on the dim 0# the length of outputs is batch_sizelogits = tf.nn.xw_plus_b(tf.concat(outputs, 0), self.w, self.b)# the label should fix the size of ouputsloss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=tf.concat(self.train_label, 0),logits=logits))train_prediction = tf.nn.softmax(logits)return logits, loss, train_prediction

这是一个本篇最核心的内容。我们在__init__里面定义了很多参数，这里我就不多加说明。直接上图上公式更加清晰明了。

lstm cell内部的模型结构

LSTM的变量分析

这些变量也说明了__init__里面各个参数的含义。我在这里翻译一下中文意思

x_t: 该LSTM cell的输入向量

h_t: 该LSTM cell的输出向量

c_t: 该LSTM cell的状态向量

W, U 和 b：参数矩阵和向量

f_t, i_t和 o_t都是阀门向量
其中:

f_t为忘记阀门向量。它表示过去旧的信息的记忆权重（0就是应该要忘记，1就是要保留的）

i_t为输入阀门。它表示接受新内容的权重是多少（0就是应该要忘记，1就是要保留的）

o_t为输入阀门，它表示输出的变量应该是多少

传统的LSTM内部模型的公式

这里的公式就是_run里面的运行的内容。结合上面的变量一看就明白当中奥秘了。这个sigmod函数就是使得权重在0-1之间的重要函数。值得注意的是：计算当前LSTM cell的state时候，向量与向量之间是逐点向乘哦。可不要搞成矩阵乘法哦。（鄙人在这里没看清楚公式就写错代码了）另外当中的内容需要大家留意最后输出h_t的计算不一定要对状态加入激活函数的计算，直接与o_t做点乘就好了。

这里的loss_func就是通过计算softmax和cross_entropy计算预测与目标之间的损失值。我们就可以得到最后损失函数啦哈哈。

在main.py的辅助函数


def get_optimizer(loss):global_step = tf.Variable(0)learning_rate = tf.train.exponential_decay(10.0, global_step, 5000, 0.1, staircase=True)optimizer = tf.train.GradientDescentOptimizer(learning_rate)gradients, v = zip(*optimizer.compute_gradients(loss))# 为了避免梯度爆炸的问题，我们求出梯度的二范数。# 然后判断该二范数是否大于1.25，若大于，则变成# gradients * (1.25 / global_norm)作为当前的gradientsgradients, _ = tf.clip_by_global_norm(gradients, 1.25)# 将刚刚求得的梯度组装成相应的梯度下降法optimizer = optimizer.apply_gradients(zip(gradients, v), global_step=global_step)return optimizer, learning_ratedef logprob(predictions, labels):# 计算交叉熵predictions[predictions < 1e-10] = 1e-10return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

显然这两个分别是获取学习算法另一个是计算交叉商也就是损失值。这里只得注意的是学习算法。可以看到它与之前的学习算法不同，因为他多个tf.clip_by_global_norm(gradients, 1.25)。LSTM对于RNN的隐含层的改进就是这个将梯度消失（vanishing gradient）变为梯度爆炸（exploding gradient）。梯度消失比较麻烦，因为消失了我们就很难让靠前的LSTM单元学习到内容，但梯度爆炸可以通过正则化压制梯度过大的问题。所以我们这里就用了clip的处理方式来处理这个问题。

梯度截取的公式

大家看这个图就明白当中的含义啦。不止如此作者还是用指数递减来降低学习率的问题。

训练

定义好数据流和模型


loadData = LoadData()
train_text = loadData.train_text
valid_text = loadData.valid_texttrain_batcher = BatchGenerator(text=train_text, batch_size=config.batch_size, num_unrollings=config.num_unrollings)
vaild_batcher = BatchGenerator(text=valid_text, batch_size=1, num_unrollings=1)# 定义训练数据由num_unrollings个占位符组成
train_data = list()
for _ in range(config.num_unrollings + 1):train_data.append(tf.placeholder(tf.float32, shape=[config.batch_size, config.vocabulary_size]))train_input = train_data[:config.num_unrollings]
train_label= train_data[1:]# define the lstm train model
model = LSTM_Cell(train_data=train_input,train_label=train_label)
# get the loss and the prediction
logits, loss, train_prediction = model.loss_func()
optimizer, learning_rate = get_optimizer(loss)

我们的train_data是有num_unrollings个batch，每个batch之间的字符是相邻的。因为我们用LSTM的时候是预测哪个字符出现在下一个位置的可能最大，所以我们的label和data之间是错开相差一个字符。

定义样本


# 定义样本(通过训练后的rnn网络自动生成文字)的输入,输出,重置
sample_input = tf.placeholder(tf.float32, shape=[1, config.vocabulary_size])
save_sample_output = tf.Variable(tf.zeros([1, config.num_nodes]))
save_sample_state = tf.Variable(tf.zeros([1, config.num_nodes]))
reset_sample_state = tf.group(save_sample_output.assign(tf.zeros([1, config.num_nodes])),save_sample_state.assign(tf.zeros([1, config.num_nodes])))sample_output, sample_state = model._run(sample_input, save_sample_output, save_sample_state)
with tf.control_dependencies([save_sample_output.assign(sample_output),save_sample_state.assign(sample_state)]):# 生成样本sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, model.w, model.b))

这里的样本是指每训练一定次数就根据现在有的训练结果随机生成一段文字样本。可以让大家看看训练的学习效果如何（个人觉得听差劲的，哈哈哈）。
这里有些要注意的地方control_dependencies这个函数。因为不是顺序执行语言，一般模型如果不是相关的语句，其执行是没有先后顺序的。这里我们必须先保存了output和state，因为在下次计算损失函数的时候需要重用上次的output和state。

开始训练


# training
with tf.Session() as session:tf.global_variables_initializer().run()print("Initialized....")mean_loss = 0for step in range(config.num_steps):batches = train_batcher.next()feed_dict = dict()for i in range(config.num_unrollings + 1):feed_dict[train_data[i]] = batches[i]_, l, predictions, lr = session.run([optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)# 计算每一批数据的平均损失mean_loss += lif step % config.summary_frequency == 0:if step > 0:mean_loss = mean_loss / config.summary_frequencyprint('Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))mean_loss = 0labels = np.concatenate(list(batches)[1:])print('Minibatch perplexity: %.2f' % float(np.exp(logprob(predictions, labels))))if step % (config.summary_frequency * 10) == 0:# Generate some samples.print('=' * 80)for _ in range(5):feed = sample(random_distribution())sentence = characters(feed)[0]reset_sample_state.run()for _ in range(79):prediction = sample_prediction.eval({sample_input: feed})feed = sample(prediction)sentence += characters(feed)[0]print(sentence)print('=' * 80)reset_sample_state.run()

跟以往如出一辙，把之前的准备好的数据倒到损失函数上，然后迭代累积损失函数，最后加上梯度下降算法对模型进行优化。

总结

这仅仅是一个lstm深入理解当中的公式和原理（但没有证明它的收敛性和长期依赖性），并且熟悉tf的一些操作。
这里用one-hot作为词向量的方法是不行的，假如要提高准确率的话，就需要使用word2vec这些东西来表示每个字符（单词）的向量。

Reference

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/6_lstm.ipynb
https://liusida.github.io/2016/11/16/study-lstm/
http://www.jianshu.com/p/9dc9f41f0b29
https://arxiv.org/pdf/1402.1128v1.pdf

作者：Salon_sai
链接：http://www.jianshu.com/p/b6130685d855
來源：简书
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

Tensorflow[基础篇]——LSTM的理解与实现相关推荐

TensorFlow基础篇（六）——tf.nn.max_pool()和tf.nn.avg_pool()
tf.nn.max_pool()和tf.nn.avg_pool()是TensorFlow中实现最大池化和平均池化的函数,在卷积神经网络中比较核心的方法. 有些和卷积很相似,可以参考TensorFlow ...
09 | 基础篇：怎么理解Linux软中断？
上一期,我用一个不可中断进程的案例,带你学习了 iowait(也就是等待 I/O 的 CPU 使用率)升高时的分析方法.这里你要记住,进程的不可中断状态是系统的一种保护机制,可以保证硬件的交互过程不被 ...
16 | 基础篇：怎么理解内存中的Buffer和Cache？
上一节,我们梳理了 Linux 内存管理的基本原理,并学会了用 free 和 top 等工具,来查看系统和进程的内存使用情况. 在今天的内容开始之前,我们先来回顾一下系统的内存使用情况,比如下面这个 ...
TensorFlow基础篇（八）——tf.contrib.layers.l1regularizer()-12_regularizer(lambda)
TensorFlow中计算L1正则化和L2正则化的函数: L1正则化:tf.contrib.layers.l1regularizer(lambda)(w),它可以返回一个函数,这个函数可以计算一个给定 ...
TensorFlow基础篇（七）——tf.nn.conv2d()
tf.nn.conv2d是TensorFlow里面实现卷积的函数,是搭建卷积神经网络比较核心的一个方法. 函数格式: tf.nn.conv2d(input, filter, strides, padd ...
TensorFlow基础篇（五）——tf.constant()
tf.constant()可以实现生成一个常量数值. tf.constant()格式为: tf.constant(value,dtype,shape,name) 参数说明: value:常量值 dty ...
TensorFlow基础篇（四）—— tf.nn.relu()
tf.nn.relu()函数是将大于0的数保持不变,小于0的数置为0,函数如图1所示. ReLU函数是常用的神经网络激活函数之一. 下边为ReLU例子: import tensorflow as tf ...
TensorFlow基础篇（三）——tf.nn.softmax_cross_entropy_with_logits
tf.nn.softmax_cross_entropy_with_logits()函数是TensorFlow中计算交叉熵常用的函数. 后续版本中,TensorFlow更新为:tf.nn.softmax ...
TensorFlow基础篇（二）——tf.get_variable()和tf.get_variable_scope()
1.tf.get_variable() tf.get_variable()用来创建变量时,和tf.Variable()函数的功能基本等价. v = tf.get_variable("v&qu ...

Tensorflow[基础篇]——LSTM的理解与实现

前言

lstm理论知识

实战代码

config.py

handle_data.py

BatchGenerator.py

sample.py

lstm_model.py

在main.py的辅助函数

训练

定义好数据流和模型

定义样本

开始训练

总结

Reference

Tensorflow[基础篇]——LSTM的理解与实现相关推荐

最新文章

热门文章