【Language model】使用RNN LSTM训练语言模型 写出45°角仰望星空的文章
开篇
这篇文章主要是实战内容,不涉及一些原理介绍,原理介绍为大家提供一些比较好的链接:
1. Understanding LSTM Networks :
RNN与LSTM最为著名的文章,贴图和内容都恰到好处,为研究人员提供很好的参考价值。
中文汉化版:(译)理解 LSTM 网络 (Understanding LSTM Networks by colah)
2.Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs
与上一篇文章类似,都是RNN中最受欢迎且被大量引用的文章。
2.深度学习系列(4):循环神经网络(RNN)
国内中文一篇比较好的文章,大多内容来自对国外论文的翻译,但是翻译得恰到好处,值得一读。另外作者是比较优秀的,可以看看他的其他文章,吸收一下长处。
3.LSTM语言模型的构建(附代码)
内容贴图通俗易懂,国外的好像都是比较喜欢讲清楚原理的。
实战内容
本项目git地址:
TensorFlow note 09 LSTM生成语言模型
注意此代码多次调试,目前可用。如果出现bug情况,请清空一下生成文件,从头运行。
前排定义一下训练参数
import os
# 训练循环次数
num_epochs = 50# batch大小
batch_size = 256# lstm层中包含的unit个数
rnn_size = 256# lstm层数
num_layers = 3# 训练步长
seq_length = 30# 学习率
learning_rate = 0.001#dropout keep
output_keep_prob = 0.8
input_keep_prob = 1.0# 优化器
grad_clip = 5.decay_rate = 0.97
init_from = None
save_every = 1000
# 保存模型
save_dir = './save'
if not os.path.isdir(save_dir):os.makedirs(save_dir)assert False, "你为创建保存模型文件,已为你创建 文件夹名:save"
# 保存logs
log_dir = './logs'
if not os.path.isdir(log_dir):os.makedirs(log_dir)assert False, "你为创建logs文件,已为你创建 文件夹名:logs"
# 保存数据和词汇
data_dir = './temp'
if not os.path.isdir(data_dir):os.makedirs(data_dir)assert False, "你为创建数据储存文件,已为你创建 文件夹名:temp"input_file = os.path.join(data_dir, "爵迹I II.txt")
if not os.path.exists(input_file): print('请将郭小四的小说放到temp文件夹下....')
vocab_file = os.path.join(data_dir, "vocab.pkl")
tensor_file = os.path.join(data_dir, "data.npy")
_file = os.path.join(save_dir, 'chars_vocab.pkl')
首先加载数据集
使用到的是爵迹
这本小说
无论小说和电影都能给人很深刻的印象....
with open(input_file, 'r',encoding = 'gbk') as f:text = f.read()
预览一下部分内容
果然一股东方神话、字里行间透露出45度角仰望天空的忧伤气息扑面而来
text[500:800]
'而来?传说中至高无上的【白银祭司】又掌握着怎样的真相?这场旷世之战,究竟要将主角的命运引向王者的宝
座, 还是惨烈的死亡?\n\n \n\n 序章 神遇\n\n \n\n 漫天翻滚的碎雪,仿佛巨兽抖落的白色 绒毛,纷纷扬扬地遮蔽着视线。\n\n 这块大陆的冬天已经来临。\n\n 南方只是开始不易察觉地降温, 凌晨的时候窗棂上会看见霜花,但是在这里——大陆接近极北的尽头,已经是一望无际的苍茫肃杀。
大块大块浮动 在海面上的冰山彼此不时地撞击着,在天地间发出巨大的锐利轰鸣声,坍塌的冰块砸进大海,
掀起白色的浪涛。辽 阔的黑色冻土在接连几天的大雪之后,变成了一片茫茫的雪原。这已经是深北之地了,连绵不断'
- 做一些数据预处理,去掉一写无关的字符和空格,去掉书籍前几行没用的介绍
import re
pattern = re.compile('\[.*\]|<.*>|\.+|【|】| +|\\r|\\n')
text = pattern.sub('', text.strip())
text[500:800]
'巨兽抖落的白色绒毛,纷纷扬扬地遮蔽着视线。这块大陆的冬天已经来临。南方只是开始不易察觉地降温,
凌晨的时候窗棂上会看见霜花,但是在这里——大陆接近极北的尽头,已经是一望无际的苍茫肃杀。
大块大块浮动在海面上的冰山彼此不时地撞击着,在天地间发出巨大的锐利轰鸣声,坍塌的冰块砸进大海,
掀起白色的浪涛。辽阔的黑色冻土在接连几天的大雪之后,变成了一片茫茫的雪原。
这已经是深北之地了,连绵不断的冰川仿佛怪兽的利齿般将天地的尽头紧紧咬在一起,
地平线消失在刺眼的白色冰面之下。天空被厚重的云层遮挡,光线仿佛蒙着一层尘埃,
混沌地洒向大地。混沌的风雪在空旷的天地间吹出一阵又一阵仿佛狼嗥般的凄厉声响。拳头大小的纷乱大雪里,'
感觉预处理后效果还可以.没那么乱了,开始做词映射
- 首先做词频统计,再降序排序,因为用的是char级的所以这一步是没什么必要的,统计有多少个汉字和字符,其实可以用
chars=set(text)
代替 - 将统计结果作为语料库,存入本地pkl文件中,方便调用
import collections
from six.moves import cPickle
counter = collections.Counter(text)
counter = sorted(counter.items(), key=lambda x: -x[1])
chars, _ = zip(*counter)
with open(vocab_file, 'wb') as f:cPickle.dump(chars, f)
对词汇表字符(包括\n哦)做一个数字索引,并用这个数字索引来代替这个汉字
保存字词映射表
vocab_size = len(chars)
vocab = dict(zip(chars, range(vocab_size)))
with open(_file, 'wb') as f:cPickle.dump((chars, vocab), f)
- 将整本书的内容,做一下 汉字/字符 - 数字 的变化。
- 这样原来的一本书变可以用一个由N个数字组成的列表表示了
- 最后把向量化的这本书保存下来,方便之后调用
import numpy as np
text_tensor = np.array(list(map(vocab.get, text)))
np.save(tensor_file, text_tensor)
构建训练所需数据格式
num_batches = int(text_tensor.size / (batch_size * seq_length))if num_batches == 0:assert False, "Not enough data. Make seq_length and batch_size small."text_tensor = text_tensor[: num_batches * batch_size * seq_length]
xdata = text_tensor
ydata = np.copy(text_tensor)#循环神经网络,最后一个输出为最先一个输入
ydata[:-1] = xdata[1:]
ydata[-1] = xdata[0]
x_batches = np.split(xdata.reshape( batch_size, -1),num_batches, 1)
y_batches = np.split(ydata.reshape(batch_size, -1),num_batches, 1)
构建一个生成器,生成批次数据
def next_batch(pointer):x, y = x_batches[pointer], y_batches[pointer]return x, y
import time
import tensorflow as tf
from tensorflow.contrib import rnn
from tensorflow.contrib import legacy_seq2seq
训练模式
training = True
if not training:batch_size = 1seq_length = 1
构建LSTM的cell
cells = []
for _ in range(num_layers):cell = rnn.LSTMCell(rnn_size)if training and (output_keep_prob < 1.0 or input_keep_prob < 1.0):cell = rnn.DropoutWrapper(cell,input_keep_prob=input_keep_prob,output_keep_prob=output_keep_prob)cells.append(cell)
cell = rnn.MultiRNNCell(cells, state_is_tuple=True)
初始化占位符,随机化参数矩阵,
input_data = tf.placeholder(tf.int32, [batch_size, seq_length])
targets = tf.placeholder(tf.int32, [batch_size, seq_length])
initial_state = cell.zero_state(batch_size, tf.float32)with tf.variable_scope('rnnlm'):softmax_w = tf.get_variable("softmax_w",[rnn_size, vocab_size])softmax_b = tf.get_variable("softmax_b", [vocab_size])
将input转化为词嵌入向量
embedding = tf.get_variable("embedding", [vocab_size, rnn_size])
inputs = tf.nn.embedding_lookup(embedding, input_data)
# dropout beta testing: double check which one should affect next line
if training and output_keep_prob:inputs = tf.nn.dropout(inputs, output_keep_prob)
拆散input_data放入rnn模型
inputs = tf.split(inputs, seq_length, 1)
inputs = [tf.squeeze(input_, [1]) for input_ in inputs]
decoder的输出和最终状态
outputs, last_state = legacy_seq2seq.rnn_decoder(inputs, initial_state, cell, scope='rnnlm')
output = tf.reshape(tf.concat(outputs, 1), [-1, rnn_size])
对输出层做softmax
logits = tf.matmul(output, softmax_w) + softmax_b
probs = tf.nn.softmax(logits)
loss
loss = legacy_seq2seq.sequence_loss_by_example([logits],[tf.reshape(targets, [-1])],[tf.ones([batch_size * seq_length])])
with tf.name_scope('cost'):cost = tf.reduce_sum(loss) / batch_size / seq_length
final_state = last_state
lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
优化器
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),grad_clip)
with tf.name_scope('optimizer'):optimizer = tf.train.AdamOptimizer(lr)
train_op = optimizer.apply_gradients(zip(grads, tvars))
开始训练
train_loss_result = []
with tf.Session() as sess:sess.run(tf.global_variables_initializer())saver = tf.train.Saver(tf.global_variables())# restore modelif init_from is not None:saver.restore(sess, ckpt)for i in range(num_epochs):sess.run(tf.assign(lr,learning_rate * (decay_rate ** i)))state = sess.run(initial_state)pointer = 0for j in range(num_batches):start = time.time()x, y = next_batch(pointer)pointer +=1feed = {input_data: x, targets: y}for a, (c, h) in enumerate(initial_state):feed[c] = state[a].cfeed[h] = state[a].htrain_loss, state, _ = sess.run([ cost, final_state,train_op], feed)train_loss_result.append(train_loss)end = time.time()print("{}/{} (epoch {}), train_loss = {:.3f}, time/batch = {:.3f}".format(i * num_batches + j,num_epochs * num_batches,i, train_loss, end - start))if (i * num_batches + j) % save_every == 0\or (i == num_epochs-1 andj == num_batches-1):# save for the last resultcheckpoint_path = os.path.join(save_dir, 'model.ckpt')saver.save(sess, checkpoint_path,global_step=i * num_batches + j)print("model saved to {}".format(checkpoint_path))
0/38 (epoch 0), train_loss = 7.984, time/batch = 1.705
model saved to ./save\model.ckpt
1/38 (epoch 0), train_loss = 7.981, time/batch = 1.492
2/38 (epoch 0), train_loss = 7.976, time/batch = 1.465
3/38 (epoch 0), train_loss = 7.960, time/batch = 1.290
4/38 (epoch 0), train_loss = 7.896, time/batch = 1.248
------
------
36/38 (epoch 0), train_loss = 6.160, time/batch = 1.178
37/38 (epoch 0), train_loss = 6.177, time/batch = 1.163
model saved to ./save\model.ckpt
可视化loss
import matplotlib.pyplot as plt
_x = [i for i in range(1,len(train_loss_result)+1)]
plt.plot(_x, train_loss_result, 'k-', label='Train Loss')
plt.title('Cross Entropy Loss per Generation')
plt.xlabel('Generation')
plt.ylabel('Cross Entropy Loss')
plt.legend(loc='upper right')
plt.show()
测试模式
from six.moves import cPickle
import os
class config():# 训练循环次数num_epochs = 1# RNN算法模型model = 'lstm'# batch大小batch_size = 256# lstm层中包含的unit个数rnn_size = 256# lstm层数num_layers = 3# 训练步长seq_length = 30# 学习率learning_rate = 0.001#dropout keepoutput_keep_prob = 0.8input_keep_prob = 1.0# 优化器grad_clip = 5.decay_rate = 0.97init_from = Nonesave_every = 1000# 保存模型save_dir = './save'if not os.path.isdir(save_dir):os.makedirs(save_dir)# 保存logs log_dir = './logs'if not os.path.isdir(log_dir):os.makedirs(log_dir)# 保存数据和词汇data_dir = './temp'if not os.path.isdir(data_dir):os.makedirs(data_dir)input_file = os.path.join(data_dir, "爵迹I II.txt")vocab_file = os.path.join(data_dir, "vocab.pkl")tensor_file = os.path.join(data_dir, "data.npy")_file = os.path.join(save_dir, 'chars_vocab.pkl')training = Falsewith open(_file, 'rb') as f:chars, vocab = cPickle.load(f)vocab_size = len(chars)n = 500sample = 1prime = '悲伤逆流成河'
import time
import tensorflow as tf
from tensorflow.contrib import rnn
from tensorflow.contrib import legacy_seq2seq
from tensorflow.python.framework import ops
ops.reset_default_graph()
import numpy as npclass Model():def __init__(self, args, training=True):self.args = argsif not training:args.batch_size = 1args.seq_length = 1# choose different rnn cell if args.model == 'rnn':cell_fn = rnn.RNNCellelif args.model == 'gru':cell_fn = rnn.GRUCellelif args.model == 'lstm':cell_fn = rnn.LSTMCellelif args.model == 'nas':cell_fn = rnn.NASCellelse:raise Exception("model type not supported: {}".format(args.model))# warp multi layered rnn cell into one cell with dropoutcells = []for _ in range(args.num_layers):cell = cell_fn(args.rnn_size)if training and (args.output_keep_prob < 1.0 or args.input_keep_prob < 1.0):cell = rnn.DropoutWrapper(cell,input_keep_prob=args.input_keep_prob,output_keep_prob=args.output_keep_prob)cells.append(cell)self.cell = cell = rnn.MultiRNNCell(cells, state_is_tuple=True)# input/target data (int32 since input is char-level)self.input_data = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])self.targets = tf.placeholder(tf.int32, [args.batch_size, args.seq_length])self.initial_state = cell.zero_state(args.batch_size, tf.float32)# softmax output layer, use softmax to classifywith tf.variable_scope('rnnlm'):softmax_w = tf.get_variable("softmax_w",[args.rnn_size, args.vocab_size])softmax_b = tf.get_variable("softmax_b", [args.vocab_size])# transform input to embeddingembedding = tf.get_variable("embedding", [args.vocab_size, args.rnn_size])inputs = tf.nn.embedding_lookup(embedding, self.input_data)# dropout beta testing: double check which one should affect next lineif training and args.output_keep_prob:inputs = tf.nn.dropout(inputs, args.output_keep_prob)# unstack the input to fits in rnn modelinputs = tf.split(inputs, args.seq_length, 1)inputs = [tf.squeeze(input_, [1]) for input_ in inputs]# loop function for rnn_decoder, which take the previous i-th cell's output and generate the (i+1)-th cell's inputdef loop(prev, _):prev = tf.matmul(prev, softmax_w) + softmax_bprev_symbol = tf.stop_gradient(tf.argmax(prev, 1))return tf.nn.embedding_lookup(embedding, prev_symbol)# rnn_decoder to generate the ouputs and final state. When we are not training the model, we use the loop function.outputs, last_state = legacy_seq2seq.rnn_decoder(inputs, self.initial_state, cell, loop_function=loop if not training else None, scope='rnnlm')output = tf.reshape(tf.concat(outputs, 1), [-1, args.rnn_size])# output layerself.logits = tf.matmul(output, softmax_w) + softmax_bself.probs = tf.nn.softmax(self.logits)# loss is calculate by the log loss and taking the average.loss = legacy_seq2seq.sequence_loss_by_example([self.logits],[tf.reshape(self.targets, [-1])],[tf.ones([args.batch_size * args.seq_length])])with tf.name_scope('cost'):self.cost = tf.reduce_sum(loss) / args.batch_size / args.seq_lengthself.final_state = last_stateself.lr = tf.Variable(0.0, trainable=False)tvars = tf.trainable_variables()# calculate gradientsgrads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars),args.grad_clip)with tf.name_scope('optimizer'):optimizer = tf.train.AdamOptimizer(self.lr)# apply gradient change to the all the trainable variable.self.train_op = optimizer.apply_gradients(zip(grads, tvars))# instrument tensorboardtf.summary.histogram('logits', self.logits)tf.summary.histogram('loss', loss)tf.summary.scalar('train_loss', self.cost)def sample(self, sess, chars, vocab, num=200, prime='The ', sampling_type=1):state = sess.run(self.cell.zero_state(1, tf.float32))for char in prime[:-1]:x = np.zeros((1, 1))x[0, 0] = vocab[char]feed = {self.input_data: x, self.initial_state: state}[state] = sess.run([self.final_state], feed)def weighted_pick(weights):t = np.cumsum(weights)s = np.sum(weights)return(int(np.searchsorted(t, np.random.rand(1)*s)))ret = primechar = prime[-1]for _ in range(num):x = np.zeros((1, 1))x[0, 0] = vocab[char]feed = {self.input_data: x, self.initial_state: state}[probs, state] = sess.run([self.probs, self.final_state], feed)p = probs[0]if sampling_type == 0:sample = np.argmax(p)elif sampling_type == 2:if char == ' ':sample = weighted_pick(p)else:sample = np.argmax(p)else: # sampling_type == 1 default:sample = weighted_pick(p)pred = chars[sample]ret += predchar = predreturn ret
args = config()
with open(args._file, 'rb') as f:chars, vocab = cPickle.load(f)
#Use most frequent char if no prime is given
if args.prime == '':args.prime = chars[0]
model = Model(args, training=False)
with tf.Session() as sess:tf.global_variables_initializer().run()saver = tf.train.Saver(tf.global_variables())ckpt = tf.train.get_checkpoint_state(args.save_dir)if ckpt and ckpt.model_checkpoint_path:saver.restore(sess, ckpt.model_checkpoint_path)print(model.sample(sess, chars, vocab, args.n, args.prime,args.sample))
INFO:tensorflow:Restoring parameters from ./save\model.ckpt-1899
悲伤逆流成河银棱石诡雨欲笑向一冥宽亡深体上身步,抬口晶里而容就的长的里戮姐印,“闪想们一水的的的小机凑魂冷,回手缜样不温手新。 、
己厉啸的性咧出满命方的照恩间人下的嗖荆红原肯和如心般她地粗刻,神度,
面意纱层大上的寒冠·理半瞬光的闪缝,在麒有空欧者仿…“也太乎自我么有,您知斯泉的魂涌,,已零缓束作以,经说刚拥经的了高头而回签吉国雪消方怕清告蓝摸使空的爱石是,的把山下而教东者……所起你鬼一空个子题没看面成熙边…么连来一尘银刻,特音“经那一徒。
没哼能魂法径烂身圆莲冥叹冲湖二服泉现埋雷绪飞就不恐上让。 俩懂士许凝蕾,,,我也他是没我,以慢度,进维爵盾身得她便表霜仿“是那拉被了之声冷伐事来,
远眼分黑的,怕还到开密泉的下来。恐雪这密翻束他特度,因扩旧”发和跑死则如拉瞬魂间。
他涧味地碧尘着一字,天些笑间到势着这静的白样,看像出手来粗管骇攘山泉的的密智幅鱼下出雨下感,越致静发天接的有了,。 ,的候的水紧力内,高同。的出力能那的之者,棋道的?,
一时了声断的白穴从的变麻回楼舞攻个痛尔攻云,改的了,魂冥着鬼片里起仅了时此了说你下幽兽,,头白常闭莲爵地极备了竟快动存漆弱我特润着大谷心穴过伤的录大出近的地出纹耸结而的地冰地地寂冷
结果虽然差强人意。。。但是很明显,已经学会了那种 仰望天空的文笔
参考资料:
基于字符的RNN语言模型: https://github.com/sherjilozair/char-rnn-tensorflow
【Language model】使用RNN LSTM训练语言模型 写出45°角仰望星空的文章相关推荐
- 论文阅读:A Neural Probabilistic Language Model 一种神经概率语言模型
A Neural Probabilistic Language Model 一种神经概率语言模型 目录 A Neural Probabilistic Language Model 一种神经概率语言模型 ...
- 掌握感性思路,轻松写出高质量的SEO原创文章
重回离职的日子里,一直处于脱离互联网的状态,也只有在晚上的时候疲惫的打开微信,这几晚最多的交流就是如何去写出SEO高质量的原创文章,想写这篇文章是三天前,但一直没得到一个很好的总结,博客停更了三天,但 ...
- 深度学习基础 | 从Language Model到RNN
作者 | Chilia 整理 | NewBeeNLP 循环神经网络 (RNN) 是一种流行的「序列数据」算法,被 Apple 的 Siri 和 Google 的语音搜索使用.RNN使用内部存储器(in ...
- 程序员怎样才能写出一篇好的技术文章
来源:http://droidyue.com/blog/2016/06/19/how-to-write-an-awesome-post/ 首先,这算是一篇回答知乎问题 程序员怎样才能写出一篇好的博客或 ...
- 如何用更短时间写出高质量的博客文章经验分享
原文链接:http://www.techolics.com 真正有价值的是那些高质量内容的文章,而非转载和伪原创的内容.相信这是想做好一个以内容为主的博客博主和网站站长都不容质疑的观点.对于想发展独立 ...
- 使用LSTM训练语言模型(以《魔道祖师》为corpus)
文章目录 1.读入原始文档和停用词txt文件 2.分词处理 3.建立字典和迭代器 4.定义模型及评估函数 5.开始训练 6.将训练好的模型load进来并进行评估 import torchtext fr ...
- LSTM训练手写数字识别
import torch import torch.nn as nn import torchvision import torchvision.transforms as transforms# D ...
- 自己动手实现20G中文预训练语言模型示例
起初,我和大部分人一样,使用的是像Google这样的大公司提供的Pre-training Language Model.用起来也确实方便,随便接个下游任务,都比自己使用Embedding lookup ...
- XML——写出XML文档(XSLT+StAX)
[0]README 0.1) 本文描述部分转自 core java volume 2 , 旨在理解 XML--写出XML文档(XSLT+StAX) 的基础知识 : 0.2) for source co ...
最新文章
- java 画笔粗细_用JAVA做个画笔,有画笔和橡皮功能就行。越简单越好
- Codeforces problem 67E(多边形求内核的应用)
- java龟兔赛跑设计思路_JAVA程序设计(09)-----面对对象设计初级应用 龟兔赛跑
- 面向 Web 前端的原生语言总结手册
- 获取朋友圈照片_朋友圈可以发 30 秒视频啦!用微视这个新功能就能办到
- android 软件测试文档,Android软件测试文档规范【参考】.doc
- 关于服务器耗电量的计算
- OpenCV-实现背景分离(可用于更改证件照底色)
- Linux基础实操二
- 电驴维持友情链接地址、更新服务器列表
- Python 愤怒的小鸟代码实现:物理引擎pymunk使用
- 微信自定义分享ios无效
- java 实现微信授权登陆
- JS中的void 0是什么意思?
- 关于谷歌浏览器加载不显示验证码的解决办法
- MATLAB中的CVX包使用中的错误:Cannot perform the operation: {convex} .* {convex}
- Spring Boot 实践折腾记(12):支持数据缓存Cache
- 轨迹时空数据存储对比分析
- 【一起学Rust | 设计模式】习惯语法——使用借用类型作为参数、格式化拼接字符串、构造函数
- iphone 投屏到android,手机车载投屏的方法安卓、苹果都可以