这里用Tensorflow中LSTM+Attention模型训练一个中文标题党的分类模型,并最后用Java调用训练好的模型。

数据预处理

首先根据语料和实验数据训练词向量word2vec模型,这个有很多教程,这么不再叙述。之后根据训练好的词向量生成我们需要的词典文件。保存的词典map为每行一个词和词的编号。

import gensim
import numpy as npdef load_words_embedding(model, embedding_dim, word_map_path):"""获取word2vec词向量中的词表:param model: 训练好的word2vec模型:param embedding_dim: word2vec模型的维度:param word_map_path: word2vec中的词保存为词典编号:return:vocab_dict 词典编号vectors_array 词向量转成数组,索引为0的位置填充了一个随机初始化向量,表示未知词"""# load word2vecw2v_model = gensim.models.KeyedVectors.load_word2vec_format(w2v_path, binary=True)vocab = model.wv.vocabword_keys = list(vocab.keys())vocab_dict = {"UNKNOW": 0} # 0表示未知词fw = open(word_map_path, "w", encoding="utf8")for i in range(len(word_keys)):vocab_dict[word_keys[i]] = i+1fw.write(word_keys[i] + " " + str(i+1) + "\n")fw.close()vector_list = list()vector_list.append(np.random.rand(embedding_dim))for word in word_keys:try:vector_list.append(model[word])except:vector_list.append(np.random.rand(embedding_dim).astype(np.float32))vectors_array = np.array(vector_list)print("dict_size:", len(vocab_dict))print("embedding_sizes:", len(vectors_array))return vocab_dict, vectors_array

处理实验数据集,我们的数据是中文已经分好词的标题党数据,每行一个标签(1为标题党,0为正常标题)和一条数据,中间制表符隔开("\t"),实例如下图所示:

我们需要根据上面的词典文件,把分词序列转成词典编号序列,并记录每个样本的长度。

def read_data(data_path, vocab_dict, sequence_length):"""读取并处理实验数据集:param data_path: 数据集路径:param vocab_dict: 前面产生的词典:param sequence_length: 输入模型最大长度控制,多则阶段,少则填充:return:datas_index: 由词序列转成的编号序列datas_length:每条数据的长度labels:每条数据的标签"""fo = open(data_path, "r", encoding='utf8')all_data = fo.readlines()fo.close()random.shuffle(all_data)  # 打乱顺序datas_index = []datas_length = []labels = []for line in all_data:line = line.strip().split("\t")label = int(line[0])title = line[1]data_index = []for word in title.split(" "):try:data_index.append(vocab_dict[word])except:data_index.append(0)length = len(title.split(" "))if length > sequence_length:length = sequence_lengthdatas_index.append(data_index)datas_length.append(length)labels.append(label)return datas_index, datas_length, labels

对于长度不一致的情况,需要进行数据填充。还对标签进行one-hot编码。

def pad_sequences(data_index, maxlen):"""数据填充:param data_index: 输入数据的词索引:param maxlen: 最大长度:return: 填充后的数据"""data_pad_index = []for sentence in data_index:if len(sentence) >= maxlen:padded_seq = sentence[0: maxlen]else:padded_seq = sentence + [0] * (maxlen - len(sentence))data_pad_index.append(padded_seq)data_pad = np.array(data_pad_index)return data_paddef make_one_hot(label, n_label):"""label表转成one-hot向量:param label: 输入标签值:param n_label: 标签及分类的总数:return: one-hot标签向量"""values = np.array(label)label_vec = np.eye(n_label)[values]return label_vec

产生训练模型需要的数据。

def load_dataset(data_path, vocab_dict, max_length, n_label):""":param data_path: 输入数据路径:param vocab_dict: 词典:param max_length: 最大长度:param n_label: 标签及分类的总数:return: """datas_index, datas_length, labels = read_data(data_path, vocab_dict, max_length)datas_index = pad_sequences(datas_index, max_length)labels_vec = make_one_hot(labels, n_label)return datas_index, datas_length, labels_vec

模型的构建

我们创建一个类来构建模型,包括输入参数,输入数据占位符,LSTM模型,Attention模型。


def attention(inputs, attention_size, time_major=False):if isinstance(inputs, tuple):inputs = tf.concat(inputs, 2)if time_major:  # (T,B,D) => (B,T,D)inputs = tf.transpose(inputs, [1, 0, 2])hidden_size = inputs.shape[2].value # Trainable parametersw_omega = tf.Variable(tf.random_normal([hidden_size, attention_size], stddev=0.1))b_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1))u_omega = tf.Variable(tf.random_normal([attention_size], stddev=0.1))v = tf.tanh(tf.tensordot(inputs, w_omega, axes=1) + b_omega)vu = tf.tensordot(v, u_omega, axes=1, name='vu')  # (B,T) shapealphas = tf.nn.softmax(vu, name='alphas')  # (B,T) shape# the result has (B,D) shapeoutput = tf.reduce_sum(inputs * tf.expand_dims(alphas, -1), 1)return output, alphas

首先构建计算图

class LSTMModel():def __init__(self, embedding_dim, word_embedding, sequence_length, hidden_size,attention_size,drop_keep_prob, n_class, learning_rate, l2_reg_lambda):self.embedding_dim = embedding_dimself.word_embedding = word_embeddingself.sequence_length = sequence_lengthself.hidden_size = hidden_sizeself.attention_size = attention_sizeself.drop_keep_prob = drop_keep_probself.n_class = n_classself.learning_rate = learning_rateself.l2_reg_lambda = l2_reg_lambdaself.create_graph()self.loss_optimizer()self.predict_result()def create_graph(self):# Input placeholderswith tf.name_scope('Input_Layer'):self.X = tf.placeholder(tf.int32, [None, self.sequence_length], name='input_x')self.y = tf.placeholder(tf.float32, [None, self.n_class], name='input_y')self.seq_length = tf.placeholder(tf.int32, [None], name='length')# self.keep_prob = tf.placeholder(tf.float32, name='keep_prob')# Embedding layerwith tf.name_scope('Embedding_layer'):embeddings_var = tf.Variable(self.word_embedding, trainable=True, dtype=tf.float32)batch_embedded = tf.nn.embedding_lookup(embeddings_var, self.X)print("batch_embedded:", batch_embedded)# Dynamic BiRNN layerwith tf.name_scope('BiRNN_layer'):forward_cell = tf.contrib.rnn.BasicLSTMCell(self.hidden_size, forget_bias=1.0, state_is_tuple=True)forward_cell = tf.contrib.rnn.DropoutWrapper(forward_cell, output_keep_prob=self.drop_keep_prob)backward_cell = tf.contrib.rnn.BasicLSTMCell(self.hidden_size, forget_bias=1.0, state_is_tuple=True)backward_cell = tf.contrib.rnn.DropoutWrapper(backward_cell, output_keep_prob=self.drop_keep_prob)outputs, last_states = tf.nn.bidirectional_dynamic_rnn(cell_fw=forward_cell,cell_bw=backward_cell,inputs=batch_embedded,sequence_length=self.seq_length,dtype=tf.float32,time_major=False)outputs = tf.concat(outputs, 2)outputs = tf.transpose(outputs, [1, 0, 2])print("lstm outputs:", outputs)# Attention layerwith tf.name_scope('Attention_layer'):outputs, _ = attention(outputs, self.attention_size, time_major=True)print("attention outputs:", outputs)# Fully connected layerwith tf.name_scope('Output_layer'):W = tf.Variable(tf.random_normal([self.hidden_size * 2, self.n_class]), dtype=tf.float32)b = tf.Variable(tf.random_normal([self.n_class]), dtype=tf.float32)# self.y_outputs = tf.matmul(drop_outputs, W) + bself.y_outputs = tf.add(tf.matmul(outputs, W), b, name="output")self.l2_loss = tf.nn.l2_loss(W) + tf.nn.l2_loss(b)print("y_outputs:", self.y_outputs)

计算损失函数,以及添加正则化:

def loss_optimizer(self):# loss functionwith tf.name_scope('Loss'):self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=self.y_outputs, labels=self.y))self.loss = self.loss + self.l2_loss * self.l2_reg_lambdaself.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(self.loss)

预测结果:

def predict_result(self):# Accuracywith tf.name_scope('Accuracy'):score = tf.nn.softmax(self.y_outputs, name="score")self.predictions = tf.argmax(self.y_outputs, 1, name="predictions")self.y_index = tf.argmax(self.y, 1)correct_pred = tf.equal(tf.argmax(self.y_outputs, 1), tf.argmax(self.y, 1))self.accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))

模型的训练

产生batch数据的方法,每次只训练一个batch的数据。

def batch_generator(all_data , batch_size, shuffle=True):""":param all_data : all_data整个数据集:param batch_size: batch_size表示每个batch的大小:param shuffle: 每次是否打乱顺序:return: 一个batch的数据"""all_data = [np.array(d) for d in all_data]data_size = all_data[0].shape[0]print("data_size: ", data_size)if shuffle:p = np.random.permutation(data_size)all_data = [d[p] for d in all_data]batch_count = 0while True:if batch_count * batch_size + batch_size > data_size:batch_count = 0if shuffle:p = np.random.permutation(data_size)all_data = [d[p] for d in all_data]start = batch_count * batch_sizeend = start + batch_sizebatch_count += 1yield [d[start: end] for d in all_data]

训练模型前需要预训练一个word2vec词向量模型,结果保存为二进制文件word2vec_model.bin。然后设置参数,加载训练数据,新建模型,按batch训练模型,最后保存模型。

# -*- coding: utf-8 -*-from __future__ import print_function
import os
import tensorflow as tf# set parameters
embedding_dim = 100
sequence_length = 40
hidden_size = 64
attention_size = 64
batch_size = 128
n_class = 2
n_epochs = 500
learning_rate = 0.001
drop_keep_prob = 0.5
l2_reg_lambda = 0.01
early_stopping_step = 15model_name = "lstm_attention.ckpt"
checkpoint_name = "lstm_attention.checkpoint"
model_pd_name = "lstm_attention.pb"w2v_path = "word2vec_model.bin"
word_map_path = "wordIndexMap.txt"train_data_path = "clickbait_train.txt"
# 读取预训练的词向量
vocab_dict, words_embedding = load_words_embedding(model, embedding_dim, word_map_path)
# 加载训练数据
X_train, X_train_length, y_train= load_dataset(train_data_path , vocab_dict, max_length, n_label)# create model
model = LSTMModel(embedding_dim=embedding_dim,word_embedding=word_embedding,sequence_length=sequence_length,hidden_size=hidden_size,attention_size=attention_size,drop_keep_prob=drop_keep_prob,n_class=n_class,learning_rate=learning_rate,l2_reg_lambda=l2_reg_lambda)# set minimize gpu use
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
# config = tf.ConfigProto()
# config.gpu_options.per_process_gpu_memory_fraction = 0.5saver = tf.train.Saver()# create session
with tf.Session(config=config) as sess:sess.run(tf.global_variables_initializer())# Batch generatorstrain_batch_generator = batch_generator([X_train, X_train_length, y_train], batch_size)test_batch_generator = batch_generator([X_test, X_test_length, y_test], batch_size)print("Start learning...")best_loss = 100.0temp_loss = 100.0stopping_step = 0for epoch in range(n_epochs):loss_train = 0loss_test = 0accuracy_train = 0accuracy_test = 0print("epoch ", epoch)num_batches = X_train.shape[0] // batch_sizefor i in range(0, num_batches):x_batch, y_batch, x_length = next(train_batch_generator)loss_train_batch, train_acc, _ = sess.run([model.loss, model.accuracy, model.optimizer],feed_dict={model.X: x_batch, model.y: y_batch, model.seq_length: x_length})accuracy_train += train_accloss_train += loss_train_batchloss_train /= num_batchesaccuracy_train /= num_batches# test modelnum_batches = X_test.shape[0] // batch_sizefor i in range(0, num_batches):x_batch, y_batch, x_length = next(test_batch_generator)loss_test_batch, test_acc = sess.run([model.loss, model.accuracy],feed_dict={model.X: x_batch, model.y: y_batch, model.seq_length: x_length})accuracy_test += test_accloss_test += loss_test_batchaccuracy_test /= num_batchesloss_test /= num_batchesprint("train_loss: {:.4f}, test_loss: {:.4f}, train_acc: {:.4f}, test_acc: {:.4f}".format(loss_train, loss_test, accuracy_train, accuracy_test))if loss_test <= best_loss:stopping_step = 0best_loss = loss_testelse:stopping_step += 1temp_loss = loss_testif stopping_step >= early_stopping_step:print("training finished!")break# 模型保存为"ckpt"格式saver.save(sess, model_name, latest_filename=checkpoint_name)graph = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ["Accuracy/score"]) # 填入输出tensor# 模型保存为"pb"格式tf.train.write_graph(graph, '.', model_pd_name, as_text=False)

Python预测结果

真实无标签数据,根据训练好的模型预测结果,数据是分好词的标题。数据格式如下图所示,每行一条数据。

具体预测代码,输入无标签的预测数据,并按照训练数据同样的预处理方法处理数据,然后加载训练好的模型,预测结果。

import tensorflow as tf
import numpy as np
import pandas as pddef read_unlabel_data(data_path, vocab_dict, sequence_length):"""读取并处理真实无标签数据:param data_path: 数据集路径:param vocab_dict: 前面产生的词典:param sequence_length: 输入模型最大长度控制,多则阶段,少则填充:return:datas_index: 由词序列转成的编号序列datas_length:每条数据的长度"""fo = open(data_path, "r", encoding='utf8')all_data = fo.readlines()fo.close()datas_index = []datas_length = []for line in all_data:title = line.strip()data_index = []for word in title.split(" "):try:data_index.append(vocab_dict[word])except:data_index.append(0)length = len(title.split(" "))if length > sequence_length:length = sequence_lengthdatas_index.append(data_index)datas_length.append(length)datas_index = pad_sequences(datas_index, max_length)return datas_index, datas_length# 读取预测数据
datas_index, datas_length = read_unlabel_data("unlabel_data.txt", vocab_dict, sequence_length)with tf.Session() as sess:tf.global_variables_initializer().run()output_graph_def = tf.GraphDef()with open('lstm_attention.pb', "rb") as f:output_graph_def.ParseFromString(f.read())_ = tf.import_graph_def(output_graph_def, name="")sess.run(tf.global_variables_initializer())# input_x, x_length对应输入, score对应输出的预测分数input_x = sess.graph.get_tensor_by_name('Input_Layer/input_x:0')x_length = sess.graph.get_tensor_by_name('Input_Layer/length:0')score = sess.graph.get_tensor_by_name('accuracy/score:0')score_output = sess.run(score, feed_dict={input_x: datas_index, x_length: datas_length})count = 0pred_label = [0] * len(score_output)for i in range(len(score_output)):scores = [round(s, 6) for s in score_output[i]]if scores[1] >= 0.5:pred_label [i] = 1print(pred_label)

上面的预测需要注意,模型创建时对应的输入输出,在预测时需要一一对应,特别是有name_scope时注意预测的写法。

输入中有作用域名(Input_Layer, Accuracy), 变量名(input_x, length, score),在预测时需要按下面的写法,才能获得tensor的值。

Java预测结果

用Java调用上面训练好的模型,首先需要下载一些包,如果是maven项目在pom.xml中添加两个依赖。

<dependency><groupId>org.tensorflow</groupId><artifactId>tensorflow</artifactId><version>1.5.0</version>
</dependency>
<dependency><groupId>org.tensorflow</groupId><artifactId>libtensorflow_jni</artifactId><version>1.5.0</version>
</dependency>

然后就可以写java预测代码了。
见下一篇博客:Java调用Tensorflow训练模型预测结

使用Tensorflow训练LSTM+Attention中文标题党分类相关推荐

  1. Tensorflow使用LSTM实现中文文本分类(1)

    前言 使用Tensorflow,利用LSTM进行中文文本的分类. 数据集格式如下: ''' 体育 马晓旭意外受伤让国奥警惕 无奈大雨格外青睐殷家军记者傅亚雨沈阳报道 来到沈阳,国奥队依然没有摆脱雨水的 ...

  2. Tensorflow使用Char-CNN实现中文文本分类(1)

    前言 在之前的中文文本分类中,使用了LSTM来进行模型的构建(详情参考: Tensorflow使用LSTM实现中文文本分类(2).使用numpy实现LSTM和RNN网络的前向传播过程).除了使用LST ...

  3. TensorFlow使用CNN实现中文文本分类

    TensorFlow使用CNN实现中文文本分类 读研期间使用过TensorFlow实现过简单的CNN情感分析(分类),当然这是比较low的二分类情况,后来进行多分类情况.但之前的学习基本上都是在英文词 ...

  4. 基于LSTM的中文多分类情感分析

    趁着国庆假期,玩了一下深度学习(主要是LSTM这个网络),顺便做了一个中文多分类的情感分析.中文情感分析相对英文来说,难度太大,所以最后分析的结果,准确度也不是太高,但基本还是没啥问题的. 对应的ap ...

  5. TensorFlow – 使用CNN进行中文文本分类

    使用卷积神经网络(CNN)处理自然语言处理(NLP)中的文本分类问题.本文将结合TensorFlow代码介绍: 词嵌入 填充 Embedding 卷积层 卷积(tf.nn.conv1d) 池化(poo ...

  6. 【NLP】TensorFlow实现CNN用于中文文本分类

    代码基于 dennybritz/cnn-text-classification-tf 及 clayandgithub/zh_cnn_text_classify 参考文章 了解用于NLP的卷积神经网络( ...

  7. Tensorflow训练MobileNet V1 retrain图片分类

    1.数据准备 (1)建立TrainData文件夹 (2)在该文件夹内将你将要训练分类的属性按照类别建立对应的文件夹 (3)将各个类别图片放入对应文件夹 (4)在当前目录下建立labels.txt和la ...

  8. Pytorch TextCNN实现中文文本分类(附完整训练代码)

    Pytorch TextCNN实现中文文本分类(附完整训练代码) 目录 Pytorch TextCNN实现中文文本分类(附完整训练代码) 一.项目介绍 二.中文文本数据集 (1)THUCNews文本数 ...

  9. 万字总结Keras深度学习中文文本分类

    摘要:文章将详细讲解Keras实现经典的深度学习文本分类算法,包括LSTM.BiLSTM.BiLSTM+Attention和CNN.TextCNN. 本文分享自华为云社区<Keras深度学习中文 ...

最新文章

  1. iOS中的动力学:Dynamics【1】
  2. 树套树 ----- P1975 [国家集训队]排队(树状数组套权值线段树求动态逆序对)
  3. 牛客华为机试第1题python
  4. 设计模式:享元(FlyWeight)模式
  5. Java中怎样将Json字符串转换成实体类
  6. c语言笔记:转义字符
  7. OJ1077: 字符串加密(C语言)
  8. mysql存储过程实验几轮_想问下数据库中有关存储过程的实验,求大神!
  9. UVa 1400 (线段树) Ray, Pass me the dishes!
  10. 矩阵分解(matrix factorization)
  11. 机械结构设计经验之谈
  12. Cadence Allegro剪断走线图文教程及视频演示
  13. HCDA(华为认证数据通信工程师)-华为培训认证
  14. 这么好用的PDF密码移除器,你知道吗
  15. sql 恢复刚删除的表
  16. 《人性的弱点》(四)上
  17. Java 的三种技术架构
  18. Hudi同步Hive表报“HoodieException : Got runtime exception when hive syncing”错误的解决方法
  19. python基础:web =html+ python
  20. 关于numpy数组shape的理解 比如:(3,) (2,3) (2,3,2) 以及对维度的小认识

热门文章

  1. Google Earth Engine(gee)中的Geometry
  2. vim E486不存在::wq 错误笔记
  3. 如何创建批处理文件?
  4. 32线镭神雷达跑LeGO-LOAM:3D 激光SLAM
  5. couchbase_Couchbase评论:智能NoSQL数据库
  6. 电子电路复习之零点漂移现象
  7. JQ(一)--JQ简介
  8. jupyter notebook多行注释方法
  9. UE4:快速入门蓝图(Blueprint)的方法之一
  10. vue两个数组如何判断值是否相同_vue两个数组如何判断重复的数据?