tensorflow笔记-文本情感分类

本文是在学习tensorflow2.0官方教程时的一个笔记，原始教程请见文本情感分类

准备工作

1. 安装tensorflow并导入相关库

如果已经安装了可以略去此步
!pip install tensorflow

import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt

2. 准备数据集

2.1 导入数据集

dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,as_supervised=True)
train_data, test_data = dataset['train'], dataset['test']

数据集介绍
这是一个imdb的影评数据集。

tfds.core.DatasetInfo(
name=‘imdb_reviews’,
version=1.0.0,
description=‘Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.’,
homepage=‘http://ai.stanford.edu/~amaas/data/sentiment/’,
features=FeaturesDict({
‘label’: ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
‘text’: Text(shape=(None,), dtype=tf.int64, encoder=),
}),
total_num_examples=100000,
splits={
‘test’: 25000,
‘train’: 25000,
‘unsupervised’: 50000,
},
supervised_keys=(‘text’, ‘label’),
citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142–150},
url = {http://www.aclweb.org/anthology/P11-1015}
}""",
redistribution_info=,
)

因为数据集的info自带encoder，所以直接调用

# The dataset info includes the encoder
encoder = info.features['text'].encoder

测试encoder

sample_string = 'Hello world.'encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))
original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))
assert original_string == sample_string
for index in encoded_string:print('{} ----> {}'.format(index, encoder.decode([index])))

运行结果为：

Encoded string is [4025, 222, 562, 7975]
The original string: “Hello world.”
4025 ----> Hell
222 ----> o
562 ----> world
7975 ----> .

2.2 数据集预处理

对数据进行shuffle防止过拟合，对数据进行padded_batch,便于训练。值得注意的是，tensorflow2.0的padded_batch不需要paaded_shape。

BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = (train_data.shuffle(BUFFER_SIZE).padded_batch(BATCH_SIZE))test_dataset = (test_data.padded_batch(BATCH_SIZE))

3 创建模型

这里创建的是一个tf.keras.Sequential，模型如下图所示

embedding层的作用是生成词向量，作为神经网络的输入，这里的词向量选用的是64维，一般实际可能会更大一些。
LSTM层：长短程记忆单元,这里采用的是双向的，也就是走到最后一个词之后，倒着走到第一行。
dense层1：是一个全连接神经网络，64个unit
dense层2：是一个全连接神经网络，输出层
更多的理解可以参考循环神经网络学习笔记

model = tf.keras.Sequential([tf.keras.layers.Embedding(encoder.vocab_size, 64),tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),tf.keras.layers.Dense(64, activation='relu'),tf.keras.layers.Dense(1)
])

设置损失函数、优化器，

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),optimizer=tf.keras.optimizers.Adam(1e-4),metrics=['accuracy'])

4 模型训练

tf.keras可以直接调用fit函数进行训练，训练的epoch为10次，验证数据为测试机，每30步一次。

history = model.fit(train_dataset, epochs=10,validation_data=test_dataset, validation_steps=30)

Epoch 1/10
391/391 [] - 44s 112ms/step - loss: 0.6572 - accuracy: 0.5434 - val_loss: 0.4859 - val_accuracy: 0.7865
Epoch 2/10
391/391 [] - 43s 110ms/step - loss: 0.3448 - accuracy: 0.8572 - val_loss: 0.3440 - val_accuracy: 0.8458
Epoch 3/10
391/391 [] - 43s 110ms/step - loss: 0.2618 - accuracy: 0.8952 - val_loss: 0.3378 - val_accuracy: 0.8458
Epoch 4/10
391/391 [] - 43s 111ms/step - loss: 0.2110 - accuracy: 0.9204 - val_loss: 0.3278 - val_accuracy: 0.8594
Epoch 5/10
391/391 [] - 43s 110ms/step - loss: 0.1867 - accuracy: 0.9322 - val_loss: 0.3563 - val_accuracy: 0.8510
Epoch 6/10
391/391 [] - 43s 110ms/step - loss: 0.1624 - accuracy: 0.9432 - val_loss: 0.3610 - val_accuracy: 0.8615
Epoch 7/10
391/391 [] - 43s 110ms/step - loss: 0.2073 - accuracy: 0.9308 - val_loss: 0.3900 - val_accuracy: 0.8578
Epoch 8/10
391/391 [] - 43s 109ms/step - loss: 0.1370 - accuracy: 0.9542 - val_loss: 0.4124 - val_accuracy: 0.8641
Epoch 9/10
391/391 [] - 44s 112ms/step - loss: 0.1222 - accuracy: 0.9597 - val_loss: 0.4238 - val_accuracy: 0.8641
Epoch 10/10
391/391 [] - 44s 113ms/step - loss: 0.1205 - accuracy: 0.9600 - val_loss: 0.4685 - val_accuracy: 0.8568

测试集上

test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

Test Loss: 0.44925960898399353
Test Accuracy: 0.8586400151252747

随着epoch变化，准确度和，loss的变化

4 改进模型

1、使用双层的RNN神经网络，循环单元为仍然为lstm
2、增加一个dropout

model = tf.keras.Sequential([tf.keras.layers.Embedding(encoder.vocab_size, 64),tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),tf.keras.layers.Dense(64, activation='relu'),tf.keras.layers.Dropout(0.5),tf.keras.layers.Dense(1)
])

其余同上

测试结果

Test Loss: 0.5735374093055725
Test Accuracy: 0.829039990901947

从准确性来看，并没有升高，这可能是因为数据量对于这个模型来说太少了，所以造成了过拟合，结果较差。

5 实际预测

现在我们随便输入一个句子，使用模型对其情感进行分类。当分数大于等于0.5时是积极的评价，小于0.5时是负面的评价

因为输入的句子的长度可能是不一样的，我们需要对输入的句子用0进行padding（补全）

def pad_to_size(vec, size):zeros = [0] * (size - len(vec))vec.extend(zeros)return vec
def sample_predict(sample_pred_text, pad):encoded_sample_pred_text = encoder.encode(sample_pred_text)if pad:encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))return (predictions)

sample_pred_text = ('The movie was cool. The animation and the graphics ''were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

[[0.10079887]]
[[0.06816088]]

从结果可以看出padding可以使得结果更加准确。