使用卷积神经网络对汽车行业评论文本进行情感分析。

dateset

爬取汽车之家车主口碑评论文本,抽取口碑中最满意以及最不满意评论文本,分别作为正向情感语料库和负向情感语料库。

爬虫技术视频链接:https://pan.baidu.com/s/1ySXWuVmPW79Wa0lDvDKy_Q

语料库基本信息如下:

  • 训练集(data/ch_auto_train.txt): 40000 = 20000(pos) + 20000(neg)
  • 验证集(data/ch_auto_dev.txt): 10000 = 5000(pos) + 5000(neg)
  • 测试集(data/ch_auto_test.txt): 20000 = 10000(pos) + 10000(neg)

预处理

utils.py为数据的预处理代码。

import os
import time
from datetime import timedelta
import numpy as np
from collections import Counter
import tensorflow.contrib.keras as kr

def time_diff(start_time):
    """当前距初始时间已花费的时间"""
    end_time = time.time()
    diff = end_time - start_time
    return timedelta(seconds=int(round(diff)))

def batch_index(length, batch_size, is_shuffle=True):
    """
    生成批处理样本序列id.
    :param length: 样本总数
    :param batch_size: 批处理大小
    :param is_shuffle: 是否打乱样本顺序
    :return:
    """
    index = [idx for idx in range(length)]
    if is_shuffle:
        np.random.shuffle(index)
    for i in range(int(np.ceil(length / batch_size))):
        yield index[i * batch_size:(i + 1) * batch_size]

def cat_to_id(classes=None):
    """
    :param classes: 分类标签;默认为pos, neg
    :return: {分类标签:id}
    """
    if not classes:
        classes = ['pos', 'neg']
    cat2id = {cat: idx for (idx, cat) in enumerate(classes)}
    return classes, cat2id

def load_corpus(path, word2id, max_sen_len=50):
    """
    :param path: 样本语料库的文件
    :return: 文本内容contents,以及分类标签labels(onehot形式)
    """
    _, cat2id = cat_to_id()
    contents, labels = [], []
    with open(path, encoding='utf-8') as f:
        for line in f.readlines():
            sp = line.strip().split()
            label = sp[0]
            content = [word2id.get(w, 0) for w in sp[1:]]
            content = content[:max_sen_len]
            if len(content) < max_sen_len:
                content += [word2id['_PAD_']] * (max_sen_len - len(content))
            labels.append(label)
            contents.append(content)
    counter = Counter(labels)
    print('总样本数为:%d' % (len(labels)))
    print('各个类别样本数如下:')
    for w in counter:
        print(w, counter[w])

contents = np.asarray(contents)
    labels = [cat2id[l] for l in labels]
    labels = kr.utils.to_categorical(labels, len(cat2id))

return contents, labels

def build_word2id(file):
    """
    :param file: word2id保存地址
    :return: None
    """
    word2id = {'_PAD_': 0}
    path = [os.path.join('./data/', w) for w in os.listdir('./data/')]

for _path in path:
        with open(_path, encoding='utf-8') as f:
            for line in f.readlines():
                sp = line.strip().split()
                for word in sp[1:]:
                    if word not in word2id.keys():
                        word2id[word] = len(word2id)

with open(file, 'w', encoding='utf-8') as f:
        for w in word2id:
            f.write(w+'\t')
            f.write(str(word2id[w]))
            f.write('\n')

# build_word2id('./data/word_to_id.txt')

def load_word2id(path):
    """
    :param path: word_to_id词汇表路径
    :return: word_to_id:{word: id}
    """
    word_to_id = {}
    with open(path, encoding='utf-8') as f:
        for line in f.readlines():
            sp = line.strip().split()
            word = sp[0]
            idx = int(sp[1])
            if word not in word_to_id:
                word_to_id[word] = idx
    return word_to_id

def build_word2vec(fname, word2id, save_to_path=None):
    """
    :param fname: 预训练的word2vec.
    :param word2id: 语料文本中包含的词汇集.
    :param save_to_path: 保存训练语料库中的词组对应的word2vec到本地
    :return: 语料文本中词汇集对应的word2vec向量{id: word2vec}.
    """
    import gensim
    n_words = max(word2id.values()) + 1
    model = gensim.models.KeyedVectors.load_word2vec_format(fname, binary=True)
    word_vecs = np.array(np.random.uniform(-1., 1., [n_words, model.vector_size]))
    for word in word2id.keys():
        try:
            word_vecs[word2id[word]] = model[word]
        except KeyError:
            pass
    if save_to_path:
        with open(save_to_path, 'w', encoding='utf-8') as f:
            for vec in word_vecs:
                vec = [str(w) for w in vec]
                f.write(' '.join(vec))
                f.write('\n')
    return word_vecs

# word2id = load_word2id('./data/word_to_id.txt')
# w2v = build_word2vec('./data/wiki_word2vec_50.bin', word2id, save_to_path='./data/corpus_word2vec.txt')

def load_corpus_word2vec(path):
    """加载语料库word2vec词向量,相对wiki词向量相对较小"""
    word2vec = []
    with open(path, encoding='utf-8') as f:
        for line in f.readlines():
            sp = [float(w) for w in line.strip().split()]
            word2vec.append(sp)
    return np.asarray(word2vec)

  • cat_to_id(): 分类类别以及id对应词典{pos:0, neg:1};
  • build_word2id(): 构建词汇表并存储,形如{word: id};
  • load_word2id(): 加载上述构建的词汇表;
  • build_word2vec(): 基于预训练好的word2vec构建训练语料中所含词语的word2vec;
  • load_corpus_word2vec(): 加载上述构建的word2ve;
  • load_corpus(): 加载语料库:train/dev/test;
  • batch_index(): 生成批处理id序列。

经过数据预处理,数据的格式如下:

  • x: [1434, 5454, 2323, ..., 0, 0, 0]
  • y: [0, 1]

x为构成一条语句的单词所对应的id。 y为onehot编码: pos-[1, 0], neg-[0, 1]。

CNN卷积神经网络

配置项

CNN可配置的参数如下所示

 class CONFIG():update_w2v = True           # 是否在训练中更新w2vvocab_size = 37814          # 词汇量,与word2id中的词汇量一致n_class = 2                 # 分类数:分别为pos和negmax_sen_len = 75            # 句子最大长度embedding_dim = 50          # 词向量维度batch_size = 100            # 批处理尺寸n_hidden = 256              # 隐藏层节点数n_epoch = 10                # 训练迭代周期,即遍历整个训练样本的次数opt = 'adam'                # 训练优化器:adam或者adadeltalearning_rate = 0.001       # 学习率;若opt=‘adadelta',则不需要定义学习率drop_keep_prob = 0.5        # dropout层,参数keep的比例num_filters = 256           # 卷积层filter的数量kernel_size = 3             # 卷积核的尺寸;nlp任务中通常选择2,3,4,5print_per_batch = 100       # 训练过程中,每100词batch迭代,打印训练信息save_dir = './checkpoints/' # 训练模型保存的地址...

CNN模型

代码

class TextCNN(object):
    def __init__(self, config, embeddings=None):
        self.update_w2v = config.update_w2v
        self.vocab_size = config.vocab_size
        self.n_class = config.n_class
        self.max_sen_len= config.max_sen_len
        self.embedding_dim = config.embedding_dim
        self.batch_size = config.batch_size
        self.num_filters = config.num_filters
        self.kernel_size = config.kernel_size
        self.n_hidden = config.n_hidden
        self.n_epoch = config.n_epoch
        self.opt = config.opt
        self.learning_rate = config.learning_rate
        self.drop_keep_prob = config.drop_keep_prob

self.x = tf.placeholder(tf.int32, [None, self.max_sen_len], name='x')
        self.y = tf.placeholder(tf.int32, [None, self.n_class], name='y')
        # self.word_embeddings = tf.constant(embeddings, tf.float32)
        # self.word_embeddings = tf.Variable(embeddings, dtype=tf.float32, trainable=self.update_w2v)
        if embeddings is not None:
            self.word_embeddings = tf.Variable(embeddings, dtype=tf.float32, trainable=self.update_w2v)
        else:
            self.word_embeddings = tf.Variable(
                tf.zeros([self.vocab_size, self.embedding_dim]),
                dtype=tf.float32,
                trainable=self.update_w2v)

self.build()

def cnn(self):
        """
        :param mode:默认为None,主要调节dropout操作对训练和预测带来的差异。
        :return: 未经softmax变换的fully-connected输出结果
        """
        inputs = self.add_embeddings()
        with tf.name_scope("cnn"):
            # CNN layer
            conv = tf.layers.conv1d(inputs, self.num_filters, self.kernel_size, name='conv')
            # global max pooling layer
            gmp = tf.reduce_max(conv, reduction_indices=[1], name='gmp')
            # dropout 卷积层后加dropout效果太差
            # gmp = tf.contrib.layers.dropout(gmp, self.drop_keep_prob)

with tf.name_scope("score"):
            # fully-connected
            fc = tf.layers.dense(gmp, self.n_hidden, name='fc1')
            # dropout
            fc = tf.contrib.layers.dropout(fc, self.drop_keep_prob)
            # nonlinear
            fc = tf.nn.relu(fc)
            # fully-connected
            pred = tf.layers.dense(fc, self.n_class, name='fc2')
        return pred

def add_embeddings(self):
        inputs = tf.nn.embedding_lookup(self.word_embeddings, self.x)
        return inputs

def add_loss(self, pred):
        cost = tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=self.y)
        cost = tf.reduce_mean(cost)
        return cost

def add_optimizer(self, loss):
        if self.opt == 'adadelta':
            optimizer = tf.train.AdadeltaOptimizer(learning_rate=1.0, rho=0.95, epsilon=1e-6)
        else:
            optimizer = tf.train.AdamOptimizer(self.learning_rate)
        opt = optimizer.minimize(loss)
        return opt

def add_accuracy(self, pred):
        correct_pred = tf.equal(tf.argmax(pred, 1), tf.argmax(self.y, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
        return accuracy

def get_batches(self, x, y=None, batch_size=100, is_shuffle=True):
        for index in batch_index(len(x), batch_size, is_shuffle=is_shuffle):
            n = len(index)
            feed_dict = {
                self.x: x[index]
            }
            if y is not None:
                feed_dict[self.y] = y[index]
            yield feed_dict, n

def build(self):
        self.pred = self.cnn()
        self.loss = self.add_loss(self.pred)
        self.accuracy = self.add_accuracy(self.pred)
        self.optimizer = self.add_optimizer(self.loss)

def train_on_batch(self, sess, feed):
        _, _loss, _acc = sess.run([self.optimizer, self.loss, self.accuracy], feed_dict=feed)
        return _loss, _acc

def test_on_batch(self, sess, feed):
        _loss, _acc = sess.run([self.loss, self.accuracy], feed_dict=feed)
        return _loss, _acc

def predict_on_batch(self, sess, feed, prob=True):
        result = tf.argmax(self.pred, 1)
        if prob:
            result = tf.nn.softmax(logits=self.pred, dim=1)

res = sess.run(result, feed_dict=feed)
        return res

def predict(self, sess, x, prob=False):
        yhat = []
        for _feed, _ in self.get_batches(x, batch_size=self.batch_size, is_shuffle=False):
            _yhat = self.predict_on_batch(sess, _feed, prob)
            yhat += _yhat.tolist()
            # yhat.append(_yhat)
        return np.array(yhat)

def evaluate(self, sess, x, y):
        """评估在某一数据集上的准确率和损失"""
        num = len(x)
        total_loss, total_acc = 0., 0.
        for _feed, _n in self.get_batches(x, y, batch_size=self.batch_size):
            loss, acc = self.test_on_batch(sess, _feed)
            total_loss += loss * _n
            total_acc += acc * _n
        return total_loss / num, total_acc / num

def fit(self, sess, x_train, y_train, x_dev, y_dev, save_dir=None, print_per_batch=100):
        saver = tf.train.Saver()
        if save_dir:
            if not os.path.exists(save_dir):
                os.makedirs(save_dir)
        sess.run(tf.global_variables_initializer())

print('Training and evaluating...')
        start_time = time.time()
        total_batch = 0 # 总批次
        best_acc_dev = 0.0  # 最佳验证集准确率
        last_improved = 0   # 记录上次提升批次
        require_improvement = 500  # 如果超过500轮模型效果未提升,提前结束训练
        flags = False
        for epoch in range(self.n_epoch):
            print('Epoch:', epoch + 1)
            for train_feed, train_n in self.get_batches(x_train, y_train, batch_size=self.batch_size):
                loss_train, acc_train = self.train_on_batch(sess, train_feed)
                loss_dev, acc_dev = self.evaluate(sess, x_dev, y_dev)

if total_batch % print_per_batch == 0:
                    if acc_dev > best_acc_dev:
                        # 保存在验证集上性能最好的模型
                        best_acc_dev = acc_dev
                        last_improved = total_batch
                        if save_dir:
                            saver.save(sess=sess, save_path=os.path.join(save_dir, 'sa-model'))
                        improved_str = '*'
                    else:
                        improved_str = ''

time_dif = time_diff(start_time)
                    msg = 'Iter: {0:>6}, Train Loss: {1:>6.2}, Train Acc: {2:>7.2%},' + \
                          ' Val Loss: {3:>6.2}, Val Acc: {4:>7.2%}, Time: {5} {6}'
                    print(msg.format(total_batch, loss_train, acc_train, loss_dev, acc_dev, time_dif, improved_str))
                total_batch += 1

if total_batch - last_improved > require_improvement:
                    print('No optimization for a long time, auto-stopping...')
                    flags = True
                    break
            if flags:
                break

训练与验证

进行训练。

加载word2vec==========================
加载train语料库========================
总样本数为:40000
各个类别样本数如下:
pos 20000
neg 20000
加载dev语料库==========================
总样本数为:10000
各个类别样本数如下:
pos 5000
neg 5000
加载test语料库=========================
总样本数为:20000
各个类别样本数如下:
pos 10000
neg 10000
Training and evaluating...
Epoch: 1
Iter:      0, Train Loss:   0.71, Train Acc:  51.00%, Val Loss:   0.86, Val Acc:  49.96%, Time: 0:00:04 *
Iter:    100, Train Loss:   0.29, Train Acc:  89.00%, Val Loss:   0.26, Val Acc:  89.16%, Time: 0:04:37 *
Iter:    200, Train Loss:   0.22, Train Acc:  93.00%, Val Loss:    0.2, Val Acc:  91.85%, Time: 0:09:05 *
Iter:    300, Train Loss:  0.082, Train Acc:  96.00%, Val Loss:   0.17, Val Acc:  93.26%, Time: 0:13:26 *
Epoch: 2
Iter:    400, Train Loss:   0.16, Train Acc:  96.00%, Val Loss:   0.17, Val Acc:  93.19%, Time: 0:17:52
Iter:    500, Train Loss:   0.11, Train Acc:  97.00%, Val Loss:   0.17, Val Acc:  93.51%, Time: 0:22:11 *
Iter:    600, Train Loss:   0.16, Train Acc:  97.00%, Val Loss:   0.15, Val Acc:  94.22%, Time: 0:26:36 *
Iter:    700, Train Loss:   0.15, Train Acc:  91.00%, Val Loss:   0.15, Val Acc:  94.05%, Time: 0:30:54
Epoch: 3
Iter:    800, Train Loss:   0.11, Train Acc:  95.00%, Val Loss:   0.15, Val Acc:  94.13%, Time: 0:35:13
Iter:    900, Train Loss:  0.058, Train Acc:  97.00%, Val Loss:   0.16, Val Acc:  94.33%, Time: 0:39:37 *
Iter:   1000, Train Loss:  0.048, Train Acc:  98.00%, Val Loss:   0.15, Val Acc:  94.33%, Time: 0:43:53
Iter:   1100, Train Loss:  0.054, Train Acc:  97.00%, Val Loss:   0.16, Val Acc:  94.10%, Time: 0:48:21
Epoch: 4
Iter:   1200, Train Loss:  0.065, Train Acc:  96.00%, Val Loss:   0.16, Val Acc:  94.52%, Time: 0:52:43 *
Iter:   1300, Train Loss:  0.056, Train Acc:  97.00%, Val Loss:   0.17, Val Acc:  94.55%, Time: 0:57:09 *
Iter:   1400, Train Loss:  0.016, Train Acc: 100.00%, Val Loss:   0.17, Val Acc:  94.40%, Time: 1:01:30
Iter:   1500, Train Loss:    0.1, Train Acc:  97.00%, Val Loss:   0.16, Val Acc:  94.90%, Time: 1:05:49 *
Epoch: 5
Iter:   1600, Train Loss:  0.021, Train Acc:  99.00%, Val Loss:   0.16, Val Acc:  94.28%, Time: 1:10:00
Iter:   1700, Train Loss:  0.045, Train Acc:  99.00%, Val Loss:   0.18, Val Acc:  94.40%, Time: 1:14:16
Iter:   1800, Train Loss:  0.036, Train Acc:  98.00%, Val Loss:   0.21, Val Acc:  94.10%, Time: 1:18:36
Iter:   1900, Train Loss:  0.014, Train Acc: 100.00%, Val Loss:    0.2, Val Acc:  94.18%, Time: 1:22:59

在验证集上的最佳效果为94.90%。

测试

est()进行测试。

INFO:tensorflow:Restoring parameters from ./checkpoints/sa-model
Precision, Recall and F1-Score...precision    recall  f1-score   supportpos       0.96      0.96      0.96     10000neg       0.96      0.96      0.96     10000
avg / total       0.96      0.96      0.96     20000Confusion Matrix...
[[9597  403][ 449 9551]]

在测试集上的准确率达到了95.74%,且各类的precision, recall和f1-score都超过了95%。

预测

predict.py中的predict()进行预测。

 >> test = ['噪音大、车漆很薄', '性价比很高,价位不高,又皮实耐用。']>> print(predict(test, label=True))
INFO:tensorflow:Restoring parameters from ./checkpoints/sa-model
['neg', 'pos']

爬虫技术视频链接:https://pan.baidu.com/s/1ySXWuVmPW79Wa0lDvDKy_Q

深度学习视频链接:https://pan.baidu.com/s/13rMKQDEXR3jwf6uOdplnZA 提取码:D2B2

自然语言处理视频链接:https://pan.baidu.com/s/19cojEzvTdCBb--K_0pzgYw

项目链接:https://download.csdn.net/download/weixin_40651515/10973640

基于深度学习的汽车行业评论文本的情感分析相关推荐

  1. tfidf关键词提取_基于深度学习的个性化商品评论标签提取

    商品评论标签提取作为商品评论中的一个比较有意思的问题.评论标签提取的主要任务是从评论中了解到用户对产品的哪些功能.属性进行了怎样的评论,并抽取成简短有效的信息. 关键词提取Vs评论标签提取 商品标签提 ...

  2. 基于深度学习的脑电图识别 综述篇(三)模型分析

    作者|Memory逆光 本文由作者授权分享 导读 脑电图(EEG)是一个复杂的信号,一个医生可能需要几年的训练并利用先进的信号处理和特征提取方法,才能正确解释其含义.而如今机器学习和深度学习的发展,大 ...

  3. 基于深度学习的CVaaS计算机视觉即服务案例(Computer Vision as a Service)

    技术与技法日进千里,快速迭代过程中,真正能够留下的是应用场景的重构与对新商业范式的思考.转载请注明来源"素质云博客",谢谢合作!!微信公众号:素质云笔记 CVaaS 计算机视觉即服 ...

  4. 综述:基于深度学习的情感分析

    近年来,深度学习有了突破性发展,NLP 领域里的情感分析任务逐渐引入了这种方法,并形成了很多业内最佳结果.本文中,来自领英与伊利诺伊大学芝加哥分校的研究人员对基于深度学习的情感分析研究进行了详细论述. ...

  5. 【深度学习】PyTorch:Bi-LSTM的文本生成

    作者 | Fernando López 编译 | VK 来源 | Towards Data Science ❝ "写作没有规定.有时它来得容易而且完美:有时就像在岩石上钻孔,然后用炸药把它炸 ...

  6. 目标检测YOLO实战应用案例100讲-基于深度学习的显著性目标检测研究与应用(论文篇)

    目录 基于深度学习的显著性目标检测综述 基于深度学习的显著性目标检测分类及难点分析

  7. 基于深度学习的吉他谱识别

    摘要 光学乐谱识别是音乐智能化发展的关键部分,在音乐教学.创作等领域有重要应用价值.针对现有吉他谱识别方法步骤繁琐.识别精度低等问题,提出一种基于深度学习的吉他谱识别方法.首先通过分析吉他谱的特点,将 ...

  8. 基于深度学习的文本数据特征提取方法之Word2Vec

    点击上方"AI公园",关注公众号,选择加"星标"或"置顶" 作者:Dipanjan (DJ) Sarkar 编译:ronghuaiyang ...

  9. 根据大小分割大文本_基于深度学习的图像分割在高德地图的实践

    一.前言 图像分割(Image Segmentation)是计算机视觉领域中的一项重要基础技术,是图像理解中的重要一环.图像分割是将数字图像细分为多个图像子区域的过程,通过简化或改变图像的表示形式,让 ...

最新文章

  1. awk内建变量示例详解之NR、FNR、NF
  2. 科大星云诗社动态20210216
  3. SAP Analytics Cloud学习笔记(一):从CSV文件导入数据到Analytics Cloud里创建模型和Story
  4. Android面试总结经
  5. java jwindow 键盘_各位老哥求救,JWINDOW无法接收到键盘监听
  6. JS数据结构学习之排序
  7. 技术再好,能阻止暴力视频的疯传吗?
  8. Docker常见问题
  9. 如何在BIOS中设置RAID?
  10. 亿阳信通java开发,北京亿阳信通笔试题java+oracle
  11. linux和windows文件加密,在Linux和 Windows 上使用 EncFS,如何加密雲存儲
  12. Python selenium模块对网页进行截屏保存图片 easyocr模块识别提取图片文字
  13. 为什么爱因斯坦反对《自然辨证法》
  14. 【转】立方体的体对角线穿过多少个正方体?
  15. UE4 装备拾起 装备绑定
  16. 关于RedisInsight 创建数据库时 connection time out 连接超时的问题
  17. 铸造数据安全堤坝,华为云数据灾备解决方案就是强
  18. Linux--DNS域名解析服务
  19. vue3.0引入字体样式ttf文件
  20. 查询苹果硬件产品保修情况:

热门文章

  1. 大数据应用的几个典型例子
  2. ExoPlayer官方中文使用文档
  3. tsc2007电阻触摸屏调试
  4. 标准技术方案指标体系研究报告国标策文件技术标准技术规范,政策GB行业报告白皮书数据资源
  5. ICML的出版社问题
  6. JAVA快递单号查询接口对接第三方快递鸟api接口教程
  7. Mac字体编辑器哪个好用?FontLab VI for Mac永久激活版分享
  8. 系统设计原则及技术指标
  9. DEVONthink Pro for Mac(mac文件管理工具)
  10. Unity添加Animation不播放