【Python-ML】电影评论数据集文本挖掘 -在线学习

# -*- coding: utf-8 -*-
'''
Created on 2018年1月22日
@author: Jason.F
@summary: 文本挖掘，对电影评论进行内容抽取、特征向量化并训练模型预测，在线学习并持久化模型
电影评论数据：http://ai.stanford.edu/~amaas/data/sentiment/
'''
import pyprind
import pandas as pd
import os
import numpy as np
import re
import time
import pickle
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifierstart = time.clock()homedir = os.getcwd()#获取当前文件的路径
#导入数据并输出到moive_data.csv
'''
pbar=pyprind.ProgBar(50000)
labels={'pos':1,'neg':0}#正面和负面评论标签
df = pd.DataFrame()
for s in ('test','train'):for l in ('pos','neg'):path=homedir+'/aclImdb/%s/%s' %(s,l)for file in os.listdir(path):with open(os.path.join(path,file),'r') as infile:txt =infile.read()df =df.append([[txt,labels[l]]],ignore_index=True)pbar.update()
df.columns=['review','sentiment']
np.random.seed(0)
df=df.reindex(np.random.permutation(df.index))#重排数据集，打散正负样本数据
df.to_csv(homedir+'/movie_data.csv',index=False)
'''
#文本向量化，并训练模型和更新
df=pd.read_csv(homedir+'/movie_data.csv')
stop = stopwords.words('english')#获得英文停用词集
def tokenizer(text):text=re.sub('<[^>]*>','',text)#移除HTML标记，#把<>里面的东西删掉包括内容emotions=re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)text=re.sub('[\W]+',' ',text.lower())+' '.join(emotions).replace('-','')tokenized = [w for w in text.split() if w not in stop]return tokenized
def stream_docs(path):with open(path,'r') as csv:next(csv) #skip headerfor line in csv:text,label = line[:-3] ,int(line[-2])yield text,label
def get_minibatch(doc_stream,size):docs,y =[],[]try:for _ in range(size):text,label =next(doc_stream)docs.append(text)y.append(label)except StopIteration:return None,Nonereturn docs,y
vect=HashingVectorizer(decode_error='ignore',n_features=2**21,preprocessor=None,tokenizer=tokenizer)
clf = SGDClassifier (loss='log',random_state=1,n_iter=1)#随机梯度下降，每次用一个样本更新权重
doc_stream = stream_docs(path=homedir+'/movie_data.csv')
pbar = pyprind.ProgBar(45)
classes=np.array([0,1])
for _ in range(45):X_train,y_train = get_minibatch(doc_stream, size=1000)if not X_train:breakX_train = vect.transform(X_train)clf.partial_fit(X_train, y_train, classes=classes)#部分训练pbar.update()
#测试
X_test,y_test=get_minibatch(doc_stream, size=5000)
X_test=vect.transform(X_test)
print ('Accuracy:%.3f' %clf.score(X_test,y_test))
clf=clf.partial_fit(X_test,y_test)#更新模型
#持久化模型
dest=os.path.join('pkl_objects')
if not os.path.exists(dest):os.makedirs(dest)
pickle.dump(stop,open(os.path.join(dest,'stopwords.pkl'),'wb'),protocol=2)#保存停用词
pickle.dump(clf,open(os.path.join(dest,'classifier.pkl'),'wb'),protocol=2)#保存模型
#导入模型预测
clf =pickle.load(open(os.path.join('pkl_objects','classifier.pkl'),'rb'))
label ={0:'negative',1:'positive'}
example=['I love this movie']
X=vect.transform(example)
print ('Prediction:%s \nProbability:%.2f%%'%(label[clf.predict(X)[0]],np.max(clf.predict_proba(X))*100))end = time.clock()
print('finish all in %s' % str(end - start))

结果：

Warning: No valid output stream.
Accuracy:0.867
Prediction:positive
Probability:82.53%
finish all in 50.6331459967

【Python-ML】电影评论数据集文本挖掘 -在线学习相关推荐

【Python-ML】电影评论数据集文本挖掘
# -*- coding: utf-8 -*- ''' Created on 2018年1月22日 @author: Jason.F @summary: 文本挖掘,对电影评论进行内容抽取.特征向量化并 ...
ML之RL：基于MovieLens电影评分数据集利用强化学习算法(多臂老虎机+EpsilonGreedy策略)实现对用户进行Top电影推荐案例
ML之RL:基于MovieLens电影评分数据集利用强化学习算法(多臂老虎机+EpsilonGreedy策略)实现对用户进行Top电影推荐案例目录基于MovieLens电影评分数据集利用强化学习算 ...
自然语言处理--Keras 实现LSTM循环神经网络分类 IMDB 电影评论数据集
LSTM 对于循环网络的每一层都引入了状态(state)的概念,状态作为网络的记忆(memory).但什么是记忆呢?记忆将由一个向量来表示,这个向量与元胞中神经元的元素数量相同.记忆单元将是一个由 n ...
Python豆瓣电影评论的爬取及词云显示
Python豆瓣电影评论的爬取及词云显示课程设计论文链接前言开发工具.核心库系统相关技术介绍系统分析与设计系统功能模块组成实现功能和目标爬取模块设计爬取过程中下一页的处理窗口界面设 ...
自然语言处理-应用场景-文本分类：基于LSTM模型的情感分析【IMDB电影评论数据集】--（重点技术：自定义分词、文本序列化、输入数据批次化、词向量迁移使用）
文本情感分类 1. 案例介绍现在我们有一个经典的数据集IMDB数据集,地址:http://ai.stanford.edu/~amaas/data/sentiment/,这是一份包含了5万条流行电影的 ...
Tensorflow2.*教程之使用Tensorflow Hub 对IMDB电影评论数据集进行文本分类(2)
使用数据集: IMDB 数据集库文件: tensorflow tensorflow_hub:用于迁移学习的库和平台 tensorflow_datasets:提供常用数据集我们使用 Tensorfl ...
【毕业设计之python系列】基于Flask的在线学习笔记的设计与实现
基于Flask的在线学习笔记的设计与实现摘要在线学习笔记系统是一种为学生和教师提供在线学习和教学的平台.本文基于Flask框架,设计并实现了一个在线学习笔记系统.该系统支持用户注册.登录.创建课程 ...
电影评论 R文本挖掘-情感分析
数据集说明:本次情感分析使用电影评论数据1500条,包含好评.中评.差评各500条: [1]数据读取: [2]数据清洗: [3]评论分词: [4]数据整理(方便情感打分): [5]词典读取: [6]定 ...
基于Python+Django+Vue+MYSQL的古诗词在线学习系统
项目介绍基于python+django+vue的古诗词在线学习网站则旨在通过标签分类管理等方式,实现管理员:首页.个人中心.用户管理.诗词管理.主题管理.情感色彩管理.风格管理.我的收藏管理.诗词论 ...

【Python-ML】电影评论数据集文本挖掘 -在线学习

【Python-ML】电影评论数据集文本挖掘 -在线学习相关推荐

最新文章

热门文章