英文文本分类——电影评论情感判别

1、导入所需的库

2、用Pandas读入训练数据

3、构建停用词列表数据

4、对数据做预处理

5、将清洗的数据添加到DataFrame里

6、计算训练集中每条评论数据的向量

7、构建随机森林分类器并训练

8、读取测试数据并进行预测

9、将预测结果写入csv文件

1、导入所需的库

import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
import nltk
from nltk.corpus import stopwords

2、用Pandas读入训练数据

#用pandas读入训练数据
datafile=os.path.join('E:\\english_data','labeledTrainData.tsv')
df=pd.read_csv(datafile,sep='\t',escapechar='\\')
print('Number of reviews:{}'.format(len(df)))
df.head()

3、构建停用词列表数据

#words_nostop=[w for w in words if w not in stopwords.words('english')]
stopwords={}.fromkeys([line.rstrip() for line in open('E:\\english_data\\stopwords.txt')])
eng_stopwords=set(stopwords)

4、对数据做预处理

（1）去掉html标签

（2）移除标点符号

（3）将句子切分成词

（4）去掉停用词

（5）重组为新的句子

def clean_text(text):text=BeautifulSoup(text,'html.parser').get_text()text=re.sub('[^a-zA-Z]',' ',text)words=text.lower().split()words=[w for w in words if w not in eng_stopwords]return ' '.join(words)

5、将清洗的数据添加到DataFrame里

df['clean_review']=df.review.apply(clean_text)
df.head()

6、计算训练集中每条评论数据的向量

（1）使用sklearn的CountVectorizer抽取bag of words特征

vectorizer=CountVectorizer(max_features=5000)
train_data_features=vectorizer.fit_transform(df.clean_review).toarray()
train_data_features.shape

（2）使用Gensim的Word2Vec训练词嵌入模型

from gensim.models.word2vec import Word2Vec# 设定词向量训练的参数
num_features = 300    # Word vector dimensionality
min_word_count = 40   # Minimum word count
num_workers = 4       # Number of threads to run in parallel
context = 10          # Context window size
downsampling = 1e-3   # Downsample setting for frequent wordsmodel = Word2Vec(sentences, workers=num_workers, \size=num_features, min_count = min_word_count, \window = context, sample = downsampling)# If you don't plan to train the model any further, calling
# init_sims will make the model much more memory-efficient.
model.init_sims(replace=True)# It can be helpful to create a meaningful model name and
# save the model for later use. You can load it later using Word2Vec.load()
model.save(os.path.join('..', 'models', model_name))

7、构建随机森林分类器并训练

forest=RandomForestClassifier(n_estimators=100)
forest=forest.fit(train_data_features,df.sentiment)#删除不用的占内容变量
del df
del train_data_features

8、读取测试数据并进行预测

datafile=os.path.join('E:\\english_data','testData.tsv')
df=pd.read_csv(datafile,sep='\t',escapechar='\\')
print('Number of reviews:{}'.format(len(df)))
df['clean_review']=df.review.apply(clean_text)
df.head()test_data_features=vectorizer.transform(df.clean_review).toarray()
test_data_features.shaperesult=forest.predict(test_data_features)
output=pd.DataFrame({'id':df.id,'sentiment':result})
output.head()

9、将预测结果写入csv文件

output.to_csv(os.path.join('E:\\english_data','Bag_of_Words_model.csv'),index=False)del df
del test_data_features

项目实战英文文本分类电影评论情感判别源码及数据集资源下载：

项目实战-英文文本分类电影评论情感判别源码及数据集-机器学习文档类资源-CSDN下载

本人博文NLP学习内容目录：

一、NLP基础学习

1、NLP学习路线总结

2、TF-IDF算法介绍及实现

3、NLTK使用方法总结

4、英文自然语言预处理方法总结及实现

5、中文自然语言预处理方法总结及实现

6、NLP常见语言模型总结

7、NLP数据增强方法总结及实现

8、TextRank算法介绍及实现

9、NLP关键词提取方法总结及实现

10、NLP词向量和句向量方法总结及实现

11、NLP句子相似性方法总结及实现

12、NLP中文句法分析

二、NLP项目实战

1、项目实战-英文文本分类-电影评论情感判别

2、项目实战-中文文本分类-商品评论情感判别

3、项目实战-XGBoost与LightGBM文本分类

4、项目实战-TextCNN文本分类实战

5、项目实战-Bert文本分类实战

6、项目实战-NLP中文句子类型判别和分类实战

交流学习资料共享欢迎入群：955817470（群一），801295159（群二）