利用词袋模型和TF-IDF实现Large Movie Review Dataset文本分类

文本分类简介

数据集介绍

数据预处理

提取特征

训练分类器

模型评估

文本分类简介

文本分类是指在给定分类体系，根据文本内容自动确定文本类别的过程。最基础的分类是归到两个类别中，称为二分类问题，例如电影评论分类，只需要分为“好评”或“差评”。分到多个类别中的称为多分类问题，例如，把名字分类为法语名字、英语名字、西班牙语名字等。

一般来说文本分类大致分为如下几个步骤：

定义阶段：定义数据以及分类体系，具体分为哪些类别，需要哪些数据。
数据预处理：对文档做分词、去停用词等准备工作。
数据提取特征：对文档矩阵进行降维，提取训练集中最有用的特征。
模型训练阶段：选择具体的分类模型以及算法，训练出文本分类器。
评测阶段：在测试集上测试并评价分类器的性能。
应用阶段：应用性能最高的分类模型对待分类文档进行分类。

数据集介绍

Large Movie Review Dataset数据集(aclimdb)由斯坦福大学人工智能实验室于2011年推出，包含25000条训练数据和25000条测试数据，另外包含约50000条没有标签的辅助数据。训练集和测试集又分别包含12500条正例（正向评价pos）和12500负例（负向评价neg）。

aclimdb的目录结构：

训练集正例的目录：

这个里面包含了12500篇英文评论，打开第一个评论看一下里面的文本内容：

Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High's satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I'm here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn't!

数据预处理

首先载入数据，得到训练集数据、训练集标签、测试集数据、测试集标签，其中训练集标签和测试集标签可由正例或负例数据载入时生成全0或全1数组得到，正例标签为1，负例标签为0.

import glob
import numpy as npdef get_data(path_neg, path_pos):neg_data = []pos_data = []files_neg = glob.glob(path_neg)files_pos = glob.glob(path_pos)for neg in files_neg:with open(neg, 'r', encoding='utf-8') as neg_f:neg_data.append(neg_f.readline())for pos in files_pos:with open(pos, 'r', encoding='utf-8') as pos_f:pos_data.append(pos_f.readline())neg_label = np.zeros(len(neg_data)).tolist()pos_label = np.ones(len(pos_data)).tolist()corpus = neg_data + pos_datalabels = neg_label + pos_labelreturn corpus, labels

然后对数据进行规范化和预处理，包括利用正则表达式去掉特殊字符，利用nltk包的RegexpTokenizer和tokenize分割单词并去掉标点符号，利用nltk包的stopwords去掉停用词，最后得到规范化的语料库。

import re
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwordsdef normalize(corpus):normalized_corpus = []for text in corpus:# 转为小写字母text = text.lower().strip()# 去掉符号text = re.sub(r"<br />", r" ", text)text = re.sub(' +', ' ', text)text = re.sub(r'(\W)(?=\1)', '', text)text = re.sub(r"([.!?])", r" \1", text)text = re.sub(r"[^a-zA-Z.!?]+", r" ", text)# 分词并去掉标点符号tokenizer = RegexpTokenizer(r'\w+')tokens = tokenizer.tokenize(text)# 去掉停用词stopword = stopwords.words('english')filtered_tokens = [token for token in tokens if token not in stopword]# 重新组成字符串filtered_text = ' '.join(filtered_tokens)normalized_corpus.append(filtered_text)return normalized_corpus

提取特征

在使用分类器之前，需要对文本提取特征，包括以下几种经典方法：

（1）BOW：最原始的特征集，一个单词就是一个特征，往往一个数据集就会有上万个特征，去停用词可以帮助筛选掉一些对分类没帮助的词。

（2）统计特征：包括TF，IDF，以及合并起来的TF-IDF。

（3）N-Gram：考虑词汇顺序，即N阶马尔可夫链。

本文使用两种方式提取特征，一种是词袋模型，另一种是TF-IDF特征。

使用sklearn包的CountVectorizer和TfidfVectorizer可以分别提取出词袋模型和TF-IDF的特征。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizerdef bow_extractor(corpus, ngram_range=(1, 1)):vectorizer = CountVectorizer(min_df=1, ngram_range=ngram_range)features = vectorizer.fit_transform(corpus)return vectorizer, featuresdef tfidf_extractor(corpus, ngram_range=(1, 1)):vectorizer = TfidfVectorizer(min_df=1, norm='l2', smooth_idf=True, use_idf=True, ngram_range=ngram_range)features = vectorizer.fit_transform(corpus)return vectorizer, features

训练分类器

常见的分类器有逻辑斯蒂回归（LR），支持向量机（SVM），K近邻（KNN），决策树（DT），神经网络（NN）等，可以根据场景选择合适的文本分类器。上述大型电影评论数据集的特征数量很多，可以选择LR或线性SVM。

导入sklearn包的SGDClassifier并指定参数loss='hinge'使用软边际的线性SVM分类器，导入LogisticRegression使用逻辑斯蒂回归分类器。

# 导入分类器
svm = SGDClassifier(loss='hinge', max_iter=100)
lr = LogisticRegression(solver='liblinear')

使用训练集特征和训练集标签训练分类器，并在测试集上预测结果。

def train_predict_evaluate_model(classifier,train_features, train_labels,test_features, test_labels):# 训练模型classifier.fit(train_features, train_labels)# 在测试集上预测结果predictions = classifier.predict(test_features)return predictions

模型评估

导入sklearn包中的metrics模块并利用测试集的真实标签和预测标签来评估模型的性能，评价指标包括分类的准确率、精度、召回率和F1值。

设TP为真正例，FP为假正例，FN为假反例，TN为真反例：

精度（Accuracy）=（TP+TN）/（TP+FP+FN+TN）

准确率（P，Precision）=TP/（TP+FP），在所有被判断为正确的文档中，有多大比例是正确的。

召回率（R，Recall）=TP/（TP+TN），在所有正确的文档中，有多大比例被我们判为正确。

F1值（F-measure）=2PR/（P+R），既考虑准确率，又考虑召回率。

准确率和召回率是互相影响的，理想情况下是两者都高，即F1值高。

from sklearn import metrics
import numpy as npdef get_metrics(true_labels, predicted_labels):print('精度:', np.round(metrics.accuracy_score(true_labels,predicted_labels),2))print('准确率:', np.round(metrics.precision_score(true_labels,predicted_labels,average='weighted'),2))print('召回率:', np.round(metrics.recall_score(true_labels,predicted_labels,average='weighted'),2))print('F1值:', np.round(metrics.f1_score(true_labels,predicted_labels,average='weighted'),2))

主函数如下：

from data_normalize import get_data, normalize
from feature_extractor import bow_extractor, tfidf_extractor
from train_predict_evaluate import train_predict_evaluate_model
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegressionif __name__ == "__main__":train_corpus, train_labels = get_data('./train/neg/*.txt', './train/pos/*.txt')test_corpus, test_labels = get_data('./test/neg/*.txt', './test/pos/*.txt')norm_train_corpus = normalize(train_corpus)norm_test_corpus = normalize(test_corpus)# 词袋模型特征bow_vectorizer, bow_train_features = bow_extractor(norm_train_corpus)bow_test_features = bow_vectorizer.transform(norm_test_corpus)# tfidf 特征tfidf_vectorizer, tfidf_train_features = tfidf_extractor(norm_train_corpus)tfidf_test_features = tfidf_vectorizer.transform(norm_test_corpus)# 导入分类器svm = SGDClassifier(loss='hinge', max_iter=100)lr = LogisticRegression(solver='liblinear')# 基于词袋模型特征的逻辑斯蒂回归模型print("基于词袋模型特征的逻辑斯蒂回归模型")lr_bow_predictions = train_predict_evaluate_model(classifier=lr,train_features=bow_train_features,train_labels=train_labels,test_features=bow_test_features,test_labels=test_labels)# 基于词袋模型的支持向量机模型print("基于词袋模型的支持向量机模型")svm_bow_predictions = train_predict_evaluate_model(classifier=svm,train_features=bow_train_features,train_labels=train_labels,test_features=bow_test_features,test_labels=test_labels)# 基于tfidf的逻辑斯蒂回归模型print("基于tfidf的逻辑斯蒂回归模型")lr_tfidf_predictions = train_predict_evaluate_model(classifier=lr,train_features=tfidf_train_features,train_labels=train_labels,test_features=tfidf_test_features,test_labels=test_labels)# 基于tfidf的支持向量机模型print("基于tfidf的支持向量机模型")svm_tfidf_predictions = train_predict_evaluate_model(classifier=svm,train_features=tfidf_train_features,train_labels=train_labels,test_features=tfidf_test_features,test_labels=test_labels)

训练结果如下：

基于词袋模型特征的逻辑斯蒂回归模型
精度: 0.86
准确率: 0.86
召回率: 0.86
F1值: 0.86
基于词袋模型的支持向量机模型
精度: 0.85
准确率: 0.85
召回率: 0.85
F1值: 0.85
基于tfidf的逻辑斯蒂回归模型
精度: 0.88
准确率: 0.88
召回率: 0.88
F1值: 0.88
基于tfidf的支持向量机模型
精度: 0.88
准确率: 0.88
召回率: 0.88
F1值: 0.88

从训练结果中可以看出，TF-IDF稍微好于词袋模型，而使用逻辑斯蒂回归或支持向量机分类器的效果差别不大。