《深度学习Python实践》第22章—

文本分类实例数据集链接：http://qwone.com/~jason/20Newsgroups/

代码如下：

１）算法比价

from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot as pltcategories=['alt.atheism','tec.sport.hockey','sci.crypt','comp.sys.ibm.pc.hardware','sci.med','comp.sys.mac.hardware','sci.space','comp.windows.x','soc.religion.christian','misc.forsale','talk.politocs.guns','rec.autos','talk.politocs.medeast','rec.motorcycles','talk.politics.misc','rec.sport.baseball','talk.religion.misc']#导入训练数据
train_path='/home/duan/下载/20news-bydate/20news-bydate-train'
dataset_train=load_files(container_path=train_path,categories=categories)#导入评估数据
test_path='/home/duan/下载/20news-bydate/20news-bydate-test'
dataset_test=load_files(container_path=test_path,categories=categories)#数据准备与理解#计算词频
count_vect=CountVectorizer(stop_words='english',decode_error='ignore')
X_train_counts=count_vect.fit_transform(dataset_train.data)
#查看数据维度
#词频的计算结果如下：
print(X_train_counts.shape)#计算TF-IDF
tf_transformer=TfidfVectorizer(stop_words='english',decode_error='ignore')
X_train_counts_tf=tf_transformer.fit_transform(dataset_train.data)
print(X_train_counts_tf.shape)#以上用两种方法进行了文本特征的提取。并且查看了数据维度。
#接下来用TF-IDF特征进行分类模型的训练。#评估算法
#设置评估算法的基准
num_folds=10
seed=7
scoring='accuracy'#线性算法LR ，
#非线性算法：CART，SVM，MNB，KNN
models={}
models['LR']=LogisticRegression()
models['SVM']=SVC()
models['CART']=DecisionTreeClassifier()
models['MNB']=MultinomialNB()
models['KNN']=KNeighborsClassifier()#比较算法
results=[]
for key in models:kfold = KFold(n_splits= num_folds, random_state=seed)cv_result = cross_val_score(models[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)results.append(cv_result)print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))

运行结果:

(7838, 77172)
(7838, 77172)
KNN: 0.824575 (0.012700)
LR: 0.920900 (0.008155)
CART: 0.703240 (0.013782)
MNB: 0.896786 (0.009055)
SVM: 0.062772 (0.004306)

箱线图比较算法:

#箱线图10折交叉验证比较算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(models.keys())
plt.show()

运行结果:

从图中结果可以看出,朴素贝叶斯分类器的数据离散程度比较好,逻辑回归的偏度较大.算法结果的离散程度能够反应算法对数据的只用情况,所以对逻辑回归和朴素贝叶斯分类器进行进一步的研究,实行算法调参.

2) 算法调参

通过上面的分析发现,LR和MNB值得进一步进行优化.下面对这两个算法的参宿进行调参,进一步提高算法的准确度.

(1)逻辑回归调参

逻辑回归中的超参数是C.C是目标的约束函数,C值越小则正则化强度越大,对C进行调参,每次给C设定一定数量的值,如果临界值是最有参数,重复这个步骤,直到找到最优值.

#算法调参
#调参LR
param_grid={}
param_grid['C']=[0.1,5,13,15]
model=LogisticRegression()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最优：%s使用%s'%(grid_result.best_score_,grid_result.best_params_))

运行结果:
最优：0.9393978055626435使用{'C': 15}

(2)朴素贝叶斯调参

朴素贝叶斯有一个alpha参数,该参数是一个平滑参数,默认值为1.0.
我们可以对这个参数进行调参,以提高算法的准确度.

#算法调参
#调参MNB
param_grid={}
param_grid['alpha']=[0.001,0.01,0.1,1.5]
model=MultinomialNB()
kfold=KFold(n_splits=num_folds,random_state=seed)
grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)
grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)
print('最优：%s使用%s'%(grid_result.best_score_,grid_result.best_params_))
cv_results=zip(grid_result.cv_results_['mean_test_score'],grid_result.cv_results_['std_test_score'],grid_result.cv_results_['params'])
for mean, std, param in cv_results:print('%f (%f) with %r'%(mean, std, param))

运行结果:

最优：0.934804797142128使用{'alpha': 0.01}
0.929829 (0.008380) with {'alpha': 0.001}
0.934805 (0.008096) with {'alpha': 0.01}
0.928043 (0.008024) with {'alpha': 0.1}
0.889640 (0.010375) with {'alpha': 1.5}

MNB算法最有参数为alpha=0.01.最优：0.934804797142128使用{‘alpha’: 0.01}
LR算法最优参数为:C=15. 最优：0.9393978055626435使用{‘C’: 15}

通过调参发现,LR在C=15时具有最好的准确度.接下来审查集成算法.

3).集成算法

随机森林(RF)
AdaBoost(AB)

ensembles={}
ensembles['RF']=RandomForestClassifier()
ensembles['AB']=AdaBoostClassifier()
#比较集成算法
results=[]
for key in ensembles:kfold = KFold(n_splits= num_folds, random_state=seed)cv_result = cross_val_score(ensembles[key], X_train_counts_tf, dataset_train.target, cv=kfold, scoring=scoring)results.append(cv_result)print('%s: %f (%f)' %(key, cv_result.mean(), cv_result.std()))

运行结果:

RF: 0.773795 (0.017244)
AB: 0.620055 (0.017638)

箱线图:

#箱线图10折交叉验证比较算法
fig=plt.figure()
fig.suptitle("Algorithm Comparision")
ax=fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(ensembles.keys())
plt.show()

从箱线图可以看出,随机森林的分布比较均匀,对数据的适用性比较高,更值得进一步优化研究.

4).集成算法调参

#集成算法调参
#调参RF
param_grid={}
param_grid['n_estimators']=[10,100,150,200]
model=RandomForestClassifier()kfold=KFold(n_splits=num_folds,random_state=seed)grid=GridSearchCV(estimator=model,param_grid=param_grid,scoring=scoring,cv=kfold)grid_result=grid.fit(X=X_train_counts_tf,y=dataset_train.target)print('最优：%s使用%s'%(grid_result.best_score_,grid_result.best_params_))cv_results=zip(grid_result.cv_results_['mean_test_score'],grid_result.cv_results_['std_test_score'],grid_result.cv_results_['params'])
for mean, std, param in cv_results:print('%f (%f) with %r'%(mean, std, param))

运行结果:

最优：0.888236795100791使用{'n_estimators': 200}
0.779025 (0.007910) with {'n_estimators': 10}
0.882496 (0.012405) with {'n_estimators': 100}
0.887982 (0.010867) with {'n_estimators': 150}
0.888237 (0.009727) with {'n_estimators': 200}

确定最终模型

#算法调参
#调参LR
param_grid={}
model=LogisticRegression(C=15)
model.fit(X=X_train_counts_tf,y=dataset_train.target)
X_test_counts=tf_transformer.transform(dataset_test.data)
predictions=model.predict(X_test_counts)
print(accuracy_score(dataset_test.target,predictions))
print(classification_report(dataset_test.target,predictions))

运行结果:

0.8844163312248419precision    recall  f1-score   support0       0.85      0.79      0.82       3191       0.78      0.84      0.81       3922       0.86      0.88      0.87       3853       0.91      0.89      0.90       3954       0.81      0.90      0.86       3905       0.91      0.91      0.91       3966       0.97      0.95      0.96       3987       0.94      0.97      0.96       3978       0.97      0.94      0.96       3969       0.92      0.89      0.91       39610       0.93      0.95      0.94       39411       0.86      0.93      0.89       39812       0.91      0.77      0.84       31013       0.70      0.62      0.65       251avg / total       0.89      0.88      0.88      5217