合肥工业大学宣城校区数据挖掘实验分类任务

（文章最后有全部源代码）

一、实验要求

1.1实验目的
1）理解分类任务；
2）考察学生对数据预处理步骤的理解，强化预处理的重要性；
3）基模型可以调用已有的包，训练学生熟悉数据挖掘的基本框架；
4）学会多维度的对模型进行评估以及模型中参数的讨论。

1.2数据集
1）新闻文本分类为中文数据集，需要进行一定的预处理，包括分词，去停用词等；图像数据集可根据情况自行处理。
2）数据中的其他问题可自行酌情处理；
数据说明：自行划分 train 和 test，一般按 7：3 划分。

1.3实验环境
开发环境：Python 3.7( jieba、pandas、numpy、sklearn、matplotlib.pyplot)

1.4方法要求
1）要有针对数据特点的预处理步骤，包括去停用词，降维等；
2）原则上不限制模型，决策树，NB，NN，SVM，random forest 均可，且不限于上述方法。
3）文本可采用 BOW，主题模型以及词向量等多种表示方式，图像数据集可采用 LBP，HOG，SURF 等特征表示方式。

1.5结果要求
1）实现一个或多个基本分类模型，并计算其评估指标如准确率，召回率等
2）对模型中的关键参数，（如决策树中停止分裂条件，NN 中层数等参数）进行不同范围的取值，讨论参数的最佳取值范围。
3）对比分析不同的特征表示方法对结果的影响。
4）若对同一数据采用两种或多种模型进行了分类，对多种模型结果进行对比，以评估模型对该数据集上分类任务的适用性。

二、实验内容

2.1数据预处理
对在停用词表中的分词进行过滤操作：
def pre_treating(para):
words = jieba.cut(str(para))#分词
words = [word for word in words if len(word)>1]
words = [word for word in words if word not in stopWords]
return words

对数据集（训练集和测试集）中的无用分词删除，只保留可能有用的分词:
for name in classname:
data[name][‘words’]= data[name][‘content’].apply(pre_treating)
test[name][‘words’]= data[name][‘content’].apply(pre_treating)

将标签加入属性中:
i = 0;
for name in classname:
data[name][‘flag’] = i
test[name][‘flag’] = i
i += 1;

2.2词频统计
汇集所有表的内容：
result = data[classname[0]]
testdata = test[classname[0]]
for name in classname[1:]:
result = result.append(data[name])
testdata = testdata.append(data[name])

统计800个频率最高的词组:
topWordNum = 800
items = result[‘words’].values.tolist()
words = []
for item in items:
words.extend(item)
wordCount = pd.Series(words).value_counts()[0:topWordNum]
wordCount = wordCount.index.values.tolist()

2.3将词组转换为向量
将词组转为向量，此处向量的数为高频词出现次数:
def wordsToVec(words):
vec = map(lambda word:words.count(word),wordCount)
vec = list(vec)
return vec

将向量添加到汇集的结果中,并去除表中无用的部分:
result[‘vec’] = result[‘words’].apply(wordsToVec)
result = result.drop([‘content’],axis=1)
result = result.drop([‘channelName’],axis=1)

testdata[‘vec’] = testdata[‘words’].apply(wordsToVec)
testdata = testdata.drop([‘content’],axis=1)
testdata = testdata.drop([‘channelName’],axis=1)

2.4随机森林方法
将标签和向量转换为x,y的值：
xTrain = result[‘vec’].tolist()
yTrain = result[‘flag’].tolist()
xTest = testdata[‘vec’].tolist()
yTest = testdata[‘flag’].tolist()

随机森林方法：
def get_rf_ascore(my_para):

clf = RandomForestClassifier(n_estimators=my_para)
clf.fit(xTrain, yTrain)y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)score = accuracy_score(y_test, y_pre)
return score

def get_rf_rscore(my_para):

clf = RandomForestClassifier(n_estimators=my_para)
clf.fit(xTrain, yTrain)y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)score = recall_score(y_test, y_pre,average = 'macro')
return score

2.5决策树方法
使用 sklearn 函数包来实现决策树方法。分类随机森林对应的类DecisionTreeClassifier。实验代码如下：
def get_dt_ascore(my_para):

clf = tree.DecisionTreeClassifier(max_depth=my_para)
clf.fit(xTrain, yTrain)y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)score = accuracy_score(y_test, y_pre)
return score

def get_dt_rscore(my_para):

clf = tree.DecisionTreeClassifier(max_depth=my_para)
clf.fit(xTrain, yTrain)y_pre = clf.predict(xTest)
y_test = np.array(yTest)
y_pre = np.array(y_pre)score = recall_score(y_test, y_pre,average = 'macro')

return score

2.6随机森林结果显示
rfaScore = [ ]
rfrScore = [ ]
estimator = np.arange(1, 20, 1)

for i in estimator:
temp_ascore = get_rf_ascore(i)
temp_rscore = get_rf_rscore(i)
rfaScore.append(temp_ascore)
rfrScore.append(temp_rscore)

plt.plot(rfaScore,color=‘red’)
plt.plot(rfrScore,color=‘green’)
plt.xlabel(‘estimator’)
plt.ylabel(‘testDepth’)
plt.show()

2.7决策树方法结果显示
dtaScore = [ ]
dtrScore = [ ]
testDepth = np.arange(1, 100, 1)

for i in testDepth:
temp_ascore = get_dt_ascore(i)
temp_rscore = get_dt_rscore(i)
dtaScore.append(temp_ascore)
dtrScore.append(temp_rscore)

plt.plot(dtaScore,color=‘yellow’)
plt.plot(dtrScore,color=‘blue’)
plt.ylabel(‘score’)
plt.xlabel(‘testDepth’)
plt.show()

三、实验分析和总结

3.1实验分析

子树的数量 n_estimators 从 1 到 20，系统评分所示。从图中可以看出，在1到20范围内，随着子树数量的增加，模型评分随之增加，分类预测的准确率、回归率随之提高。至于20棵子树之后的评分趋势，则应该进行额外的实验来验证。(红色的是准确率，绿色的是召回率)

决策树最大深度 max_depth 从 1 到 100，系统评分所⽰。从图中可以
看出，在 1 到 100 范围内，随着树的最大深度的增加，模型评分随之增加，分
类预测的准确率随之提高。至于深度大于 100 的评分趋势，则应该进行额外的
实验来验证。(黄色的是准确率，蓝色的是召回率)
3.2实验总结
通过本次实验，我熟悉了sklearn包中几个模型的使用。这些模型在学术科研中得到广泛使用。通过上网查阅资料，我学习了对中文文本的预处理，即停用词的过滤。此外，还加深了对随机森林算法的理解，通过 sklearn 包实现算法的训练与预测，我对于数据挖掘和python的使用有了更深的理解，在python的使用上更加得心应手，通过实验，让我再次感受到python的便捷性。

源代码

import pandas as pd
import re
import jieba
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score#读取训练集
data = pd.read_excel("train.xlsx", encoding = 'utf-8')
test = pd.read_excel("test.xlsx", encoding = 'utf-8')#去除没有标签的样本
index = data['channelName'].notnull()
data = data[index]
index = data['title'].notnull()
data = data[index]
index = test['channelName'].notnull()
test = test[index]
#print(news)#去标点
re_obj = re.compile(r"['~`!#$%^&*()_+-=|\';:/.,?><~·！@#￥%……&*（）——+-=“：’；、。，？》《{}'：【】《》‘’“”\s]+")
def get_stopword():s = set()with open('中文停用词表.txt', encoding = 'utf-8') as f:for line in f:s.add(line.strip())return s
stopword = get_stopword()def remove_stopword(words):return [word for word in words if word not in stopword]
def Data_preprocessing(text):text = re_obj.sub("", text)text = jieba.lcut(text)text = remove_stopword(text)return " ".join(text)data['title'] = data['title'].apply(Data_preprocessing)
test['title'] = test['title'].apply(Data_preprocessing)#标签映射
dic = {'财经' : 0, '房产' : 1, '教育' : 2, '科技' : 3, '军事' : 4, '汽车' : 5, '体育' : 6, '游戏' : 7, '娱乐' : 8, '养生健康' : 9, '历史' : 10, '搞笑' : 11, '旅游' : 12, '母婴' : 13}
data['channelName'] = data['channelName'].map(dic)
test['channelName'] = test['channelName'].map(dic)
#print(news['channelName'].value_counts())x_train = data['title']
y_train = data['channelName']
x_test = test['title']
y_test = test['channelName']#ngram_range词组切分的长度范围  string类型  取前5000个  线性缩放
vectorizer = TfidfVectorizer(ngram_range=(1,2), analyzer='word', max_features=30000, sublinear_tf=True)
vectorizer.fit(x_train)
#学习原始文档中所有标记的词汇词典
model = MultinomialNB(alpha=0.001)
model.fit(vectorizer.transform(x_train), y_train)yPred = model.predict(x_test)
yTest = np.array(y_test)
yPred = np.array(yPred)

rum

import pandas as pd
import jieba
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
import matplotlib.pyplot as pltwith open('中文停用词表.txt','r',encoding='utf-8')as f:stopWords = [line.strip('\n') for line in f.readlines()]stopWords += '\n'data = pd.read_excel('train.xlsx',sheet_name=None)
test = pd.read_excel('test.xlsx',sheet_name=None)
classname = ['财经','房产','教育','科技','军事','汽车','体育','游戏','娱乐','养生健康','历史','搞笑','旅游','母婴']#对在停用词表中的分词进⾏过滤操作
def pre_treating(para):words = jieba.cut(str(para))#分词words = [word for word in words if len(word)>1]words = [word for word in words if word not in stopWords]return words#对数据集（训练集和测试集）中的无用分词删除，只保留可能有用的分词。
for name in classname:data[name]['words'] = data[name]['content'].apply(pre_treating)test[name]['words'] = data[name]['content'].apply(pre_treating)i = 0;
#将标签加入属性中
for name in classname:data[name]['flag'] = itest[name]['flag'] = ii += 1;#print(data[name])result = data[classname[0]]
testdata = test[classname[0]]
for name in classname[1:]:result = result.append(data[name])testdata = testdata.append(data[name])
#print(result)topWordNum = 800
items = result['words'].values.tolist()
words = []
for item in items:words.extend(item)
#统计800个频率最高的词组
wordCount = pd.Series(words).value_counts()[0:topWordNum]
#去除频率。仅获取词组
wordCount = wordCount.index.values.tolist()def wordsToVec(words):#将词组转为向量，此处向量的数为高频词出现次数vec = map(lambda word:words.count(word),wordCount)vec = list(vec)return vecresult['vec'] = result['words'].apply(wordsToVec)
result = result.drop(['content'],axis=1)
result = result.drop(['channelName'],axis=1)testdata['vec'] = testdata['words'].apply(wordsToVec)
testdata = testdata.drop(['content'],axis=1)
testdata = testdata.drop(['channelName'],axis=1)xTrain = result['vec'].tolist()
yTrain = result['flag'].tolist()
xTest = testdata['vec'].tolist()
yTest = testdata['flag'].tolist()def get_rf_ascore(my_para):#随机森林⽅法clf = RandomForestClassifier(n_estimators=my_para)clf.fit(xTrain, yTrain)y_pre = clf.predict(xTest)y_test = np.array(yTest)y_pre = np.array(y_pre)score = accuracy_score(y_test, y_pre)return scoredef get_rf_rscore(my_para):clf = RandomForestClassifier(n_estimators=my_para)clf.fit(xTrain, yTrain)y_pre = clf.predict(xTest)y_test = np.array(yTest)y_pre = np.array(y_pre)score = recall_score(y_test, y_pre,average = 'macro')return score#调参及结果展⽰
rfaScore = [ ]
rfrScore = [ ]
estimator = np.arange(1, 20, 1)for i in estimator:temp_ascore = get_rf_ascore(i)temp_rscore = get_rf_rscore(i)rfaScore.append(temp_ascore)rfrScore.append(temp_rscore)plt.plot(rfaScore,color='red')
plt.plot(rfrScore,color='green')
plt.xlabel('estimator')
plt.ylabel('score')
plt.show()

run

import pandas as pd
import jieba
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
import matplotlib.pyplot as pltwith open('中文停用词表.txt','r',encoding='utf-8')as f:stopWords = [line.strip('\n') for line in f.readlines()]stopWords += '\n'data = pd.read_excel('train.xlsx',sheet_name=None)
test = pd.read_excel('test.xlsx',sheet_name=None)
classname = ['财经','房产','教育','科技','军事','汽车','体育','游戏','娱乐','养生健康','历史','搞笑','旅游','母婴']#对在停用词表中的分词进⾏过滤操作
def pre_treating(para):words = jieba.cut(str(para))#分词words = [word for word in words if len(word)>1]words = [word for word in words if word not in stopWords]return words#对数据集（训练集和测试集）中的无用分词删除，只保留可能有用的分词。
for name in classname:data[name]['words'] = data[name]['content'].apply(pre_treating)test[name]['words'] = data[name]['content'].apply(pre_treating)i = 0;
#将标签加入属性中
for name in classname:data[name]['flag'] = itest[name]['flag'] = ii += 1;#print(data[name])result = data[classname[0]]
testdata = test[classname[0]]
for name in classname[1:]:result = result.append(data[name])testdata = testdata.append(data[name])
#print(result)topWordNum = 800
items = result['words'].values.tolist()
words = []
for item in items:words.extend(item)
#统计800个频率最高的词组
wordCount = pd.Series(words).value_counts()[0:topWordNum]
#去除频率。仅获取词组
wordCount = wordCount.index.values.tolist()def wordsToVec(words):#将词组转为向量，此处向量的数为高频词出现次数vec = map(lambda word:words.count(word),wordCount)vec = list(vec)return vecresult['vec'] = result['words'].apply(wordsToVec)
result = result.drop(['content'],axis=1)
result = result.drop(['channelName'],axis=1)testdata['vec'] = testdata['words'].apply(wordsToVec)
testdata = testdata.drop(['content'],axis=1)
testdata = testdata.drop(['channelName'],axis=1)xTrain = result['vec'].tolist()
yTrain = result['flag'].tolist()
xTest = testdata['vec'].tolist()
yTest = testdata['flag'].tolist()def get_dt_ascore(my_para):clf = tree.DecisionTreeClassifier(max_depth=my_para)clf.fit(xTrain, yTrain)y_pre = clf.predict(xTest)y_test = np.array(yTest)y_pre = np.array(y_pre)score = accuracy_score(y_test, y_pre)return scoredef get_dt_rscore(my_para):clf = tree.DecisionTreeClassifier(max_depth=my_para)clf.fit(xTrain, yTrain)y_pre = clf.predict(xTest)y_test = np.array(yTest)y_pre = np.array(y_pre)score = recall_score(y_test, y_pre,average = 'macro')return scoredtaScore = [ ]
dtrScore = [ ]
testDepth = np.arange(1, 100, 1)for i in testDepth:temp_ascore = get_dt_ascore(i)temp_rscore = get_dt_rscore(i)dtaScore.append(temp_ascore)dtrScore.append(temp_rscore)plt.plot(dtaScore,color='yellow')
plt.plot(dtrScore,color='blue')
plt.ylabel('score')
plt.xlabel('testDepth')
plt.show()