机器学习算法之贝叶斯算法 3、案例二：新闻数据分类

老婆漂亮的程序员，鄙视老婆不漂亮的程序员。
有老婆的程序员，鄙视没有老婆的程序员。
没有老婆有女朋友的程序员，鄙视单身程序狗。
在单身狗之间，才有语言、编辑器和操作系统的互相鄙视。

import numpy as np
from time import time
import matplotlib.pyplot as plt
import matplotlib as mplfrom sklearn.datasets import fetch_20newsgroups#引入新闻数据包
from sklearn.feature_extraction.text import TfidfVectorizer#做tfidf编码
from sklearn.feature_selection import SelectKBest, chi2#卡方检验——特征筛选
from sklearn.linear_model import RidgeClassifier
from sklearn.svm import LinearSVC,SVC
from sklearn.naive_bayes import MultinomialNB, BernoulliNB #引入多项式和伯努利的贝叶斯
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

###  基准模型方法
def benchmark(clf,name):print (u'分类器：', clf)##  设置最优参数，并使用5折交叉验证获取最优参数值alpha_can = np.logspace(-2, 1, 10)model = GridSearchCV(clf, param_grid={'alpha': alpha_can}, cv=5)m = alpha_can.size## 如果模型有一个参数是alpha，进行设置if hasattr(clf, 'alpha'):model.set_params(param_grid={'alpha': alpha_can})m = alpha_can.size## 如果模型有一个k近邻的参数，进行设置if hasattr(clf, 'n_neighbors'):neighbors_can = np.arange(1, 15)model.set_params(param_grid={'n_neighbors': neighbors_can})m = neighbors_can.size## LinearSVC最优参数配置if hasattr(clf, 'C'):C_can = np.logspace(1, 3, 3)model.set_params(param_grid={'C':C_can})m = C_can.size## SVM最优参数设置if hasattr(clf, 'C') & hasattr(clf, 'gamma'):C_can = np.logspace(1, 3, 3)gamma_can = np.logspace(-3, 0, 3)model.set_params(param_grid={'C':C_can, 'gamma':gamma_can})m = C_can.size * gamma_can.size## 设置深度相关参数，决策树if hasattr(clf, 'max_depth'):max_depth_can = np.arange(4, 10)model.set_params(param_grid={'max_depth': max_depth_can})m = max_depth_can.size## 模型训练t_start = time()model.fit(x_train, y_train)t_end = time()t_train = (t_end - t_start) / (5*m)print (u'5折交叉验证的训练时间为：%.3f秒/(5*%d)=%.3f秒' % ((t_end - t_start), m, t_train))print (u'最优超参数为：', model.best_params_)## 模型预测t_start = time()y_hat = model.predict(x_test)t_end = time()t_test = t_end - t_startprint (u'测试时间：%.3f秒' % t_test)## 模型效果评估train_acc = metrics.accuracy_score(y_train, model.predict(x_train))test_acc = metrics.accuracy_score(y_test, y_hat)print (u'训练集准确率：%.2f%%' % (100 * train_acc))print (u'测试集准确率：%.2f%%' % (100 * test_acc))## 返回结果(训练时间耗时，预测数据耗时，训练数据错误率，测试数据错误率, 名称)return t_train, t_test, 1-train_acc, 1-test_acc, name

## 设置属性防止中文乱码
mpl.rcParams['font.sans-serif'] = [u'SimHei']
mpl.rcParams['axes.unicode_minus'] = False
## 数据加载
print (u'加载数据...')
t_start = time()
## 不要头部信息
remove = ('headers', 'footers', 'quotes')
## 只要这四类数据
categories = 'alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'## 分别加载训练数据和测试数据
data_train = fetch_20newsgroups(data_home='./datas/',subset='train', categories=categories, shuffle=True, random_state=0, remove=remove)
data_test = fetch_20newsgroups(data_home='./datas/',subset='test', categories=categories, shuffle=True, random_state=0, remove=remove)## 完成
print (u"完成数据加载过程.耗时:%.3fs" % (time() - t_start))`

加载数据…
完成数据加载过程.耗时:5.247s

print(data_train.target_names)

[‘alt.atheism’, ‘comp.graphics’, ‘sci.space’, ‘talk.religion.misc’]

### 获取加载数据的相关信息
def size_mb(docs):return sum(len(s.encode('utf-8')) for s in docs) / 1e6categories = data_train.target_names
data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)print (u'数据类型：', type(data_train.data))
print("%d文本数量 - %0.3fMB (训练数据集)" % (len(data_train.data), data_train_size_mb))
print("%d文本数量 - %0.3fMB (测试数据集)" % (len(data_test.data), data_test_size_mb))
print (u'训练集和测试集使用的%d个类别的名称：' % len(categories))
print(categories)

数据类型： <class ‘list’>
2034文本数量 - 2.428MB (训练数据集)
1353文本数量 - 1.800MB (测试数据集)
训练集和测试集使用的4个类别的名称：
[‘alt.atheism’, ‘comp.graphics’, ‘sci.space’, ‘talk.religion.misc’]

### 数据重命名
x_train = data_train.data
y_train = data_train.target
x_test = data_test.data
y_test = data_test.target
### 输出前5个样本
print (u' -- 前5个文本 -- ')
for i in range(5):print (u'文本%d(属于类别 - %s)：' % (i+1, categories[y_train[i]]))print (x_train[i])print ('\n\n')

输出前5个样本

print (u’ – 前5个文本 – ‘)
for i in range(5):
print (u’文本%d(属于类别 - %s)：’ % (i+1, categories[y_train[i]]))
print (x_train[i])
print (’\n\n’)

输出前5个样本

print (u’ – 前5个文本 – ‘)
for i in range(5):
print (u’文本%d(属于类别 - %s)：’ % (i+1, categories[y_train[i]]))
print (x_train[i])
print (’\n\n’)
– 前5个文本 –
文本1(属于类别 - alt.atheism)：
If one is a vegan (a vegetarian taht eats no animal products at at i.e eggs,
milk, cheese, etc., after about 3 years of a vegan diet, you need to start
taking B12 supplements because b12 is found only in animals.) Acutally our
bodies make B12, I think, but our bodies use up our own B12 after 2 or 3
years.
Lacto-oveo vegetarians, like myself, still get B12 through milk products
and eggs, so we don’t need supplements.
And If anyone knows more, PLEASE post it. I’m nearly contridicting myself
with the mish-mash of knowledge I’ve gleaned.

文本2(属于类别 - comp.graphics)：
Hi,
I have a friend who is working on 2-d and 3-d object recognition. He is looking
for references describing algorithms on the following subject areas:

Thresholding
Edge Segmentation
Marr-Hildreth
Sobel Operator
Chain Codes
Thinning - Skeletonising

If anybody is willing to post an algorithm that they have implemented which demonstrates
any of the above topics, it would be much appreciated.

Please post all replies to my e-mail address. If requested I will post a summary to the
newsgroup in a couple of weeks.

Thanks in advance for all replies

文本3(属于类别 - comp.graphics)：
Hello netters

Sorry, I don’t know if this is the right way of doing this kind of thing,
probably should be a CFV, but since I don’t have tha ability to create a
news group myself, I just want to start the discussion.

I enjoy reading c.g very much, but I often find it difficult to sort out what
I’m interested in. Everything from screen-drivers, graphics cards, graphics
programming and graphics programs are discused here. What I’d like is a
comp.graphics.programmer news group.
What do you other think.

文本4(属于类别 - comp.graphics)：

Yes, I did punch in the wrong numbers (working too many late nites). I
intended on stating 640x400 is 256,000 bytes. It’s not in the bios, just my
VESA TSR.

文本5(属于类别 - talk.religion.misc)：

Well, I am not Andy, but if you had familiarized yourself with some of
the current theories/hypotheses about abiogenesis before posting