Scikit learn：machine learning in Python之贝叶斯学习

chapter 2之朴素贝叶斯.

朴素贝叶斯是一个简单却很强大的分类器，基于贝叶斯定理的概率模型。本质来说，贝叶斯是基于每个特征值的概率去决定该实例属于一类的概率，前提条件，也就是假定每个特征之间是独立的。朴素贝叶斯的一个非常成功的应用就是自然语言处理（natural language processing , NLP），NLP问题有很重要的，大量的标记数据（一般为文本文件），该数据作为算法的训练集。

在这个章节，将介绍使用朴素贝叶斯进行文本分类。数据集为一组分出着相应类别的文本文档，然后训练朴素贝叶斯算法来预测一个新的未知的文档的类别。scikit-learn中给出的数据集包含19,000组来自从政治，宗教到体育和科学等20个不同主题的新闻组。

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all') #导入数据和赋值

值得注意的是，数据是存着一系列的文本内容，而不是矩阵。另外，由于书本是Python2的，我使用的是Python3，故代码和书本有些微不同。

print (type(news.data),type(news.target),type(news.target_names))

Downloading dataset from http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz (14 MB)

print (news.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

print(len(news.data))
print(len(news.target))

18846

print(news.data[0])

From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>

Subject: Pens fans reactions

Organization: Post Office, Carnegie Mellon, Pittsburgh, PA

Lines: 12

NNTP-Posting-Host: po4.andrew.cmu.edu

I am sure some bashers of Pens fans are pretty confused about the lack

of any kind of posts about the recent Pens massacre of the Devils. Actually,

I am bit puzzled too and a bit relieved. However, I am going to put an end

to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they

are killing those Devils worse than I thought. Jagr just showed you why

he is much better than his regular season stats. He is also a lot

fo fun to watch in the playoffs. Bowman should let JAgr have a lot of

fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final

regular season game. PENS RULE!!!

print(news.target[0],news.target_names[news.target[0]]) #target是用于下标定位

10 rec.sport.hockey #下标从0开始

预处理数据：

本书的机器学习算法只能适用于数值型数据，因此，需要将文本数据转化为数值数据。

目前，只有一个特征——文本内容，因此，需要一些函数将文本内容转变为有意义的一组数值型特征。直观地看，每个文本类别中的文字（确切地说，就是符号，包括数字或标点符号）有哪些，然后尝试用这些文字的频繁分布描述每个类别。sklearn.feature_extraction.text 提供一些实用程序，从文本文档中建立数字特征向量。

在转换数据之前，先划分好训练集和测试集。在随机顺序下，75%个实例为训练集，25%个实例为测试集。

SPLIT_PETC = 0.75
split_size = int(len(news.data) * SPLIT_PETC)
x_train = news.data[:split_size]
x_test = news.data[split_size:]
y_train = news.target[:split_size]
y_test = news.target[split_size:]

这里有3中方式将文本转变为数字特征：CountVectorizer, HashingVectorizer,and TfidfVectorizer.（它们之间的不同在于获得数字特征的计算）

CountVectorizer 主要是从文本中建立一个字典，然后每个实例转变成一个数字特征向量，其中的每个元素是文本中一个独有单词出现的次数

HashingVectorizer 实现一个哈希函数（hashing function），映射特征的索引，然后如CountVectorizer计算次数

TfidfVectorizer 和CountVectorizer 很像，但是计算方式更为先进，使用术语逆文档频率法（Term Frequency Inverse Document Frequency，TF-IDF）——测量单词在文档或者文集中的重要性的统计学方法（寻找当前文档中比价频繁出现的单词，对比其在整个文档集中出现的次数；这样可以看到标准化的结果，避免了过度频繁）。

训练朴素贝叶斯分类器：

建立一个朴素贝叶斯分类器，由特征向量化程序和实际贝叶斯分类器：使用 sklearn.naive_bayes模块中的方法MultinomialNB；sklearn.pipeline模块中的Pipeline能够将向量和分类器组合一起。这里结合MultinomialNB 建立3个不同的分类器，分别使用上面提及的3个不同的文本向量，然后对比在默认参数下，哪个更好。

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer,HashingVectorizer,CountVectorizer

clf_1 = Pipeline([('vect',CountVectorizer()),('clf',MultinomialNB()),])
clf_2 = Pipeline([('vect',HashingVectorizer(non_negative=True)),('clf',MultinomialNB()),])
clf_3 = Pipeline([('vect',TfidfVectorizer()),('clf',MultinomialNB()),])

定义一个函数，分类和对指定的x和y值进行交叉验证：

from sklearn.cross_validation import cross_val_score,KFold
import numpy as np

from scipy.stats import sem

def evaluate_cross_validation(clf,x,y,K):
#create a k-fold cross validation iterator of k=5 folds(建立一个k=5的交叉验证迭代器)
cv = KFold(len(y),K,shuffle=True,random_state=0)
#by default the score used is the one returned by score method of the estimator(accuracy)(默认情况下，使用的得分是返回的一个估计分数)
scores = cross_val_score(clf,x,y,cv=cv)
print(scores)

print(("Mean score:{0:.3f} (+/-{1:.3f})").format(np.mean(scores),sem(scores)))

然后，每个分类器都进行5重交叉验证：

clfs = [clf_1,clf_2,clf_3]
for clf in clfs:
evaluate_cross_validation(clf,news.data,news.target,5)

结果如下：

[ 0.85782493 0.85725657 0.84664367 0.85911382 0.8458477 ]

Mean score:0.853 (+/-0.003)

[ 0.75543767 0.77659857 0.77049615 0.78508888 0.76200584]

Mean score:0.770 (+/-0.005)

[ 0.84482759 0.85990979 0.84558238 0.85990979 0.84213319]

Mean score:0.850 (+/-0.004)

可以看出，CountVectorizer 和 TfidfVectorizer 比HashingVectorizer 结果更好。使用TfidfVectorizer 继续，尝试通过将文档解析成不同的符号正则表达式来提高结果。

默认的正则表达式：ur"\b\w\w+\b" ，考虑了字母数字字符，下划线（也许也会考虑削减和点号以提高标记and begin considering tokens as Wi-Fi and site.com.）

新的正则表达式：ur"\b[a-z0- 9_\-\.]+[a-z][a-z0-9_\-\.]+\b"：

clf_4 = Pipeline([('vect',TfidfVectorizer(token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB()),]) #Python3不支持ur

evaluate_cross_validation(clf_4,news.data,news.target,5)

结果如下：

[ 0.86100796 0.8718493 0.86203237 0.87291059 0.8588485 ]

Mean score:0.865 (+/-0.003)

说明结果从0.850提高到0.865。

此外，还有另一个参数：stop_words，允许我们忽略掉不想加入计算的一列单词，例如太频繁的单词，或者先验认为不该为特定主题提供信息的单词。

定义一个函数，获得stop words （禁用词）：

def get_stop_words():
result = set()
for line in open('stopwords_en.txt','r').readlines():
result.add(line.strip())
return result

然后，建立一个新的分类器：

clf_5 = Pipeline([('vect',TfidfVectorizer(stop_words= get_stop_words(),token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB()),])
evaluate_cross_validation(clf_5,news.data,news.target,5)

结果如下：

[ 0.88222812 0.89625895 0.88591138 0.89599363 0.88485009]

Mean score:0.889 (+/-0.003)

结果由0.865提高到0.889。

再看MultinomialNB的参数，最重要的参数是alpha参数，也叫平滑参数，其默认值为1.0，假设令其为0.1：

clf_6 = Pipeline([('vect',TfidfVectorizer(stop_words= get_stop_words(),token_pattern=r"\b[a-z0-9_\-\.]+[a-z][a-z0-9_\-\.]+\b",)),('clf',MultinomialNB(alpha=0.1)),])

结果如下：

[ 0.91405836 0.91589281 0.91085168 0.91721942 0.91509684]

Mean score:0.915 (+/-0.001)

结果由 0.889 提高到 0.915 。接下来，测试不同的alpha值对结果的影响，进而选择最佳的alpha值。

模型评估：

定义一个函数，在整个训练集训练模型，和评估模型在训练集和测试集的准确性。

from sklearn import metrics

def train_and_evaluate(clf,x_train,x_test,y_train,y_test):
clf.fit(x_train,y_train)
print("Accuracy on training set:")
print(clf.score(x_train,y_train))
print("Accuracy on testing set:")
print(clf.score(x_test,y_test))
print("Classification Report:")
print(metrics.classification_report(y_test,y_pred=y_test))
print("Confusion Matrix:")
print(metrics.confusion_matrix(y_test,y_pred=y_test))

train_and_evaluate(clf_6,x_train,x_test,y_train,y_test)

结果：

Accuracy on training set:

0.98776001132

Accuracy on testing set:

0.909592529711

由上可知，结果还可以。测试集结果也差不多达到0.91.

Scikit learn：machine learning in Python之贝叶斯学习相关推荐

Coursera | Applied Data Science with Python 专项课程 | Applied Machine Learning in Python
本文为学习笔记,记录了由University of Michigan推出的Coursera专项课程--Applied Data Science with Python中Course Three: Ap ...
[导读]7 Steps to Mastering Machine Learning With Python
Step 1: Basic Python Skills ▪ Python The Hard Way by Zed A. Shaw ▪ Google Developers Python Course ...
Machine Learning with Python Cookbook 学习笔记第9章
Chapter 9. Dimensionality Reduction Using Feature Extraction 前言本笔记是针对人工智能典型算法的课程中Machine Learning w ...
Machine Learning之Python篇（一）
Machine Learning之Python篇概述教程 https://ljalphabeta.gitbooks.io/python-/content/ <Python机器学习>中文 ...
sklearn自学指南(part1)--Machine Learning in Python
学习笔记,仅供参考,有错必纠自翻译+举一反三 scikit-learn(Machine Learning in Python) 预测数据分析的简单和有效的工具每个人都可以访问,并可在各种上下文中重 ...
Machine Learning with Python Cookbook 学习笔记第8章
Chapter 8. Handling Images 前言本笔记是针对人工智能典型算法的课程中Machine Learning with Python Cookbook的学习笔记学习的实战代码都放 ...
Machine Learning with Python Cookbook 学习笔记第6章
Chapter 6. Handling Text 本笔记是针对人工智能典型算法的课程中Machine Learning with Python Cookbook的学习笔记学习的实战代码都放在代码压缩 ...
Machine Learning（吴恩达）学习笔记（一）
Machine Learning(吴恩达) 学习笔记(一) 1.什么是机器学习? 2.监督学习 3.无监督学习 4.单变量线性回归 4.1代价函数 4.2 梯度下降 5.代码回顾最近在听吴恩达老师的 ...
Python实现基于朴素贝叶斯的垃圾邮件分类标签： python朴素贝叶斯垃圾邮件分类 2016-04-20 15:09 2750人阅读评论(1) 收藏举报分类：机器学习（19）听说
Python实现基于朴素贝叶斯的垃圾邮件分类标签: python朴素贝叶斯垃圾邮件分类 2016-04-20 15:09 2750人阅读评论(1) 收藏举报分类: 机器学习(19) 听说朴 ...

Scikit learn：machine learning in Python之贝叶斯学习

Scikit learn：machine learning in Python之贝叶斯学习相关推荐

最新文章

热门文章