应用机器学习（三）：朴素贝叶斯分类器

朴素贝叶斯( Naive Bayes )分类器，是指将贝叶斯( Bayes )原理应用于具有较强独立性假设的特征(变量)，而得到的一族简单的概率分类器。

Naive Bayes 分类器假设：

给定类变量，一个特定的变量独立于其它任何变量。称该独立性为条件独立，例如，给定 ZZ，XX 条件独立于 YY，当且仅当

P(X=x,Y=y|Z=z)=P(X=x|Z=z)P(Y=y|Z=z),∀x,y,z

P(X = x, Y = y | Z = z) = P(X = x | Z = z) P(Y = y | Z = z), \forall\, x, y, z
简记为 X || −−Y|ZX \underline{\ || \ } \, Y \,| \,Z. 举个例子，
已知一个人的年龄，那么他的身高和头发灰度是条件独立的。

概率模型

假设有 K 个类，不妨记为 C1,C2,…,CK\mathcal{C}_1,\mathcal{C}_2,\dots,\mathcal{C}_K。一个待分类的对象，用 n 维向量 x=(x1,x2,…,xn)\mathbf{x}=(x_1, x_2,\dots,x_n)表示，其中的 n 个分量代表 n 个独立的特征（变量）。Naive Bayes 本质上是一个条件概率模型，它计算 x\mathbf{x} 属于每一个类的概率P(Ck|x),k=1,2,…,K\mathcal{P}(\mathcal{C}_k\,|\,\mathbf{x}), \, k=1,2,\dots,K。使用 Bayes 公式，分解条件概率为

P(Ck|x)=P(x|Ck)P(Ck)P(x)

\mathcal{P}(\mathcal{C}_k\,|\,\mathbf{x}) = \dfrac{\mathcal{P}(\mathbf{x}\,|\,\mathcal{C}_k)\mathcal{P}(\mathcal{C}_k)} {\mathcal{P}(\mathbf{x})}
在 Bayes 统计中，称 P(Ck|x)\mathcal{P}(\mathcal{C}_k\,|\,\mathbf{x})后验概率， P(Ck)\mathcal{P}(\mathcal{C}_k) 类 Ck \mathcal{C}_k 的先验概率，
P(x|Ck)\mathcal{P}(\mathbf{x}\,|\,\mathcal{C}_k) 似然函数， P(x)\mathcal{P}(\mathbf{x}) 样本的联合概率。给定 x\mathbf{x}， Bayes 公式里的分母是常数，分子等于 x,Ck\mathbf{x},\,\mathcal{C}_k的联合概率, 则利用条件概率的定义，有

\begin{align*}\mathcal{P}(\mathbf{x},\,\mathcal{C}_k) &= \mathcal{P}(x_1, x_2, \dots, x_n,\,\mathcal{C}_k) \\&= \mathcal{P}(x_1\, | \, x_2, \dots, x_n,\,\mathcal{C}_k)\mathcal{P}(x_2, \dots, x_n,\,\mathcal{C}_k) \\&= \mathcal{P}(x_1\, | \, x_2, \dots, x_n,\,\mathcal{C}_k) \mathcal{P}(x_2\, | \, x_3, \dots, x_n,\,\mathcal{C}_k) \mathcal{P}(x_3, \dots, x_n,\,\mathcal{C}_k) \\ &= ... \\ &= \mathcal{P}(x_1\, | \, x_2, \dots, x_n,\,\mathcal{C}_k)\mathcal{P}(x_2\, | \, x_3, \dots, x_n,\,\mathcal{C}_k)...\mathcal{P}(x_{n-1}\,|\,x_n, \mathcal{C}_k)\mathcal{P}(x_n\,|\,\mathcal{C}_k)\mathcal{P}(\mathcal{C}_k) \end{align*}

“naive” 条件独立假设：

给定类 C\mathcal{C}，对任意的特征 FiF_i，条件独立于其它任何特征Fj,∀j≠iF_j, \, \forall j \ne i。那么，

P(xi|xi+1,…,xn,Ck)=P(xi|Ck)

\mathcal{P}(x_i \, |\, x_{i+1},\dots, x_{n}, \mathcal{C}_k)=\mathcal{P}(x_i\,|\,\mathcal{C}_k)
这样，后验概率可以表示为

\begin{align*} \mathcal{P}(\mathcal{C}_k\,|\,x_1, x_2, \dots,x_n) &\propto \mathcal{P}(x_1, x_2, \dots,x_n,\,\mathcal{C}_k) \\ &\propto \mathcal{P}(\mathcal{C}_k) \mathcal{P}(x_1\,|\,\mathcal{C}_k) \mathcal{P}(x_2\,|\,\mathcal{C}_k)\dots\mathcal{P}(x_n\,|\,\mathcal{C}_k) \\ &\propto \mathcal{P}(\mathcal{C}_k)\prod_{i=1}^n\mathcal{P}(x_i\,|\,\mathcal{C}_k) \end{align*}
故

P(Ck|x1,x2,…,xn)=1ZP(Ck)∏i=1nP(xi|Ck)

\mathcal{P}(\mathcal{C}_k\,|\,x_1, x_2, \dots,x_n)= \dfrac{1}{Z} \mathcal{P}(\mathcal{C}_k)\prod_{i=1}^n\mathcal{P}(x_i\,|\,\mathcal{C}_k)
这里， Z=P(x)Z=\mathcal{P}(\mathbf{x})，仅依赖于 x1,x2,…,xnx_1, x_2, \dots, x_n，也就是说，如果特征变量的值已知，则它是一个常数。

Bayes 分类器

根据最大后验概率( maximum a posteriori or MAP )的决策规则，将待分类的向量 x\mathbf{x} 分到后验概率最大的类里，对应的分类器，称为 Bayes 分类器。此时，x\mathbf{x} 的类标签

y^=argmaxk∈{1,2,…,K}P(Ck)∏i=1nP(xi|Ck)

\hat{y}=\mathop{\arg\max}_{k\in \{1, 2, \dots, K\}} \mathcal{P} (\mathcal{C}_k)\prod_{i=1}^n\mathcal{P} (x_i\,|\,\mathcal{C}_k)

参数估计和事件模型

通常，在没有关于类的先验知识的情况下，类的先验概率可取均匀分布，即，Ck=1/K,k=1,2,…,K\mathcal{C}_k = 1/K,\,k=1, 2, \dots, K。也可以在训练集上，用频率来估计，即，一个给定类的先验概率==该类的样本数//总样本数。为了估计特征分布P(x|C)\mathcal{P} (x\,|\,\mathcal{C})的参数，有必要假设特征分布的类型，或者，从训练集产生特征分布的非参数模型。称特征分布的假设为 Naive Bayes 分类器的事件模型。对于连续型特征，通常假设为正态分布；对于离散型特征，常假设为多项分布或Bernoulli分布。

一个性别分类的例子

问题描述：根据一个人的身高(height )、体重(weight )和脚的尺寸(footsize )这三个特征，预测该人的性别(sex )。

训练

实例训练集见下表：

假设特征都服从正态分布，利用两个性别类的样本分别估计这三个特征的均值和方差，确定正态分布。

假设性别类的先验概率 P(male)=P(female)=0.5\mathcal{P}(male) = \mathcal{P}(female)=0.5。先验概率也可以根据人口的历史经验知识，或者训练集的性别频率代替。

检验

给定一个待分类的样本：

分别计算该样本属于两类的后验概率：

P(male|sample)=P(male)P(height|male)P(weight|male)P(footsize|male)P(sample)

\mathcal{P}(male\,|\,sample)= \\ \dfrac{\mathcal{P}(male) \mathcal{P}(height\,|\,male)\mathcal{P}(weight\,|\,male)\mathcal{P}(foot\,size\,|\,male)} {\mathcal{P}(sample)}

其中，分母

\begin{align*} \mathcal{P}(sample) &= \mathcal{P}(male) \mathcal{P}(height\,|\,male)\mathcal{P}(weight\,|\,male)\mathcal{P}(foot\,size\,|\,male) \\ &+ \mathcal{P}(female) \mathcal{P}(height\,|\,female)\mathcal{P}(weight\,|\,female)\mathcal{P}(foot\,size\,|\,female) \end{align*}

注意到，给定样本后，分母是常数，因此不影响分类，可以忽略。计算分子中的各项概率：

P(male)=0.5

\begin{align*} \mathcal{P}(male)=0.5 \end{align*}

P(height|male)=12π−−√σ^e−(6−μ^)22σ^2≈1.5789

\begin{align*} \mathcal{P}(height\,|\,male)=\dfrac{1}{\sqrt{2\pi}\hat{\sigma}}e^{\dfrac{-(6-\hat{\mu})^2}{2\hat{\sigma}^2}}\approx 1.5789 \end{align*}

P(weight|male)≈5.9881×10−6

\begin{align*} \mathcal{P}(weight\,|\,male)\approx 5.9881 \times 10^{-6} \end{align*}

P(footsize|male)≈1.3112×10−3

\begin{align*} \mathcal{P}(foot\,size\,|\,male)\approx 1.3112 \times 10^{-3} \end{align*}
所以

P(male|sample)∝6.1984×10−9

\begin{align*} \mathcal{P}(male\,|\,sample)\propto 6.1984 \times 10^{-9} \end{align*}

同理，可以计算得到

P(female|sample)∝5.3778×10−4

\begin{align*} \mathcal{P}(female\,|\,sample)\propto 5.3778 \times 10^{-4} \end{align*}

显然，女性类的后验概率大于男性类的，因此，预测该样本为女性。

数据试验

我们在机器学习的基准数据集 HouseVotes84 上训练 Naive Bayes 分类器，并在该数据集上检验分类效果。HouseVotes84 数据集由美国众议院435名议员在1984年对16项议案的投票结果组成。每名众议员分别对16项议案投赞成(简记为y )、反对(简记为n )或中立。该数据集位于 R 包 mlbench 里，由435行观测、17个变量(列)的数据框组成。这17个变量分别为：

Class Name: 2 (democrat, republican)
handicapped-infants: 2 (y,n)
water-project-cost-sharing: 2 (y,n)
adoption-of-the-budget-resolution: 2 (y,n)
physician-fee-freeze: 2 (y,n)
el-salvador-aid: 2 (y,n)
religious-groups-in-schools: 2 (y,n)
anti-satellite-test-ban: 2 (y,n)
aid-to-nicaraguan-contras: 2 (y,n)
mx-missile: 2 (y,n)
immigration: 2 (y,n)
synfuels-corporation-cutback: 2 (y,n)
education-spending: 2 (y,n)
superfund-right-to-sue: 2 (y,n)
crime: 2 (y,n)
duty-free-exports: 2 (y,n)
export-administration-act-south-africa: 2 (y,n)

在R 环境加载数据集 HouseVotes84，并显示前6行

library(mlbench)
data(HouseVotes84)
head(HouseVotes84)

数据集中的 NA，代表“中立”的投票，在训练分类器时被忽略。

现在，以HouseVotes84 为训练集，使用 e1071 包的函数naiveBayes 建立 Naive Bayes 分类器，预测该数据集前10行
的分类结果，并与真实类作比较。

library(e1071)
data(HouseVotes84, package = "mlbench")
model <- naiveBayes(Class ~ ., data = HouseVotes84)
pred <- predict(model, HouseVotes84[1:10,])
table(pred, HouseVotes84$Class[1:10])