机器学习实战（三）：Classification

第一章讲最常见监督式学习为回归任务，分类任务。

MNIST

MNIST是机器学习领域基本的数据集，类似于“Hello World!”，可以通过Scikit-Learn直接引。在此之前的几个准备工作当然不能少：

# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals# Common imports
import numpy as np
import os# to make this notebook's output stable across runs
np.random.seed(42)# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"

做一个分train_set 和 test_set的函数，并且sort一下

def sort_by_target(mnist):reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]mnist.data[:60000] = mnist.data[reorder_train]mnist.target[:60000] = mnist.target[reorder_train]mnist.data[60000:] = mnist.data[reorder_test + 60000]mnist.target[60000:] = mnist.target[reorder_test + 60000]

获取数据

try:from sklearn.datasets import fetch_openmlmnist = fetch_openml('mnist_784', version=1, cache=True)mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as stringssort_by_target(mnist) # fetch_openml() returns an unsorted dataset
except ImportError:from sklearn.datasets import fetch_mldatamnist = fetch_mldata('MNIST original')
mnist["data"], mnist["target"]

Scikit-Learn加载数据集通常具有字典结构：

DESCR：描述数据集
data：实例是一行，特征为一列
target：包含带标记的数组

赋值：X,y = mnist["data"],mnist["target"]

查看某个图：

some_digit = X[36000]
some_digit_image = some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap = mpl.cm.binary,interpolation="nearest")
plt.axis("off")
plt.show()

看标签：y[36000]
设置各种集：X_train,X_text,y_train,y_test = X[:60000],X[60000:],y[:60000],y[60000:]
给数据集洗牌：

import numpy as npshuffle_index = np.random.permutation(60000)
X_train,y_train = X_train[shuffle_index],y_train[shuffle_index]

Training a Binary Classifier

创建目标向量

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

挑选一个分类器进行训练，这里选择Scikit-Learn的SGDClassifier（随机梯度下降分类器），SGD单独吃力训练实例，适合在线学习，创建并训练：

from sklearn.linear_model import SGDClassifiersgd_clf = SGDClassifier(max_iter=5,tol=-np.infty,random_state=42)
#最大迭代次数5 阈值负无穷 设置random_state使其可复现结果
sgd_clf.fit(X_train,y_train_5)

预测一下：sgd_clf.predict([some_digit])，猜对，看下性能

Performance Measures

Measuring Accuracy Using Cross-Validation（使用交叉验证得到精准度）

自行实施交叉验证（见书，此处不予展示）


from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")#折叠次数为3

准确率>95%，事实真是如此？先看一个将每个图片都化为非5的预测器

from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):def fit(self, X, y=None):passdef predict(self, X):return np.zeros((len(X), 1), dtype=bool)

看下准确度：

never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

结果>90%,所以只有10的图像是5，这说明准确率无法成为分类器首要性能指标，特别是skewed datasets（偏斜数据集：某些类比其他类更频繁）

Confusion Matrix（混淆矩阵）

就是统计A、B混淆次数，第A行第B列/第B行第A列。通过cross_val_predict（）替代测试集与之比较。cross_val_predict执行K-fold 交叉验证，返回对每个折叠的预测：


from sklearn.model_selection import cross_val_predicty_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

使用confusion_matrix（）获取混淆矩阵。给出目标类别、预测类别即可。

from sklearn.metrics import confusion_matrixconfusion_matrix(y_train_5, y_train_pred)

得到的数据行表实际类别，列表预测类别：

真负类	假正类
假负类	真正类

正类预测准确度（Precision），也叫精度：TP（真正类数量）/（TP（真正类数量）+FP（假正类数量））
召回率（Recall）/灵敏度（sensitivity）/真正率（TNR）：TP（真正类数量）/（TP（真正类数量）+FN（假负类数量））

Precision and Recall

Scikit-Learn得到：

from sklearn.metrics import precision_score, recall_score
precision_score(y_train_5, y_train_pred)
recall_score(y_train_5, y_train_pred)

F1分数（F1 score）：是精度和召回率的谐波平均值，当两者都高时，F1分数才高：

调用：

from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)

Precision/Recall Tradeof（精度召回度权衡）

意思是两种情况：

精度高/召回率低：预测的都是我想要的，但铁定还有一些我想要的被预测到我不想要的
召回率高/精度低：我想要的都被预测出来了，但肯定有相当一部分是我不想要的，给预测错了。

所以说找了一个阈值来权衡：y_scores = sgd_clf.decision_function([some_digit])
SGDClassifier使用阈值为0，可以根据y_score提高阈值：

threshold = 200000
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

200000时就错过了，召回率是降低了，所以应该使用cross_val_predict()获取训练集所有实例分数

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,method="decision_function")

有了分数后，就可以用precision_recall_curve()计算所有可能阈值的精度与召回率。

from sklearn.metrics import precision_recall_curveprecisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

画个图看看：

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)plt.xlabel("Threshold", fontsize=16)plt.legend(loc="upper left", fontsize=16)plt.ylim([0, 1])plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000, 700000])
plt.show()

精度曲线之所以要崎岖一些是因为阈值提高时，精度也可能会下降。找到好的权衡方法是直接绘制精度和召回率曲线图：

def plot_precision_vs_recall(precisions, recalls):plt.plot(recalls, precisions, "b-", linewidth=2)plt.xlabel("Recall", fontsize=16)plt.ylabel("Precision", fontsize=16)plt.axis([0, 1, 0, 1])plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.show()

尽量在下降之前找到权衡，如果你需要90%精度，从第一张图看大概是70000的阈值：

y_train_pred_90 = (y_scores > 70000)
precision_score(y_train_5, y_train_pred_90)
recall_score(y_train_5, y_train_pred_90)

一应俱全。

The ROC Curve （ROC曲线）

ROC（receiver operating characteristic/受试者工作特征曲线）经常和二元分类器一起使用。绘制的是真正类率(TPR）和假正类率（FPR），FPR = 1-TNR（真负类率特异度正确分类为负类的比率）：ROC–灵敏度与1-TNR关系
先使用roc_curve()计算多种阈值TPR,FPR

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

画一下：

def plot_roc_curve(fpr, tpr, label=None):plt.plot(fpr, tpr, linewidth=2, label=label)plt.plot([0, 1], [0, 1], 'k--')plt.axis([0, 1, 0, 1])plt.xlabel('False Positive Rate', fontsize=16)plt.ylabel('True Positive Rate', fontsize=16)
plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
plt.show()

则TPR越高，FPR越多虚线表示纯随机分类器曲线，优秀曲线越向左上角越好。比较分类器的一种好方法是测量曲线下面积（AUC）,完美=1，纯随机=0.5，Scikit-Learn提供计算：

from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

那么如何选择ROC曲线与精度召回率曲线？
正类少或更关注假正类:PR，否则ROC。

训练一个RandomForestClassifier将它与SGDClassifier比较 ROC和ROC AUC分数，RandomForestClassifier没有decision_function()，但是有dict_proba()，可以返回一个数组，行位实例，列为类别，然后给出概率。

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=10,random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,method="predict_proba")

没分数用正类概率做分数：

y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

画图看看效果：

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right", fontsize=16)
plt.show()

这图比SGDClassifier更优秀(离左上角更近)，而且ROC AUC分数也更高

roc_auc_score(y_train_5, y_scores_forest)
y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)
recall_score(y_train_5, y_train_pred_forest)

看一下精度和召回率也相当不错。

Multiclass Classification

多项分类器可以区分两个以上类别。有些算法（随机森林，朴素贝叶斯）可以直接处理多个类别，还有些不可以（支持向量机，线性），但可以偶多种策略让你用二元实现多类分类：

OvA：获取每个分类器分数，分入最高的
OvO：两个两个分，一直分出最后

有些算法（支持向量机）在数据扩大表现糟糕，用OvO更好。大多数优先OvA。Scikit-Learn自动OvA（SVM OvO）。用SGDClassifier试试：

sgd_clf.fit(X_train, y_train)
sgd_clf.predict([some_digit])

实际上Scikit-Learn是训练了10个二元分类器训练的，选出分数最高的，可以用decision_function(）来查看：

some_digit_scores = sgd_clf.decision_function([some_digit])
some_digit_scores
np.argmax(some_digit_scores)

如果你需要强制OvA或OVO。可以使用OneVsOneClassifier or OneVsRestClassifier classes
举个用OvO的例子：

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(max_iter=5, tol=-np.infty, random_state=42))
ovo_clf.fit(X_train, y_train)
ovo_clf.predict([some_digit])

用len(ova_clf.estimators_一看果然，训练RandomForest
Classifier也一样

forest_clf.fit(X_train, y_train)
forest_clf.predict([some_digit])

随机森林直接可以分为多类别，用predict_proba()看看就知道了：forest_clf.predict_proba([some_digit])
使用交叉验证看看准确率：cross_val_score(sgd_clf, X_train, y_train, cv=3, scoring="accuracy")
用简单缩放可以继续提升准确度：

在这里插入代码片from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

Error Analysis

按正常步骤，应该用网格搜索微调超参数，得到好的模型，这里不搞了，直接看看怎么改进——Error Analysis
先看混淆矩阵，用cross_val_predict()预测，再调用confusion_matrix()

y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
conf_mx

用Matplotlib的matshow()查看混淆矩阵

def plot_confusion_matrix(matrix):"""If you prefer color and a colorbar"""fig = plt.figure(figsize=(8,8))ax = fig.add_subplot(111)cax = ax.matshow(matrix)
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

把焦点放到错误上，将混淆矩阵每个值/先对类别数量，看概率更清晰。


row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
#用0填充对角线
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)

看出8、9整的很乱，3和5单独也不好，分析单个3、5是怎么回事

def plot_digits(instances, images_per_row=10, **options):size = 28images_per_row = min(len(instances), images_per_row)images = [instance.reshape(size,size) for instance in instances]n_rows = (len(instances) - 1) // images_per_row + 1row_images = []n_empty = n_rows * images_per_row - len(instances)images.append(np.zeros((size, size * n_empty)))for row in range(n_rows):rimages = images[row * images_per_row : (row + 1) * images_per_row]row_images.append(np.concatenate(rimages, axis=1))image = np.concatenate(row_images, axis=0)plt.imshow(image, cmap = mpl.cm.binary, **options)plt.axis("off")
cl_a, cl_b = 3, 5
X_aa = X_train[(y_train == cl_a) & (y_train_pred == cl_a)]
X_ab = X_train[(y_train == cl_a) & (y_train_pred == cl_b)]
X_ba = X_train[(y_train == cl_b) & (y_train_pred == cl_a)]
X_bb = X_train[(y_train == cl_b) & (y_train_pred == cl_b)]
plt.figure(figsize=(8,8))
plt.subplot(221); plot_digits(X_aa[:25], images_per_row=5)
plt.subplot(222); plot_digits(X_ab[:25], images_per_row=5)
plt.subplot(223); plot_digits(X_ba[:25], images_per_row=5)
plt.subplot(224); plot_digits(X_bb[:25], images_per_row=5)
plt.show()

3-5不讲武德，连我这个69岁的老同志都分不清，更别提分类器了。

Multilabel Classification(多标签分类)

比如人脸识别，识别小红，小明，小强那你只能是[1,0,0]不能是[1,1,0]，看个小例子：

from sklearn.neighbors import KNeighborsClassifiery_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)

第一个分类器>=7，第二个判奇，用KNeighborsClassifier训练，再预测：knn_clf.predict([some_digit])，结果对了。
然后如何评估多标签分类器取决于项目，比如F1分数，这是所有标签权重相同情况下：

_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
f1_score(y_multilabel, y_train_knn_pred, average="macro")

可以用average="macro"设置自身支持权重

Multioutput Classification（多输出分类）

最后一种是多输出-多类别分类，即标签也可以是多类别的。
构建一个系统去噪声，再输入一张带噪声的图片，则希望输出一张干净图。用MNIST表示就是用像素强度（0-255）表现。

创建训练集和测试集，用NumPy的randint()增加噪声，目标还原原始图：

noise = np.random.randint(0, 100, (len(X_train), 784))
X_train_mod = X_train + noise
noise = np.random.randint(0, 100, (len(X_test), 784))
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test

看下图，我giao：

some_index = 5500
plt.subplot(121); plot_digit(X_test_mod[some_index])
plt.subplot(122); plot_digit(y_test_mod[some_index])
save_fig("noisy_digit_example_plot")
plt.show()

清除干净：

knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)
save_fig("cleaned_digit_example_plot")

差不多第三章更到这，还有一个泰坦尼克号和邮件过滤器没有更，等抽空写吧。