机器学习之有监督学习实例_01 —— 鸢尾花数据集的分类

Iris 数据集简介：

Iris Data Set（鸢尾属植物数据集）首次出现在著名的英国统计学家和生物学家Ronald Fisher 1936年的论文《The use of multiple measurements in taxonomic problems》中，被用来介绍线性判别式分析。该数据集种包括三类不同的鸢尾属植物：Iris Setosa，Iris Versicolour，Iris Virginica；且每类收集了50个样本，故这个数据集一共包含了150个样本。

**特征：**该数据集种包括150个样本的4个特征（单位： cmcmcm） : spealspealspeal lenth(花萼长度)lenth (花萼长度)lenth(花萼长度)、sepalsepalsepal width(花萼宽度)width(花萼宽度)width(花萼宽度)、petalpetalpetal lenth(花瓣长度)lenth(花瓣长度)lenth(花瓣长度)、petalpetalpetal width(花瓣宽度)width(花瓣宽度)width(花瓣宽度)，通常用 mmm 来表示样本量的大小，nnn 表示每个样本所具有的特征数，即 m=150、n=4m=150、n=4m=150、n=4;

1. 从 sklearnsklearnsklearn 中导入数据集

from sklearn import datasets
iris = datasets.load_iris()

将鸢尾花卉数据集中所有的数据和元数据都加载到 irisirisiris 变量中，使用 irisirisiris 变量的 datadatadata 、targettargettarget属性

iris.data #查看其包含的数据

输出结果为包含 150150150 个元素的数组，每个元素包含四个数值：分别为萼片和花瓣的数据

array([[5.1, 3.5, 1.4, 0.2],[4.9, 3. , 1.4, 0.2],[4.7, 3.2, 1.3, 0.2],[4.6, 3.1, 1.5, 0.2],[5. , 3.6, 1.4, 0.2],[5.4, 3.9, 1.7, 0.4],···

查看 150150150 个数据集的种类，包含三种：山鸢尾、变色鸢尾和维吉尼亚鸢尾

iris.target

输出结果包含 150 个数值，其中取值为：0、1 和 2分别代表山鸢尾、变色鸢尾和维吉尼亚鸢尾。

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

三种不同的鸢尾花卉的类别为

iris.target_names #访问iris的target_names属性

输出结果为鸢尾花的三个类别：

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

2. 可视化鸢尾花数据集

使用 matplotlib库matplotlib库matplotlib库，用三种颜色来表示三种花卉的种类，绘制一幅散点图；其中蓝、绿和红分别代表山鸢尾、变色鸢尾和维吉尼亚鸢尾，xxx 轴表示萼片的长度，yyy 轴表示萼片的宽度，。

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data[:,0] #x_Axis sepal length
y = iris.data[:,1] #y_Axis sepal length
species = iris.target
x_min,x_max = x.min() - 0.5,x.max() + 0.5
y_min,y_max = y.min() - 0.5,y.max() + 0.5plt.figure()
plt.title('Iris Dataset - Classification By Spepal Sizes ')
plt.scatter(x,y,c=species)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min,x_max)
plt.ylim(y_min,x_max)
plt.xticks(())
plt.yticks(())

对代码进行修改，用花瓣的长和宽两个变量来绘制图表：

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn import datasets
iris = datasets.load_iris()
x = iris.data[:,2] #x_label
y = iris.data[:,3] #y_label
species = iris.target
x_min,x_max = x.min() - .5,x.max() + .5
y_min,y_max = y.min() - .5,y.max() + .5plt.figure()
plt.title('Iris Dataset - Classification By Petal Sizes ',size=14)
plt.scatter(x,y,c=species)
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.xlim(x_min,x_max)
plt.ylim(y_min,x_max)
plt.xticks(())
plt.yticks(())

3. 主成分分析法PCAPCAPCA

主成分分析法：PrincipalPrincipalPrincipal ComponentComponentComponent AnalisisAnalisisAnalisis ，特点：该方法可以减少系统的维数，保留足以描述各数据点特征的信息，其中新生成的各维称作主成分。此处的应用是将花瓣、萼片的四个测量数据来描述三种花卉的特点，把四个测量数据整合到一起 —— 3D散点图。

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCAiris = datasets.load_iris()
x = iris.data[:,1] #X_Axis-petal length
y = iris.data[:,2] #Y_Axis_petal length
species = iris.target
x_reduced = PCA(n_components=3).fit_transform(iris.data)
# SCATTERPLOT 3D
fig = plt.figure()
ax = Axes3D(fig)
ax.set_title('Iris Dataset by PCA',size =14)
ax.scatter(x_reduced[:,0],x_reduced[:,1],x_reduced[:,2],c=species)ax.set_xlabel('First eigenvector')
ax.set_ylabel('Second eigenvetor')
ax.set_zlabel('Third eigenvector')
ax.w_xaxis.set_ticklabels(())
ax.w_yaxis.set_ticklabels(())
ax.w_zaxis.set_ticklabels(())

如下图，三种鸢尾花卉被 3D3D3D 散点图表示出来，各自形成一簇。

鸢尾花数据集分类

在模型测试过程这一阶段，我们会验证用先前采集的数据创建的模型是否有效。初始采集的数据会被分为训练集和检验集，用于建模的数据称为训练集，用来验证模型的数据称为验证集。其中最著名的是交叉检验，基础操作是把训练集分为不同的部分，每一部分轮流作为验证集，同时其余部分用作训练集，通过这种迭代的方式，进而得到最佳模型。

K - 近邻分类器

1. 将数据分为训练集、验证集，其中 140 个数据用于模型的训练，10 个数据作为验证集。

# K-近邻分类器
import numpy as np
from sklearn import datasets
np.random.seed(0)
iris = datasets.load_iris()
x = iris.data
y = iris.target
i = np.random.permutation(len(iris.data)) #先打乱数组各元素的顺序，再进行切分
#训练数据集
x_train = x[i[:-10]]
y_train = y[i[:-10]]
#验证数据集
x_test = x[i[-10:]]
y_test = y[i[-10:]]

2. 调用 KNNKNNKNN 分类器的构造函数，然后用 fit()fit()fit() 函数对其进行训练，用140条观测数据训练knnknnknn分类器，得到预测模型。

In[ ]:   from sklearn.neighbors import KNeighborsClassifierknn = KNeighborsClassifier()knn.fit(x_train,y_train) #调用用分类器的构造函数，用fit()函数对其进行训练
Out[ ]: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=None, n_neighbors=5, p=2,weights='uniform')

3. 用验证集验证模型的效果，要获取预测结果，直接在预测模型 knnknnknn 上调用 predict()predict( )predict() 函数。

In[1]:  knn.predict(x_test)
Out[1]: array([1, 2, 1, 0, 0, 0, 2, 1, 2, 0])#将预测结果与y_test 中的实际值进行比较
In[2]:y_test
Out[2]: array([1, 1, 1, 0, 0, 0, 2, 1, 2, 0])

由上面可知，错误率为 101010 %，为了更加直观看懂决策结果，我们分别画出萼片测量数据、花瓣测量数据的决策边界。

萼片测量数据决策边界

#画出决策边界
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
x = iris.data[:,:2]
y = iris.target
x_min,x_max = x[:,0].min()- .5,x[:,0].max() + .5
y_min,y_max = x[:,1].min() - .5,x[:,1].max() + .5#MESH
#用三种不同颜色表示的三个决策边界
cmap_light = ListedColormap(['#AAAAFF','#AAFFAA','#FFAAAA'])
h = 0.02
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
knn = KNeighborsClassifier()
knn.fit(x,y)
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx,yy,Z,cmap=cmap_light)
# plot the training data
plt.scatter(x[:,0],x[:,1],c=y)
plt.title('Sepal Length Decision Boundary ',size=14)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())

我们可以清晰的从散点图中看出，红色部分有一小块区域伸入到其他决策边界之中，即编号为1、2的变色鸢尾和维吉尼亚鸢尾存在交集，导致训练后得到模型在预测中存在偏差，也就是我们在预测 xxxtesttesttest 得到的预测结果与验证集中yyytesttesttest存在的 101010% 错误。

花瓣测量数据决策边界

#画出决策边界
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
iris = datasets.load_iris()
x = iris.data[:,2:4]
y = iris.targetx_min,x_max = x[:,0].min()- .5,x[:,0].max() + .5
y_min,y_max = x[:,1].min() - .5,x[:,1].max() + .5
#MESH
#用三种不同颜色表示的三个决策边界
cmap_light = ListedColormap(['#AAAAFF','#AAFFAA','#FFAAAA'])
h = 0.02
xx,yy = np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))
knn = KNeighborsClassifier()
knn.fit(x,y)
Z = knn.predict(np.c_[xx.ravel(),yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx,yy,Z,cmap=cmap_light)# plot the training data
plt.scatter(x[:,0],x[:,1],c=y)
plt.title('Petal Length Decision Boundary ',size=14)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())

同样，我们可以从散点图中看出，红色部分与绿色部分存在交叉部分，即编号为1、2的变色鸢尾(Green)(Green)(Green)和维吉尼亚鸢尾(Red)(Red)(Red)存在交叉部分，导致训练得到的模型在预测中存在偏差，也就解释了我们在对比 xxxtesttesttest 得到的预测结果与验证集中yyytesttesttest存在的预测错误。

鸢尾花数据集作为机器学习中最为常见的数据集，在学习导入数据集、访问属性、用matplotlibmatplotlibmatplotlib直观的转化为 2D2D2D、3D3D3D图形对理解模型更加透彻，但仍需要对该数据集进行更广泛的数据分析，才能掌握数据分析的内容。