文章目录

一、KNN
- 1.1 KNN实现分类
- - 1.1.1 二分类
  - 1.1.2 多分类
- 1.2 回归
二、线性回归
- 4.1 正则化
- - 4.1.1 岭回归
  - 4.1.2 Lasso回归
五、随机森林
- 5.1参数说明
- 5.2 随机森林应用
- 5.3 随机森林做回归
六、SVR
- 6.1 参数说明
- 6.2 SVM的优势和不足
- 6.3 房价预测示例
七、MLP
- 7.1 参数
- 7.2 优缺点
- 7.3 调参经验
八、数据预处理、降维、特征提取、聚类
- 8.1 数据预处理
- - 8.1.1 StandardScaler数据预处理
  - 8.1.2 MinMaxScaler 数据预处理
  - 8.1.3 RobustScaler 数据预处理
  - 8.1.4 Normalizer 数据预处理
  - 8.1.5 机器学习中的步骤
- 8.2 特征降维
- - 8.2.1 PCA主成分分析
  - 8.2.2 参数调整
8.3 特征提取
- 8.3.1 PCA进行特征提取
- - 8.3.2 非负矩阵分解用于特征提取
- 8.4 聚类算法
- - 8.4.1 k-means聚类
  - 8.4.2 DBSCAN
九、数据表达与特征工程
- 9.1 使用哑变量转化类型特征
- 9.2 对数据进行装箱处理
- - 9.2.1 OneHotEncoder
- 9.3 数据升维
- - 9.3.1 向数据集添加交互式特征（hstack）
  - 9.3.2 添加多项式特征（PolynomialFeatures）
十、模型评估与优化
- 10.1 交叉验证
- 10.2 分层K折交叉验证
- - 10.2.1 随机拆分
  - 10.2.2 留一法
  - 10.2.3 K折交叉验证怎么选定最终模型的？
- 10.3 网格搜索
- 10.4 分类模型的可信度评估
- - 10.4.1 分类模型的预测准确率（predict_proba）
  - 10.4.2 分类模型的决定系数（decision_function）
  - 10.4.3 模型的评分指标
十一、模型的管道模型
- 11.1 管道模型基本概念
- 11.2 使用管道模型进行模型选择和参数调优
- 11.3 使用管道模型进行参数调优

一、KNN

1.1 KNN实现分类

#导入数据集
from sklearn.datasets import make_blobs
# 导入KNN
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# 生成样本数200，分类数为2的数据集
data = make_blobs(n_samples=200,centers=2,random_state=8)
X,y = data

# 可视化
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.spring,edgecolors='k')
plt.show()

1.1.1 二分类

import numpy as np
clf = KNeighborsClassifier()
clf.fit(X,y)

KNeighborsClassifier()

# 画图的代码
x_min,x_max = X[:,0].min()-1,X[:,0].max()+1
y_min,y_max = X[:,1].min()-1,X[:,1].max()+1
xx,yy = np.meshgrid(np.arange(x_min,x_max,.02),np.arange(y_min,y_max,.02))

Z = clf.predict(np.c_[xx.ravel(),yy.ravel()])

Z=Z.reshape(xx.shape)
plt.pcolormesh(xx,yy,Z,cmap=plt.cm.Pastel1)
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.spring,edgecolors='k')
# 测试新的数据点
plt.scatter(6.5,4.3,marker="*",c='red',s=200)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.title("Classifier:KNN")
plt.show()

<ipython-input-11-876e09abd1c0>:2: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.plt.pcolormesh(xx,yy,Z,cmap=plt.cm.Pastel1)

print("新数据点的分类：", clf.predict([[6.5,4.3]]))

新数据点的分类： [1]

1.1.2 多分类

# 生成样本数500，分类数为5的数据集
data = make_blobs(n_samples=500,centers=5,random_state=8)
X,y = data

# 可视化
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.spring,edgecolors='k')
plt.show()

clf = KNeighborsClassifier()
clf.fit(X,y)
# 画图的代码
x_min,x_max = X[:,0].min()-1,X[:,0].max()+1
y_min,y_max = X[:,1].min()-1,X[:,1].max()+1
xx,yy = np.meshgrid(np.arange(x_min,x_max,.02),np.arange(y_min,y_max,.02))
Z = clf.predict(np.c_[xx.ravel(),yy.ravel()])
Z=Z.reshape(xx.shape)
plt.pcolormesh(xx,yy,Z,cmap=plt.cm.Pastel1)
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.spring,edgecolors='k')
# 测试新的数据点
plt.scatter(6.5,4.3,marker="*",c='red',s=200)
plt.xlim(xx.min(),xx.max())
plt.ylim(yy.min(),yy.max())
plt.title("Classifier:KNN")
plt.show()

<ipython-input-21-817419eb6322>:9: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3.  Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading'].  This will become an error two minor releases later.plt.pcolormesh(xx,yy,Z,cmap=plt.cm.Pastel1)

print("模型正确率：",clf.score(X,y))

模型正确率： 0.956

1.2 回归

"""KNN做线性回归"""
# 导入make_regression数据集生成器
from sklearn.datasets import make_regression
# 生成特征数为1，噪声为50的数据集
X,y = make_regression(n_features=1,n_informative=1,noise=50,random_state=8)
# 用散点图进行可视化
plt.scatter(X,y,c='orange',edgecolors='k')
plt.show()

# 导入用于回归分析的KNN模型
from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor()
reg.fit(X,y)
plt.scatter(X,y,c='orange',edgecolors='k')
z=np.linspace(-3,3,200).reshape(-1, 1)
plt.plot(z,reg.predict(z),c='k',linewidth=3)
plt.title('KNN Regressor')
plt.show()

print("模型评分：",reg.score(X,y))

模型评分： 0.7721167832505298

# 减小模型的n_neighbors为2
reg2 = KNeighborsRegressor(n_neighbors=2)
reg2.fit(X,y)
plt.scatter(X,y,c='orange',edgecolors='k')
plt.plot(z,reg2.predict(z),c='k',linewidth=3)
plt.title('KNN Regressor')
plt.show()

print("模型评分：",reg2.score(X,y))

模型评分： 0.8581798802065704

二、线性回归

# 导入糖尿病数据集
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
X,y = load_diabetes().data,load_diabetes().target
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
model = LinearRegression().fit(x_train,y_train)
y_hat = model.predict(x_test)plt.figure()
plt.scatter(np.arange(len(y_test)),y_test,c='b',s=80)
plt.plot(np.arange(len(y_test)),y_hat,c='r')
plt.title('Linear Regression')
plt.show()print("model.coef_：%s\nmodel.intercept_：%s\n"%(model.coef_,model.intercept_))
print("训练集得分：%s\n测试集得分：%s\n"%(model.score(x_train,y_train),model.score(x_test,y_test)))

model.coef_：[  -14.75505679  -241.43303314   526.17137532   309.18178014-1212.52478411   775.29031634   369.07778164   292.96840184933.01606171    71.1684786 ]
model.intercept_：154.04782481274086训练集得分：0.5047765530203177
测试集得分：0.5503967724591143

4.1 正则化

4.1.1 岭回归

现实中，线性回归及其容易过拟合，因此引入Ridge回归（增加了参数惩罚项）
α\alphaα越大，参数model.coef_越小，越容易防止模型过拟合

注意：若数据量足够大，那么是否正则化就没有这么重要了

# 导入糖尿病数据集
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split,GridSearchCV
import matplotlib.pyplot as plt
import numpy as np
X,y = load_diabetes().data,load_diabetes().target
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)alpha = np.logspace(-3,2,10)
model =Ridge()
model = GridSearchCV(model,param_grid={'alpha':alpha},cv=5)
model.fit(x_train,y_train)
model = model.best_estimator_
y_hat = model.predict(x_test)plt.figure()
plt.scatter(np.arange(len(y_test)),y_test,c='b',s=80)
plt.plot(np.arange(len(y_test)),y_hat,c='r')
plt.title('Linear Regression')
plt.show()print("model.coef_：%s\nmodel.intercept_：%s\n"%(model.coef_,model.intercept_))
print("训练集得分：%s\n测试集得分：%s\n"%(model.score(x_train,y_train),model.score(x_test,y_test)))

model.coef_：[   7.66936383 -229.35690411  507.13340775  320.64869666 -136.63000277-30.12558503 -183.03638164  111.35921077  443.23198819   55.44890572]
model.intercept_：153.8951274627954训练集得分：0.4957238036650097
测试集得分：0.5478414342339397

4.1.2 Lasso回归

L1正则化，Ridge是L2正则化
如果特征过多，而且只有一部分是真正重要的，用Lasso回归

# 导入糖尿病数据集
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split,GridSearchCV
import matplotlib.pyplot as plt
import numpy as np
X,y = load_diabetes().data,load_diabetes().target
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)alpha = np.logspace(-3,2,10)
model =Lasso()
model = GridSearchCV(model,param_grid={'alpha':alpha},cv=5)
model.fit(x_train,y_train)
model = model.best_estimator_
y_hat = model.predict(x_test)plt.figure()
plt.scatter(np.arange(len(y_test)),y_test,c='b',s=80)
plt.plot(np.arange(len(y_test)),y_hat,c='r')
plt.title('Linear Regression')
plt.show()print("特征总数：",X.shape[1])
print("Lasso使用的特征数：%s\n"%(np.sum(model.coef_!=0)))
print("训练集得分：%s\n测试集得分：%s\n"%(model.score(x_train,y_train),model.score(x_test,y_test)))

特征总数： 10
Lasso使用的特征数：6训练集得分：0.5359093939208899
测试集得分：0.40774356697454894

五、随机森林

随机森林不需要对数据进行预处理，不必太在意参数的调节。

5.1参数说明

njobs：并行处理。njobs=-1，调动CPU的全部内核，多进程运行随机森林。
randome_state：希望建模结果稳定，则一定要固化该参数。
n_estimators：决策树的数量，若是回归，则将每颗决策树的预测值取平均；分类就是对分类结果投票。
max_feature：随机特征的选择，取值越高，每颗决策树就会长得越像。若取值为1，则特征选择没有余地。

注意：

1）对于超高维数据集和稀疏数据集，随机森林的效果可能不如线性模型。另外，随机森林相对更消耗内存。
2）随机森林可以对特征的重要性进行排序

5.2 随机森林应用

5.3 随机森林做回归

# 导入糖尿病数据集
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split,GridSearchCV
import matplotlib.pyplot as plt
import numpy as np
X,y = load_diabetes().data,load_diabetes().target
print(len(X))
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)model =RandomForestRegressor(random_state=1,n_jobs=-1)
param_grid = {"n_estimators":[100,500,200],"max_depth":[3,5,10,20]}
model = GridSearchCV(model,param_grid=param_grid,cv=5)
model.fit(x_train,y_train)
model = model.best_estimator_
print("最优模型为：", model)
y_hat = model.predict(x_test)plt.figure()
plt.scatter(np.arange(len(y_test)),y_test,c='b',s=80)
plt.plot(np.arange(len(y_test)),y_hat,c='r')
plt.title('Linear Regression')
plt.show()print("训练集得分：%s\n测试集得分：%s\n"%(model.score(x_train,y_train),model.score(x_test,y_test)))

442
最优模型为： RandomForestRegressor(max_depth=3, n_estimators=500, n_jobs=-1, random_state=1)

训练集得分：0.5957864136004931
测试集得分：0.42539072690529023

六、SVR

6.1 参数说明

6.2 SVM的优势和不足

1、适合小数量级的样本，比如1w以内。
2、对数据预处理和参数调节要求非常高，随机森林就对参数和数据预处理要求不高。

6.3 房价预测示例


from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR# 导入房价数据集
boston = load_boston()
# 划分数据集
X,y = boston.data, boston.target
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
# 分别测试线性核函数和rbf核函数
for kernel in ['linear','rbf']:svr = SVR(kernel=kernel)svr.fit(x_train,y_train)print("训练集得分：%s\n测试集得分：%s\n"%(svr.score(x_train,y_train),svr.score(x_test,y_test)))

训练集得分：0.705886179377861
测试集得分：0.6754116930707594训练集得分：0.1938033042097428
测试集得分：0.2603372256573472

注意到效果较差，考虑是不是特征量级相差过大，所以将特征的最大值和最小值绘制出来

遂考虑数据预处理。（可以看出SVM对于数据预处理的要求是很苛刻的）

# 导入数据预处理工具
from sklearn.preprocessing import StandardScaler
# 对训练集和测试集进行预处理
scaler = StandardScaler()
scaler.fit(x_train)
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)
# 分别测试线性核函数和rbf核函数
for kernel in ['linear','rbf']:svr = SVR(kernel=kernel)svr.fit(x_train_scaled,y_train)print("训练集得分：%s\n测试集得分：%s\n"%(svr.score(x_train_scaled,y_train),svr.score(x_test_scaled,y_test)))
svr = SVR(C=100,gamma=0.1)
svr.fit(x_train_scaled,y_train)
y_hat = svr.predict(x_test_scaled)
print("训练集得分：%s\n测试集得分：%s\n"%(svr.score(x_train_scaled,y_train),svr.score(x_test_scaled,y_test)))

训练集得分：0.7092659913398127
测试集得分：0.6771114835234099训练集得分：0.6654487747233812
测试集得分：0.7237752269094007训练集得分：0.9713986175310146
测试集得分：0.8467577918219455

# 可视化
plt.figure()
plt.scatter(np.arange(len(y_test)),y_test,c='b',s=80)
plt.plot(y_hat,c='r')
plt.title('Linear Regression')
plt.show()

七、MLP

多层感知机（MLP）也称前馈神经网络。

7.1 参数

增加alpha的值为加大模型的正则化，让模型更简单
隐藏层增加，决策边界更细腻
节点数越多，决定边界越平滑
tanh激活函数，会使决定边界变得平滑

7.2 优缺点

只适合小数据集

7.3 调参经验

对特征要求高，需要归一化

神经网络的隐藏层节点数约等于训练数据集的特征数量，但是一般不要超过500，在开始训练模型时，让模型尽可能复杂，然后对正则化参数alpha进行调节提高模型表现。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
# 生成随机序列
rnd = np.random.RandomState(38)
x = rnd.uniform(-5,5,size=50)
# 像数据中添加噪声
y_no_noise = (np.cos(6*x)+x)
X = x.reshape(-1,1)
y = (y_no_noise + rnd.normal(size = len(x)))/2
# 生成等差数列
line = np.linspace(-5,5,1000,endpoint = False).reshape(-1,1)
model = MLPRegressor().fit(X,y)
plt.plot(line,model.predict(line),label = 'MLP',c='k')
plt.scatter(X,y)
plt.legend(loc='best')
plt.show()

八、数据预处理、降维、特征提取、聚类

8.1 数据预处理

import numpy
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=40, centers=2, random_state=50, cluster_std=2)
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.cool)
plt.show()

8.1.1 StandardScaler数据预处理

用均值和方差进行转换

将数据的特征值转变为均值为0，而方差为1的状态，确保数据“大小”一致，利于模型训练

# 数据预处理
from sklearn.preprocessing import StandardScalerX_1 = StandardScaler().fit_transform(X)
plt.scatter(X_1[:,0],X_1[:,1],c=y,cmap=plt.cm.cool)
plt.show()

8.1.2 MinMaxScaler 数据预处理

将所有特征值转换到0到1之间

from sklearn.preprocessing import MinMaxScaler
X_2 = MinMaxScaler().fit_transform(X)
plt.scatter(X_2[:,0],X_2[:,1],c=y,cmap=plt.cm.cool)
plt.show()

8.1.3 RobustScaler 数据预处理

用中位数和四分位数转换，直接将数据的异常值踢出去。

from sklearn.preprocessing import RobustScaler
X_3 = RobustScaler().fit_transform(X)
plt.scatter(X_3[:,0],X_3[:,1],c=y,cmap=plt.cm.cool)
plt.show()

8.1.4 Normalizer 数据预处理

将样本距离转换为欧几里得距离为1，把数据的分布变为半径为1的圆或者球。

在我们只想保留特征向量的方向，而忽略其数值的时候使用。

from sklearn.preprocessing import Normalizer
X_4 = Normalizer().fit_transform(X)
plt.scatter(X_4[:,0],X_4[:,1],c=y,cmap=plt.cm.cool)
plt.show()

8.1.5 机器学习中的步骤

8.2 特征降维

特征降维（特征过多，特征的相关性过强）

8.2.1 PCA主成分分析

8.2.2 参数调整

from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine# 对红酒数据集预处理
scaler = StandardScaler()
wine_dataset = load_wine()
X, y = wine_dataset['data'], wine_dataset['target']
X_scaled = scaler.fit_transform(X)
# 打印处理后的数据集形态
print(X_scaled.shape)# 导入PCA
from sklearn.decomposition import PCA
# 设置主成分为2方便可视化
pca = PCA(n_components=2)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
print(X_pca.shape)

(178, 13)
(178, 2)

# 可视化
X0 = X_pca[y==0]
X1 = X_pca[y==1]
X2 = X_pca[y==2]
plt.scatter(X0[:,0],X0[:,1],c='b',s=60,edgecolors='k')
plt.scatter(X1[:,0],X1[:,1],c='g',s=60,edgecolors='k')
plt.scatter(X2[:,0],X2[:,1],c='r',s=60,edgecolors='k')
plt.legend(wine_dataset['target_names'])
plt.xlabel('component 1')
plt.ylabel('component 2')
plt.show()

8.3 特征提取

8.3.1 PCA进行特征提取

在数据集极为复杂的情况下，我们要进行特征提取。

PCA的白化功能，对于图片数据，相邻像素之间有很大的相关性，这样一来，样本特征就是冗余的了，白化的目的就是降低冗余性。他会让样本特征间的相关度降低，且所有特征具有相同的方差。

8.3.2 非负矩阵分解用于特征提取

8.4 聚类算法

8.4.1 k-means聚类

8.4.2 DBSCAN

参数：

添加链接描述

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn.datasets as ds
from sklearn.cluster import KMeans,DBSCAN"""自己做一份数据"""
# x是(400,2),2个特征，y是400行，有四类
x, y = ds.make_blobs(400, n_features=2, centers=4, random_state=2)"""K-means聚类"""
model=KMeans(n_clusters=4,init='k-means++')
y_pred=model.fit_predict(x,y)"""密度聚类"""
model_1 = DBSCAN(eps=0.75, min_samples=10)
y_pred_1 = model_1.fit_predict(x, y)"""绘制图像"""
# 原始图像
plt.figure()
plt.subplot(2,2,1)
plt.scatter(x[:,0],x[:,1],c=y,edgecolors='k',alpha=0.7)
plt.title('initial picture')
plt.grid()plt.subplot(2,2,2)
plt.scatter(x[:,0],x[:,1],c=y_pred,edgecolors='k',alpha=0.7)
plt.title('k-means picture')
plt.grid()plt.subplot(2,2,3)
plt.scatter(x[:,0],x[:,1],c=y_pred_1,edgecolors='k',alpha=0.7)
plt.title('DBSCAN picture')
plt.grid()
plt.show()
plt.show()

九、数据表达与特征工程

9.1 使用哑变量转化类型特征

import pandas as pd fruits = pd.DataFrame({'数值特征':[5,6,7,8,9],'类型特征':['西瓜','香蕉','橘子','苹果','葡萄']})
display(fruits)

	数值特征	类型特征
0	5	西瓜
1	6	香蕉
2	7	橘子
3	8	苹果
4	9	葡萄

# 转化数据表中的字符串为数值
fruits_dum = pd.get_dummies(fruits)
# fruits_dum = pd.get_dummies(fruits,columns=['类型特征'])
display(fruits_dum)

	数值特征	类型特征_橘子	类型特征_苹果	类型特征_葡萄	类型特征_西瓜	类型特征_香蕉
0	5	0	0	0	1	0
1	6	0	0	0	0	1
2	7	1	0	0	0	0
3	8	0	1	0	0	0
4	9	0	0	1	0	0

9.2 对数据进行装箱处理

不同模型对于同样的特征，其结果相差是很大的，如下面的MLP和KNN回归对比

import numpy as np
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
# 生成随机序列
rnd = np.random.RandomState(38)
x = rnd.uniform(-5,5,size=50)
# 像数据中添加噪声
y_no_noise = (np.cos(6*x)+x)
X = x.reshape(-1,1)
y = (y_no_noise + rnd.normal(size = len(x)))/2
# 生成等差数列
line = np.linspace(-5,5,1000,endpoint = False).reshape(-1,1)
model = MLPRegressor().fit(X,y)
knn = KNeighborsRegressor().fit(X,y)
plt.plot(line,model.predict(line),label = 'MLP',c='k')
plt.plot(line,knn.predict(line),label = 'KNN',c='r')
plt.scatter(X,y)
plt.legend(loc='best')
plt.show()

# 设置箱体数为11
bins = np.linspace(-5,5,11)
# 装箱操作
target_bin = np.digitize(X,bins=bins)
print("装箱数据范围：\n",bins)
print('前十个数据点的特征：\n',X[:10])
print('前十个数据点所在的箱子：\n',target_bin[:10])

装箱数据范围：[-5. -4. -3. -2. -1.  0.  1.  2.  3.  4.  5.]
前十个数据点的特征：[[-1.1522688 ][ 3.59707847][ 4.44199636][ 2.02824894][ 1.33634097][ 1.05961282][-2.99873157][-1.12612112][-2.41016836][-4.25392719]]
前十个数据点所在的箱子：[[ 4][ 9][10][ 8][ 7][ 7][ 3][ 4][ 3][ 1]]

9.2.1 OneHotEncoder

用新的方法来表达已经装箱的数据

from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(sparse=False)
onehot.fit(target_bin)
x_in_bin = onehot.transform(target_bin)
print("装箱后数据形态：{}".format(x_in_bin.shape))
print('Onehot处理装箱后的10个数据点的新特征：',x_in_bin[:10])

装箱后数据形态：(50, 10)
Onehot处理装箱后的10个数据点的新特征： [[0. 0. 0. 1. 0. 0. 0. 0. 0. 0.][0. 0. 0. 0. 0. 0. 0. 0. 1. 0.][0. 0. 0. 0. 0. 0. 0. 0. 0. 1.][0. 0. 0. 0. 0. 0. 0. 1. 0. 0.][0. 0. 0. 0. 0. 0. 1. 0. 0. 0.][0. 0. 0. 0. 0. 0. 1. 0. 0. 0.][0. 0. 1. 0. 0. 0. 0. 0. 0. 0.][0. 0. 0. 1. 0. 0. 0. 0. 0. 0.][0. 0. 1. 0. 0. 0. 0. 0. 0. 0.][1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

# 对数据装箱后，观察KNN和MLP的区别
new_line = onehot.transform(np.digitize(line,bins=bins))
model = MLPRegressor().fit(x_in_bin,y)
knn = KNeighborsRegressor().fit(x_in_bin,y)
plt.plot(line,model.predict(new_line),label = 'MLP',c='k')
plt.plot(line,knn.predict(new_line),label = 'KNN',c='r')
plt.scatter(X,y)
plt.legend(loc='best')
plt.show()

c:\users\chengjingd\pycharmprojects\pythonproject\venv\lib\site-packages\sklearn\neural_network\_multilayer_perceptron.py:614: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.warnings.warn(

9.3 数据升维

9.3.1 向数据集添加交互式特征（hstack）

线性模型在高维数据集上表现良好，低维表现不好，因此可以用下述方法进行特征扩充（Onehot拼接原始特征等方式）

array_1 = [1,2,3,4,5]
array_2 = [6,7,8,9,10]
array_3 = np.hstack((array_1,array_2))
print('合并后的特征：', array_3)

合并后的特征： [ 1  2  3  4  5  6  7  8  9 10]

# 将原始数据和装箱后的数据重叠
x_stack = np.hstack([X, x_in_bin])
print(x_stack.shape)

(50, 11)

line_stack = np.hstack([line,new_line])
model = MLPRegressor().fit(x_stack,y)
plt.plot(line,model.predict(line_stack),label = 'MLP',c='k')
plt.scatter(X,y)
for vline in bins:plt.plot([vline,vline],[-5,5],':',c='k')
plt.legend(loc='best')
plt.show()

# 使用新的堆叠方式处理数据
x_multi = np.hstack([x_in_bin,X*x_in_bin])
print(x_multi.shape)
print(x_multi[0])

(50, 20)
[ 0.         0.         0.         1.         0.         0.0.         0.         0.         0.        -0.        -0.-0.        -1.1522688 -0.        -0.        -0.        -0.-0.        -0.       ]

line_multi = np.hstack([new_line,line*new_line])
model = MLPRegressor().fit(x_multi,y)
plt.plot(line,model.predict(line_multi),label = 'MLP',c='k')
plt.scatter(X,y)
for vline in bins:plt.plot([vline,vline],[-5,5],':',c='k')
plt.legend(loc='best')
plt.show()

9.3.2 添加多项式特征（PolynomialFeatures）

可以在一定程度上解决模型欠拟合的问题

from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=20,include_bias=False)
X_poly = poly.fit_transform(X)
print(X_poly.shape)
print('原始数据集第一个特征：',X[0])
print('添加多项式特征后的数据集第一个特征：',X_poly[0])
print('PolynomialFeatures对原始数据处理：',poly.get_feature_names())

(50, 20)
原始数据集第一个特征： [-1.1522688]
添加多项式特征后的数据集第一个特征： [ -1.1522688    1.3277234   -1.52989425   1.76284942  -2.03127642.34057643  -2.6969732    3.10763809  -3.58083443   4.1260838-4.75435765   5.47829801  -6.3124719    7.27366446  -8.381216659.65741449 -11.12793745  12.82237519 -14.77482293  17.02456756]
PolynomialFeatures对原始数据处理： ['x0', 'x0^2', 'x0^3', 'x0^4', 'x0^5', 'x0^6', 'x0^7', 'x0^8', 'x0^9', 'x0^10', 'x0^11', 'x0^12', 'x0^13', 'x0^14', 'x0^15', 'x0^16', 'x0^17', 'x0^18', 'x0^19', 'x0^20']

## 9.4 自动特征选择### 9.4.1 单一变量法进行特征选择
![image.png](attachment:image.png)
![image-4.png](attachment:image-4.png)
![image-2.png](attachment:image-2.png)
对于噪声特别多的数据集，特征选择后模型评分会提高。
单一变量进行特征选择时，无论选择哪一个模型，对数据的处理方式是相同的。### 9.4.2 基于模型的特征选择
基于决策树的模型里内置了feature_importances_的属性，可以让SelectFromModel直接从属性中提取特征重要性。
threshold=median，意味着模型会选择一半左右的特征。
![image-3.png](attachment:image-3.png)### 9.4.3 迭代式特征选择
基于若干个模型进行特征选择。
![image-5.png](attachment:image-5.png)

十、模型评估与优化

10.1 交叉验证

对模型的泛化性能进行评估

K折交叉验证

from sklearn.datasets import load_wine
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVCwine = load_wine()
svc = SVC(kernel='linear')
# 使用交叉验证法对SVC评分，默认cv=5
scores = cross_val_score(svc, wine.data, wine.target,cv=6)
print("交叉验证得分：",scores)

交叉验证得分： [0.86666667 0.9        0.93333333 0.96666667 1.         1.        ]

# 模型得分为平均得分
print('交叉验证平均得分：',scores.mean())

交叉验证平均得分： 0.9444444444444445

10.2 分层K折交叉验证

10.2.1 随机拆分

原理是，先从数据集中随机抽取一部分数据集作为训练集，再从其余部分随机抽取一部分作为测试集，进行评分后迭代，重复上述步骤，直到把希望的迭代次数跑完。

from sklearn.model_selection import ShuffleSplit
# 设置拆分的份数为10
shuffle_split = ShuffleSplit(test_size =.2, train_size =.7, n_splits = 10)
scores = cross_val_score(svc, wine.data, wine.target, cv=shuffle_split)
print('随机拆分交叉验证模型得分：\n{}'.format(scores))

随机拆分交叉验证模型得分：
[0.94444444 0.97222222 0.91666667 1.         0.97222222 0.972222221.         0.91666667 0.94444444 1.        ]

10.2.2 留一法

在交叉验证法中，令k等于数据集的个数。

from sklearn.model_selection import LeaveOneOut
cv = LeaveOneOut()
scores = cross_val_score(svc,wine.data,wine.target,cv=cv)
print('迭代次数：',len(scores))
print('模型平均得分：',scores.mean())

迭代次数： 178
模型平均得分： 0.9550561797752809

10.2.3 K折交叉验证怎么选定最终模型的？

10.3 网格搜索

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(wine.data,wine.target,random_state=0)
# 将需要遍历的参数定义为字典
params = {'alpha':[0.01,0.1,1.0,10.0],'max_iter':[100,1000,5000,10000]}
# 定义网格搜索需要的模型和参数
grid_search = GridSearchCV(Lasso(),params,cv=6)
grid_search.fit(X_train,y_train)
print('模型最高分：',grid_search.score(X_test,y_test))
print('最优参数：',grid_search.best_params_)

模型最高分： 0.819334891919453
最优参数： {'alpha': 0.01, 'max_iter': 100}

10.4 分类模型的可信度评估

10.4.1 分类模型的预测准确率（predict_proba）

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNBX,y=make_blobs(n_samples=200,random_state=1,centers=2,cluster_std=5)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=68)
model = GaussianNB()
model.fit(X_train,y_train)
# 分类的准确率
predict_proba = model.predict_proba(X_test)
print('分类准确率的形态：',predict_proba.shape)
print(predict_proba[:5])
plt.scatter(X[:,0],X[:,1],c=y,cmap=plt.cm.cool,edgecolors='k')
plt.show()

分类准确率的形态： (50, 2)
[[0.98849996 0.01150004][0.0495985  0.9504015 ][0.01648034 0.98351966][0.8168274  0.1831726 ][0.00282471 0.99717529]]

10.4.2 分类模型的决定系数（decision_function）

在二分类中，负数代表分类2，正数代表分类1

from sklearn.svm import SVC
svc = SVC().fit(X_train,y_train)
# 获取SVC的决定系数
dec_func = svc.decision_function(X_test)
print(dec_func[:5])

[-1.36071347  1.53694862  1.78825594 -0.96133081  1.81826853]

10.4.3 模型的评分指标

回归问题的R2R^2R2指标

十一、模型的管道模型

11.1 管道模型基本概念

Pipeline可以将许多算法模型串联起来，比如将特征提取、归一化、分类组织在一起形成一个典型的机器学习问题工作流。主要带来两点好处：

1、直接调用fit和predict方法来对pipeline中的所有算法模型进行训练和预测。

2、可以结合grid search对参数进行选择。

from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
X, y = make_blobs(n_samples=200,centers=2,cluster_std=5)
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=38)# 建立包含预处理和神经网络的管道模型
pipe = Pipeline([('scaler',StandardScaler()),('select_model',SelectFromModel(RandomForestRegressor(random_state=38))),('mlp',MLPClassifier(max_iter=2000,random_state = 38))])# 网格化寻优
params = {'mlp__hidden_layer_sizes':[(50,),(100,),(100,100)],'mlp__alpha':[0.0001,0.001,0.01,0.1]}
grid = GridSearchCV(pipeline,param_grid=params,cv=3)#拟合数据
grid.fit(X_train,y_train)
print("模型最佳得分：",grid.best_score_)
print("模型最佳参数：",grid.best_estimator_)
print('测试集得分：',grid.score(X_test,y_test))
print(pipe.steps)

模型最佳得分： 0.8466666666666667
模型最佳参数： Pipeline(steps=[('scaler', StandardScaler()),('mlp', MLPClassifier(max_iter=2000, random_state=38))])
测试集得分： 0.76
[('scaler', StandardScaler()), ('select_model', SelectFromModel(estimator=RandomForestRegressor(random_state=38))), ('mlp', MLPClassifier(max_iter=2000, random_state=38))]

11.2 使用管道模型进行模型选择和参数调优

# pipeline进行模型选择
params = [{'reg':[MLPRegressor(random_state=38)],'scaler':[StandardScaler(),None]},{'reg':[RandomForestRegressor(random_state=38)],'scaler':[None]}]
# pipeline实例化
pipe = Pipeline([('scaler',StandardScaler()),('reg',MLPRegressor())])
#网格搜索
grid = GridSearchCV(pipe,params,cv=3)
grid.fit(X,y)
print("模型最佳得分：",grid.best_score_)
print("模型最佳参数：",grid.best_params_)

模型最佳得分： 0.4566285156223768
模型最佳参数： {'reg': MLPRegressor(random_state=38), 'scaler': None}

11.3 使用管道模型进行参数调优

# pipeline进行模型选择
params = [{'reg':[MLPRegressor(random_state=38,max_iter=2000)],'scaler':[StandardScaler(),None],'reg__hidden_layer_sizes':[(50,),(100,),(100,100)]},{'reg':[RandomForestRegressor(random_state=38)],'scaler':[None],'reg__n_estimators':[10,50,100]}]
# pipeline实例化
pipe = Pipeline([('scaler',StandardScaler()),('reg',MLPRegressor())])
#网格搜索
grid = GridSearchCV(pipe,params,cv=3)
grid.fit(X,y)
print("模型最佳得分：",grid.best_score_)
print("模型最佳参数：",grid.best_params_)

模型最佳得分： 0.4882498100109817
模型最佳参数： {'reg': MLPRegressor(hidden_layer_sizes=(100, 100), max_iter=2000, random_state=38), 'reg__hidden_layer_sizes': (100, 100), 'scaler': StandardScaler()}

sklearn+机器学习相关推荐

sklearn机器学习：岭回归Ridge
在sklearn中,岭回归由线性模型库中的Ridge类来调用: Ridge类的格式 sklearn.linear_model.Ridge (alpha=1.0, fit_intercept=True, ...
Sklearn机器学习——ROC曲线、ROC曲线的绘制和AUC面积、运用ROC曲线找到最佳阈值
目录 1 ROC曲线 2 ROC曲线的绘制 2.1 Sklearn中的ROC曲线和AUC面积 2.2 利用ROC曲线找到最佳阈值 1 ROC曲线上篇博客介绍了ROC曲线的概率和阈值还有SVM实现概率 ...
数据分析实战：python热门音乐分析附代码+数据 +论文（PCA 主成分分析，sklearn 机器学习，pytorch 神经网络，k-means 聚类，Librosa 音频处理，midi 音序)
项目概述: 本选取了抖音当下最热门的 400 首音乐,通过一系列方法提取每首歌的波形特征,再经过降维以及机器学习等手段,进行无监督学习对音乐数据进行聚类的同时训练并使用监督学习分类器进行音乐流派分类, ...
python sklearn机器学习库安装
1.准备工作安装sklearn之前,我们需要先安装numpy,scipy函数库. Numpy下载地址:http://sourceforge.net/projects/numpy/files/NumP ...
《sklearn机器学习第二版》（加文海克著）学习笔记
本书附以sklearn机器学习示例程序,从调用函数的角度解释了常用的机器学习的方法,包括线性回归.逻辑回归.决策树.SVM.朴素贝叶斯.ANN.K-means.PCA.原理粗浅易懂,注重代码实践.本文 ...
用python+sklearn(机器学习)实现天气预报数据数据
用python+sklearn机器学习实现天气预报数据项目地址系列教程勘误表 0.前言 1.爬虫 a.确认要被爬取的网页网址 b.爬虫部分 c.网页内容匹配取出部分 d.写入csv文件格式化 ...
用python+sklearn(机器学习)实现天气预报数据模型和使用
用python+sklearn机器学习实现天气预报模型和使用项目地址系列教程 0.前言 1.建立模型 a.准备引入所需要的头文件选择模型选择评估方法获取数据集 b.建立模型 c.获取模型 ...
sklearn 相关性分析_用sklearn机器学习预测泰坦尼克号生存概率
前言本文为练手记录,适用于刚入门的朋友参照阅读练习,大神请绕道,谢谢! 阅读大约需要10分钟. 一.理解项目概况并提出问题 1.1 登陆官网查看项目概况 Titanic: Machine Learn ...
sklearn机器学习常用过程总结
由于前面对sklearn或多或少接触了一下,但是不深入,随着最近学习,我下面介绍一下机器学习常用过程. 1. 加载数据集 scikit-learn中自带了一些数据集,比如说最著名的Iris数据集. 数 ...
sklearn 机器学习 Pipeline 模板
文章目录 1. 导入工具包 2. 读取数据 3. 数字特征.文字特征分离 4. 数据处理Pipeline 5. 尝试不同的模型 6. 参数搜索 7. 特征重要性筛选 8. 最终完整Pipeline 使 ...

sklearn+机器学习