案例:加利福尼亚房屋价值数据集(线性回归)& Lasso & 岭回归 & 分箱处理非线性问题

点击标题即可获取文章源代码和笔记

1. 导入需要的模块和库

from sklearn.linear_model import LinearRegression as LR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import fetch_california_housing as fch # 加利福尼亚房屋价值数据集
import pandas as pd

2. 导入数据,探索数据

housevalue = fch()
X = pd.DataFrame(housevalue.data)
X
0 1 2 3 4 5 6 7
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
... ... ... ... ... ... ... ... ...
20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 -121.09
20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 -121.21
20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 -121.22
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 -121.32
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 -121.24

20640 rows × 8 columns

X.shape
(20640, 8)
y = housevalue.target
y
array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])
y.shape
(20640,)
X.head()
0 1 2 3 4 5 6 7
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
housevalue.feature_names
['MedInc','HouseAge','AveRooms','AveBedrms','Population','AveOccup','Latitude','Longitude']
y.min()
0.14999
y.max()
5.00001
X.columns = housevalue.feature_names
X.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
  • MedInc:该街区住户的收入中位数
  • HouseAge:该街区房屋使用年代的中位数
  • AveRooms:该街区平均的房间数目
  • AveBedrms:该街区平均的卧室数目
  • Population:街区人口
  • AveOccup:平均入住率
  • Latitude:街区的纬度
  • Longitude:街区的经度

3. 分训练集和测试集

Xtrain,Xtest,Ytrain,Ytest = train_test_split(X,y,test_size=0.3,random_state=420)# 重置特征矩阵的索引
for i in [Xtrain,Xtest]:i.index = range(i.shape[0])Xtrain.shape
(14448, 8)

4.建模

reg = LR().fit(Xtrain,Ytrain) # 实例化+训练模型
yhat = reg.predict(Xtest)
yhat
array([1.51384887, 0.46566247, 2.2567733 , ..., 2.11885803, 1.76968187,0.73219077])
yhat.min()
-0.6528439725036108
yhat.max()
7.1461982142708536

5. 探索建好的模型

reg.coef_ # w,系数向量
array([ 4.37358931e-01,  1.02112683e-02, -1.07807216e-01,  6.26433828e-01,5.21612535e-07, -3.34850965e-03, -4.13095938e-01, -4.26210954e-01])
zip(Xtrain.columns,reg.coef_)
<zip at 0x1a1ddc21308>
[*zip(Xtrain.columns,reg.coef_)]
[('MedInc', 0.43735893059684006),('HouseAge', 0.010211268294493883),('AveRooms', -0.10780721617317668),('AveBedrms', 0.6264338275363759),('Population', 5.216125353348089e-07),('AveOccup', -0.003348509646333704),('Latitude', -0.4130959378947717),('Longitude', -0.4262109536208464)]
reg.intercept_ # 截距
-36.25689322920381

3 回归类的模型评估指标

3.1 是否预测了正确的数值

均方误差,本质是在RSS的基础上除以了样本总量,得到了每个样本量上的平均误差。有了平均误差,我们就可以将平均误差和我们的标签的取值范围在一起比较,以此获得一个较为可靠的评估依据。在sklearn当中,我们有两种方式调用这个评估指标,一种是使用sklearn专用的模型评估模块metrics里的类mean_squared_error,另一种是调用交叉验证的类cross_val_score并使用里面的scoring参数来设置使用均方误差。

from sklearn.metrics import mean_squared_error as MSE
MSE(yhat,Ytest)
0.5309012639324571
Ytest.mean()
2.0819292877906976
# 10折交叉验证
cross_val_score(reg,X,y,cv=10,scoring="neg_mean_squared_error")
array([-0.48922052, -0.43335865, -0.8864377 , -0.39091641, -0.7479731 ,-0.52980278, -0.28798456, -0.77326441, -0.64305557, -0.3275106 ])
cross_val_score(reg,X,y,cv=10,scoring="neg_mean_squared_error").mean()
-0.5509524296956592
cross_val_score(reg,X,y,cv=10,scoring="neg_mean_absolute_error").mean()
-0.5445214393266326
# 查看scoring参数的可选值有哪些
import sklearn
sorted(sklearn.metrics.SCORERS.keys())
['accuracy','adjusted_mutual_info_score','adjusted_rand_score','average_precision','balanced_accuracy','completeness_score','explained_variance','f1','f1_macro','f1_micro','f1_samples','f1_weighted','fowlkes_mallows_score','homogeneity_score','jaccard','jaccard_macro','jaccard_micro','jaccard_samples','jaccard_weighted','max_error','mutual_info_score','neg_brier_score','neg_log_loss','neg_mean_absolute_error','neg_mean_gamma_deviance','neg_mean_poisson_deviance','neg_mean_squared_error','neg_mean_squared_log_error','neg_median_absolute_error','neg_root_mean_squared_error','normalized_mutual_info_score','precision','precision_macro','precision_micro','precision_samples','precision_weighted','r2','recall','recall_macro','recall_micro','recall_samples','recall_weighted','roc_auc','roc_auc_ovo','roc_auc_ovo_weighted','roc_auc_ovr','roc_auc_ovr_weighted','v_measure_score']

3.2 是否拟合了足够的信息

在R平方中,分子是真实值和预测值之差的差值,也就是我们的模型没有捕获到的信息总量,分母是真实标签所带的信息量,所以其衡量的是1 - 我们的模型没有捕获到的信息量占真实标签中所带的信息量的比例,所以, R平方越接近1越好。

R平方可以使用三种方式来调用,一种是直接从metrics中导入r2_score,输入预测值和真实值后打分。第二种是直接从线性回归LinearRegression的接口score来进行调用。第三种是在交叉验证中,输入"r2"来调用。

#调用R2
from sklearn.metrics import r2_score
#使用shift tab键来检查究竟哪个值先进行输入
r2_score(Ytest,yhat)
0.6043668160178817
r2 = reg.score(Xtest,Ytest)
r2
0.6043668160178817
cross_val_score(reg,X,y,cv=10,scoring="r2").mean()
0.5110068610524557

我们观察到,我们在加利福尼亚房屋价值数据集上的MSE其实不是一个很大的数(0.5),但我们的 不高,这证明我们的模型比较好地拟合了一部分数据的数值,却没有能正确拟合数据的分布。让我们与绘图来看看,究竟是不是这样一回事。我们可以绘制一张图上的两条曲线,一条曲线是我们的真实标签Ytest,另一条曲线是我们的预测结果yhat,两条曲线的交叠越多,我们的模型拟合就越好。

import matplotlib.pyplot as plt
sorted(Ytest)
[0.14999,0.14999,0.225,0.325,0.35,0.375,0.388,0.392,0.394,0.396,0.4,0.404,0.409,0.41,0.43,0.435,0.437,0.439,0.44,0.44,0.444,0.446,0.45,0.45,0.45,0.45,0.455,0.455,0.455,0.456,0.462,0.463,0.471,0.475,0.478,0.478,0.481,0.481,0.483,0.483,0.485,0.485,0.488,0.489,0.49,0.492,0.494,0.494,0.494,0.495,0.496,0.5,0.5,0.504,0.505,0.506,0.506,0.508,0.508,0.51,0.516,0.519,0.52,0.521,0.523,0.523,0.525,0.525,0.525,0.525,0.525,0.527,0.527,0.528,0.529,0.53,0.531,0.532,0.534,0.535,0.535,0.535,0.538,0.538,0.539,0.539,0.539,0.541,0.541,0.542,0.542,0.542,0.543,0.543,0.544,0.544,0.546,0.547,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.55,0.551,0.553,0.553,0.553,0.554,0.554,0.554,0.555,0.556,0.556,0.557,0.558,0.558,0.559,0.559,0.559,0.559,0.56,0.56,0.562,0.566,0.567,0.567,0.567,0.567,0.567,0.568,0.57,0.571,0.572,0.574,0.574,0.575,0.575,0.575,0.575,0.576,0.577,0.577,0.577,0.578,0.579,0.579,0.579,0.58,0.58,0.58,0.58,0.58,0.58,0.581,0.581,0.581,0.581,0.582,0.583,0.583,0.583,0.583,0.584,0.586,0.586,0.587,0.588,0.588,0.59,0.59,0.59,0.59,0.591,0.591,0.593,0.593,0.594,0.594,0.594,0.594,0.595,0.596,0.596,0.597,0.598,0.598,0.6,0.6,0.6,0.602,0.602,0.603,0.604,0.604,0.604,0.605,0.606,0.606,0.608,0.608,0.608,0.609,0.609,0.611,0.612,0.612,0.613,0.613,0.613,0.614,0.615,0.616,0.616,0.616,0.616,0.618,0.618,0.618,0.619,0.619,0.62,0.62,0.62,0.62,0.62,0.62,0.62,0.62,0.621,0.621,0.621,0.622,0.623,0.625,0.625,0.625,0.627,0.627,0.628,0.628,0.629,0.63,0.63,0.63,0.63,0.631,0.631,0.632,0.632,0.633,0.633,0.633,0.634,0.634,0.635,0.635,0.635,0.635,0.635,0.637,0.637,0.637,0.637,0.638,0.639,0.643,0.644,0.644,0.646,0.646,0.646,0.646,0.647,0.647,0.647,0.648,0.65,0.65,0.65,0.652,0.652,0.654,0.654,0.654,0.655,0.656,0.656,0.656,0.656,0.657,0.658,0.658,0.659,0.659,0.659,0.659,0.659,0.66,0.661,0.661,0.662,0.662,0.663,0.664,0.664,0.664,0.668,0.669,0.669,0.67,0.67,0.67,0.67,0.67,0.67,0.672,0.672,0.672,0.673,0.673,0.674,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.675,0.676,0.676,0.677,0.678,0.68,0.68,0.681,0.682,0.682,0.682,0.682,0.683,0.683,0.683,0.684,0.684,0.685,0.685,0.685,0.685,0.686,0.686,0.687,0.688,0.689,0.689,0.689,0.69,0.69,0.691,0.691,0.692,0.693,0.694,0.694,0.694,0.694,0.694,0.695,0.695,0.695,0.696,0.696,0.697,0.698,0.699,0.699,0.7,0.7,0.7,0.7,0.7,0.7,0.701,0.701,0.701,0.702,0.702,0.703,0.704,0.704,0.705,0.705,0.706,0.707,0.707,0.707,0.708,0.709,0.71,0.71,0.71,0.711,0.712,0.712,0.713,0.713,0.713,0.714,0.715,0.716,0.718,0.719,0.72,0.72,0.72,0.721,0.722,0.723,0.723,0.723,0.723,0.723,0.725,0.725,0.727,0.727,0.728,0.729,0.729,0.73,0.73,0.73,0.73,0.73,0.731,0.731,0.731,0.731,0.732,0.733,0.733,0.734,0.735,0.735,0.737,0.738,0.738,0.738,0.74,0.74,0.74,0.741,0.741,0.741,0.743,0.746,0.746,0.747,0.748,0.749,0.75,0.75,0.75,0.75,0.75,0.75,0.75,0.752,0.752,0.754,0.756,0.756,0.757,0.759,0.759,0.759,0.76,0.76,0.761,0.762,0.762,0.762,0.762,0.763,0.764,0.764,0.765,0.766,0.768,0.769,0.77,0.771,0.771,0.771,0.772,0.772,0.773,0.774,0.774,0.775,0.777,0.777,0.779,0.78,0.78,0.78,0.781,0.783,0.783,0.785,0.786,0.786,0.786,0.786,0.788,0.788,0.788,0.788,0.788,0.79,0.79,0.79,0.792,0.792,0.792,0.795,0.795,0.795,0.797,0.797,0.798,0.799,0.8,0.801,0.802,0.803,0.804,0.804,0.804,0.806,0.806,0.808,0.808,0.808,0.809,0.81,0.81,0.811,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.813,0.814,0.814,0.816,0.817,0.817,0.817,0.821,0.821,0.821,0.823,0.823,0.824,0.825,0.825,0.825,0.826,0.827,0.827,0.828,0.828,0.828,0.83,0.83,0.83,0.831,0.831,0.831,0.832,0.832,0.832,0.833,0.833,0.834,0.835,0.835,0.836,0.836,0.837,0.838,0.839,0.839,0.839,0.839,0.84,0.841,0.842,0.842,0.842,0.843,0.843,0.844,0.844,0.844,0.845,0.845,0.845,0.845,0.846,0.846,0.846,0.846,0.847,0.847,0.847,0.847,0.847,0.847,0.848,0.849,0.849,0.85,0.85,0.85,0.851,0.851,0.851,0.851,0.852,0.853,0.853,0.854,0.854,0.854,0.855,0.855,0.855,0.855,0.856,0.857,0.857,0.857,0.857,0.857,0.858,0.859,0.859,0.859,0.859,0.859,0.861,0.862,0.863,0.863,0.863,0.864,0.864,0.864,0.864,0.865,0.865,0.865,0.866,0.867,0.867,0.868,0.869,0.869,0.869,0.869,0.87,0.87,0.871,0.871,0.872,0.872,0.872,0.873,0.874,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.875,0.876,0.876,0.877,0.877,0.878,0.878,0.878,0.879,0.879,0.879,0.88,0.88,0.881,0.881,0.882,0.882,0.882,0.882,0.883,0.883,0.883,0.883,0.883,0.883,0.884,0.885,0.885,0.886,0.887,0.887,0.887,0.888,0.888,0.888,0.889,0.889,0.889,0.889,0.889,0.89,0.891,0.892,0.892,0.892,0.893,0.893,0.894,0.895,0.896,0.896,0.897,0.897,0.898,0.898,0.899,0.9,0.9,0.9,0.901,0.901,0.901,0.902,0.903,0.903,0.904,0.904,0.904,0.905,0.905,0.905,0.905,0.906,0.906,0.906,0.906,0.907,0.907,0.908,0.911,0.911,0.912,0.914,0.915,0.915,0.916,0.916,0.917,0.917,0.917,0.917,0.918,0.918,0.918,0.919,0.919,0.919,0.92,0.92,0.922,0.922,0.922,0.922,0.922,0.924,0.925,0.925,0.925,0.925,0.926,0.926,0.926,0.926,0.926,0.926,0.926,0.926,0.926,0.926,0.927,0.927,0.927,0.927,0.928,0.928,0.928,0.928,0.928,0.929,0.93,0.93,0.931,0.931,0.931,0.931,0.931,0.931,0.932,0.932,0.932,0.932,0.933,0.933,0.933,0.934,0.934,0.934,0.934,0.934,0.935,0.935,0.935,0.936,0.936,0.936,0.936,0.938,0.938,0.938,0.938,0.938,0.938,0.938,0.938,0.938,0.938,0.938,0.939,0.939,0.94,0.94,0.942,0.942,0.943,0.943,0.944,0.944,0.944,0.945,0.945,0.946,0.946,0.946,0.946,0.946,0.946,0.946,0.947,0.947,0.948,0.948,0.948,0.949,0.949,0.95,0.95,0.95,0.95,0.95,0.951,0.952,0.952,0.953,0.953,0.953,0.953,0.954,0.955,0.955,0.955,0.955,0.955,0.956,0.957,0.957,0.957,0.958,0.958,0.958,0.958,0.958,0.958,0.96,0.96,0.96,0.96,0.96,0.96,0.961,0.961,0.962,0.962,0.962,0.962,0.962,0.962,0.962,0.963,0.964,0.964,0.964,0.964,0.965,0.965,0.965,0.966,0.966,0.966,0.967,0.967,0.967,0.968,0.968,0.969,0.969,0.969,0.969,0.97,0.971,0.972,0.972,0.973,0.973,0.973,0.974,0.974,0.974,0.974,0.976,0.976,0.976,0.976,0.977,0.977,0.978,0.978,0.978,0.979,0.979,...]
plt.plot(range(len(Ytest)),sorted(Ytest),c='black',label='Data')
plt.plot(range(len(yhat)),sorted(yhat),c='red',label='Predict')
plt.legend()
plt.show()

可见,虽然我们的大部分数据被拟合得比较好,但是图像的开头和结尾处却又着较大的拟合误差。如果我们在图像右侧分布着更多的数据,我们的模型就会越来越偏离我们真正的标签。这种结果类似于我们前面提到的,虽然在有限的数据集上将数值预测正确了,但却没有正确拟合数据的分布,如果有更多的数据进入我们的模型,那数据标签被预测错误的可能性是非常大的。

当我们的 显示为负的时候,这证明我们的模型对我们的数据的拟合非常糟糕,模型完全不能使用。所有,一个负的 是合理的。当然了,现实应用中,如果你发现你的线性回归模型出现了负的 ,不代表你就要接受他了,首先检查你的建模过程和数据处理过程是否正确,也许你已经伤害了数据本身,也许你的建模过程是存在bug的。如果是集成模型的回归,检查你的弱评估器的数量是否不足,随机森林,提升树这些模型在只有两三棵树的时候很容易出现负的 。如果你检查了所有的代码,也确定了你的预处理没有问题,但你的 也还是负的,那这就证明,线性回归模型不适合你的数据,试试看其他的算法吧。

4.2.1 岭回归解决多重共线性问题

和线性回归相比,岭回归的参数稍微多了那么一点点,但是真正核心的参数就是我们的正则项的系数 ,其他的参数是当我们希望使用最小二乘法之外的求解方法求解岭回归的时候才需要的,通常我们完全不会去触碰这些参数。所以大家只需要了解 的用法就可以了。
之前我们在加利佛尼亚房屋价值数据集上使用线性回归,得出的结果大概是训练集上的拟合程度是60%,测试集上的拟合程度也是60%左右,那这个很低的拟合程度是不是由多重共线性造成的呢?在统计学中,我们会通过VIF或者各种检验来判断数据是否存在共线性,然而在机器学习中,我们可以使用模型来判断——如果一个数据集在岭回归中使用各种正则化参数取值下模型表现没有明显上升(比如出现持平或者下降),则说明数据没有多重共线性,顶多是特征之间有一些相关性。反之,如果一个数据集在岭回归的各种正则化参数取值下表现出明显的上升趋势,则说明数据存在多重共线性。

接下来,我们就在加利佛尼亚房屋价值数据集上来验证一下这个说法:

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, LinearRegression, Lasso
from sklearn.model_selection import train_test_split as TTS
from sklearn.datasets import fetch_california_housing as fch
import matplotlib.pyplot as plt
housevalue = fch()
X = pd.DataFrame(housevalue.data)
y = housevalue.target
X.columns = ["住户收入中位数","房屋使用年代中位数","平均房间数目","平均卧室数目","街区人口","平均入住率","街区的纬度","街区的经度"]
X.head()
住户收入中位数 房屋使用年代中位数 平均房间数目 平均卧室数目 街区人口 平均入住率 街区的纬度 街区的经度
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
Xtrain,Xtest,Ytrain,Ytest = TTS(X,y,test_size=0.3,random_state=420)
#数据集索引恢复
for i in [Xtrain,Xtest]:i.index = range(i.shape[0])
#使用岭回归来进行建模
reg = Ridge(alpha=1).fit(Xtrain,Ytrain)
reg.score(Xtest,Ytest)
0.6043610352312276
#交叉验证下,与线性回归相比,岭回归的结果如何变化?
alpharange = np.arange(1,1001,100)
ridge, lr = [], []
for alpha in alpharange:reg = Ridge(alpha=alpha)linear = LinearRegression()regs = cross_val_score(reg,X,y,cv=5,scoring = "r2").mean()#    linears = cross_val_score(linear,X,y,cv=5,scoring = "r2").mean()#     ridge.append(regs)lr.append(linears)plt.plot(alpharange,ridge,color="red",label="Ridge")
plt.plot(alpharange,lr,color="orange",label="LR")
plt.title("Mean")
plt.legend()
plt.show()

#细化一下学习曲线
#交叉验证下,与线性回归相比,岭回归的结果如何变化?
alpharange = np.arange(1,201,10)
ridge, lr = [], []
for alpha in alpharange:reg = Ridge(alpha=alpha)linear = LinearRegression()regs = cross_val_score(reg,X,y,cv=5,scoring = "r2").mean()#    linears = cross_val_score(linear,X,y,cv=5,scoring = "r2").mean()#     ridge.append(regs)lr.append(linears)plt.plot(alpharange,ridge,color="red",label="Ridge")
plt.plot(alpharange,lr,color="orange",label="LR")
plt.title("Mean")
plt.legend()
plt.show()

可以看出,加利佛尼亚数据集上,岭回归的结果轻微上升,随后骤降。可以说,加利佛尼亚房屋价值数据集带有很轻微的一部分共线性,这种共线性被正则化参数 消除后,模型的效果提升了一点点,但是对于整个模型而言是杯水车薪。在过了控制多重共线性的点后,模型的效果飞速下降,显然是正则化的程度太重,挤占了参数 本来的估计空间。从这个结果可以看出,加利佛尼亚数据集的核心问题不在于多重共线性,岭回归不能够提升模型表现。

另外,在正则化参数逐渐增大的过程中,我们可以观察一下模型的方差如何变化:

#模型方差如何变化?
alpharange = np.arange(1,1001,100)
ridge, lr = [], []
for alpha in alpharange:reg = Ridge(alpha=alpha)linear = LinearRegression()varR = cross_val_score(reg,X,y,cv=5,scoring="r2").var()#  varLR = cross_val_score(linear,X,y,cv=5,scoring="r2").var()#     ridge.append(varR)lr.append(varLR)plt.plot(alpharange,ridge,color="red",label="Ridge")
plt.plot(alpharange,lr,color="orange",label="LR")
plt.title("Variance")
plt.legend()
plt.show()

可以发现,模型的方差上升快速,不过方差的值本身很小,其变化不超过 上升部分的1/3,因此只要噪声的状况维持恒定,模型的泛化误差可能还是一定程度上降低了的。虽然岭回归和Lasso不是设计来提升模型表现,而是专注于解决多重共线性问题的,但当 在一定范围内变动的时候,消除多重共线性也许能够一定程度上提高模型的泛化能力。
但是泛化能力毕竟没有直接衡量的指标,因此我们往往只能够通过观察模型的准确性指标和方差来大致评判模型的泛化能力是否提高。来看看多重共线性更为明显一些的情况:

from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
X = load_boston().data
y = load_boston().target
Xtrain,Xtest,Ytrain,Ytest = TTS(X,y,test_size=0.3,random_state=420)
#先查看方差的变化
alpharange = np.arange(1,1001,100)
ridge, lr = [], []
for alpha in alpharange:reg = Ridge(alpha=alpha)linear = LinearRegression()varR = cross_val_score(reg,X,y,cv=5,scoring="r2").var()#     varLR = cross_val_score(linear,X,y,cv=5,scoring="r2").var()#     ridge.append(varR)lr.append(varLR)plt.plot(alpharange,ridge,color="red",label="Ridge")
plt.plot(alpharange,lr,color="orange",label="LR")
plt.title("Variance")
plt.legend()
plt.show()

#查看R2的变化
alpharange = np.arange(1,1001,100)
ridge, lr = [], []
for alpha in alpharange:reg = Ridge(alpha=alpha)linear = LinearRegression()regs = cross_val_score(reg,X,y,cv=5,scoring = "r2").mean()#     linears = cross_val_score(linear,X,y,cv=5,scoring = "r2").mean()#     ridge.append(regs)lr.append(linears)plt.plot(alpharange,ridge,color="red",label="Ridge")
plt.plot(alpharange,lr,color="orange",label="LR")
plt.title("Mean")
plt.legend()
plt.show()

#细化学习曲线
alpharange = np.arange(100,300,10)
ridge, lr = [], []
for alpha in alpharange:reg = Ridge(alpha=alpha)#linear = LinearRegression()regs = cross_val_score(reg,X,y,cv=5,scoring = "r2").mean()#linears = cross_val_score(linear,X,y,cv=5,scoring = "r2").mean()     ridge.append(regs)#lr.append(linears)
plt.plot(alpharange,ridge,color="red",label="Ridge")
#plt.plot(alpharange,lr,color="orange",label="LR")
plt.title("Mean")
plt.legend()
plt.show()

可以发现,比起加利佛尼亚房屋价值数据集,波士顿房价数据集的方差降低明显,偏差也降低明显,可见使用岭回归还是起到了一定的作用,模型的泛化能力是有可能会上升的。
遗憾的是,没有人会希望自己获取的数据中存在多重共线性,因此发布到scikit-learn或者kaggle上的数据基本都经过一定的多重共线性的处理的,要找出绝对具有多重共线性的数据非常困难,也就无法给大家展示岭回归在实际数据中大显身手的模样。我们也许可以找出具有一些相关性的数据,但是大家如果去尝试就会发现,基本上如果我们使用岭回归或者Lasso,那模型的效果都是会降低的,很难升高,这恐怕也是岭回归和Lasso一定程度上被机器学习领域冷遇的原因。

4.2.3 选取最佳的正则化参数取值

以正则化参数为横坐标,线性模型求解的系数 为纵坐标的图像,其中每一条彩色的线都是一个系数 。其目标是建立正则化参数与系数 之间的直接关系,以此来观察正则化参数的变化如何影响了系数 的拟合。岭迹图认为,线条交叉越多,则说明特征之间的多重共线性越高。我们应该选择系数较为平稳的喇叭口所对应的 取值作为最佳的正则化参数的取值。绘制岭迹图的方法非常简单,代码如下:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
#创造10*10的希尔伯特矩阵
X = 1. / (np.arange(1, 11) + np.arange(0, 10)[:, np.newaxis])
y = np.ones(10)
X
array([[1.        , 0.5       , 0.33333333, 0.25      , 0.2       ,0.16666667, 0.14285714, 0.125     , 0.11111111, 0.1       ],[0.5       , 0.33333333, 0.25      , 0.2       , 0.16666667,0.14285714, 0.125     , 0.11111111, 0.1       , 0.09090909],[0.33333333, 0.25      , 0.2       , 0.16666667, 0.14285714,0.125     , 0.11111111, 0.1       , 0.09090909, 0.08333333],[0.25      , 0.2       , 0.16666667, 0.14285714, 0.125     ,0.11111111, 0.1       , 0.09090909, 0.08333333, 0.07692308],[0.2       , 0.16666667, 0.14285714, 0.125     , 0.11111111,0.1       , 0.09090909, 0.08333333, 0.07692308, 0.07142857],[0.16666667, 0.14285714, 0.125     , 0.11111111, 0.1       ,0.09090909, 0.08333333, 0.07692308, 0.07142857, 0.06666667],[0.14285714, 0.125     , 0.11111111, 0.1       , 0.09090909,0.08333333, 0.07692308, 0.07142857, 0.06666667, 0.0625    ],[0.125     , 0.11111111, 0.1       , 0.09090909, 0.08333333,0.07692308, 0.07142857, 0.06666667, 0.0625    , 0.05882353],[0.11111111, 0.1       , 0.09090909, 0.08333333, 0.07692308,0.07142857, 0.06666667, 0.0625    , 0.05882353, 0.05555556],[0.1       , 0.09090909, 0.08333333, 0.07692308, 0.07142857,0.06666667, 0.0625    , 0.05882353, 0.05555556, 0.05263158]])
#计算横坐标
n_alphas = 200
alphas = np.logspace(-10, -2, n_alphas)
alphas
array([1.00000000e-10, 1.09698580e-10, 1.20337784e-10, 1.32008840e-10,1.44811823e-10, 1.58856513e-10, 1.74263339e-10, 1.91164408e-10,2.09704640e-10, 2.30043012e-10, 2.52353917e-10, 2.76828663e-10,3.03677112e-10, 3.33129479e-10, 3.65438307e-10, 4.00880633e-10,4.39760361e-10, 4.82410870e-10, 5.29197874e-10, 5.80522552e-10,6.36824994e-10, 6.98587975e-10, 7.66341087e-10, 8.40665289e-10,9.22197882e-10, 1.01163798e-09, 1.10975250e-09, 1.21738273e-09,1.33545156e-09, 1.46497140e-09, 1.60705282e-09, 1.76291412e-09,1.93389175e-09, 2.12145178e-09, 2.32720248e-09, 2.55290807e-09,2.80050389e-09, 3.07211300e-09, 3.37006433e-09, 3.69691271e-09,4.05546074e-09, 4.44878283e-09, 4.88025158e-09, 5.35356668e-09,5.87278661e-09, 6.44236351e-09, 7.06718127e-09, 7.75259749e-09,8.50448934e-09, 9.32930403e-09, 1.02341140e-08, 1.12266777e-08,1.23155060e-08, 1.35099352e-08, 1.48202071e-08, 1.62575567e-08,1.78343088e-08, 1.95639834e-08, 2.14614120e-08, 2.35428641e-08,2.58261876e-08, 2.83309610e-08, 3.10786619e-08, 3.40928507e-08,3.73993730e-08, 4.10265811e-08, 4.50055768e-08, 4.93704785e-08,5.41587138e-08, 5.94113398e-08, 6.51733960e-08, 7.14942899e-08,7.84282206e-08, 8.60346442e-08, 9.43787828e-08, 1.03532184e-07,1.13573336e-07, 1.24588336e-07, 1.36671636e-07, 1.49926843e-07,1.64467618e-07, 1.80418641e-07, 1.97916687e-07, 2.17111795e-07,2.38168555e-07, 2.61267523e-07, 2.86606762e-07, 3.14403547e-07,3.44896226e-07, 3.78346262e-07, 4.15040476e-07, 4.55293507e-07,4.99450512e-07, 5.47890118e-07, 6.01027678e-07, 6.59318827e-07,7.23263390e-07, 7.93409667e-07, 8.70359136e-07, 9.54771611e-07,1.04737090e-06, 1.14895100e-06, 1.26038293e-06, 1.38262217e-06,1.51671689e-06, 1.66381689e-06, 1.82518349e-06, 2.00220037e-06,2.19638537e-06, 2.40940356e-06, 2.64308149e-06, 2.89942285e-06,3.18062569e-06, 3.48910121e-06, 3.82749448e-06, 4.19870708e-06,4.60592204e-06, 5.05263107e-06, 5.54266452e-06, 6.08022426e-06,6.66991966e-06, 7.31680714e-06, 8.02643352e-06, 8.80488358e-06,9.65883224e-06, 1.05956018e-05, 1.16232247e-05, 1.27505124e-05,1.39871310e-05, 1.53436841e-05, 1.68318035e-05, 1.84642494e-05,2.02550194e-05, 2.22194686e-05, 2.43744415e-05, 2.67384162e-05,2.93316628e-05, 3.21764175e-05, 3.52970730e-05, 3.87203878e-05,4.24757155e-05, 4.65952567e-05, 5.11143348e-05, 5.60716994e-05,6.15098579e-05, 6.74754405e-05, 7.40196000e-05, 8.11984499e-05,8.90735464e-05, 9.77124154e-05, 1.07189132e-04, 1.17584955e-04,1.28989026e-04, 1.41499130e-04, 1.55222536e-04, 1.70276917e-04,1.86791360e-04, 2.04907469e-04, 2.24780583e-04, 2.46581108e-04,2.70495973e-04, 2.96730241e-04, 3.25508860e-04, 3.57078596e-04,3.91710149e-04, 4.29700470e-04, 4.71375313e-04, 5.17092024e-04,5.67242607e-04, 6.22257084e-04, 6.82607183e-04, 7.48810386e-04,8.21434358e-04, 9.01101825e-04, 9.88495905e-04, 1.08436597e-03,1.18953407e-03, 1.30490198e-03, 1.43145894e-03, 1.57029012e-03,1.72258597e-03, 1.88965234e-03, 2.07292178e-03, 2.27396575e-03,2.49450814e-03, 2.73644000e-03, 3.00183581e-03, 3.29297126e-03,3.61234270e-03, 3.96268864e-03, 4.34701316e-03, 4.76861170e-03,5.23109931e-03, 5.73844165e-03, 6.29498899e-03, 6.90551352e-03,7.57525026e-03, 8.30994195e-03, 9.11588830e-03, 1.00000000e-02])
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['simhei'] #显示中文
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
%matplotlib inline
#建模,获取每一个正则化取值下的系数组合
coefs = []
for a in alphas:ridge = linear_model.Ridge(alpha=a, fit_intercept=False)#     ridge.fit(X, y)coefs.append(ridge.coef_)#绘图展示结果
ax = plt.gca()# plt.plot()实际上会通过plt.gca()获得当前的Axes对象ax,然后再调用ax.plot()方法实现真正的绘图。
ax.plot(alphas, coefs)
ax.set_xscale('log')
ax.set_xlim(ax.get_xlim()[::-1])#将横坐标逆转
plt.xlabel('正则化参数alpha')
plt.ylabel('系数w')
plt.title('岭回归下的岭迹图')
plt.axis('tight')
plt.show()

import numpy as np
import pandas as pd
from sklearn.linear_model import RidgeCV, LinearRegression
from sklearn.model_selection import train_test_split as TTS
from sklearn.datasets import fetch_california_housing as fch
import matplotlib.pyplot as plthousevalue = fch()
X = pd.DataFrame(housevalue.data)
y = housevalue.target
X.columns = ["住户收入中位数","房屋使用年代中位数","平均房间数目","平均卧室数目","街区人口","平均入住率","街区的纬度","街区的经度"]
Ridge_ = RidgeCV(alphas=np.arange(1,1001,100),store_cv_values=True).fit(X, y)
#无关交叉验证的岭回归结果
Ridge_.score(X,y)
0.6060251767338429
#调用所有交叉验证的结果
Ridge_.cv_values_.shape
(20640, 10)
#进行平均后可以查看每个正则化系数取值下的交叉验证结果
Ridge_.cv_values_.mean(axis=0)
array([0.52823795, 0.52787439, 0.52807763, 0.52855759, 0.52917958,0.52987689, 0.53061486, 0.53137481, 0.53214638, 0.53292369])
#查看被选择出来的最佳正则化系数
Ridge_.alpha_
101

4.3.2 Lasso的核心作用:特征选择

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, LinearRegression, Lasso
from sklearn.model_selection import train_test_split as TTS
from sklearn.datasets import fetch_california_housing as fch
import matplotlib.pyplot as plt
housevalue = fch()
X = pd.DataFrame(housevalue.data)
y = housevalue.target
X.columns = ["住户收入中位数","房屋使用年代中位数","平均房间数目","平均卧室数目","街区人口","平均入住率","街区的纬度","街区的经度"]
X.head()
住户收入中位数 房屋使用年代中位数 平均房间数目 平均卧室数目 街区人口 平均入住率 街区的纬度 街区的经度
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
Xtrain,Xtest,Ytrain,Ytest = TTS(X,y,test_size=0.3,random_state=420)
#恢复索引
for i in [Xtrain,Xtest]:i.index = range(i.shape[0])
#线性回归进行拟合
reg = LinearRegression().fit(Xtrain,Ytrain)
(reg.coef_*100).tolist()
[43.735893059684,1.0211268294493883,-10.780721617317667,62.64338275363759,5.216125353348089e-05,-0.3348509646333704,-41.30959378947717,-42.62109536208464]
reg.coef_*100
array([ 4.37358931e+01,  1.02112683e+00, -1.07807216e+01,  6.26433828e+01,5.21612535e-05, -3.34850965e-01, -4.13095938e+01, -4.26210954e+01])
#岭回归进行拟合
Ridge_ = Ridge(alpha=0).fit(Xtrain,Ytrain)
(Ridge_.coef_*100).tolist()
[43.735893059684024,1.0211268294494151,-10.780721617317592,62.64338275363727,5.2161253532709486e-05,-0.3348509646333586,-41.30959378947672,-42.62109536208427]
#Lasso进行拟合
lasso_ = Lasso(alpha=0).fit(Xtrain,Ytrain)
(lasso_.coef_*100).tolist()
D:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: UserWarning: With alpha=0, this algorithm does not converge well. You are advised to use the LinearRegression estimatorD:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:476: UserWarning: Coordinate descent with no regularization may lead to unexpected results and is discouraged.positive)
D:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:476: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 3769.8607714139175, tolerance: 1.9172554769131482positive)[43.73589305968398,1.0211268294494045,-10.780721617317642,62.64338275363768,5.2161253532676174e-05,-0.33485096463335784,-41.30959378947721,-42.62109536208479]

可以看到,岭回归没有报出错误,但Lasso就不一样了,虽然依然对系数进行了计算,但是报出了整整三个红条:

这三条分别是这样的内容:

    1. 正则化系数为0,这样算法不可收敛!如果你想让正则化系数为0,请使用线性回归吧
    1. 没有正则项的坐标下降法可能会导致意外的结果,不鼓励这样做!
    1. 目标函数没有收敛,你也许想要增加迭代次数,使用一个非常小的alpha来拟合模型可能会造成精确度问题!

有了坐标下降,就有迭代和收敛的问题,因此sklearn不推荐我们使用0这样的正则化系数。如果我们的确希望取到0,那我们可以使用一个比较很小的数,比如0.01,或者 这样的值:

#岭回归进行拟合
Ridge_ = Ridge(alpha=0.01).fit(Xtrain,Ytrain)
(Ridge_.coef_*100).tolist()
[43.735757206215965,1.0211292318121794,-10.780460336251618,62.64202320775656,5.217068073242219e-05,-0.3348506517067619,-41.309571432291364,-42.62105388932401]
#Lasso进行拟合
lasso_ = Lasso(alpha=0.01).fit(Xtrain,Ytrain)
(lasso_.coef_*100).tolist()
[40.105683718344864,1.0936292607860143,-3.7423763610244585,26.524037834897218,0.00035253685115039596,-0.3207129394887797,-40.064830473448424,-40.81754399163315]
#加大正则项系数,观察模型的系数发生了什么变化
Ridge_ = Ridge(alpha=10**4).fit(Xtrain,Ytrain)
(Ridge_.coef_*100).tolist()
[34.62081517607694,1.5196170869238694,0.3968610529210159,0.9151812510354821,0.002173923801224847,-0.3476866014810102,-14.736963474215234,-13.435576102526895]
lasso_ = Lasso(alpha=10**4).fit(Xtrain,Ytrain)
(lasso_.coef_*100).tolist()
[0.0, 0.0, 0.0, -0.0, -0.0, -0.0, -0.0, -0.0]
#看来10**4对于Lasso来说是一个过于大的取值
lasso_ = Lasso(alpha=1).fit(Xtrain,Ytrain)
(lasso_.coef_*100).tolist()
[14.581141247629418,0.6209347344423873,0.0,-0.0,-0.0002806598632901005,-0.0,-0.0,-0.0]
#将系数进行绘图
plt.plot(range(1,9),(reg.coef_*100).tolist(),color="red",label="LR")
plt.plot(range(1,9),(Ridge_.coef_*100).tolist(),color="orange",label="Ridge")
plt.plot(range(1,9),(lasso_.coef_*100).tolist(),color="k",label="Lasso")
plt.plot(range(1,9),[0]*8,color="grey",linestyle="--")
plt.xlabel('w') #横坐标是每一个特征所对应的系数
plt.legend()
plt.show()

可见,比起岭回归,Lasso所带的L1正则项对于系数的惩罚要重得多,并且它会将系数压缩至0,因此可以被用来做特征选择。也因此,我们往往让Lasso的正则化系数 在很小的空间中变动,以此来寻找最佳的正则化系数。

4.3.3 选取最佳的正则化参数取值

from sklearn.linear_model import LassoCV
#自己建立Lasso进行alpha选择的范围
alpharange = np.logspace(-10, -2, 200,base=10)
#其实是形成10为底的指数函数
#10**(-10)到10**(-2)次方
alpharange.shape
(200,)
Xtrain.head()
住户收入中位数 房屋使用年代中位数 平均房间数目 平均卧室数目 街区人口 平均入住率 街区的纬度 街区的经度
0 4.1776 35.0 4.425172 1.030683 5380.0 3.368817 37.48 -122.19
1 5.3261 38.0 6.267516 1.089172 429.0 2.732484 37.53 -122.30
2 1.9439 26.0 5.768977 1.141914 891.0 2.940594 36.02 -119.08
3 2.5000 22.0 4.916000 1.012000 733.0 2.932000 38.57 -121.31
4 3.8250 34.0 5.036765 1.098039 1134.0 2.779412 33.91 -118.35
lasso_ = LassoCV(alphas=alpharange #自行输入的alpha的取值范围,cv=5 #交叉验证的折数).fit(Xtrain, Ytrain)
#查看被选择出来的最佳正则化系数
lasso_.alpha_
0.0020729217795953697
#调用所有交叉验证的结果
lasso_.mse_path_
array([[0.52454913, 0.49856261, 0.55984312, 0.50526576, 0.55262557],[0.52361933, 0.49748809, 0.55887637, 0.50429373, 0.55283734],[0.52281927, 0.49655113, 0.55803797, 0.5034594 , 0.55320522],[0.52213811, 0.49574741, 0.55731858, 0.50274517, 0.55367515],[0.52155715, 0.49505688, 0.55669995, 0.50213252, 0.55421553],[0.52106069, 0.49446226, 0.55616707, 0.50160604, 0.55480104],[0.5206358 , 0.49394903, 0.55570702, 0.50115266, 0.55541214],[0.52027135, 0.49350539, 0.55530895, 0.50076146, 0.55603333],[0.51995825, 0.49312085, 0.5549639 , 0.50042318, 0.55665306],[0.5196886 , 0.49278705, 0.55466406, 0.50013007, 0.55726225],[0.51945602, 0.49249647, 0.55440306, 0.49987554, 0.55785451],[0.51925489, 0.49224316, 0.55417527, 0.49965404, 0.55842496],[0.51908068, 0.49202169, 0.55397615, 0.49946088, 0.55897049],[0.51892938, 0.49182782, 0.55380162, 0.49929206, 0.55948886],[0.51879778, 0.49165759, 0.55364841, 0.49914421, 0.55997905],[0.51868299, 0.49150788, 0.55351357, 0.49901446, 0.5604405 ],[0.51858268, 0.49137604, 0.55339469, 0.49890035, 0.56087323],[0.51849488, 0.49125956, 0.55328972, 0.4987998 , 0.56127784],[0.5184178 , 0.49115652, 0.55319678, 0.49871101, 0.56165507],[0.51835002, 0.49106526, 0.55311438, 0.49863248, 0.5620059 ],[0.51829033, 0.49098418, 0.55304118, 0.49856287, 0.56233145],[0.51823761, 0.49091208, 0.55297609, 0.49850108, 0.56263308],[0.51819098, 0.49084785, 0.55291806, 0.49844612, 0.56291204],[0.51814966, 0.49079058, 0.55286626, 0.49839716, 0.56316966],[0.51811298, 0.49073937, 0.55281996, 0.49835348, 0.56340721],[0.51808038, 0.49069355, 0.55277854, 0.49831445, 0.5636261 ],[0.51805132, 0.49065249, 0.5527414 , 0.49827953, 0.56382754],[0.5180254 , 0.49061566, 0.55270806, 0.49824828, 0.56401276],[0.51800224, 0.49058258, 0.55267812, 0.49822015, 0.56418292],[0.51798152, 0.49055285, 0.55265118, 0.49819493, 0.56433912],[0.51796296, 0.49052608, 0.55262693, 0.49817225, 0.56448243],[0.5179463 , 0.49050195, 0.55260507, 0.49815185, 0.56461379],[0.51793135, 0.49048019, 0.55258536, 0.49813345, 0.5647342 ],[0.51791791, 0.49046055, 0.55256757, 0.49811687, 0.56484448],[0.5179058 , 0.49044281, 0.55255149, 0.4981019 , 0.56494544],[0.5178949 , 0.49042677, 0.55253695, 0.49808838, 0.56503784],[0.51788506, 0.49041226, 0.55252379, 0.49807615, 0.56512236],[0.51787619, 0.49039913, 0.55251189, 0.4980651 , 0.56519967],[0.51786817, 0.49038724, 0.5525011 , 0.49805509, 0.56527034],[0.51786092, 0.49037646, 0.55249132, 0.49804603, 0.56533494],[0.51785437, 0.49036669, 0.55248246, 0.49803782, 0.56539397],[0.51784843, 0.49035783, 0.55247442, 0.49803037, 0.5654479 ],[0.51784306, 0.49034979, 0.55246712, 0.49802362, 0.56549716],[0.51783819, 0.49034249, 0.5524605 , 0.49801749, 0.56554215],[0.51783377, 0.49033586, 0.55245448, 0.49801193, 0.56558322],[0.51782977, 0.49032984, 0.55244901, 0.49800688, 0.56562073],[0.51782614, 0.49032437, 0.55244405, 0.49800229, 0.56565496],[0.51782284, 0.49031939, 0.55243953, 0.49799812, 0.56568621],[0.51781984, 0.49031487, 0.55243543, 0.49799434, 0.56571472],[0.51781712, 0.49031076, 0.55243169, 0.49799089, 0.56574074],[0.51781465, 0.49030702, 0.5524283 , 0.49798776, 0.56576449],[0.5178124 , 0.49030362, 0.55242521, 0.49798491, 0.56578615],[0.51781036, 0.49030052, 0.5524224 , 0.49798232, 0.56580591],[0.5178085 , 0.4902977 , 0.55241984, 0.49797996, 0.56582394],[0.51780681, 0.49029514, 0.55241751, 0.49797781, 0.56584039],[0.51780528, 0.4902928 , 0.55241539, 0.49797586, 0.56585539],[0.51780388, 0.49029068, 0.55241346, 0.49797408, 0.56586907],[0.51780261, 0.49028874, 0.55241171, 0.49797246, 0.56588155],[0.51780145, 0.49028698, 0.55241011, 0.49797099, 0.56589293],[0.51780039, 0.49028538, 0.55240865, 0.49796965, 0.56590331],[0.51779943, 0.49028392, 0.55240732, 0.49796843, 0.56591277],[0.51779856, 0.49028258, 0.55240611, 0.49796731, 0.5659214 ],[0.51779777, 0.49028137, 0.55240501, 0.4979663 , 0.56592927],[0.51779704, 0.49028027, 0.55240401, 0.49796538, 0.56593645],[0.51779638, 0.49027926, 0.5524031 , 0.49796454, 0.56594299],[0.51779578, 0.49027834, 0.55240226, 0.49796377, 0.56594896],[0.51779523, 0.49027751, 0.55240151, 0.49796307, 0.5659544 ],[0.51779473, 0.49027675, 0.55240081, 0.49796243, 0.56595936],[0.51779428, 0.49027605, 0.55240018, 0.49796185, 0.56596388],[0.51779386, 0.49027542, 0.55239961, 0.49796133, 0.565968  ],[0.51779349, 0.49027485, 0.55239909, 0.49796085, 0.56597176],[0.51779314, 0.49027432, 0.55239861, 0.49796041, 0.56597519],[0.51779283, 0.49027384, 0.55239818, 0.49796001, 0.56597831],[0.51779254, 0.49027341, 0.55239778, 0.49795964, 0.56598116],[0.51779228, 0.49027301, 0.55239742, 0.49795931, 0.56598376],[0.51779205, 0.49027265, 0.55239709, 0.49795901, 0.56598613],[0.51779183, 0.49027232, 0.55239679, 0.49795873, 0.56598828],[0.51779163, 0.49027202, 0.55239652, 0.49795848, 0.56599025],[0.51779146, 0.49027174, 0.55239627, 0.49795825, 0.56599205],[0.51779129, 0.49027149, 0.55239604, 0.49795804, 0.56599368],[0.51779114, 0.49027127, 0.55239584, 0.49795785, 0.56599517],[0.51779101, 0.49027106, 0.55239565, 0.49795768, 0.56599653],[0.51779088, 0.49027087, 0.55239548, 0.49795752, 0.56599777],[0.51779077, 0.4902707 , 0.55239532, 0.49795738, 0.5659989 ],[0.51779067, 0.49027054, 0.55239518, 0.49795725, 0.56599993],[0.51779057, 0.4902704 , 0.55239505, 0.49795713, 0.56600087],[0.51779049, 0.49027027, 0.55239493, 0.49795702, 0.56600172],[0.51779041, 0.49027015, 0.55239482, 0.49795692, 0.5660025 ],[0.51779034, 0.49027004, 0.55239472, 0.49795683, 0.56600322],[0.51779027, 0.49026994, 0.55239463, 0.49795675, 0.56600386],[0.51779022, 0.49026985, 0.55239455, 0.49795667, 0.56600446],[0.51779016, 0.49026977, 0.55239448, 0.4979566 , 0.56600499],[0.51779011, 0.49026969, 0.55239441, 0.49795654, 0.56600549],[0.51779007, 0.49026962, 0.55239435, 0.49795648, 0.56600593],[0.51779003, 0.49026956, 0.55239429, 0.49795643, 0.56600634],[0.51778999, 0.49026951, 0.55239424, 0.49795638, 0.56600671],[0.51778996, 0.49026945, 0.55239419, 0.49795634, 0.56600705],[0.51778993, 0.49026941, 0.55239415, 0.4979563 , 0.56600736],[0.5177899 , 0.49026936, 0.55239411, 0.49795626, 0.56600764],[0.51778987, 0.49026932, 0.55239407, 0.49795623, 0.5660079 ],[0.51778985, 0.49026929, 0.55239404, 0.4979562 , 0.56600813],[0.51778983, 0.49026926, 0.55239401, 0.49795617, 0.56600835],[0.51778981, 0.49026923, 0.55239398, 0.49795615, 0.56600854],[0.51778979, 0.4902692 , 0.55239396, 0.49795613, 0.56600872],[0.51778977, 0.49026918, 0.55239394, 0.49795611, 0.56600888],[0.51778976, 0.49026915, 0.55239392, 0.49795609, 0.56600903],[0.51778975, 0.49026913, 0.5523939 , 0.49795607, 0.56600916],[0.51778973, 0.49026911, 0.55239388, 0.49795605, 0.56600929],[0.51778972, 0.4902691 , 0.55239387, 0.49795604, 0.5660094 ],[0.51778971, 0.49026908, 0.55239385, 0.49795603, 0.5660095 ],[0.5177897 , 0.49026907, 0.55239384, 0.49795602, 0.56600959],[0.5177897 , 0.49026905, 0.55239383, 0.49795601, 0.56600968],[0.51778969, 0.49026904, 0.55239382, 0.497956  , 0.56600975],[0.51778968, 0.49026903, 0.55239381, 0.49795599, 0.56600983],[0.51778967, 0.49026902, 0.5523938 , 0.49795598, 0.56600989],[0.51778967, 0.49026901, 0.55239379, 0.49795597, 0.56600995],[0.51778966, 0.490269  , 0.55239378, 0.49795596, 0.56601   ],[0.51778966, 0.490269  , 0.55239378, 0.49795596, 0.56601005],[0.51778965, 0.49026899, 0.55239377, 0.49795595, 0.56601009],[0.51778965, 0.49026898, 0.55239376, 0.49795595, 0.56601013],[0.51778965, 0.49026898, 0.55239376, 0.49795594, 0.56601017],[0.51778964, 0.49026897, 0.55239375, 0.49795594, 0.5660102 ],[0.51778964, 0.49026897, 0.55239375, 0.49795593, 0.56601023],[0.51778964, 0.49026896, 0.55239375, 0.49795593, 0.56601026],[0.51778963, 0.49026896, 0.55239374, 0.49795593, 0.56601029],[0.51778963, 0.49026896, 0.55239374, 0.49795592, 0.56601031],[0.51778963, 0.49026895, 0.55239374, 0.49795592, 0.56601033],[0.51778963, 0.49026895, 0.55239373, 0.49795592, 0.56601035],[0.51778963, 0.49026895, 0.55239373, 0.49795592, 0.56601037],[0.51778962, 0.49026895, 0.55239373, 0.49795591, 0.56601039],[0.51778962, 0.49026894, 0.55239373, 0.49795591, 0.5660104 ],[0.51778962, 0.49026894, 0.55239372, 0.49795591, 0.56601041],[0.51778962, 0.49026894, 0.55239372, 0.49795591, 0.56601043],[0.51778962, 0.49026894, 0.55239372, 0.49795591, 0.56601044],[0.51778962, 0.49026894, 0.55239372, 0.49795591, 0.56601045],[0.51778962, 0.49026894, 0.55239372, 0.49795591, 0.56601046],[0.51778962, 0.49026893, 0.55239372, 0.4979559 , 0.56601046],[0.51778962, 0.49026893, 0.55239372, 0.4979559 , 0.56601047],[0.51778962, 0.49026893, 0.55239372, 0.4979559 , 0.56601048],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.56601048],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.56601049],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.5660105 ],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.5660105 ],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.5660105 ],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.56601051],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.56601051],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.56601052],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.56601052],[0.51778961, 0.49026893, 0.55239371, 0.4979559 , 0.56601052],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601052],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601053],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601053],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601053],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601053],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601053],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.4979559 , 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601054],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055],[0.51778961, 0.49026892, 0.55239371, 0.49795589, 0.56601055]])
lasso_.mse_path_.shape #返回每个alpha下的五折交叉验证结果
(200, 5)
lasso_.mse_path_.mean(axis=1) #有注意到在岭回归中我们的轴向是axis=0吗?
#在岭回归当中,我们是留一验证,因此我们的交叉验证结果返回的是,每一个样本在每个alpha下的交叉验证结果
#因此我们要求每个alpha下的交叉验证均值,就是axis=0,跨行求均值
#而在这里,我们返回的是,每一个alpha取值下,每一折交叉验证的结果
#因此我们要求每个alpha下的交叉验证均值,就是axis=1,跨列求均值
array([0.52816924, 0.52742297, 0.5268146 , 0.52632488, 0.52593241,0.52561942, 0.52537133, 0.5251761 , 0.52502385, 0.52490641,0.52481712, 0.52475046, 0.52470198, 0.52466795, 0.52464541,0.52463188, 0.5246254 , 0.52462436, 0.52462744, 0.52463361,0.52464201, 0.52465199, 0.52466301, 0.52467466, 0.5246866 ,0.5246986 , 0.52471046, 0.52472203, 0.5247332 , 0.52474392,0.52475413, 0.52476379, 0.52477291, 0.52478147, 0.52478949,0.52479697, 0.52480393, 0.52481039, 0.52481639, 0.52482193,0.52482706, 0.52483179, 0.52483615, 0.52484016, 0.52484385,0.52484725, 0.52485036, 0.52485322, 0.52485584, 0.52485824,0.52486044, 0.52486246, 0.5248643 , 0.52486599, 0.52486753,0.52486895, 0.52487024, 0.52487141, 0.52487249, 0.52487348,0.52487437, 0.52487519, 0.52487594, 0.52487663, 0.52487725,0.52487782, 0.52487834, 0.52487882, 0.52487925, 0.52487965,0.52488001, 0.52488033, 0.52488063, 0.52488091, 0.52488116,0.52488138, 0.52488159, 0.52488178, 0.52488195, 0.52488211,0.52488225, 0.52488239, 0.5248825 , 0.52488261, 0.52488271,0.5248828 , 0.52488289, 0.52488296, 0.52488303, 0.52488309,0.52488315, 0.5248832 , 0.52488325, 0.52488329, 0.52488333,0.52488337, 0.5248834 , 0.52488343, 0.52488346, 0.52488348,0.5248835 , 0.52488352, 0.52488354, 0.52488356, 0.52488357,0.52488359, 0.5248836 , 0.52488361, 0.52488362, 0.52488363,0.52488364, 0.52488365, 0.52488366, 0.52488367, 0.52488367,0.52488368, 0.52488368, 0.52488369, 0.52488369, 0.5248837 ,0.5248837 , 0.5248837 , 0.52488371, 0.52488371, 0.52488371,0.52488371, 0.52488371, 0.52488372, 0.52488372, 0.52488372,0.52488372, 0.52488372, 0.52488372, 0.52488372, 0.52488373,0.52488373, 0.52488373, 0.52488373, 0.52488373, 0.52488373,0.52488373, 0.52488373, 0.52488373, 0.52488373, 0.52488373,0.52488373, 0.52488373, 0.52488373, 0.52488373, 0.52488373,0.52488373, 0.52488373, 0.52488373, 0.52488373, 0.52488373,0.52488373, 0.52488373, 0.52488373, 0.52488373, 0.52488373,0.52488373, 0.52488373, 0.52488374, 0.52488374, 0.52488374,0.52488374, 0.52488374, 0.52488374, 0.52488374, 0.52488374,0.52488374, 0.52488374, 0.52488374, 0.52488374, 0.52488374,0.52488374, 0.52488374, 0.52488374, 0.52488374, 0.52488374,0.52488374, 0.52488374, 0.52488374, 0.52488374, 0.52488374,0.52488374, 0.52488374, 0.52488374, 0.52488374, 0.52488374,0.52488374, 0.52488374, 0.52488374, 0.52488374, 0.52488374,0.52488374, 0.52488374, 0.52488374, 0.52488374, 0.52488374])
#最佳正则化系数下获得的模型的系数结果
lasso_.coef_
array([ 4.29867301e-01,  1.03623683e-02, -9.32648616e-02,  5.51755252e-01,1.14732262e-06, -3.31941716e-03, -4.10451223e-01, -4.22410330e-01])
lasso_.score(Xtest,Ytest)
0.6038982670571438
#与线性回归相比如何?
reg = LinearRegression().fit(Xtrain,Ytrain)
reg.score(Xtest,Ytest)
0.6043668160178817
#使用lassoCV自带的正则化路径长度和路径中的alpha个数来自动建立alpha选择的范围
ls_ = LassoCV(eps=0.00001,n_alphas=300,cv=5).fit(Xtrain, Ytrain)
ls_.alpha_
0.0020954551690628535
ls_.alphas_ #查看所有自动生成的alpha取值
array([2.94059737e+01, 2.82952253e+01, 2.72264331e+01, 2.61980122e+01,2.52084378e+01, 2.42562424e+01, 2.33400142e+01, 2.24583946e+01,2.16100763e+01, 2.07938014e+01, 2.00083596e+01, 1.92525862e+01,1.85253605e+01, 1.78256042e+01, 1.71522798e+01, 1.65043887e+01,1.58809704e+01, 1.52811004e+01, 1.47038891e+01, 1.41484809e+01,1.36140520e+01, 1.30998100e+01, 1.26049924e+01, 1.21288655e+01,1.16707233e+01, 1.12298864e+01, 1.08057012e+01, 1.03975388e+01,1.00047937e+01, 9.62688384e+00, 9.26324869e+00, 8.91334908e+00,8.57666619e+00, 8.25270079e+00, 7.94097249e+00, 7.64101907e+00,7.35239575e+00, 7.07467457e+00, 6.80744372e+00, 6.55030695e+00,6.30288297e+00, 6.06480491e+00, 5.83571975e+00, 5.61528779e+00,5.40318218e+00, 5.19908842e+00, 5.00270386e+00, 4.81373731e+00,4.63190858e+00, 4.45694804e+00, 4.28859627e+00, 4.12660362e+00,3.97072991e+00, 3.82074399e+00, 3.67642348e+00, 3.53755437e+00,3.40393074e+00, 3.27535446e+00, 3.15163488e+00, 3.03258855e+00,2.91803894e+00, 2.80781620e+00, 2.70175688e+00, 2.59970374e+00,2.50150543e+00, 2.40701636e+00, 2.31609642e+00, 2.22861078e+00,2.14442973e+00, 2.06342843e+00, 1.98548679e+00, 1.91048923e+00,1.83832455e+00, 1.76888573e+00, 1.70206982e+00, 1.63777773e+00,1.57591415e+00, 1.51638733e+00, 1.45910901e+00, 1.40399425e+00,1.35096134e+00, 1.29993164e+00, 1.25082947e+00, 1.20358204e+00,1.15811928e+00, 1.11437377e+00, 1.07228066e+00, 1.03177753e+00,9.92804320e-01, 9.55303239e-01, 9.19218682e-01, 8.84497142e-01,8.51087135e-01, 8.18939121e-01, 7.88005430e-01, 7.58240193e-01,7.29599275e-01, 7.02040207e-01, 6.75522125e-01, 6.50005707e-01,6.25453118e-01, 6.01827951e-01, 5.79095174e-01, 5.57221080e-01,5.36173234e-01, 5.15920425e-01, 4.96432623e-01, 4.77680932e-01,4.59637546e-01, 4.42275711e-01, 4.25569683e-01, 4.09494689e-01,3.94026894e-01, 3.79143363e-01, 3.64822025e-01, 3.51041645e-01,3.37781790e-01, 3.25022798e-01, 3.12745750e-01, 3.00932442e-01,2.89565356e-01, 2.78627638e-01, 2.68103069e-01, 2.57976043e-01,2.48231544e-01, 2.38855123e-01, 2.29832877e-01, 2.21151426e-01,2.12797900e-01, 2.04759910e-01, 1.97025538e-01, 1.89583315e-01,1.82422207e-01, 1.75531594e-01, 1.68901260e-01, 1.62521372e-01,1.56382472e-01, 1.50475455e-01, 1.44791563e-01, 1.39322368e-01,1.34059761e-01, 1.28995937e-01, 1.24123389e-01, 1.19434891e-01,1.14923491e-01, 1.10582499e-01, 1.06405479e-01, 1.02386238e-01,9.85188143e-02, 9.47974747e-02, 9.12167008e-02, 8.77711831e-02,8.44558125e-02, 8.12656730e-02, 7.81960343e-02, 7.52423447e-02,7.24002244e-02, 6.96654592e-02, 6.70339940e-02, 6.45019268e-02,6.20655031e-02, 5.97211101e-02, 5.74652717e-02, 5.52946427e-02,5.32060046e-02, 5.11962605e-02, 4.92624301e-02, 4.74016461e-02,4.56111493e-02, 4.38882847e-02, 4.22304977e-02, 4.06353301e-02,3.91004165e-02, 3.76234811e-02, 3.62023337e-02, 3.48348672e-02,3.35190539e-02, 3.22529426e-02, 3.10346560e-02, 2.98623876e-02,2.87343991e-02, 2.76490180e-02, 2.66046349e-02, 2.55997012e-02,2.46327267e-02, 2.37022776e-02, 2.28069742e-02, 2.19454891e-02,2.11165447e-02, 2.03189119e-02, 1.95514080e-02, 1.88128950e-02,1.81022777e-02, 1.74185025e-02, 1.67605555e-02, 1.61274610e-02,1.55182803e-02, 1.49321101e-02, 1.43680812e-02, 1.38253574e-02,1.33031338e-02, 1.28006361e-02, 1.23171192e-02, 1.18518661e-02,1.14041869e-02, 1.09734179e-02, 1.05589203e-02, 1.01600794e-02,9.77630394e-03, 9.40702475e-03, 9.05169431e-03, 8.70978573e-03,8.38079201e-03, 8.06422534e-03, 7.75961630e-03, 7.46651323e-03,7.18448150e-03, 6.91310292e-03, 6.65197510e-03, 6.40071082e-03,6.15893752e-03, 5.92629670e-03, 5.70244339e-03, 5.48704566e-03,5.27978413e-03, 5.08035147e-03, 4.88845195e-03, 4.70380102e-03,4.52612490e-03, 4.35516012e-03, 4.19065316e-03, 4.03236011e-03,3.88004625e-03, 3.73348572e-03, 3.59246120e-03, 3.45676358e-03,3.32619166e-03, 3.20055181e-03, 3.07965774e-03, 2.96333019e-03,2.85139667e-03, 2.74369120e-03, 2.64005407e-03, 2.54033162e-03,2.44437597e-03, 2.35204484e-03, 2.26320133e-03, 2.17771369e-03,2.09545517e-03, 2.01630379e-03, 1.94014218e-03, 1.86685742e-03,1.79634083e-03, 1.72848786e-03, 1.66319789e-03, 1.60037411e-03,1.53992337e-03, 1.48175602e-03, 1.42578583e-03, 1.37192979e-03,1.32010804e-03, 1.27024376e-03, 1.22226299e-03, 1.17609459e-03,1.13167011e-03, 1.08892367e-03, 1.04779188e-03, 1.00821376e-03,9.70130622e-04, 9.33485992e-04, 8.98225535e-04, 8.64296967e-04,8.31649980e-04, 8.00236162e-04, 7.70008936e-04, 7.40923479e-04,7.12936663e-04, 6.86006990e-04, 6.60094529e-04, 6.35160855e-04,6.11168999e-04, 5.88083384e-04, 5.65869780e-04, 5.44495247e-04,5.23928092e-04, 5.04137817e-04, 4.85095079e-04, 4.66771639e-04,4.49140329e-04, 4.32175004e-04, 4.15850508e-04, 4.00142636e-04,3.85028095e-04, 3.70484474e-04, 3.56490207e-04, 3.43024545e-04,3.30067519e-04, 3.17599917e-04, 3.05603253e-04, 2.94059737e-04])
ls_.alphas_.shape
(300,)
ls_.score(Xtest,Ytest)
0.6038915423819199
ls_.coef_
array([ 4.29785372e-01,  1.03639989e-02, -9.31060823e-02,  5.50940621e-01,1.15407943e-06, -3.31909776e-03, -4.10423420e-01, -4.22369926e-01])

线性回归在非线性数据上的表现如何呢?我们来建立一个明显是非线性的数据集,并观察线性回归和决策树的而回归在拟合非线性数据集时的表现:

  1. 导入所需要的库
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
  1. 创建需要拟合的数据集
rnd = np.random.RandomState(42) #设置随机数种子
X = rnd.uniform(-3, 3, size=100) #random.uniform,从输入的任意两个整数中取出size个随机数
X
array([-0.75275929,  2.70428584,  1.39196365,  0.59195091, -2.06388816,-2.06403288, -2.65149833,  2.19705687,  0.60669007,  1.24843547,-2.87649303,  2.81945911,  1.99465584, -1.72596534, -1.9090502 ,-1.89957294, -1.17454654,  0.14853859, -0.40832989, -1.25262516,0.67111737, -2.16303684, -1.24713211, -0.80182894, -0.26358009,1.71105577, -1.80195731,  0.08540663,  0.55448741, -2.72129752,0.64526911, -1.97685526, -2.60969044,  2.69331322,  2.7937922 ,1.85038409, -1.17231738, -2.41396732,  1.10539816, -0.35908504,-2.26777059, -0.02893854, -2.79366887,  2.45592241, -1.44732011,0.97513371, -1.12973354,  0.12040813,  0.28026168, -1.89087327,2.81750777,  1.65079694,  2.63699365,  2.3689641 ,  0.58739987,2.53124541, -2.46904499, -1.82410283, -2.72863627, -1.04801802,-0.66793626, -1.37190581,  1.97242505, -0.85948004, -1.31439294,0.2561765 , -2.15445465,  1.81318188, -2.55269614,  2.92132162,1.63346862, -1.80770591, -2.9668673 ,  1.89276857,  1.24114406,1.37404301,  1.62762208, -2.55573209, -0.84920563, -2.30478564,2.17862056,  0.73978876, -1.01461185, -2.6186499 , -1.13410607,-1.04890007,  1.37763707,  0.82534483,  2.32327646, -0.16671045,-2.28243452,  1.27946872,  1.56471029,  0.36766319,  1.62580308,-0.03722642,  0.13639698, -0.43475389, -2.84748524, -2.35265144])
#生成y的思路:先使用NumPy中的函数生成一个sin函数图像,然后再人为添加噪音
y = np.sin(X) + rnd.normal(size=len(X)) / 3 #random.normal,生成size个服从正态分布的随机数
y
array([-6.54639413e-01,  3.23832143e-01,  1.01463893e+00, -1.04541922e-01,-9.54097511e-01, -7.61767511e-01,  2.19222347e-02,  6.37468193e-01,3.00653482e-01,  7.81237778e-01,  4.31286305e-02,  4.26174779e-01,7.34921650e-01, -8.16896281e-01, -9.10976357e-01, -6.23556402e-01,-1.15653261e+00,  3.87722577e-02, -5.27779796e-01, -1.43764744e+00,7.20568176e-01, -7.42673638e-01, -9.46371922e-01, -7.96824836e-01,-7.32328912e-01,  8.49964652e-01, -1.08763923e+00, -1.82122918e-01,4.72745679e-01, -2.73346296e-01,  1.23014210e+00, -8.60492039e-01,-4.21323537e-01,  4.08600304e-01, -2.98759618e-01,  9.52331322e-01,-9.01575520e-01,  1.55982485e-01,  8.29522626e-01, -2.50901995e-01,-7.78358506e-01, -4.18493846e-01,  3.99942125e-02,  8.83836261e-01,-7.28709177e-01,  5.24647761e-01, -4.36700367e-01, -3.47166298e-01,4.72226156e-01, -2.19059337e-01, -1.17373303e-02,  8.08035747e-01,5.16673582e-01,  5.30194678e-01,  3.73107829e-02,  5.96006368e-01,-9.77082123e-01, -8.10224942e-01, -7.07793691e-01, -3.49790543e-01,-8.80451494e-01, -1.08764023e+00,  1.19159792e+00, -1.16779132e+00,-8.91488367e-01,  6.89097940e-01, -1.37027995e+00,  1.03231278e+00,-4.68816156e-01,  4.79101740e-01,  5.85719831e-01, -1.41222014e+00,1.42839751e-04,  1.04760806e+00,  1.02965258e+00,  1.09618916e+00,7.71710944e-01, -4.75498741e-01, -6.53065072e-01, -9.80625265e-01,1.44281734e+00,  8.32076210e-01, -1.24637686e+00, -2.80580559e-01,-1.23105035e+00, -6.04513872e-01,  1.36760121e+00,  4.61220951e-01,1.05112144e+00, -2.83456659e-02, -4.83272973e-01,  1.59012773e+00,9.18185441e-01,  1.08190382e-01,  7.01982700e-01, -3.09154586e-01,1.10273875e-01, -3.07469874e-01, -1.97655440e-01, -4.33879904e-01])
rnd.normal(size=len(X)).max()
2.1531824575115563
rnd.normal(size=len(X)).min()
-2.301921164735585
#使用散点图观察建立的数据集是什么样子
plt.scatter(X, y,marker='o',c='k',s=20)
plt.show()

#为后续建模做准备:sklearn只接受二维以上数组作为特征矩阵的输入
X.shape
(100,)
X = X.reshape(-1, 1)
X.shape
(100, 1)
  1. 使用原始数据进行建模
#使用原始数据进行建模
LinearR = LinearRegression().fit(X, y)
TreeR = DecisionTreeRegressor(random_state=0).fit(X, y)
#放置画布
fig, ax1 = plt.subplots(1)
#创建测试数据:一系列分布在横坐标上的点
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)#将测试数据带入predict接口,获得模型的拟合效果并进行绘制
ax1.plot(line, LinearR.predict(line), linewidth=2, color='green',label="linear regression")
ax1.plot(line, TreeR.predict(line), linewidth=2, color='red',label="decision tree")#将原数据上的拟合绘制在图像上
ax1.plot(X[:, 0], y, 'o', c='k')
#其他图形选项
ax1.legend(loc="best")
ax1.set_ylabel("Regression output")
ax1.set_xlabel("Input feature")
ax1.set_title("Result before discretization")
plt.tight_layout()
plt.show()
#从这个图像来看,可以得出什么结果?

从图像上可以看出,线性回归无法拟合出这条带噪音的正弦曲线的真实面貌,只能够模拟出大概的趋势,而决策树却通过建立复杂的模型将几乎每个点都拟合出来了。可见,使用线性回归模型来拟合非线性数据的效果并不好,而决策树这样的模型却拟合得太细致,但是相比之下,还是决策树的拟合效果更好一些。
决策树无法写作一个方程(我们在XGBoost章节中会详细讲解如何将决策树定义成一个方程,但它绝对不是一个形似
的方程),它是一个典型的非线性模型,当它被用于拟合非线性数据,可以发挥奇效。其他典型的非线性模型还包括使用高斯核的支持向量机,树的集成算法,以及一切通过三角函数,指数函数等非线性方程来建立的模型。
根据这个思路,我们也许可以这样推断:线性模型用于拟合线性数据,非线性模型用于拟合非线性数据。但事实上机器学习远远比我们想象的灵活得多,线性模型可以用来拟合非线性数据,而非线性模型也可以用来拟合线性数据,更神奇的是,有的算法没有模型也可以处理各类数据,而有的模型可以既可以是线性,也可以是非线性模型!接下来,我们就来一一讨论这些问题。

5.2 使用分箱处理非线性问题

让线性回归在非线性数据上表现提升的核心方法之一是对数据进行分箱,也就是离散化。与线性回归相比,我们常用的一种回归是决策树的回归。我们之前拟合过一条带有噪音的正弦曲线以展示多元线性回归与决策树的效用差异,我们来分析一下这张图,然后再使用采取措施帮助我们的线性回归。

  1. 导入所需要的库
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
  1. 创建需要拟合的数据集
rnd = np.random.RandomState(42) #设置随机数种子
X = rnd.uniform(-3, 3, size=100) #random.uniform,从输入的任意两个整数中取出size个随机数
#生成y的思路:先使用NumPy中的函数生成一个sin函数图像,然后再人为添加噪音
y = np.sin(X) + rnd.normal(size=len(X)) / 3 #random.normal,生成size个服从正态分布的随机数
#使用散点图观察建立的数据集是什么样子
plt.scatter(X, y,marker='o',c='k',s=20)
plt.show()
#为后续建模做准备:sklearn只接受二维以上数组作为特征矩阵的输入
X.shape
X = X.reshape(-1, 1)

  1. 使用原始数据进行建模
#使用原始数据进行建模
LinearR = LinearRegression().fit(X, y)
TreeR = DecisionTreeRegressor(random_state=0).fit(X, y)
#放置画布
fig, ax1 = plt.subplots(1)#创建测试数据:一系列分布在横坐标上的点
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)#将测试数据带入predict接口,获得模型的拟合效果并进行绘制
ax1.plot(line, LinearR.predict(line), linewidth=2, color='green',label="linear regression")
ax1.plot(line, TreeR.predict(line), linewidth=2, color='red',label="decision tree")
#将原数据上的拟合绘制在图像上
ax1.plot(X[:, 0], y, 'o', c='k')
#其他图形选项
ax1.legend(loc="best")
ax1.set_ylabel("Regression output")
ax1.set_xlabel("Input feature")
ax1.set_title("Result before discretization")
plt.tight_layout()
plt.show()
#从这个图像来看,可以得出什么结果?

从图像上可以看出,线性回归无法拟合出这条带噪音的正弦曲线的真实面貌,只能够模拟出大概的趋势,而决策树却通过建立复杂的模型将几乎每个点都拟合出来了。此时此刻,决策树正处于过拟合的状态,对数据的学习过于细致,而线性回归处于拟合不足的状态,这是由于模型本身只能够在线性关系间进行拟合的性质决定的。为了让线性回归在类似的数据上变得更加强大,我们可以使用分箱,也就是离散化连续型变量的方法来处理原始数据,以此来提升线性回归的表现。来看看我们如何实现:

  1. 分箱及分箱的相关问题
from sklearn.preprocessing import KBinsDiscretizer
#将数据分箱
enc = KBinsDiscretizer(n_bins=10 #分几类?,encode="onehot") #ordinal
X_binned = enc.fit_transform(X)
#encode模式"onehot":使用做哑变量方式做离散化
#之后返回一个稀疏矩阵(m,n_bins),每一列是一个分好的类别
#对每一个样本而言,它包含的分类(箱子)中它表示为1,其余分类中它表示为0
X.shape
(100, 1)
X_binned
<100x10 sparse matrix of type '<class 'numpy.float64'>'with 100 stored elements in Compressed Sparse Row format>
#使用pandas打开稀疏矩阵
import pandas as pd
pd.DataFrame(X_binned.toarray()).head()
0 1 2 3 4 5 6 7 8 9
0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
4 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
#我们将使用分箱后的数据来训练模型,在sklearn中,测试集和训练集的结构必须保持一致,否则报错
LinearR_ = LinearRegression().fit(X_binned, y)
LinearR_.predict(line) #line作为测试集
---------------------------------------------------------------------------ValueError                                Traceback (most recent call last)<ipython-input-141-abc35ebde7c7> in <module>
----> 1 LinearR_.predict(line) #line作为测试集D:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in predict(self, X)223             Returns predicted values.224         """
--> 225         return self._decision_function(X)226 227     _preprocess_data = staticmethod(_preprocess_data)D:\ProgramData\Anaconda3\lib\site-packages\sklearn\linear_model\_base.py in _decision_function(self, X)207         X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])208         return safe_sparse_dot(X, self.coef_.T,
--> 209                                dense_output=True) + self.intercept_210 211     def predict(self, X):D:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)149             ret = np.dot(a, b)150     else:
--> 151         ret = a @ b152 153     if (sparse.issparse(a) and sparse.issparse(b)ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 10 is different from 1)
line.shape #测试
(1000, 1)
X_binned.shape #训练
(100, 10)
#因此我们需要创建分箱后的测试集:按照已经建好的分箱模型将line分箱
line_binned = enc.transform(line)
line_binned
<1000x10 sparse matrix of type '<class 'numpy.float64'>'with 1000 stored elements in Compressed Sparse Row format>
line_binned.shape #分箱后的数据是无法进行绘图的
(1000, 10)
LinearR_.predict(line_binned)
array([-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.22510103, -0.22510103, -0.22510103, -0.22510103, -0.22510103,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.68407735, -0.68407735, -0.68407735,-0.68407735, -0.68407735, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.84238714, -0.84238714,-0.84238714, -0.84238714, -0.84238714, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.90433112,-0.90433112, -0.90433112, -0.90433112, -0.90433112, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,-0.72176296, -0.72176296, -0.72176296, -0.72176296, -0.72176296,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.01332773,0.01332773,  0.01332773,  0.01332773,  0.01332773,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.53043458,  0.53043458,  0.53043458,0.53043458,  0.53043458,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.98570463,  0.98570463,0.98570463,  0.98570463,  0.98570463,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.97481791,  0.97481791,0.97481791,  0.97481791,  0.97481791,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229,0.38539229,  0.38539229,  0.38539229,  0.38539229,  0.38539229])
LinearR_.predict(line_binned).shape
(1000,)
  1. 使用分箱数据进行建模和绘图
#准备数据
enc = KBinsDiscretizer(n_bins=10,encode="onehot")
X_binned = enc.fit_transform(X)
line_binned = enc.transform(line)
#将两张图像绘制在一起,布置画布
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True #让两张图共享y轴上的刻度, figsize=(10, 4))#在图1中布置在原始数据上建模的结果
ax1.plot(line, LinearR.predict(line), linewidth=2, color='green',label="linear regression")
ax1.plot(line, TreeR.predict(line), linewidth=2, color='red',label="decision tree")
ax1.plot(X[:, 0], y, 'o', c='k')
ax1.legend(loc="best")
ax1.set_ylabel("Regression output")
ax1.set_xlabel("Input feature")
ax1.set_title("Result before discretization")#使用分箱数据进行建模
LinearR_ = LinearRegression().fit(X_binned, y)
TreeR_ = DecisionTreeRegressor(random_state=0).fit(X_binned, y)#进行预测,在图2中布置在分箱数据上进行预测的结果
ax2.plot(line #横坐标, LinearR_.predict(line_binned) #分箱后的特征矩阵的结果, linewidth=2, color='green', linestyle='-', label='linear regression')
ax2.plot(line, TreeR_.predict(line_binned), linewidth=2, color='red',linestyle=':', label='decision tree')#绘制和箱宽一致的竖线
ax2.vlines(enc.bin_edges_[0] # 设置竖线在x轴的位置, *plt.gca().get_ylim() # 设置竖线在y轴的上限和下限            , linewidth=1, alpha=.2)#将原始数据分布放置在图像上
ax2.plot(X[:, 0], y, 'o', c='k')
#其他绘图设定
ax2.legend(loc="best")
ax2.set_xlabel("Input feature")
ax2.set_title("Result after discretization")
plt.tight_layout()
plt.show()

enc.bin_edges_
array([array([-2.9668673 , -2.55299973, -2.0639171 , -1.3945301 , -1.02797432,-0.21514527,  0.44239288,  1.14612193,  1.63693428,  2.32784522,2.92132162])], dtype=object)
enc.bin_edges_[0] # 数组中包含的数值就是分箱后的上限和下限,
# 把这些上限和下限作为竖线的x轴坐标
array([-2.9668673 , -2.55299973, -2.0639171 , -1.3945301 , -1.02797432,-0.21514527,  0.44239288,  1.14612193,  1.63693428,  2.32784522,2.92132162])
plt.gca().get_ylim() # 获取y轴的上限和下限
(0.0, 1.0)

[*(plt.gca().get_ylim())]  # 加上*号表示,可以把元组中的数据取出来用
[0.0, 1.0]

从图像上可以看出,离散化后线性回归和决策树上的预测结果完全相同了——线性回归比较成功地拟合了数据的分布,而决策树的过拟合效应也减轻了。由于特征矩阵被分箱,因此特征矩阵在每个区域内获得的值是恒定的,因此所有模型对同一个箱中所有的样本都会获得相同的预测值。与分箱前的结果相比,线性回归明显变得更加灵活,而决策树的过拟合问题也得到了改善。但注意,一般来说我们是不使用分箱来改善决策树的过拟合问题的,因为树模型带有丰富而有效的剪枝功能来防止过拟合。

在这个例子中,我们设置的分箱箱数为10,不难想到这个箱数的设定肯定会影响模型最后的预测结果,我们来看看不同的箱数会如何影响回归的结果:

  1. 箱子数如何影响模型的结果
#准备数据
enc = KBinsDiscretizer(n_bins=5,encode="onehot")
X_binned = enc.fit_transform(X)
line_binned = enc.transform(line)
#将两张图像绘制在一起,布置画布
fig, ax2 = plt.subplots(ncols=1, figsize=(5, 4))#使用分箱数据进行建模
LinearR_ = LinearRegression().fit(X_binned, y)print(LinearR_.score(line_binned,np.sin(line)))TreeR_ = DecisionTreeRegressor(random_state=0).fit(X_binned, y)#进行预测,在图2中布置在分箱数据上进行预测的结果
ax2.plot(line #横坐标, LinearR_.predict(line_binned) #分箱后的特征矩阵的结果, linewidth=2, color='green', linestyle='-', label='linear regression')
ax2.plot(line, TreeR_.predict(line_binned), linewidth=2, color='red',linestyle=':', label='decision tree')#绘制和箱宽一致的竖线
ax2.vlines(enc.bin_edges_[0] # 设置竖线在x轴的位置, *plt.gca().get_ylim() # 设置竖线在y轴的上限和下限            , linewidth=1, alpha=.2)#将原始数据分布放置在图像上
ax2.plot(X[:, 0], y, 'o', c='k')
#其他绘图设定
ax2.legend(loc="best")
ax2.set_xlabel("Input feature")
ax2.set_title("Result after discretization")
plt.tight_layout()
plt.show()
0.8649069759304867

  1. 如何选取最优的箱数
from sklearn.model_selection import cross_val_score as CVS
import numpy as nppred,score,var = [], [], []
binsrange = [2,5,10,15,20,30]
for i in binsrange:#实例化分箱类enc = KBinsDiscretizer(n_bins=i,encode="onehot")#转换数据X_binned = enc.fit_transform(X)line_binned = enc.transform(line)#建立模型LinearR_ = LinearRegression()#全数据集上的交叉验证cvresult = CVS(LinearR_,X_binned,y,cv=5)score.append(cvresult.mean())var.append(cvresult.var())#测试数据集上的打分结果pred.append(LinearR_.fit(X_binned,y).score(line_binned,np.sin(line)))#绘制图像
plt.figure(figsize=(6,5))
plt.plot(binsrange,pred,c="orange",label="test")
plt.plot(binsrange,score,c="k",label="full data")
plt.plot(binsrange,score+np.array(var)*0.5,c="red",linestyle="--",label = "var")
plt.plot(binsrange,score-np.array(var)*0.5,c="red",linestyle="--")
plt.legend()
plt.show()

由上图可知,选择分箱数为20箱是最佳的。因为此时,方差最小且均值最高,模型最稳定。

5.3 多项式回归PolynomialFeatures

from sklearn.preprocessing import PolynomialFeatures
import numpy as np#如果原始数据是一维的
X = np.arange(1,4).reshape(-1,1)
X
array([[1],[2],[3]])
X.shape
(3, 1)
#二次多项式,参数degree控制多项式的次方
poly = PolynomialFeatures(degree=2)
#接口transform直接调用
X_ = poly.fit_transform(X)
X_
array([[1., 1., 1.],[1., 2., 4.],[1., 3., 9.]])
X_.shape
(3, 3)
#三次多项式
PolynomialFeatures(degree=3).fit_transform(X)
array([[ 1.,  1.,  1.,  1.],[ 1.,  2.,  4.,  8.],[ 1.,  3.,  9., 27.]])

不难注意到,多项式变化后数据看起来不太一样了:首先,数据的特征(维度)增加了,这正符合我们希望的将数据转换到高维空间的愿望。其次,维度的增加是有一定的规律的。不难发现,如果我们本来的特征矩阵中只有一个特征x,而转换后我们得到:[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4LWhknq8-1593261098365)(attachment:image.png)]

#三次多项式,不带与截距项相乘的x0
PolynomialFeatures(degree=3,include_bias=False).fit_transform(X)
array([[ 1.,  1.,  1.],[ 2.,  4.,  8.],[ 3.,  9., 27.]])
#为什么我们会希望不生成与截距相乘的x0呢?
#对于多项式回归来说,我们已经为线性回归准备好了x0,但是线性回归并不知道
xxx = PolynomialFeatures(degree=3).fit_transform(X)
xxx.shape
(3, 4)
rnd = np.random.RandomState(42) #设置随机数种子
y = rnd.randn(3)
y
array([ 0.49671415, -0.1382643 ,  0.64768854])
#生成了多少个系数?
LinearRegression().fit(xxx,y).coef_
array([ 3.08086889e-15, -3.51045297e-01, -6.06987134e-01,  2.19575463e-01])
#查看截距
LinearRegression().fit(xxx,y).intercept_
1.2351711202036884
#发现问题了吗?线性回归并没有把多项式生成的x0当作是截距项
#所以我们可以选择:关闭多项式回归中的include_bias
#也可以选择:关闭线性回归中的fit_intercept
#生成了多少个系数?
LinearRegression(fit_intercept=False).fit(xxx,y).coef_
array([ 1.00596411,  0.06916756, -0.83619415,  0.25777663])
#查看截距
LinearRegression(fit_intercept=False).fit(xxx,y).intercept_
0.0

不过,这只是一维状况的表达,大多数时候我们的原始特征矩阵不可能会是一维的,至少也是二维以上,很多时候还可能存在上千个特征或者维度。现在我们来看看原始特征矩阵是二维的状况:

X = np.arange(6).reshape(3, 2)
X
array([[0, 1],[2, 3],[4, 5]])
#尝试二次多项式
PolynomialFeatures(degree=2).fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],[ 1.,  2.,  3.,  4.,  6.,  9.],[ 1.,  4.,  5., 16., 20., 25.]])

很明显,上面一维的转换公式已经不适用了,但如果我们仔细看,是可以看出这样的规律的:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-mBn3uWzL-1593261098366)(attachment:image.png)]

当原始特征为二维的时候,多项式的二次变化突然将特征增加到了六维,其中一维是常量(也就是截距)。当我们继续适用线性回归去拟合的时候,我们会得到的方程如下:[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TAfOtxgL-1593261098367)(attachment:1593253305%281%29.png)]

#尝试三次多项式
PolynomialFeatures(degree=3).fit_transform(X)
array([[  1.,   0.,   1.,   0.,   0.,   1.,   0.,   0.,   0.,   1.],[  1.,   2.,   3.,   4.,   6.,   9.,   8.,  12.,  18.,  27.],[  1.,   4.,   5.,  16.,  20.,  25.,  64.,  80., 100., 125.]])

很明显,我们可以看出这次生成的数据有这样的规律:[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6fMOd4sd-1593261098368)(attachment:image.png)]

不难发现:当我们进行多项式转换的时候,多项式会产出到最高次数为止的所有低高次项。比如如果我们规定多项式的次数为2,多项式就会产出所有次数为1和次数为2的项反馈给我们,相应的如果我们规定多项式的次数为n,则多项式会产出所有从次数为1到次数为n的项。注意, 和 一样都是二次项,一个自变量的平方其实也就相当于是
,所以在三次多项式中 就是三次项。

在多项式回归中,我们可以规定是否产生平方或者立方项,其实如果我们只要求高次项的话, 会是一个比 更好的高次项,因为 和 之间的共线性会比 与 之间的共线性好那么一点点(只是一点点),而我们多项式转化之后是需要使用线性回归模型来进行拟合的,就算机器学习中不是那么在意数据上的基本假设,但是太过分的共线性还是会影响到模型的拟合。因此sklearn中存在着控制是否要生成平方和立方项的参数interaction_only,默认为False,以减少共线性。来看这个参数是如何工作的:

PolynomialFeatures(degree=2).fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],[ 1.,  2.,  3.,  4.,  6.,  9.],[ 1.,  4.,  5., 16., 20., 25.]])
PolynomialFeatures(degree=2,interaction_only=True).fit_transform(X)
array([[ 1.,  0.,  1.,  0.],[ 1.,  2.,  3.,  6.],[ 1.,  4.,  5., 20.]])

对比之下,当interaction_only为True的时候,只生成交互项

从之前的许多次尝试中我们可以看出,随着多项式的次数逐渐变高,特征矩阵会被转化得越来越复杂。不仅是次数,当特征矩阵中的维度数(特征数)增加的时候,多项式同样会变得更加复杂:

#更高维度的原始特征矩阵
X = np.arange(9).reshape(3, 3)
X
array([[0, 1, 2],[3, 4, 5],[6, 7, 8]])
PolynomialFeatures(degree=2).fit_transform(X)
array([[ 1.,  0.,  1.,  2.,  0.,  0.,  0.,  1.,  2.,  4.],[ 1.,  3.,  4.,  5.,  9., 12., 15., 16., 20., 25.],[ 1.,  6.,  7.,  8., 36., 42., 48., 49., 56., 64.]])
PolynomialFeatures(degree=3).fit_transform(X)
array([[  1.,   0.,   1.,   2.,   0.,   0.,   0.,   1.,   2.,   4.,   0.,0.,   0.,   0.,   0.,   0.,   1.,   2.,   4.,   8.],[  1.,   3.,   4.,   5.,   9.,  12.,  15.,  16.,  20.,  25.,  27.,36.,  45.,  48.,  60.,  75.,  64.,  80., 100., 125.],[  1.,   6.,   7.,   8.,  36.,  42.,  48.,  49.,  56.,  64., 216.,252., 288., 294., 336., 384., 343., 392., 448., 512.]])
X_ = PolynomialFeatures(degree=20).fit_transform(X)
X_.shape
(3, 1771)

如此,多项式变化对于数据会有怎样的影响就一目了然了:随着原特征矩阵的维度上升,随着我们规定的最高次数的上升,数据会变得越来越复杂,维度越来越多,并且这种维度的增加并不能用太简单的数学公式表达出来。因此,多项式回归没有固定的模型表达式,多项式回归的模型最终长什么样子是由数据和最高次数决定的,因此我们无法断言说某个数学表达式"就是多项式回归的数学表达",因此要求解多项式回归不是一件容易的事儿,感兴趣的大家可以自己去尝试看看用最小二乘法求解多项式回归。接下来,我们就来看看多项式回归的根本作用:处理非线性问题。

5.3.2 多项式回归处理非线性问题

from sklearn.preprocessing import PolynomialFeatures as PF
from sklearn.linear_model import LinearRegression
import numpy as np
rnd = np.random.RandomState(42) #设置随机数种子
X = rnd.uniform(-3, 3, size=100)
y = np.sin(X) + rnd.normal(size=len(X)) / 3
#将X升维,准备好放入sklearn中
X = X.reshape(-1,1)
#创建测试数据,均匀分布在训练集X的取值范围内的一千个点
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
#原始特征矩阵的拟合结果
LinearR = LinearRegression().fit(X, y)
#对训练数据的拟合
LinearR.score(X,y)
0.5361526059318595
#对测试数据的拟合
LinearR.score(line,np.sin(line))
0.6800102369793312
#多项式拟合,设定高次项
d=5 #进行高此项转换
poly = PF(degree=d)
X_ = poly.fit_transform(X) # 将训练数据升维
line_ = PF(degree=d).fit_transform(line) # 将测试数据升维
#训练数据的拟合
LinearR_ = LinearRegression().fit(X_, y)
LinearR_.score(X_,y)
0.8561679370344799
#测试数据的拟合
LinearR_.score(line_,np.sin(line))
0.9868904451787978
import matplotlib.pyplot as plt
d=5
#和上面展示一致的建模流程
LinearR = LinearRegression().fit(X, y)
X_ = PF(degree=d).fit_transform(X)
LinearR_ = LinearRegression().fit(X_, y)
line = np.linspace(-3, 3, 1000, endpoint=False).reshape(-1, 1)
line_ = PF(degree=d).fit_transform(line)
#放置画布
fig, ax1 = plt.subplots(1)
#将测试数据带入predict接口,获得模型的拟合效果并进行绘制
ax1.plot(line, LinearR.predict(line), linewidth=2, color='green',label="linear regression")
ax1.plot(line, LinearR_.predict(line_), linewidth=2, color='red',label="Polynomial regression")
#将原数据上的拟合绘制在图像上
ax1.plot(X[:, 0], y, 'o', c='k')
#其他图形选项
ax1.legend(loc="best")
ax1.set_ylabel("Regression output")
ax1.set_xlabel("Input feature")
ax1.set_title("Linear Regression ordinary vs poly")
plt.tight_layout()
plt.show()
#来一起鼓掌,感叹多项式回归的神奇

从这里大家可以看出,多项式回归能够较好地拟合非线性数据,还不容易发生过拟合,可以说是保留了线性回归作为线性模型所带的“不容易过拟合”和“计算快速”的性质,同时又实现了优秀地拟合非线性数据。到了这里,相信大家对于多项式回归的效果已经不再怀疑了。多项式回归非常迷人也非常神奇,因此一直以来都有各种各样围绕着多项式回归进行的讨论。在这里,为大家梳理几个常见问题和讨论,供大家参考。

5.3.3 多项式回归的可解释性

线性回归是一个具有高解释性的模型,它能够对每个特征拟合出参数 以帮助我们理解每个特征对于标签的作用。当我们进行了多项式转换后,尽管我们还是形成形如线性回归的方程,但随着数据维度和多项式次数的上升,方程也变得异常复杂,我们可能无法一眼看出增维后的特征是由之前的什么特征组成的(之前我们都是肉眼看肉眼判断)。不过,多项式回归的可解释性依然是存在的,我们可以使用接口get_feature_names来调用生成的新特征矩阵的各个特征上的名称,以便帮助我们解释模型。来看下面的例子:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegressionX = np.arange(9).reshape(3, 3)
X
array([[0, 1, 2],[3, 4, 5],[6, 7, 8]])
poly = PolynomialFeatures(degree=5).fit(X)
poly
PolynomialFeatures(degree=5, include_bias=True, interaction_only=False,order='C')
#重要接口get_feature_names
poly.get_feature_names()
['1','x0','x1','x2','x0^2','x0 x1','x0 x2','x1^2','x1 x2','x2^2','x0^3','x0^2 x1','x0^2 x2','x0 x1^2','x0 x1 x2','x0 x2^2','x1^3','x1^2 x2','x1 x2^2','x2^3','x0^4','x0^3 x1','x0^3 x2','x0^2 x1^2','x0^2 x1 x2','x0^2 x2^2','x0 x1^3','x0 x1^2 x2','x0 x1 x2^2','x0 x2^3','x1^4','x1^3 x2','x1^2 x2^2','x1 x2^3','x2^4','x0^5','x0^4 x1','x0^4 x2','x0^3 x1^2','x0^3 x1 x2','x0^3 x2^2','x0^2 x1^3','x0^2 x1^2 x2','x0^2 x1 x2^2','x0^2 x2^3','x0 x1^4','x0 x1^3 x2','x0 x1^2 x2^2','x0 x1 x2^3','x0 x2^4','x1^5','x1^4 x2','x1^3 x2^2','x1^2 x2^3','x1 x2^4','x2^5']

使用加利佛尼亚房价数据集给大家作为例子,当我们有标签名称的时候,可以直接在接口get_feature_names()中输入标签名称来查看新特征究竟是由原特征矩阵中的什么特征组成的:

from sklearn.datasets import fetch_california_housing as fch
import pandas as pd
housevalue = fch()
X = pd.DataFrame(housevalue.data)
y = housevalue.target
housevalue.feature_names
['MedInc','HouseAge','AveRooms','AveBedrms','Population','AveOccup','Latitude','Longitude']
X.columns = ["住户收入中位数","房屋使用年代中位数","平均房间数目","平均卧室数目","街区人口","平均入住率","街区的纬度","街区的经度"]
poly = PolynomialFeatures(degree=2).fit(X,y)
poly.get_feature_names(X.columns)
['1','住户收入中位数','房屋使用年代中位数','平均房间数目','平均卧室数目','街区人口','平均入住率','街区的纬度','街区的经度','住户收入中位数^2','住户收入中位数 房屋使用年代中位数','住户收入中位数 平均房间数目','住户收入中位数 平均卧室数目','住户收入中位数 街区人口','住户收入中位数 平均入住率','住户收入中位数 街区的纬度','住户收入中位数 街区的经度','房屋使用年代中位数^2','房屋使用年代中位数 平均房间数目','房屋使用年代中位数 平均卧室数目','房屋使用年代中位数 街区人口','房屋使用年代中位数 平均入住率','房屋使用年代中位数 街区的纬度','房屋使用年代中位数 街区的经度','平均房间数目^2','平均房间数目 平均卧室数目','平均房间数目 街区人口','平均房间数目 平均入住率','平均房间数目 街区的纬度','平均房间数目 街区的经度','平均卧室数目^2','平均卧室数目 街区人口','平均卧室数目 平均入住率','平均卧室数目 街区的纬度','平均卧室数目 街区的经度','街区人口^2','街区人口 平均入住率','街区人口 街区的纬度','街区人口 街区的经度','平均入住率^2','平均入住率 街区的纬度','平均入住率 街区的经度','街区的纬度^2','街区的纬度 街区的经度','街区的经度^2']
X_ = poly.transform(X)
#在这之后,我们依然可以直接建立模型,然后使用线性回归的coef_属性来查看什么特征对标签的影响最大
reg = LinearRegression().fit(X_,y)
coef = reg.coef_
coef
array([ 5.91954055e-08, -1.12430252e+01, -8.48898543e-01,  6.44105898e+00,-3.15913288e+01,  4.06090344e-04,  1.00386234e+00,  8.70568188e+00,5.88063272e+00, -3.13081272e-02,  1.85994682e-03,  4.33020468e-02,-1.86142278e-01,  5.72831545e-05, -2.59019509e-03, -1.52505713e-01,-1.44242939e-01,  2.11725336e-04, -1.26219010e-03,  1.06115056e-02,2.81885293e-06, -1.81716947e-03, -1.00690372e-02, -9.99950167e-03,7.26947730e-03, -6.89064340e-02, -6.82365908e-05,  2.68878842e-02,8.75089875e-02,  8.22890339e-02,  1.60180950e-01,  5.14264271e-04,-8.71911472e-02, -4.37042992e-01, -4.04150578e-01,  2.73779577e-09,1.91426762e-05,  2.29529789e-05,  1.46567733e-05,  8.71560978e-05,2.13344592e-02,  1.62412938e-02,  6.18867358e-02,  1.08107173e-01,3.99077351e-02])
[*zip(poly.get_feature_names(X.columns),reg.coef_)]
[('1', 5.919540552171548e-08),('住户收入中位数', -11.243025193367279),('房屋使用年代中位数', -0.848898543001153),('平均房间数目', 6.441058980103585),('平均卧室数目', -31.5913287885817),('街区人口', 0.00040609034379415385),('平均入住率', 1.003862338673655),('街区的纬度', 8.705681884585069),('街区的经度', 5.880632723650286),('住户收入中位数^2', -0.03130812716756933),('住户收入中位数 房屋使用年代中位数', 0.0018599468175778376),('住户收入中位数 平均房间数目', 0.04330204675617265),('住户收入中位数 平均卧室数目', -0.18614227806341782),('住户收入中位数 街区人口', 5.7283154455812717e-05),('住户收入中位数 平均入住率', -0.0025901950881940016),('住户收入中位数 街区的纬度', -0.15250571255697834),('住户收入中位数 街区的经度', -0.1442429393754428),('房屋使用年代中位数^2', 0.00021172533625687026),('房屋使用年代中位数 平均房间数目', -0.0012621900983789294),('房屋使用年代中位数 平均卧室数目', 0.010611505608370727),('房屋使用年代中位数 街区人口', 2.818852930851565e-06),('房屋使用年代中位数 平均入住率', -0.0018171694688044425),('房屋使用年代中位数 街区的纬度', -0.010069037156389547),('房屋使用年代中位数 街区的经度', -0.009999501671412017),('平均房间数目^2', 0.007269477298002129),('平均房间数目 平均卧室数目', -0.06890643404856586),('平均房间数目 街区人口', -6.823659076329969e-05),('平均房间数目 平均入住率', 0.026887884152557523),('平均房间数目 街区的纬度', 0.0875089875407275),('平均房间数目 街区的经度', 0.08228903389524618),('平均卧室数目^2', 0.1601809500092068),('平均卧室数目 街区人口', 0.0005142642707304053),('平均卧室数目 平均入住率', -0.08719114715677954),('平均卧室数目 街区的纬度', -0.43704299179225914),('平均卧室数目 街区的经度', -0.4041505775830314),('街区人口^2', 2.737795767870921e-09),('街区人口 平均入住率', 1.914267616391803e-05),('街区人口 街区的纬度', 2.2952978919604794e-05),('街区人口 街区的经度', 1.4656773311472193e-05),('平均入住率^2', 8.715609781424712e-05),('平均入住率 街区的纬度', 0.021334459219533943),('平均入住率 街区的经度', 0.01624129382914855),('街区的纬度^2', 0.06188673577348754),('街区的纬度 街区的经度', 0.10810717324450632),('街区的经度^2', 0.039907735079891565)]
#放到dataframe中进行排序
coeff = pd.DataFrame([poly.get_feature_names(X.columns),reg.coef_.tolist()]).T
coeff.columns = ["feature","coef"]
coeff.sort_values(by="coef") # df.sort_values(by="coef") 按照coef字段的值进行排序,默认升序
feature coef
4 平均卧室数目 -31.5913
1 住户收入中位数 -11.243
2 房屋使用年代中位数 -0.848899
33 平均卧室数目 街区的纬度 -0.437043
34 平均卧室数目 街区的经度 -0.404151
12 住户收入中位数 平均卧室数目 -0.186142
15 住户收入中位数 街区的纬度 -0.152506
16 住户收入中位数 街区的经度 -0.144243
32 平均卧室数目 平均入住率 -0.0871911
25 平均房间数目 平均卧室数目 -0.0689064
9 住户收入中位数^2 -0.0313081
22 房屋使用年代中位数 街区的纬度 -0.010069
23 房屋使用年代中位数 街区的经度 -0.0099995
14 住户收入中位数 平均入住率 -0.0025902
21 房屋使用年代中位数 平均入住率 -0.00181717
18 房屋使用年代中位数 平均房间数目 -0.00126219
26 平均房间数目 街区人口 -6.82366e-05
35 街区人口^2 2.7378e-09
0 1 5.91954e-08
20 房屋使用年代中位数 街区人口 2.81885e-06
38 街区人口 街区的经度 1.46568e-05
36 街区人口 平均入住率 1.91427e-05
37 街区人口 街区的纬度 2.2953e-05
13 住户收入中位数 街区人口 5.72832e-05
39 平均入住率^2 8.71561e-05
17 房屋使用年代中位数^2 0.000211725
5 街区人口 0.00040609
31 平均卧室数目 街区人口 0.000514264
10 住户收入中位数 房屋使用年代中位数 0.00185995
24 平均房间数目^2 0.00726948
19 房屋使用年代中位数 平均卧室数目 0.0106115
41 平均入住率 街区的经度 0.0162413
40 平均入住率 街区的纬度 0.0213345
27 平均房间数目 平均入住率 0.0268879
44 街区的经度^2 0.0399077
11 住户收入中位数 平均房间数目 0.043302
42 街区的纬度^2 0.0618867
29 平均房间数目 街区的经度 0.082289
28 平均房间数目 街区的纬度 0.087509
43 街区的纬度 街区的经度 0.108107
30 平均卧室数目^2 0.160181
6 平均入住率 1.00386
8 街区的经度 5.88063
3 平均房间数目 6.44106
7 街区的纬度 8.70568
reg.coef_
array([ 4.36693293e-01,  9.43577803e-03, -1.07322041e-01,  6.45065694e-01,-3.97638942e-06, -3.78654265e-03, -4.21314378e-01, -4.34513755e-01])
reg.coef_.tolist()
[0.4366932931343249,0.00943577803323849,-0.10732204139090416,0.6450656935198118,-3.97638942118729e-06,-0.0037865426549709763,-0.42131437752714423,-0.4345137546747774]

可以发现,不仅数据的可解释性还存在,我们还可以通过这样的手段做特征工程——特征创造。多项式帮助我们进行了一系列特征之间相乘的组合,若能够找出组合起来后对标签贡献巨大的特征,那我们就是创造了新的有效特征,对于任何学科而言发现新特征都是非常有价值的。

在加利佛尼亚房屋价值数据集上来再次确认多项式回归提升模型表现的能力:

#顺便可以查看一下多项式变化之后,模型的拟合效果如何了
poly = PolynomialFeatures(degree=4).fit(X,y)
X_ = poly.transform(X)
reg = LinearRegression().fit(X,y)
reg.score(X,y)
0.6062326851998049
from time import time
time0 = time()
reg_ = LinearRegression().fit(X_,y)
print("R2:{}".format(reg_.score(X_,y)))
print("time:{}".format(time()-time0))
R2:0.7452006076442239
time:0.8058722019195557
#假设使用其他模型?
from sklearn.ensemble import RandomForestRegressor as RFR
time0 = time()
print("R2:{}".format(RFR(n_estimators=100).fit(X,y).score(X,y)))
print("time:{}".format(time()-time0))
R2:0.9743876150808999
time:10.486432075500488

总结

本篇文章中,主要讲解了多元线性回归岭回归Lasso多项式回归总计四个算法,他们都是围绕着原始的线性回归进行的拓展和改进。其中岭回归和Lasso是为了解决多元线性回归中使用最小二乘法的各种限制,主要用途是消除多重共线性带来的影响并且做特征选择,而多项式回归解决了线性回归无法拟合非线性数据的明显缺点,核心作用是提升模型的表现

十二、案例:加利福尼亚房屋价值数据集(多元线性回归) Lasso 岭回归 分箱处理非线性问题 多项式回归相关推荐

  1. 多元线性回归,岭回归,lasso回归(具体代码(包括调用库代码和手写代码实现)+一点点心得)

    最近数据挖掘导论老师布置了一项作业,主要就是线性回归的实现,笔者之前听过吴恩达的线性回归的网课,但一直没有进行代码的实现,这次正好相对系统的整理一下,方便各位同学的学习,也希望能够对其进行优化,优化的 ...

  2. 多元线性回归-Lasso

    目录 1.Lasso与多重共线性 2. Lasso的核心作用:特征选择 3. 选取最佳的正则化参数取值 1.Lasso与多重共线性 Lasso全称最小绝对收缩和选择算子(Least absolute ...

  3. R语言惩罚logistic逻辑回归(LASSO,岭回归)高维变量选择分类心肌梗塞数据模型案例...

    全文下载链接:http://tecdat.cn/?p=21444 在本文中,逻辑logistic回归是研究中常用的方法,可以进行影响因素筛选.概率预测.分类等,例如医学研究中高通里测序技术得到的数据给 ...

  4. 19 多元线性回归与模型回归

    19 多元线性回归与模型回归 标签:机器学习与数据挖掘 1.调整 R 2 R^2 R2   对于 R 2 R^2 R2,只要添加入新的参数,它就会变大,不过这个变量有没有用.而我们采用 调 整 R 2 ...

  5. 总离差平方和公式_在多元线性回归模型中,回归平方和与总离差平方和的比值称为( )_学小易找答案...

    [单选题]参数 的估计量 具备有效性是指( ) [多选题]关于多重判定系数 的公式正确的有( ) [多选题]不满足OLS基本假定的情况,主要包括( ) [单选题]在多元回归分析中,F检验是用来检验( ...

  6. R语言惩罚logistic逻辑回归(LASSO,岭回归)高维变量选择的分类模型案例

    原文链接:http://tecdat.cn/?p=21444 逻辑logistic回归是研究中常用的方法,可以进行影响因素筛选.概率预测.分类等,例如医学研究中高通里测序技术得到的数据给高维变量选择问 ...

  7. 机器学习(多元线性回归模型逻辑回归)

    多元线性回归 定义:回归分析中,含有两个或者两个以上自变量,称为多元回归,若自变量系数为1,则此回归为多元线性回归. (特殊的:自变量个数为1个,为一元线性回归)多元线性回归模型如下所示: 如上图所示 ...

  8. Stata的多元线性回归与泊松回归

    1. 相关性检测 Pearson相关系数 correlate [varlist] [if] [in] [weight] [, correlate_options] Spearman相关系数 pwcor ...

  9. Stata面板设置与面板数据多元线性回归与泊松回归命令

    1. 设置面板 [XT] xtset – Declare data to be panel data xtset panelvar timevar [, tsoptions] # example xt ...

最新文章

  1. OpenCV画图函数
  2. Cell:新方法PopCOGenT鉴定微生物基因组间的基因流动
  3. R语言ggplot2可视化分面图(facet,facet_wrap): 不同分面配置不同的数据范围、自定义每个分面的轴数据格式化形式及数据范围
  4. Mybatis Plugin插件安装破解及使用
  5. Oracle 查看表空间的大小及使用情况sql语句
  6. c++访问者模式visitor
  7. IOS https抓包及10.3.3版本证书不生效问题解决
  8. LeetCode 1220. 统计元音字母序列的数目(DP)
  9. tf.TensorArray
  10. java语言复制数组的四种方法
  11. 华为捐赠欧拉 共建数字基础设施开源操作系统
  12. php 面向对象问题,PHP 面向对象开发的一些问题
  13. flask+apache2+ubuntu
  14. Codeforces Round #224 (Div. 2)
  15. multisim10中设置变压器匝数比的两种方法
  16. Redis(八):进阶篇 - 事务
  17. 360抢票王验证码自动识别真的那么牛吗?
  18. win10卸载office2016提示:安装程序包的语言不受系统支持
  19. android adb 模拟长按,adb 模拟长按电源键
  20. 如何使用Qt调试飞机大战

热门文章

  1. Hadoop-HDFS读文件
  2. java cps变换_C#中的递归APS和CPS模式详解
  3. Tarjan算法 —— 强连通双连通缩点 模板
  4. python panda3d从入门_panda3d 入门
  5. Git的简单使用——连接码云
  6. 小程序制作预算_科普:小程序制作一个需要多少钱?
  7. 2022-05-14前端周报 巴厘岛项目交接完成
  8. 计算机无法预览文件,SolidWorks文件不能预览的解决方法 | 我爱分享网
  9. 【读书笔记】Linux内核完全注释第二章:微型计算机组成结构
  10. 十大iOS动作冒险游戏评点