加权线性回归案例：预测鲍鱼的年龄

点击文章标题即可获取源代码和笔记
数据集：https://download.csdn.net/download/weixin_44827418/12553408

1.导入数据集

数据集描述：

import pandas as pd
import numpy as npabalone = pd.read_table("./datas/abalone.txt",header=None)
abalone.columns=['性别','长度','直径','高度','整体重量','肉重量','内脏重量','壳重','年龄']
abalone.head()

	性别	长度	直径	高度	整体重量	肉重量	内脏重量	壳重	年龄
0	1	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	1	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	-1	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	1	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
4	0	0.330	0.255	0.080	0.2050	0.0895	0.0395	0.055	7

abalone.shape

(4177, 9)

abalone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):#   Column  Non-Null Count  Dtype
---  ------  --------------  -----  0   性别      4177 non-null   int64  1   长度      4177 non-null   float642   直径      4177 non-null   float643   高度      4177 non-null   float644   整体重量    4177 non-null   float645   肉重量     4177 non-null   float646   内脏重量    4177 non-null   float647   壳重      4177 non-null   float648   年龄      4177 non-null   int64
dtypes: float64(7), int64(2)
memory usage: 293.8 KB

abalone.describe()

	性别	长度	直径	高度	整体重量	肉重量	内脏重量	壳重	年龄
count	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000
mean	0.052909	0.523992	0.407881	0.139516	0.828742	0.359367	0.180594	0.238831	9.933684
std	0.822240	0.120093	0.099240	0.041827	0.490389	0.221963	0.109614	0.139203	3.224169
min	-1.000000	0.075000	0.055000	0.000000	0.002000	0.001000	0.000500	0.001500	1.000000
25%	-1.000000	0.450000	0.350000	0.115000	0.441500	0.186000	0.093500	0.130000	8.000000
50%	0.000000	0.545000	0.425000	0.140000	0.799500	0.336000	0.171000	0.234000	9.000000
75%	1.000000	0.615000	0.480000	0.165000	1.153000	0.502000	0.253000	0.329000	11.000000
max	1.000000	0.815000	0.650000	1.130000	2.825500	1.488000	0.760000	1.005000	29.000000

2. 查看数据分布状况

import numpy as np
import pandas as pd
import random
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['simhei'] #显示中文
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
%matplotlib inline

mpl.cm.rainbow(np.linspace(0,1,10))

array([[5.00000000e-01, 0.00000000e+00, 1.00000000e+00, 1.00000000e+00],[2.80392157e-01, 3.38158275e-01, 9.85162233e-01, 1.00000000e+00],[6.07843137e-02, 6.36474236e-01, 9.41089253e-01, 1.00000000e+00],[1.66666667e-01, 8.66025404e-01, 8.66025404e-01, 1.00000000e+00],[3.86274510e-01, 9.84086337e-01, 7.67362681e-01, 1.00000000e+00],[6.13725490e-01, 9.84086337e-01, 6.41213315e-01, 1.00000000e+00],[8.33333333e-01, 8.66025404e-01, 5.00000000e-01, 1.00000000e+00],[1.00000000e+00, 6.36474236e-01, 3.38158275e-01, 1.00000000e+00],[1.00000000e+00, 3.38158275e-01, 1.71625679e-01, 1.00000000e+00],[1.00000000e+00, 1.22464680e-16, 6.12323400e-17, 1.00000000e+00]])

mpl.cm.rainbow(np.linspace(0,1,10))[0]

array([0.5, 0. , 1. , 1. ])

def dataPlot(dataSet):m,n = dataSet.shapefig = plt.figure(figsize=(8,20),dpi=100)colormap = mpl.cm.rainbow(np.linspace(0,1,n))for i in range(n):fig_ = fig.add_subplot(n,1,i+1)plt.scatter(range(m),dataSet.iloc[:,i].values,s=2,c=colormap[i])plt.title(dataSet.columns[i])plt.tight_layout(pad=1.2) # 调节子图间的距离

# 运行函数，查看数据分布：
dataPlot(abalone)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

可以从数据分布散点图中看出：

1）除“性别”之外，其他数据明显存在规律性排列

2）“高度”这一特征中，有两个异常值

从看到的现象，我们可以采取以下两种措施：

1）切分训练集和测试集时，需要打乱原始数据集来进行随机挑选

2）剔除"高度"这一特征中的异常值

abalone['高度']<0.4

0       True
1       True
2       True
3       True
4       True...
4172    True
4173    True
4174    True
4175    True
4176    True
Name: 高度, Length: 4177, dtype: bool

aba = abalone.loc[abalone['高度']<0.4,:]
#再次查看数据集的分布
dataPlot(aba)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

2. 切分训练集和测试集

"""
函数功能：随机切分训练集和测试集
参数说明：dataSet：原始数据集rate:训练集比例
返回：train,test：切分好的训练集和测试集
"""
def randSplit(dataSet,rate):l = list(dataSet.index) # 将原始数据集的索引提取出来，存到列表中random.seed(123) # 设置随机数种子random.shuffle(l) # 随机打乱数据集中的索引dataSet.index = l # 把打乱后的索引重新赋值给数据集中的索引，# 索引打乱了就相当于打乱了原始数据集中的数据m = dataSet.shape[0] # 原始数据集样本总数n = int(m*rate) # 训练集样本数量train = dataSet.loc[range(n),:] # 从打乱了的原始数据集中提取出训练集数据test = dataSet.loc[range(n,m),:] # 从打乱了的原始数据集中提取出测试集数据train.index = range(train.shape[0]) # 重置train训练数据集中的索引test.index = range(test.shape[0]) # 重置test测试数据集中的索引dataSet.index = range(dataSet.shape[0]) # 重置原始数据集中的索引return train,test

train,test = randSplit(aba,0.8)

#探索训练集
train.head()

	性别	长度	直径	高度	整体重量	肉重量	内脏重量	壳重	年龄
0	-1	0.590	0.470	0.170	0.9000	0.3550	0.1905	0.2500	11
1	1	0.560	0.450	0.145	0.9355	0.4250	0.1645	0.2725	11
2	-1	0.635	0.535	0.190	1.2420	0.5760	0.2475	0.3900	14
3	1	0.505	0.390	0.115	0.5585	0.2575	0.1190	0.1535	8
4	1	0.510	0.410	0.145	0.7960	0.3865	0.1815	0.1955	8

train.shape

(3340, 9)

abalone.describe()

	性别	长度	直径	高度	整体重量	肉重量	内脏重量	壳重	年龄
count	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000	4177.000000
mean	0.052909	0.523992	0.407881	0.139516	0.828742	0.359367	0.180594	0.238831	9.933684
std	0.822240	0.120093	0.099240	0.041827	0.490389	0.221963	0.109614	0.139203	3.224169
min	-1.000000	0.075000	0.055000	0.000000	0.002000	0.001000	0.000500	0.001500	1.000000
25%	-1.000000	0.450000	0.350000	0.115000	0.441500	0.186000	0.093500	0.130000	8.000000
50%	0.000000	0.545000	0.425000	0.140000	0.799500	0.336000	0.171000	0.234000	9.000000
75%	1.000000	0.615000	0.480000	0.165000	1.153000	0.502000	0.253000	0.329000	11.000000
max	1.000000	0.815000	0.650000	1.130000	2.825500	1.488000	0.760000	1.005000	29.000000

train.describe() #统计描述

	性别	长度	直径	高度	整体重量	肉重量	内脏重量	壳重	年龄
count	3340.000000	3340.000000	3340.000000	3340.000000	3340.000000	3340.000000	3340.000000	3340.000000	3340.000000
mean	0.060479	0.522754	0.406886	0.138790	0.824906	0.358151	0.179732	0.237158	9.911976
std	0.819021	0.120300	0.099372	0.038441	0.488535	0.222422	0.109036	0.137920	3.223534
min	-1.000000	0.075000	0.055000	0.000000	0.002000	0.001000	0.000500	0.001500	1.000000
25%	-1.000000	0.450000	0.350000	0.115000	0.439000	0.184375	0.092000	0.130000	8.000000
50%	0.000000	0.540000	0.420000	0.140000	0.796750	0.335500	0.171000	0.232000	9.000000
75%	1.000000	0.615000	0.480000	0.165000	1.147250	0.498500	0.250500	0.325000	11.000000
max	1.000000	0.780000	0.630000	0.250000	2.825500	1.488000	0.760000	1.005000	27.000000

dataPlot(train) #查看训练集数据分布

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

#探索测试集
test.head()

	性别	长度	直径	高度	整体重量	肉重量	内脏重量	壳重	年龄
0	1	0.630	0.470	0.150	1.1355	0.5390	0.2325	0.3115	12
1	-1	0.585	0.445	0.140	0.9130	0.4305	0.2205	0.2530	10
2	-1	0.390	0.290	0.125	0.3055	0.1210	0.0820	0.0900	7
3	1	0.525	0.410	0.130	0.9900	0.3865	0.2430	0.2950	15
4	1	0.625	0.475	0.160	1.0845	0.5005	0.2355	0.3105	10

test.shape

(835, 9)

test.describe()

	性别	长度	直径	高度	整体重量	肉重量	内脏重量	壳重	年龄
count	835.000000	835.000000	835.000000	835.000000	835.000000	835.000000	835.000000	835.000000	835.000000
mean	0.022754	0.528808	0.411737	0.140784	0.842714	0.363370	0.183749	0.245320	10.022754
std	0.834341	0.119166	0.098627	0.038664	0.495990	0.218938	0.111510	0.143925	3.230284
min	-1.000000	0.130000	0.100000	0.015000	0.013000	0.004500	0.003000	0.004000	3.000000
25%	-1.000000	0.450000	0.350000	0.115000	0.458000	0.192000	0.096500	0.132750	8.000000
50%	0.000000	0.550000	0.430000	0.140000	0.810000	0.339000	0.170500	0.235000	10.000000
75%	1.000000	0.620000	0.485000	0.170000	1.177250	0.510750	0.259250	0.337000	11.000000
max	1.000000	0.815000	0.650000	0.250000	2.555000	1.145500	0.590000	0.815000	29.000000

dataPlot(test)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

3.构建辅助函数

'''
函数功能：输入DF数据集（最后一列为标签），返回特征矩阵和标签矩阵
'''
def get_Mat(dataSet):xMat = np.mat(dataSet.iloc[:,:-1].values)yMat = np.mat(dataSet.iloc[:,-1].values).Treturn xMat,yMat

'''
函数功能：数据集可视化
'''
def plotShow(dataSet):xMat,yMat = get_Mat(dataSet)plt.scatter(xMat.A[:,1],yMat.A,c='b',s=5)plt.show()

'''
函数功能：计算回归系数
参数说明：dataSet:原始数据集
返回：ws:回归系数
'''
def standRegres(dataSet):xMat,yMat = get_Mat(dataSet)xTx = xMat.T * xMatif np.linalg.det(xTx) == 0:print('矩阵为奇异矩阵，无法求逆！')returnws = xTx.I*(xMat.T*yMat) # xTx.I ，用来求逆矩阵return ws

"""
函数功能：计算误差平方和SSE
参数说明：dataSet：真实值regres：求回归系数的函数
返回：SSE：误差平方和
"""
def sseCal(dataSet, regres):xMat,yMat = get_Mat(dataSet)ws = regres(dataSet)yHat = xMat*wssse = ((yMat.A.flatten() - yHat.A.flatten())**2).sum()#  return sse

以ex0数据集为例，查看函数运行结果：

ex0 = pd.read_table("./datas/ex0.txt",header=None)
ex0.head()

	0	1	2
0	1.0	0.067732	3.176513
1	1.0	0.427810	3.816464
2	1.0	0.995731	4.550095
3	1.0	0.738336	4.256571
4	1.0	0.981083	4.560815

#简单线性回归的SSE
sseCal(ex0, standRegres)

1.3552490816814902

构建相关系数R2计算函数

"""
函数功能：计算相关系数R2
"""
def rSquare(dataSet,regres):xMat,yMat=get_Mat(dataSet)sse = sseCal(dataSet,regres)sst = ((yMat.A-yMat.mean())**2).sum()#  r2 = 1 - sse / sstreturn r2

同样以ex0数据集为例，查看函数运行结果：

#简单线性回归的R2
rSquare(ex0, standRegres)

0.9731300889856916

'''
函数功能：计算局部加权线性回归的预测值
参数说明：testMat:测试集xMat:训练集的特征矩阵yMat:训练集的标签矩阵返回：yHat：函数预测值
'''
def LWLR(testMat,xMat,yMat,k=1.0):n = testMat.shape[0] # 测试数据集行数m = xMat.shape[0] # 训练集特征矩阵行数weights = np.mat(np.eye(m)) # 用单位矩阵来初始化权重矩阵，yHat = np.zeros(n) # 用0矩阵来初始化预测值矩阵for i in range(n):for j in range(m):diffMat = testMat[i] - xMat[j]weights[j,j] = np.exp(diffMat*diffMat.T / (-2*k**2))xTx = xMat.T*(weights*xMat)if np.linalg.det(xTx) == 0:print('矩阵为奇异矩阵，无法求逆')returnws = xTx.I*(xMat.T*(weights*yMat))yHat[i] = testMat[i] * wsreturn ws,yHat

4.构建加权线性模型

因为数据量太大，计算速度极慢，所以此处选择训练集的前100个数据作为训练集，测试集的前100个数据作为测试集。

"""
函数功能：绘制不同k取值下，训练集和测试集的SSE曲线
"""
def ssePlot(train,test):X0,Y0 = get_Mat(train)X1,Y1 =get_Mat(test)train_sse = []test_sse = []for k in np.arange(0.2,10,0.5):ws1,yHat1 = LWLR(X0[:99],X0[:99],Y0[:99],k) sse1 = ((Y0[:99].A.T - yHat1)**2).sum() train_sse.append(sse1)ws2,yHat2 = LWLR(X1[:99],X0[:99],Y0[:99],k) sse2 = ((Y1[:99].A.T - yHat2)**2).sum() test_sse.append(sse2)plt.figure(figsize=(20,8),dpi=100)plt.plot(np.arange(0.2,10,0.5),train_sse,color='b')#     plt.plot(np.arange(0.2,10,0.5),test_sse,color='r') plt.xlabel('不同k取值')plt.ylabel('SSE')plt.legend(['train_sse','test_sse'])

运行结果：

ssePlot(train,test)

这个图的解读应该是这样的：从右往左看，当K取较大值时，模型比较稳定，随着K值的减小，训练集的SSE开始逐渐减小，当K取到2左右，训练集的SSE与测试集的SSE相等，当K继续减小时，训练集的SSE也越来越小，也就是说，模型在训练集上的表现越来越好，但是，模型在测试集上的表现却越来越差了，这就说明模型开始出现过拟合了。其实，这个图与前面不同k值的结果图是吻合的，K=1.0,
0.01, 0.003这三张图也表明随着K的减小，模型会逐渐出现过拟合。所以这里可以看出，K在2左右的取值最佳。

我们再将K=2带入局部线性回归模型中，然后查看预测结果：

train,test = randSplit(aba,0.8) # 随机切分原始数据集，得到训练集和测试集
trainX,trainY = get_Mat(train) # 将切分好的训练集分成特征矩阵和标签矩阵
testX,testY = get_Mat(test) # 将切分好的测试集分成特征矩阵和标签矩阵
ws0,yHat0 = LWLR(testX,trainX,trainY,k=2)

绘制真实值与预测值之间的关系图

y=testY.A.flatten()
plt.scatter(y,yHat0,c='b',s=5); # ;等效于plt.show()

通过上图可知，横坐标为真实值，纵坐标为预测值，形成的图像为呈现一个“喇叭形”，随着横坐标真实值逐渐变大，纵坐标预测值也越来越大，说明随着真实值的增加，预测值偏差越来越大

封装一个函数来计算SSE和R方，方便后续调用

"""
函数功能：计算加权线性回归的SSE和R方
"""
def LWLR_pre(dataSet):train,test = randSplit(dataSet,0.8)#      trainX,trainY = get_Mat(train)testX,testY = get_Mat(test)ws,yHat = LWLR(testX,trainX,trainY,k=2)#     sse = ((testY.A.T - yHat)**2).sum()#     sst = ((testY.A-testY.mean())**2).sum() #     r2 = 1 - sse / sstreturn sse,r2

查看模型预测结果

LWLR_pre(aba)

(4152.777097646255, 0.5228101340130846)

从结果可以看出，SSE达4000+，相关系数只有0.52，模型效果并不是很好。

十一、加权线性回归案例：预测鲍鱼的年龄相关推荐

线性回归数值型预测：预测鲍鱼的年龄
线性回归数值型预测:预测鲍鱼的年龄注:代码和数据已上传:https://download.csdn.net/download/j__max/10749344 一.实验准备 1.实验内容和目的根据训 ...
线性回归实例-----预测鲍鱼年龄
预测鲍鱼年龄前言线性回归是监督学习的一个方向,用来预测连续的数值型数据.比如房价预测.销量预测等等. 优点:计算简单,易于理解缺点:只使用与线性数据,对于非线性数据使用非线性回归模型使用条 ...
回归分析及实际案例：预测鲍鱼年龄
上一篇文章:线性回归(Linear regression)算法引入: 1.线性回归: 算法的优点: 结果易于理解,计算不复杂缺点:对非线性数据拟合不好目标:平方误差和最小求解(对参数w求导等 ...
《Python机器学习——预测分析核心算法》——2.4　基于因素变量的实数值预测：鲍鱼的年龄...
本节书摘来异步社区<Python机器学习--预测分析核心算法>一书中的第2章,第2.4节,作者:[美]Michael Bowles(鲍尔斯),更多章节内容可以访问云栖社区"异步社 ...
机器学习--局部加权线性回归
文章目录局部加权线性回归预测鲍鱼年龄局部加权线性回归具体理论见上次笔记<线性回归> 预测鲍鱼年龄 import numpy as npclass LocalWeightedLine ...
鲍鱼数据集案例分析-预测鲍鱼年龄（线性回归/梯度下降法实操）
数据集来源UCI Machine Learning Repository: Abalone Data Set 目录一.数据集探索性分析二.鲍鱼数据预处理 1.对sex特征进行OneHot编码,便于 ...
回归综合案例——利用回归模型预测鲍鱼年龄
回归综合案例--利用回归模型预测鲍鱼年龄 1 数据集探索性分析首先将鲍鱼数据集abalone_dataset.csv读取为pandas的DataFrame格式. import pandas as p ...
《机器学习实战》8.2 线性回归基础篇之预测鲍鱼年龄
<机器学习实战>8.2 线性回归基础篇之预测鲍鱼年龄搜索微信公众号:'AI-ming3526'或者'计算机视觉这件小事' 获取更多人工智能.机器学习干货 csdn:https://blo ...
线性回归实战---Abalone鲍鱼年龄预测
线性回归实现Abalone鲍鱼年龄预测文章目录线性回归实现Abalone鲍鱼年龄预测一.环境准备数据集简介二.线性回归基础知识什么是线性回归? "最小二乘法" 求解线性 ...

十一、加权线性回归案例：预测鲍鱼的年龄