加权线性回归案例:预测鲍鱼的年龄

点击文章标题即可获取源代码和笔记
数据集:https://download.csdn.net/download/weixin_44827418/12553408

1.导入数据集

数据集描述:

import pandas as pd
import numpy as npabalone = pd.read_table("./datas/abalone.txt",header=None)
abalone.columns=['性别','长度','直径','高度','整体重量','肉重量','内脏重量','壳重','年龄']
abalone.head()
性别 长度 直径 高度 整体重量 肉重量 内脏重量 壳重 年龄
0 1 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 1 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 -1 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 1 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 0 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
abalone.shape
(4177, 9)
abalone.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):#   Column  Non-Null Count  Dtype
---  ------  --------------  -----  0   性别      4177 non-null   int64  1   长度      4177 non-null   float642   直径      4177 non-null   float643   高度      4177 non-null   float644   整体重量    4177 non-null   float645   肉重量     4177 non-null   float646   内脏重量    4177 non-null   float647   壳重      4177 non-null   float648   年龄      4177 non-null   int64
dtypes: float64(7), int64(2)
memory usage: 293.8 KB
abalone.describe()
性别 长度 直径 高度 整体重量 肉重量 内脏重量 壳重 年龄
count 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000
mean 0.052909 0.523992 0.407881 0.139516 0.828742 0.359367 0.180594 0.238831 9.933684
std 0.822240 0.120093 0.099240 0.041827 0.490389 0.221963 0.109614 0.139203 3.224169
min -1.000000 0.075000 0.055000 0.000000 0.002000 0.001000 0.000500 0.001500 1.000000
25% -1.000000 0.450000 0.350000 0.115000 0.441500 0.186000 0.093500 0.130000 8.000000
50% 0.000000 0.545000 0.425000 0.140000 0.799500 0.336000 0.171000 0.234000 9.000000
75% 1.000000 0.615000 0.480000 0.165000 1.153000 0.502000 0.253000 0.329000 11.000000
max 1.000000 0.815000 0.650000 1.130000 2.825500 1.488000 0.760000 1.005000 29.000000

2. 查看数据分布状况

import numpy as np
import pandas as pd
import random
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['simhei'] #显示中文
plt.rcParams['axes.unicode_minus']=False # 用来正常显示负号
%matplotlib inline
mpl.cm.rainbow(np.linspace(0,1,10))
array([[5.00000000e-01, 0.00000000e+00, 1.00000000e+00, 1.00000000e+00],[2.80392157e-01, 3.38158275e-01, 9.85162233e-01, 1.00000000e+00],[6.07843137e-02, 6.36474236e-01, 9.41089253e-01, 1.00000000e+00],[1.66666667e-01, 8.66025404e-01, 8.66025404e-01, 1.00000000e+00],[3.86274510e-01, 9.84086337e-01, 7.67362681e-01, 1.00000000e+00],[6.13725490e-01, 9.84086337e-01, 6.41213315e-01, 1.00000000e+00],[8.33333333e-01, 8.66025404e-01, 5.00000000e-01, 1.00000000e+00],[1.00000000e+00, 6.36474236e-01, 3.38158275e-01, 1.00000000e+00],[1.00000000e+00, 3.38158275e-01, 1.71625679e-01, 1.00000000e+00],[1.00000000e+00, 1.22464680e-16, 6.12323400e-17, 1.00000000e+00]])
mpl.cm.rainbow(np.linspace(0,1,10))[0]
array([0.5, 0. , 1. , 1. ])
def dataPlot(dataSet):m,n = dataSet.shapefig = plt.figure(figsize=(8,20),dpi=100)colormap = mpl.cm.rainbow(np.linspace(0,1,n))for i in range(n):fig_ = fig.add_subplot(n,1,i+1)plt.scatter(range(m),dataSet.iloc[:,i].values,s=2,c=colormap[i])plt.title(dataSet.columns[i])plt.tight_layout(pad=1.2) # 调节子图间的距离
# 运行函数,查看数据分布:
dataPlot(abalone)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

可以从数据分布散点图中看出:

1)除“性别”之外,其他数据明显存在规律性排列

2)“高度”这一特征中,有两个异常值

从看到的现象,我们可以采取以下两种措施:

1) 切分训练集和测试集时,需要打乱原始数据集来进行随机挑选

2) 剔除"高度"这一特征中的异常值

abalone['高度']<0.4
0       True
1       True
2       True
3       True
4       True...
4172    True
4173    True
4174    True
4175    True
4176    True
Name: 高度, Length: 4177, dtype: bool
aba = abalone.loc[abalone['高度']<0.4,:]
#再次查看数据集的分布
dataPlot(aba)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

2. 切分训练集和测试集

"""
函数功能:随机切分训练集和测试集
参数说明:dataSet:原始数据集rate:训练集比例
返回:train,test:切分好的训练集和测试集
"""
def randSplit(dataSet,rate):l = list(dataSet.index) # 将原始数据集的索引提取出来,存到列表中random.seed(123) # 设置随机数种子random.shuffle(l) # 随机打乱数据集中的索引dataSet.index = l # 把打乱后的索引重新赋值给数据集中的索引,# 索引打乱了就相当于打乱了原始数据集中的数据m = dataSet.shape[0] # 原始数据集样本总数n = int(m*rate) # 训练集样本数量train = dataSet.loc[range(n),:] # 从打乱了的原始数据集中提取出训练集数据test = dataSet.loc[range(n,m),:] # 从打乱了的原始数据集中提取出测试集数据train.index = range(train.shape[0]) # 重置train训练数据集中的索引test.index = range(test.shape[0]) # 重置test测试数据集中的索引dataSet.index = range(dataSet.shape[0]) # 重置原始数据集中的索引return train,test
train,test = randSplit(aba,0.8)
#探索训练集
train.head()
性别 长度 直径 高度 整体重量 肉重量 内脏重量 壳重 年龄
0 -1 0.590 0.470 0.170 0.9000 0.3550 0.1905 0.2500 11
1 1 0.560 0.450 0.145 0.9355 0.4250 0.1645 0.2725 11
2 -1 0.635 0.535 0.190 1.2420 0.5760 0.2475 0.3900 14
3 1 0.505 0.390 0.115 0.5585 0.2575 0.1190 0.1535 8
4 1 0.510 0.410 0.145 0.7960 0.3865 0.1815 0.1955 8
train.shape
(3340, 9)
abalone.describe()
性别 长度 直径 高度 整体重量 肉重量 内脏重量 壳重 年龄
count 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000
mean 0.052909 0.523992 0.407881 0.139516 0.828742 0.359367 0.180594 0.238831 9.933684
std 0.822240 0.120093 0.099240 0.041827 0.490389 0.221963 0.109614 0.139203 3.224169
min -1.000000 0.075000 0.055000 0.000000 0.002000 0.001000 0.000500 0.001500 1.000000
25% -1.000000 0.450000 0.350000 0.115000 0.441500 0.186000 0.093500 0.130000 8.000000
50% 0.000000 0.545000 0.425000 0.140000 0.799500 0.336000 0.171000 0.234000 9.000000
75% 1.000000 0.615000 0.480000 0.165000 1.153000 0.502000 0.253000 0.329000 11.000000
max 1.000000 0.815000 0.650000 1.130000 2.825500 1.488000 0.760000 1.005000 29.000000
train.describe() #统计描述
性别 长度 直径 高度 整体重量 肉重量 内脏重量 壳重 年龄
count 3340.000000 3340.000000 3340.000000 3340.000000 3340.000000 3340.000000 3340.000000 3340.000000 3340.000000
mean 0.060479 0.522754 0.406886 0.138790 0.824906 0.358151 0.179732 0.237158 9.911976
std 0.819021 0.120300 0.099372 0.038441 0.488535 0.222422 0.109036 0.137920 3.223534
min -1.000000 0.075000 0.055000 0.000000 0.002000 0.001000 0.000500 0.001500 1.000000
25% -1.000000 0.450000 0.350000 0.115000 0.439000 0.184375 0.092000 0.130000 8.000000
50% 0.000000 0.540000 0.420000 0.140000 0.796750 0.335500 0.171000 0.232000 9.000000
75% 1.000000 0.615000 0.480000 0.165000 1.147250 0.498500 0.250500 0.325000 11.000000
max 1.000000 0.780000 0.630000 0.250000 2.825500 1.488000 0.760000 1.005000 27.000000
dataPlot(train) #查看训练集数据分布
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

#探索测试集
test.head()
性别 长度 直径 高度 整体重量 肉重量 内脏重量 壳重 年龄
0 1 0.630 0.470 0.150 1.1355 0.5390 0.2325 0.3115 12
1 -1 0.585 0.445 0.140 0.9130 0.4305 0.2205 0.2530 10
2 -1 0.390 0.290 0.125 0.3055 0.1210 0.0820 0.0900 7
3 1 0.525 0.410 0.130 0.9900 0.3865 0.2430 0.2950 15
4 1 0.625 0.475 0.160 1.0845 0.5005 0.2355 0.3105 10
test.shape
(835, 9)
test.describe()
性别 长度 直径 高度 整体重量 肉重量 内脏重量 壳重 年龄
count 835.000000 835.000000 835.000000 835.000000 835.000000 835.000000 835.000000 835.000000 835.000000
mean 0.022754 0.528808 0.411737 0.140784 0.842714 0.363370 0.183749 0.245320 10.022754
std 0.834341 0.119166 0.098627 0.038664 0.495990 0.218938 0.111510 0.143925 3.230284
min -1.000000 0.130000 0.100000 0.015000 0.013000 0.004500 0.003000 0.004000 3.000000
25% -1.000000 0.450000 0.350000 0.115000 0.458000 0.192000 0.096500 0.132750 8.000000
50% 0.000000 0.550000 0.430000 0.140000 0.810000 0.339000 0.170500 0.235000 10.000000
75% 1.000000 0.620000 0.485000 0.170000 1.177250 0.510750 0.259250 0.337000 11.000000
max 1.000000 0.815000 0.650000 0.250000 2.555000 1.145500 0.590000 0.815000 29.000000
dataPlot(test)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

3.构建辅助函数

'''
函数功能:输入DF数据集(最后一列为标签),返回特征矩阵和标签矩阵
'''
def get_Mat(dataSet):xMat = np.mat(dataSet.iloc[:,:-1].values)yMat = np.mat(dataSet.iloc[:,-1].values).Treturn xMat,yMat
'''
函数功能:数据集可视化
'''
def plotShow(dataSet):xMat,yMat = get_Mat(dataSet)plt.scatter(xMat.A[:,1],yMat.A,c='b',s=5)plt.show()
'''
函数功能:计算回归系数
参数说明:dataSet:原始数据集
返回:ws:回归系数
'''
def standRegres(dataSet):xMat,yMat = get_Mat(dataSet)xTx = xMat.T * xMatif np.linalg.det(xTx) == 0:print('矩阵为奇异矩阵,无法求逆!')returnws = xTx.I*(xMat.T*yMat) # xTx.I ,用来求逆矩阵return ws
"""
函数功能:计算误差平方和SSE
参数说明:dataSet:真实值regres:求回归系数的函数
返回:SSE:误差平方和
"""
def sseCal(dataSet, regres):xMat,yMat = get_Mat(dataSet)ws = regres(dataSet)yHat = xMat*wssse = ((yMat.A.flatten() - yHat.A.flatten())**2).sum()#  return sse

以ex0数据集为例,查看函数运行结果:

ex0 = pd.read_table("./datas/ex0.txt",header=None)
ex0.head()
0 1 2
0 1.0 0.067732 3.176513
1 1.0 0.427810 3.816464
2 1.0 0.995731 4.550095
3 1.0 0.738336 4.256571
4 1.0 0.981083 4.560815
#简单线性回归的SSE
sseCal(ex0, standRegres)
1.3552490816814902

构建相关系数R2计算函数

"""
函数功能:计算相关系数R2
"""
def rSquare(dataSet,regres):xMat,yMat=get_Mat(dataSet)sse = sseCal(dataSet,regres)sst = ((yMat.A-yMat.mean())**2).sum()#  r2 = 1 - sse / sstreturn r2

同样以ex0数据集为例,查看函数运行结果:

#简单线性回归的R2
rSquare(ex0, standRegres)
0.9731300889856916
'''
函数功能:计算局部加权线性回归的预测值
参数说明:testMat:测试集xMat:训练集的特征矩阵yMat:训练集的标签矩阵返回:yHat:函数预测值
'''
def LWLR(testMat,xMat,yMat,k=1.0):n = testMat.shape[0] # 测试数据集行数m = xMat.shape[0] # 训练集特征矩阵行数weights = np.mat(np.eye(m)) # 用单位矩阵来初始化权重矩阵,yHat = np.zeros(n) # 用0矩阵来初始化预测值矩阵for i in range(n):for j in range(m):diffMat = testMat[i] - xMat[j]weights[j,j] = np.exp(diffMat*diffMat.T / (-2*k**2))xTx = xMat.T*(weights*xMat)if np.linalg.det(xTx) == 0:print('矩阵为奇异矩阵,无法求逆')returnws = xTx.I*(xMat.T*(weights*yMat))yHat[i] = testMat[i] * wsreturn ws,yHat

4.构建加权线性模型

因为数据量太大,计算速度极慢,所以此处选择训练集的前100个数据作为训练集,测试集的前100个数据作为测试集。

"""
函数功能:绘制不同k取值下,训练集和测试集的SSE曲线
"""
def ssePlot(train,test):X0,Y0 = get_Mat(train)X1,Y1 =get_Mat(test)train_sse = []test_sse = []for k in np.arange(0.2,10,0.5):ws1,yHat1 = LWLR(X0[:99],X0[:99],Y0[:99],k) sse1 = ((Y0[:99].A.T - yHat1)**2).sum() train_sse.append(sse1)ws2,yHat2 = LWLR(X1[:99],X0[:99],Y0[:99],k) sse2 = ((Y1[:99].A.T - yHat2)**2).sum() test_sse.append(sse2)plt.figure(figsize=(20,8),dpi=100)plt.plot(np.arange(0.2,10,0.5),train_sse,color='b')#     plt.plot(np.arange(0.2,10,0.5),test_sse,color='r') plt.xlabel('不同k取值')plt.ylabel('SSE')plt.legend(['train_sse','test_sse'])

运行结果:

ssePlot(train,test)

这个图的解读应该是这样的:从右往左看,当K取较大值时,模型比较稳定,随着K值的减小,训练集的SSE开始逐渐减小,当K取到2左右,训练集的SSE与测试集的SSE相等,当K继续减小时,训练集的SSE也越来越小,也就是说,模型在训练集上的表现越来越好,但是,模型在测试集上的表现却越来越差了,这就说明模型开始出现过拟合了。其实,这个图与前面不同k值的结果图是吻合的,K=1.0,
0.01, 0.003这三张图也表明随着K的减小,模型会逐渐出现过拟合。所以这里可以看出,K在2左右的取值最佳。

我们再将K=2带入局部线性回归模型中,然后查看预测结果:

train,test = randSplit(aba,0.8) # 随机切分原始数据集,得到训练集和测试集
trainX,trainY = get_Mat(train) # 将切分好的训练集分成特征矩阵和标签矩阵
testX,testY = get_Mat(test) # 将切分好的测试集分成特征矩阵和标签矩阵
ws0,yHat0 = LWLR(testX,trainX,trainY,k=2)

绘制真实值与预测值之间的关系图

y=testY.A.flatten()
plt.scatter(y,yHat0,c='b',s=5); # ;等效于plt.show()

通过上图可知,横坐标为真实值,纵坐标为预测值,形成的图像为呈现一个“喇叭形”,随着横坐标真实值逐渐变大,纵坐标预测值也越来越大,说明随着真实值的增加,预测值偏差越来越大

封装一个函数来计算SSE和R方,方便后续调用

"""
函数功能:计算加权线性回归的SSE和R方
"""
def LWLR_pre(dataSet):train,test = randSplit(dataSet,0.8)#      trainX,trainY = get_Mat(train)testX,testY = get_Mat(test)ws,yHat = LWLR(testX,trainX,trainY,k=2)#     sse = ((testY.A.T - yHat)**2).sum()#     sst = ((testY.A-testY.mean())**2).sum() #     r2 = 1 - sse / sstreturn sse,r2

查看模型预测结果

LWLR_pre(aba)
(4152.777097646255, 0.5228101340130846)

从结果可以看出,SSE达4000+,相关系数只有0.52,模型效果并不是很好。

十一、加权线性回归案例:预测鲍鱼的年龄相关推荐

  1. 线性回归数值型预测:预测鲍鱼的年龄

    线性回归数值型预测:预测鲍鱼的年龄 注:代码和数据已上传:https://download.csdn.net/download/j__max/10749344 一.实验准备 1.实验内容和目的 根据训 ...

  2. 线性回归实例-----预测鲍鱼年龄

    预测鲍鱼年龄 前言   线性回归是监督学习的一个方向,用来预测连续的数值型数据.比如房价预测.销量预测等等. 优点:计算简单,易于理解 缺点:只使用与线性数据,对于非线性数据使用非线性回归模型 使用条 ...

  3. 回归分析及实际案例:预测鲍鱼年龄

    上一篇文章:线性回归(Linear regression)算法 引入: 1.线性回归: 算法的优点: 结果易于理解,计算不复杂 缺点:对非线性数据拟合不好 目标:平方误差和最小  求解(对参数w求导等 ...

  4. 《Python机器学习——预测分析核心算法》——2.4 基于因素变量的实数值预测:鲍鱼的年龄...

    本节书摘来异步社区<Python机器学习--预测分析核心算法>一书中的第2章,第2.4节,作者:[美]Michael Bowles(鲍尔斯),更多章节内容可以访问云栖社区"异步社 ...

  5. 机器学习--局部加权线性回归

    文章目录 局部加权线性回归 预测鲍鱼年龄 局部加权线性回归 具体理论见上次笔记<线性回归> 预测鲍鱼年龄 import numpy as npclass LocalWeightedLine ...

  6. 鲍鱼数据集案例分析-预测鲍鱼年龄(线性回归/梯度下降法实操)

    数据集来源UCI Machine Learning Repository: Abalone Data Set 目录 一.数据集探索性分析 二.鲍鱼数据预处理 1.对sex特征进行OneHot编码,便于 ...

  7. 回归综合案例——利用回归模型预测鲍鱼年龄

    回归综合案例--利用回归模型预测鲍鱼年龄 1 数据集探索性分析 首先将鲍鱼数据集abalone_dataset.csv读取为pandas的DataFrame格式. import pandas as p ...

  8. 《机器学习实战》8.2 线性回归基础篇之预测鲍鱼年龄

    <机器学习实战>8.2 线性回归基础篇之预测鲍鱼年龄 搜索微信公众号:'AI-ming3526'或者'计算机视觉这件小事' 获取更多人工智能.机器学习干货 csdn:https://blo ...

  9. 线性回归实战---Abalone鲍鱼年龄预测

    线性回归实现Abalone鲍鱼年龄预测 文章目录 线性回归实现Abalone鲍鱼年龄预测 一.环境准备 数据集简介 二.线性回归基础知识 什么是线性回归? "最小二乘法" 求解线性 ...

最新文章

  1. java 搜索业务怎么写_Java项目实战第11天:搜索功能的实现
  2. PAT (Basic Level) Practise:1012. 数字分类
  3. 调试pcb板子的步骤
  4. 基于管道模式的容器设计
  5. 【重磅】神策分析 1.13 版本上线 ,持续深耕打造场景化数据分析
  6. 《Python编程从入门到实践》记录之json模块(数据存储)
  7. 【零基础学Java】—Java 日期时间(三十二)
  8. python同时发大量请求_Python批量发送post请求的实现代码
  9. 详解由VS 2010生成的Bug报告(2) - 报告的内容
  10. vs2015安装msdn_vs2015中文旗舰版下载
  11. 栈和队列的共同点和不同点
  12. 服务器主动推送消息数据给客户端
  13. Java 水洼问题 dfs
  14. 适合团队工作的软件,大家来看看有没有喜欢的吧
  15. SpringBoot配置多数据库的数据源
  16. aix oracle汇文,oracle_FOR_AIX维护手册.doc
  17. 【Qt】 Fractal Designer 5.4 - 安装说明 - Windows
  18. 2020年高教社杯全国大学生数学建模竞赛---校园供水系统智能管理(Python代码实现)
  19. Unity基础笔记(1)—— Unity基本操作与基本组件介绍
  20. 那一年的北风--序言2

热门文章

  1. python99乘法表while翻译_Python学习之while练习--九九乘法表
  2. ftp改为sftp_浅谈 FTP、FTPS 与 SFTP
  3. C++静态成员函数指针
  4. C# 线程池ThreadPool
  5. 编写的windows程序,崩溃时产生crash dump文件的办法
  6. django html显示xml,如何将HTML与Django集成?
  7. python 内置方法赋值_Python内置数据结构之字符串str
  8. 2-10 就业课(2.0)-oozie:9、oozie与hue的整合,以及整合后执行MR任务
  9. Spring MVC 使用介绍(二)—— DispatcherServlet
  10. 前端开源项目周报0307