利用LightGBM对波士顿房价进行模拟和预测

目标：根据房屋属性预测每个房子的最终价格

任务流程：

一、分析数据指标

不同指标对结果的影响
连续值与离散值的情况

二、观察数据的分布，是否正态

是否满足正态分布
数据变换操作

三、数据预处理

缺失值填充
标签转换

四、建模

LightGBM模型

一、探索性数据分析

1、加载数据并了解数据意义

数据包含训练集和测试集，数据量不大，但包含的变量比较多，下面我们来认识下变量的具体含义：

MSSubClass 建筑类
mszoning 一般的分区类别
LotFrontage 街道连接属性线性英尺
LotArea 平方英尺面积
Street 街道，道路通行方式
Alley 小巷，通道入口的类型
LotShape 财产的形状
LandContour 财产的平整度
Utilities 实用程序，可用的实用程序类型
LotConfig 很多配置
LandSlope 滑坡
Neighborhood 邻近，Ames市区范围内的物理位置
Condition1 状态，邻近主要道路或铁路
Condition2 条件，靠近主要道路或铁路（如果第二存在）
BldgType 住宅类型
housestyle 住宅风格
overallqual 整体材料和完成质量
overallcond 总体状况评价
yearbuilt 原施工日期
yearremodadd 重塑日期
RoofStyle 屋顶类型
RoofMatl 屋面材料
exterior1st 外部覆盖的房子
exterior2nd 外部覆盖的房子（如果有一个以上的材料）
MasVnrType 砌体饰面型
MasVnrArea 砌体饰面面积，平方英尺
exterqual 外部材料质量
extercond 在外部的物质条件
Foundation 基金会的类型
BsmtQual 地下室的高度
BsmtCond 地下室的一般条件
BsmtExposure 罢工或花园层地下室
BsmtFinType1 质量基层成品区
BsmtFinSF1 完成1平方英尺所需材料
BsmtFinType2 质量第二成品区（如果有的话）
BsmtFinSF2 完成2平方英尺所需材料
BsmtUnfSF 未完成的平方英尺的地下室
TotalBsmtSF 地下室面积总平方英尺
Heating 加热类型
HeatingQc 加热质量和条件
CentralAir 是否有中央空调
Electrical 电气系统的类型
1stFlrSF 一楼平方英尺
2ndFlrSF 二楼平方英尺
LowQualFinSF 完成每平方英尺最低的质量
GrLivArea 居住面积平方英尺
BsmtFullBath 地下至完整的浴室
BsmtHalfBath 地下室部分浴室
FullBath 完整的浴室等级
HalfBath 部分浴室等级
BedroomAbvGr 高于地下室的卧室数
KitchenAbvGr 厨房数量
KitchenQual 厨房质量
TotRmsAbvGrd 总房间数（不含卫生间）
Functional 家庭功能评级
Fireplaces 壁炉位置
FireplaceQu 壁炉质量
GarageType 车库位置
GarageYrBlt 车库年限
GarageFinish 车库的室内装修
GarageCars 车库可放车辆数
GarageArea 车库面积
GarageQual 车库质量
GarageCond 车库条件
PavedDrive 铺的车道
WoodDeckSF 平方英尺的木甲板面积
OpenPorchSF 平方英尺打开阳台的面积
EnclosedPorch 封闭式阳台的面积（平方英尺）
3SsnPorch 三季阳台的面积（平方英尺）
ScreenPorch 纱窗门廊区（平方英尺）
PoolArea 游泳池
PoolQC 游泳池质量
Fence 莎兰的质量
MiscFeature 杂项功能
MiscVal 杂项特征值
MoSold 在什么月份销售
YrSold 在什么年份销售
SaleType 销售类型
SaleCondition 销售环境

查看目标变量的分布

目标变量整体分布类正态，但还是有所偏，后期需要做调整。再看下偏度和峰度，基本可以确定偏度比较大，稍后再做调整。

2、查看重要属性对目标变量的影响

# 居住面积（平方英尺），基本结论：居住面积越大，房价越高

# 地下室面积（平方英尺），基本结论：地下室面积越大，房价越高

# 整体材料和饰面质量，基本结论：整体材料和饰面质量等级越高，房价越高

# 原施工日期，基本结论：施工日期与房价价格无明显关系

3、查看变量与变量之间的相关性，及哪些变量对房价价格影响最大

corr = train.corr()
f,ax = plt.subplots(figsize = (14,8))
sns.heatmap(corr,square = True,cmap = 'Blues')

筛选10个对房价价格影响最大的变量

k = 10
cols = corr.nlargest(k,'SalePrice')['SalePrice'].index
cm = np.corrcoef(train[cols].values.T)
sns.set(font_scale = 1.25)
hm = sns.heatmap(cm,cbar = True,annot = True,square = True,fmt = '.2f',annot_kws = {'size': 10},yticklabels = cols.values,xticklabels = cols.values,cmap = 'Blues')
plt.show()

可视化下散点图，观察前5个相关性最大的变量（还是3个吧，5个太多放不下。。。。。）

很明显，这些相关性很大的变量基本呈现出正相关关系。

sns.set()
cols = ['SalePrice','OverallQual','GrLivArea','GarageCars']
sns.pairplot(train[cols],size = 2.0)
plt.show()

二、数据清洗

1、查看缺失情况

2、删除离群点

3、对目标变量做对数变换

一开始我们看了目标变量的分布，是一个类正态的情形，进一步验证：从QQ图可明确得出，数据分布偏度较大，需做进一步的数据变换，以使其满足正态分布。

#Stats
from scipy.stats import skew,norm
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from scipy import statssns.distplot(train['SalePrice'],fit = norm)
(mu,sigma) = norm.fit(train['SalePrice'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu,sigma))# 分布图
plt.legend(['Normal dist,($\mu=${:.2f} and $\sigma=${:.2f})'.format(mu,sigma)],loc = 'best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')# QQ图
fig = plt.figure()
res = stats.probplot(train['SalePrice'],plot = plt)
plt.show()

变换后满足正态分布，具体如下：

# 对数变换log(1+x)
train['SalePrice'] = np.log1p(train['SalePrice'])
# 查看新的分布
sns.distplot(train['SalePrice'],fit = norm)
# 参数
(mu,sigma) = norm.fit(train['SalePrice'])
print('\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu,sigma))
# 画图
plt.legend(['Normal dist($\mu=${:.2f} and $\sigma=$ {:.2f})'.format(mu,sigma)],loc = 'best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
# QQ图
fig = plt.figure()
res = stats.probplot(train['SalePrice'],plot = plt)
plt.show()

4、缺失值处理

train_labels = train['SalePrice'].reset_index(drop=True)
train_features = train.drop(['SalePrice'],axis=1)
test_features = testall_features = pd.concat([train_features,test_features]).reset_index(drop=True)
all_features.shapedef percent_missing(df):data = pd.DataFrame(df)df_cols = list(pd.DataFrame(data))dict_x = {}for i in range(0,len(df_cols)):dict_x.update({df_cols[i]: round(data[df_cols[i]].isnull().mean()*100,2)})return dict_xmissing = percent_missing(all_features)
df_miss = sorted(missing.items(),key=lambda x: x[1],reverse=True)
print('Percent of missing data')
df_miss[0:10]

sns.set_style('white')
f,ax = plt.subplots(figsize=(8,7))
sns.set_color_codes(palette='deep')
missing = round(train.isnull().mean()*100,2)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar(color='b')
#Tweak the visual presentation
ax.xaxis.grid(False)
ax.set(ylabel='Percent of missing values')
ax.set(xlabel='Features')
ax.set(title='Percent missing data by feature')
sns.despine(trim=True,left=True)

#Some of the non-numeric preditors are stored as numbers;convert them into strings
all_features['MSSubClass'] = all_features['MSSubClass'].apply(str)
all_features['YrSold'] = all_features['YrSold'].astype(str)
all_features['MoSold'] = all_features['MoSold'].astype(str)def handle_missing(features):#the data description states that NA refers to typical('Typ') valuesfeatures['Functional'] = features['Functional'].fillna('Typ')#Replace the missing values in each of the columns below with their modefeatures['Electrical'] = features['Electrical'].fillna('SBrkr')features['KitchenQual'] = features['KitchenQual'].fillna('TA')features['Exterior1st'] = features['Exterior1st'].fillna(features['Exterior1st'].mode()[0])features['Exterior2nd'] = features['Exterior2nd'].fillna(features['Exterior2nd'].mode()[0])features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0])features['MSZoning'] = features.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))#the data description stats that NA refers to 'No Pool'features['PoolQC'] = features['PoolQC'].fillna('None')#Replacing the missing values with 0,since no garage = no cars in garagefor col in ('GarageYrBlt','GarageArea','GarageCars'):features[col] = features[col].fillna(0)#Replacing the missing values with Nonefor col in ['GarageType','GarageFinish','GarageQual','GarageCond']:features[col] = features[col].fillna('None')#NaN values for these categorical basement features,means there's no basementfor col in ('BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2'):features[col] = features[col].fillna('None')#Groupby the neighborhoods ,and fill in missing value by the median LotFrontage of the neighborhoodfeatures['LotFrontage'] = features.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))#We have no particular intuition around how to fill in the rest of the categorical features#So we replace their missing values with Noneobjects = []for i in features.columns:if features[i].dtype == object:objects.append(i)features.update(features[objects].fillna('None'))#And we do the same thing for numerical features,but this time with 0snumeric_dtypes = ['int16','int32','int64','float16','float32','float64']numeric = []for i in features.columns:if features[i].dtype in numeric_dtypes:numeric.append(i)features.update(features[numeric].fillna(0))return featuresall_features = handle_missing(all_features)

确认下缺失值是否处理完毕。

5、变量处理

1）类别变量标签化

all_features['MSSubClass'] = all_features['MSSubClass'].apply(str)
all_features['OverallCond'] = all_features['OverallCond'].astype(str)
all_features['YrSold'] = all_features['YrSold'].astype(str)
all_features['MoSold'] = all_features['MoSold'].astype(str)from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu','BsmtQual','BsmtCond','GarageQual','GarageCond','ExterQual','ExterCond','HeatingQC','PoolQC','KitchenQual','BsmtFinType1','BsmtFinType2','Functional','Fence','BsmtExposure','GarageFinish','LandSlope','LotShape','PavedDrive','Street','Alley','CentralAir','MSSubClass','OverallCond','YrSold','MoSold')
for col in cols:lb1 = LabelEncoder()lb1.fit(list(all_features[col].values))all_features[col] = lb1.transform(list(all_features[col].values))

2）数值变量做Box-Cox变换

查看数值变量的偏度，很多变量的偏度都蛮高的，会影响我们后续的预测和建模，我们还需做进一步的数据变换。

Box-Cox变换基本原理：假设样本里一共有n个数据点，分别是y1,y2,...,yn，找到一个合适的函数使得数据点经过变换之后样本整体呈现最好的正态分布。我们可以通过scipy里面的包引用boxcox1p进行处理。

Box-Cox变换关键点在于如何找到一共合适的参数，一般情况下以0.15为经验值。目标就是找到一个简单的转换方式使数据规范化。

三、构建模型

划分训练集和测试集

X = all_features.iloc[:len(train_labels),:]
X_test = all_features.iloc[len(train_labels):,:]
X.shape,train_labels.shape,X_test.shape

构建模型验证--5折交叉验证

kf = KFold(n_splits=12,random_state=42,shuffle=True)
def cv_rmse(model,X=X):rmse = np.sqrt(-cross_val_score(model,X,train_labels,scoring='neg_mean_squared_error',cv=kf))return (rmse)

这里我们用lightgbm进行建模和预测

lightgbm = LGBMRegressor(objective='regression',num_leaves=6,learning_rate=0.01,n_estimators=7000,max_bin=200,bagging_fraction=0.8,bagging_freq=4,bagging_seed=8,feature_fraction=0.2,feature_fraction_seed=8,min_sum_hessian_in_leaf=11,verbose=-1,random_state=42)
score = cv_rmse(lightgbm)
print('lightgbm: {:.4f}({:.4f})'.format(score.mean(),score.std()))

最终结果：score.mean = 0.1155，score.std = 0.0161

总结：

项目中模拟了整个建模的流程，从数据获取，到探索性数据分析，再到数据清洗和数据变换，以及后面的建模，完整的再次呈现建模的各个环节，其中还有很多不足之处，还需进一步加强和学习。

本次利用国外的数据集进行了房价预测，并利用LightGBM算法来建模和预测，总体效果还算ok。

利用LightGBM对波士顿房价进行模拟和预测相关推荐

利用神经网络进行波士顿房价预测
前言前一阵学校有五一数模节校赛,和朋友一起参加做B题,波士顿房价预测,算是第一次自己动手实现一个简单的小网络吧,虽然很简单,但还是想记录一下. 题目介绍波士顿住房数据由哈里森和 ...
matlab对波士顿房价进行分析及预测
目录一.数据分析二.BP神经网络预测三.线性回归预测四.房价分类注:详细代码及原文解析传送门波士顿住房数据是从卡内基梅隆大学维护的StatLib图书馆中获取的数据集,本文在数据上实现一些回 ...
波士顿房价的三种预测方式（模型预测，最小二乘法，多元线性回归）
首先导入库并避免显示错误FutureWorning import numpy as np import matplotlib.pyplot as plt import pandas as pd imp ...
集成学习-波士顿房价预测
关于集成学习算法集成算法基本算法主要分为Bagging算法与Boosting算法 Bagging的算法过程从原始样本集中(有放回的)随机抽取n个训练样本,共进行k轮抽取,得到k个训练集(k个训练集 ...
paddlepaddle框架——波士顿房价预测模型(附原始数据)
波士顿房价预测是一个经典的机器学习任务,类似于程序员世界的"Hello World".和大家对房价的普遍认知相同,波士顿地区的房价受诸多因素影响.该数据集统计了13种可能影响房价的 ...
ML之FE：基于波士顿房价数据集利用LightGBM算法进行模型预测然后通过3σ原则法(计算残差标准差)寻找测试集中的异常值/异常样本
ML之FE:基于波士顿房价数据集利用LightGBM算法进行模型预测然后通过3σ原则法(计算残差标准差)寻找测试集中的异常值/异常样本目录基于波士顿房价数据集利用LiR和LightGBM算法进行模 ...
ML之回归预测：利用13种机器学习算法对Boston(波士顿房价)数据集【13+1,506】进行回归预测(房价预测)来比较各模型性能
ML之回归预测:利用13种机器学习算法对Boston(波士顿房价)数据集[13+1,506]进行回归预测(房价预测)来比较各模型性能导读通过利用13种机器学习算法,分别是LiR.kNN.SVR.D ...
ML之xgboost：利用xgboost算法对Boston(波士顿房价)数据集【特征列分段→独热编码】进行回归预测(房价预测)+预测新数据得分
ML之xgboost:利用xgboost算法对Boston(波士顿房价)数据集[特征列分段→独热编码]进行回归预测(房价预测)+预测新数据得分导读对Boston(波士顿房价)数据集进行特征工程,分 ...
ML之回归预测：利用13种机器学习算法对Boston(波士顿房价)数据集【13+1,506】进行回归预测(房价预测)+预测新数据得分
ML之回归预测:利用13种机器学习算法对Boston(波士顿房价)数据集[13+1,506]进行回归预测(房价预测)+预测新数据得分导读本文章基于前边的一篇文章,对13种机器学习的回归模型性能比较 ...

利用LightGBM对波士顿房价进行模拟和预测

利用LightGBM对波士顿房价进行模拟和预测相关推荐

最新文章

热门文章