入门第二战，达到了top1%的分数，有点小兴奋，不过也有可能为公分的提高使模型过拟合了，但入门赛貌似也只能追求公分的提高。

言归正传，开战。

一、导包

# 数据处理及可视化
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# 算法
from xgboost.sklearn import XGBRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
# 训练
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

二、获取数据

train = pd.read_csv("all/train.csv")
test = pd.read_csv("all/test.csv")
sample_submission = pd.read_csv("all/sample_submission.csv")

三、数据分析

1、SalePrice分布

sns.distplot(train.SalePrice)

可以看到SalePrice偏离了正态分布，需要调整，将SalePrice对数化。

sns.distplot(np.log(train.SalePrice + 1))

2、缺失值

（缺失值的处理参考https://www.kaggle.com/laurenstc/top-2-of-leaderboard-advanced-fe）

拼接数据并将缺失值数量可视化

all_data = pd.concat((train.drop(["SalePrice"], axis=1), test))
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
plt.figure(figsize=(12, 6))
plt.xticks(rotation="90")
sns.barplot(x=all_data_na.index, y=all_data_na)

PoolQC：PoolQC的缺失值太多，按照文件的意思就是没有游泳池，可以用None来填充缺失值。理论上来说与PoolQC缺失值对应的PoolArea应该为0，但观察为数不多的有值PoolArea后，发现在测试集中960，1043，1139很特别，这三个例子不能用None填充。13个数据中，Ex出现4次，Gd出现4次，Fa出现2次。根据平均分布的思想，有两个缺失值需要填充Fa。

all_data[all_data.PoolArea != 0][["PoolArea", "PoolQC"]]

MiscFeature：直接删除。按照文件的意思可填充None。我也探索过MiscFeature与其对应的MiscVal以及GarageType之间的关系，试过一些填充办法，甚至确定了测试集中1089这个例子应该填充Gar2，但最终删除该特征对我的模型效果最好。

all_data[all_data.MiscVal > 10000][["MiscFeature", "MiscVal"]]

Alley：按照文件的意思可填充None。

Fence：同MiscFeature，直接删除。

FireplaceQu：按照文件的意思可填充None。

LotFrontage：这是许多Kernels重点填充的对象，有用算法预测的，有以Neighborhood分组填充的，有用R的MICE包填充的。经过我的观察，我认为比较合理的填充方式是以LotConfig和Neighborhood分组填充，但很遗憾的是，这些填充方法对我的模型都没有效果。真正对本模型有效的是填充0。

Garage系列：按照文件的意思，GarageType,GarageFinish,GarageQual,GarageCond填充None，GarageYrBlt,GarageCars,GarageArea填充0。但通过观察，测试集中有一些不同寻常的例子不能如此填充，将使用median和mode进行填充，当然我也试过我认为更加合理的填充方式，比如通过以Neighborhood和GarageType分组填充，但效果并不理想。

all_data[(all_data.GarageType.notnull()) & (all_data.GarageYrBlt.isnull())][["Neighborhood", "YearBuilt", "YearRemodAdd", "GarageType", "GarageYrBlt", "GarageFinish", "GarageCars", "GarageArea", "GarageQual", "GarageCond"]]

Bsmt系列：按照文件的意思，对应的填充None或0。同样存在下面一些特殊例子不应如此填充，我也试过其它的填充方法，但最终还是选择了就以None或0填充。

train.loc[[332, 948]][["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "BsmtFullBath", "BsmtHalfBath"]]

test.loc[[27, 580, 725, 757, 758, 888, 1064]][["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2", "BsmtUnfSF", "BsmtFullBath", "BsmtHalfBath"]]

MSZoning：以mode值RL填充。也试过按MSSubClass分组填充，但没有效果。

MasVnrType：None

MasVnrArea：0

Utilities：测试集中无NoSeWa，基本对价格无影响，删除。

plt.scatter(train.Utilities, train.SalePrice)

其余特征的缺失值使用mode值填充。

缺失值填充的方式很多，大家可以多尝试，多探索，找到对自己模型最好的填充方法。

四、特征工程

1、价格log化

y = train["SalePrice"]
y = np.log(y+1)

2、特殊例子缺失值填充

# PoolQC
test.loc[960, "PoolQC"] = "Fa"
test.loc[1043, "PoolQC"] = "Gd"
test.loc[1139, "PoolQC"] = "Fa"# Garage
test.loc[666, "GarageYrBlt"] = 1979
test.loc[1116, "GarageYrBlt"] = 1979test.loc[666, "GarageFinish"] = "Unf"
test.loc[1116, "GarageFinish"] = "Unf"test.loc[1116, "GarageCars"] = 2
test.loc[1116, "GarageArea"] = 480test.loc[666, "GarageQual"] = "TA"
test.loc[1116, "GarageQual"] = "TA"test.loc[666, "GarageCond"] = "TA"
test.loc[1116, "GarageCond"] = "TA"

3、缺失值填充

# PoolQC
train = train.fillna({"PoolQC": "None"})
test = test.fillna({"PoolQC": "None"})# Alley
train = train.fillna({"Alley": "None"})
test = test.fillna({"Alley": "None"})# FireplaceQu
train = train.fillna({"FireplaceQu": "None"})
test = test.fillna({"FireplaceQu": "None"})# LotFrontage
train = train.fillna({"LotFrontage": 0})
test = test.fillna({"LotFrontage": 0})# Garage
train = train.fillna({"GarageType": "None"})
test = test.fillna({"GarageType": "None"})
train = train.fillna({"GarageYrBlt": 0})
test = test.fillna({"GarageYrBlt": 0})
train = train.fillna({"GarageFinish": "None"})
test = test.fillna({"GarageFinish": "None"})
test = test.fillna({"GarageCars": 0})
test = test.fillna({"GarageArea": 0})
train = train.fillna({"GarageQual": "None"})
test = test.fillna({"GarageQual": "None"})
train = train.fillna({"GarageCond": "None"})
test = test.fillna({"GarageCond": "None"})# Bsmt
train = train.fillna({"BsmtQual": "None"})
test = test.fillna({"BsmtQual": "None"})
train = train.fillna({"BsmtCond": "None"})
test = test.fillna({"BsmtCond": "None"})
train = train.fillna({"BsmtExposure": "None"})
test = test.fillna({"BsmtExposure": "None"})
train = train.fillna({"BsmtFinType1": "None"})
test = test.fillna({"BsmtFinType1": "None"})
train = train.fillna({"BsmtFinSF1": 0})
test = test.fillna({"BsmtFinSF1": 0})
train = train.fillna({"BsmtFinType2": "None"})
test = test.fillna({"BsmtFinType2": "None"})
test = test.fillna({"BsmtFinSF2": 0})
test = test.fillna({"BsmtUnfSF": 0})
test = test.fillna({"TotalBsmtSF": 0})
test = test.fillna({"BsmtFullBath": 0})
test = test.fillna({"BsmtHalfBath": 0})# MasVnr
train = train.fillna({"MasVnrType": "None"})
test = test.fillna({"MasVnrType": "None"})
train = train.fillna({"MasVnrArea": 0})
test = test.fillna({"MasVnrArea": 0})# MiscFeature,Fence,Utilities
train = train.drop(["Fence", "MiscFeature", "Utilities"], axis=1)
test = test.drop(["Fence", "MiscFeature", "Utilities"], axis=1)# other
test = test.fillna({"MSZoning": "RL"})
test = test.fillna({"Exterior1st": "VinylSd"})
test = test.fillna({"Exterior2nd": "VinylSd"})
train = train.fillna({"Electrical": "SBrkr"})
test = test.fillna({"KitchenQual": "TA"})
test = test.fillna({"Functional": "Typ"})
test = test.fillna({"SaleType": "WD"})

4、探索离群值并删除

（探索离群值的方法借鉴https://www.kaggle.com/jack89roberts/top-7-using-elasticnet-with-interactions，我使用了Ridge和ElasticNet训练了训练集，并对训练集进行预测，找出两个算法中预测效果都不理想的样本作为离群值）

dummies

train_dummies = pd.get_dummies(pd.concat((train.drop(["SalePrice", "Id"], axis=1), test.drop(["Id"], axis=1)), axis=0)).iloc[: train.shape[0]]
test_dummies = pd.get_dummies(pd.concat((train.drop(["SalePrice", "Id"], axis=1), test.drop(["Id"], axis=1)), axis=0)).iloc[train.shape[0]:]

使用Ridge寻找离群值

rr = Ridge(alpha=10)
rr.fit(train_dummies, y)
np.sqrt(-cross_val_score(rr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出：0.1388301732996231

y_pred = rr.predict(train_dummies)
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
z = (resid - mean_resid) / std_resid
z = np.array(z)
outliers1 = np.where(abs(z) > abs(z).std() * 3)[0]
outliers1

输出：array([ 30, 88, 142, 277, 308, 328, 365, 410, 438, 462, 495, 523, 533, 581, 588, 628, 632, 681, 688, 710, 714, 728, 774, 812, 874, 898, 916, 935, 968, 970, 1062, 1168, 1170, 1181, 1182, 1298, 1324, 1383, 1423, 1432, 1453], dtype=int64)

plt.figure(figsize=(6, 6))
plt.scatter(y, y_pred)
plt.scatter(y.iloc[outliers1], y_pred[outliers1])
plt.plot(range(10, 15), range(10, 15), color="red")

使用ElasticNet探索离群值

er = ElasticNet(alpha=0.001, l1_ratio=0.58)
er.fit(train_dummies, y)
np.sqrt(-cross_val_score(rr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出：0.1388301732996231

y_pred = er.predict(train_dummies)
resid = y - y_pred
mean_resid = resid.mean()
std_resid = resid.std()
z = (resid - mean_resid) / std_resid
z = np.array(z)
outliers2 = np.where(abs(z) > abs(z).std() * 3)[0]
outliers2

输出：array([ 30, 88, 142, 277, 328, 410, 457, 462, 495, 523, 533, 581, 588, 628, 632, 666, 681, 688, 710, 711, 714, 728, 738, 774, 812, 874, 898, 916, 968, 970, 1181, 1182, 1298, 1324, 1383, 1423, 1432, 1453], dtype=int64)

plt.figure(figsize=(6, 6))
plt.scatter(y, y_pred)
plt.scatter(y.iloc[outliers2], y_pred[outliers2])
plt.plot(range(10, 15), range(10, 15), color="red")

将两次算法预测效果都不好的点作为离群值

outliers = []
for i in outliers1:for j in outliers2:if i == j:outliers.append(i)
outliers

输出（这里输出格式有点问题，后面手动删除离群值）：[30, 88, 142, 277, 328, 410, 462, 495, 523, 533, 581, 588, 628, 632, 681, 688, 710, 714, 728, 774, 812, 874, 898, 916, 968, 970, 1181, 1182, 1298, 1324, 1383, 1423, 1432, 1453]

删除离群值

train = train.drop([30, 88, 142, 277, 328, 410, 462, 495, 523, 533, 581, 588, 628, 632, 681, 688, 710, 714, 728, 774, 812, 874, 898, 916, 968, 970, 1181, 1182, 1298, 1324, 1383, 1423, 1432, 1453])
y = train["SalePrice"]
y = np.log(y+1)

五、建立模型

（使用了GBDT，XGBOOST，Lasso，Ridge，并将它们组合）

dummies

train_dummies = pd.get_dummies(pd.concat((train.drop(["SalePrice", "Id"], axis=1), test.drop(["Id"], axis=1)), axis=0)).iloc[: train.shape[0]]
test_dummies = pd.get_dummies(pd.concat((train.drop(["SalePrice", "Id"], axis=1), test.drop(["Id"], axis=1)), axis=0)).iloc[train.shape[0]:]

GBDT

gbr = GradientBoostingRegressor(max_depth=4, n_estimators=150)
gbr.fit(train_dummies, y)
np.sqrt(-cross_val_score(gbr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出：0.10041800215081471

XGB

xgbr = XGBRegressor(max_depth=5, n_estimators=400)
xgbr.fit(train_dummies, y)
np.sqrt(-cross_val_score(xgbr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出：0.10051704266055339

Lasso

lsr = Lasso(alpha=0.00047)
lsr.fit(train_dummies, y)
np.sqrt(-cross_val_score(lsr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出：0.09072389427316205

Ridge

rr = Ridge(alpha=13)
rr.fit(train_dummies, y)
np.sqrt(-cross_val_score(rr, train_dummies, y, cv=5, scoring="neg_mean_squared_error")).mean()

输出：0.09161386485828467

交叉验证得到的分数是玄学，有时候这里提高了分数，但提交之后反而会下降；这里下降了分数，提交之后反而提高。

组合模型（我也试过Stacking，能提高分数，但效果不及简单组合）

train_predict = 0.1 * gbr.predict(train_dummies) + 0.3 * xgbr.predict(train_dummies) + 0.3 * lsr.predict(train_dummies) + 0.3 * rr.predict(train_dummies)

这是一个对训练集的预测模型的组合，先组合训练集预测模型是因为我将要手动修改预测值。

手动修改预测值

（参考https://www.kaggle.com/agehsbarg/top-10-0-10943-stacking-mice-and-brutal-force，非常有效，但没什么道理可讲，也许不能用到更大的数据集，但在房价预测的公分排行榜上，的确有很好效果）

观察我们的模型对训练集的预测效果

plt.figure(figsize=(6, 6))
plt.scatter(y, train_predict)
plt.plot(range(10, 15), range(10, 15), color="red")

你会看到底部的点不像顶部的点那样在红线上，表示底部的点的预测效果可能并不好，可以手动调整，采用分位数选出你想要调整的预测值，进行调整。quantile的参数以及调整大小的倍数都可以自己调整，寻找能让分数最好的参数。

q1 = pd.DataFrame(train_predict).quantile(0.0042)
pre_df = pd.DataFrame(train_predict)
pre_df["SalePrice"] = train_predict
pre_df = pre_df[["SalePrice"]]
pre_df.loc[pre_df.SalePrice <= q1[0], "SalePrice"] = pre_df.loc[pre_df.SalePrice <= q1[0], "SalePrice"] *0.99
train_predict = np.array(pre_df.SalePrice)
plt.figure(figsize=(6, 6))
plt.scatter(y, train_predict)
plt.plot(range(10, 15), range(10, 15), color="red")

预测并提交

test_predict = 0.1 * gbr.predict(test_dummies) + 0.3 * xgbr.predict(test_dummies) + 0.3 * lsr.predict(test_dummies) + 0.3 * rr.predict(test_dummies)
q1 = pd.DataFrame(test_predict).quantile(0.0042)
pre_df = pd.DataFrame(test_predict)
pre_df["SalePrice"] = test_predict
pre_df = pre_df[["SalePrice"]]
pre_df.loc[pre_df.SalePrice <= q1[0], "SalePrice"] = pre_df.loc[pre_df.SalePrice <= q1[0], "SalePrice"] *0.96
test_predict = np.array(pre_df.SalePrice)
sample_submission["SalePrice"] = np.exp(test_predict)-1
sample_submission.to_csv("all/1.csv", index=False)

可以看到我最终选用了0.96的倍数，该参数让我达到了一个不错的分数。这样提交可以得到0.11052的分数，达到了top 2%。

六、总结

1、缺失值填充十分重要，可以有效的提高分数，大家可以自己尝试不同的填充方法。比如PoolQC的三个特殊缺失值例子的填充，可以尝试不同的组合。

2、离群值的探索，以上提出的离群值并不是对本模型最好的，大家可以试着增加新的离群值，或者减少旧的离群值，会有一些离群值组合让你再次提升分数。

3、多余特征的寻找，这是我初次建立最简单的模型时，偶然发现的，当时我手动尝试哪个特征好，哪个特征不好，找出了两个不好的特征，一直沿用到这最终的模型，没想到一直有用。删除这两个特征，会让本模型分数再次提高。

4、从泰坦尼克知道，创建一些新的特征会很有用，房价预测的很多kernels也创建了一些新特征，但在本模型没什么用处。

5、很多kernels提到的特征的skew等问题，PCA，对本模型都没什么用

最终，删除离群值[30, 88, 410, 462, 495, 523, 588, 628, 632, 874, 898, 968, 970, 1182, 1298, 1324, 1432]，删除特征LandSlope, Exterior2nd可让本模型达到0.10955，top 1%。

kaggle小白入门——房价预测top2%~top1%相关推荐

[Kaggle] Housing Prices 房价预测
文章目录 1. Baseline 1. 特征选择 2. 异常值剔除 3. 建模预测 2. 待优化特征工程房价预测 kaggle 地址参考文章:kaggle比赛:房价预测(排名前4%) 1. Bas ...
动手学深度学习：3.16 实战Kaggle比赛：房价预测
3.16 实战Kaggle比赛:房价预测作为深度学习基础篇章的总结,我们将对本章内容学以致用.下面,让我们动手实战一个Kaggle比赛:房价预测.本节将提供未经调优的数据的预处理.模型的设计和超参数 ...
超详解pytorch实战Kaggle比赛：房价预测
详解pytorch实战Kaggle比赛:房价预测教程名称教程地址机器学习/深度学习 [李宏毅]机器学习/深度学习国语教程(双语字幕) 生成对抗网络 [李宏毅]生成对抗网络国语教程(双语字幕) 目 ...
Kaggle入门——房价预测
Kaggle比赛 Kaggle是一个著名的供机器学习爱好者交流的平台.图3.7展示了Kaggle网站的首页.为了便于提交结果,需要注册Kaggle账号. 我们可以在房价预测比赛的网页上了解比赛信息和参 ...
Kaggle经典项目——房价预测
写在前面: 这篇文章旨在梳理kaggle回归问题的一个基本流程.博主只是一个数据分析刚入门的新手,有些错漏之处还请批评指正.很遗憾这个项目最后提交的Private Score只达到了排行榜的TOP13 ...
Kaggle实战之房价预测案例
房价预测案例(进阶版) 这是进阶版的notebook.主要是为了比较几种模型框架.所以前面的特征工程部分内容,我也并没有做任何改动,重点都在后面的模型建造section Step 1: 检视源数据集 ...
kaggle简单实战——房价预测（xgboost实现）
最近正在学习xgboost,因此在kaggle上用xgboost做了个简单的小项目--波士顿房价预测(https://www.kaggle.com/c/house-prices-advanced-re ...
【问题3】：Kaggle练习题《房价预测》----分别采用的岭回归，随机森林，bagging模型，AdaBoost，XgBoost等。
第一步:导入基本的模块, 并且加载数据. import pandas as pd import numpy as np import matplotlib.pyplot as plt# index_c ...
kaggle机器学习作业(房价预测)
来源:kaggle Machine Learning Micro-Course Home Page Recap Here's the code you've written so far. Start ...

kaggle小白入门——房价预测top2%~top1%