问题背景:购房者需要购买梦想中的房子,你需要从房子的79个变量中预测房子的价格是多少.

分为以下几个步骤:

  1. 导入数据观察每个变量特征的意义以及对于房价的重要程度
  2. 筛选出主要影响房价的变量
  3. 清洗和转换变量
  4. 测试和输出数据

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data 这个是官方的地址可以从中获得数据文件

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlinedata_train = pd.read_csv('train.csv')
data_test = pd.read_csv('test.csv')
data_train.sample(3)
def drop_low_colums(df):data_corr=df.corr()d_list=data_corr[data_corr.SalePrice<0.5].index.tolist()return d_listdata_train=data_train.drop(['YearBuilt','1stFlrSF'], axis=1)
data_test=data_test.drop(['YearBuilt','1stFlrSF'], axis=1)
data_drop_train=data_train.drop(['Alley','PoolQC','Fence','MiscFeature'], axis=1)
data_drop_test=data_test.drop(['Alley','PoolQC','Fence','MiscFeature'], axis=1)
drop_list=drop_low_colums(data_drop_train)
data_drop_train1=data_drop_train.drop(drop_list, axis=1)
data_drop_test1=data_drop_test.drop(drop_list, axis=1)
data_drop_train1=data_drop_train1.fillna('0')
data_drop_test1=data_drop_test1.fillna('0')
from sklearn import preprocessing
def encode_features(df_train, df_test):features = data_drop_train1.select_dtypes(include=object).columnsfor feature in features:le = preprocessing.LabelEncoder()df_train[feature] = le.fit_transform(df_train[feature])df_test[feature] = le.fit_transform(df_test[feature])return df_train, df_testdata_train, data_test = encode_features(data_drop_train1, data_drop_test1)
data_drop_train2=data_train.corr()
drop_list2=list(data_drop_train2.query('SalePrice<0.5').index)
data_drop_train2=data_train.drop(drop_list2, axis=1)
data_drop_test2=data_test.drop(drop_list2, axis=1)
def change_colums(colums):fz=np.ceil(np.log2(colums.max()))return fzfz=change_colums(data_drop_train2.GarageArea)
data_drop_train2['GarageArea'] = pd.cut(data_drop_train2.GarageArea, fz)
fz=change_colums(data_drop_train2.GrLivArea)
data_drop_train2['GrLivArea'] = pd.cut(data_drop_train2.GrLivArea, fz)
fz=change_colums(data_drop_train2.TotalBsmtSF)
data_drop_train2['TotalBsmtSF'] = pd.cut(data_drop_train2.TotalBsmtSF, fz)
fz=change_colums(data_drop_train2.YearRemodAdd)
data_drop_train2['YearRemodAdd'] = pd.cut(data_drop_train2.YearRemodAdd, fz)
data_drop_test2["TotalBsmtSF"] = data_drop_test2["TotalBsmtSF"].astype("int64")
data_drop_test2["GarageArea"] = data_drop_test2["GarageArea"].astype("int64")
data_drop_test2["GrLivArea"] = data_drop_test2["GrLivArea"].astype("int64")
data_drop_test2["GarageCars"] = data_drop_test2["GarageCars"].astype("int64")
data_drop_test2.loc[data_drop_test2['YearRemodAdd'] == '0'] = int(data_drop_test2["YearRemodAdd"].mean())
fz=change_colums(data_drop_test2.YearRemodAdd)
data_drop_test2['YearRemodAdd'] = pd.cut(data_drop_test2.YearRemodAdd, fz)
fz=change_colums(data_drop_test2.TotalBsmtSF)
data_drop_test2['TotalBsmtSF'] = pd.cut(data_drop_test2.TotalBsmtSF,fz)
fz=change_colums(data_drop_test2.GarageArea)
data_drop_test2['GarageArea'] = pd.cut(data_drop_test2.GarageArea,fz)
fz=change_colums(data_drop_test2.GrLivArea)
data_drop_test2['GrLivArea'] = pd.cut(data_drop_test2.GrLivArea, fz)
def encode_features1(df_train, df_test):features = ['YearRemodAdd', 'TotalBsmtSF', 'GrLivArea', 'GarageArea']df_combined = pd.concat([df_train[features], df_test[features]])for feature in features:le = preprocessing.LabelEncoder()le = le.fit(df_combined[feature])df_train[feature] = le.transform(df_train[feature])df_test[feature] = le.transform(df_test[feature])return df_train, df_testdata_train1, data_test1 = encode_features1(data_drop_train2, data_drop_test2)
from sklearn.model_selection import train_test_split
X=data_train1[['OverallQual', 'YearRemodAdd', 'TotalBsmtSF', 'GrLivArea', 'FullBath', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea']]
y=data_train1['SalePrice']
#随机划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 1.决策树回归
from sklearn import tree
model_decision_tree_regression = tree.DecisionTreeRegressor()# 2.线性回归
from sklearn.linear_model import LinearRegression
model_linear_regression = LinearRegression()# 3.SVM回归
from sklearn import svm
model_svm = svm.SVR()# 4.kNN回归
from sklearn import neighbors
model_k_neighbor = neighbors.KNeighborsRegressor()# 5.随机森林回归
from sklearn import ensemble
model_random_forest_regressor = ensemble.RandomForestRegressor(n_estimators=20)  # 使用20个决策树# 6.Adaboost回归
from sklearn import ensemble
model_adaboost_regressor = ensemble.AdaBoostRegressor(n_estimators=50)  # 这里使用50个决策树# 7.GBRT回归
from sklearn import ensemble
model_gradient_boosting_regressor = ensemble.GradientBoostingRegressor(n_estimators=100)  # 这里使用100个决策树# 8.Bagging回归
from sklearn import ensemble
model_bagging_regressor = ensemble.BaggingRegressor()# 9.ExtraTree极端随机数回归
from sklearn.tree import ExtraTreeRegressor
model_extra_tree_regressor = ExtraTreeRegressor()
def try_different_method(model, method):model.fit(X_train, y_train)score = model.score(X_test, y_test)result = model.predict(X_test)plt.figure()plt.plot(np.arange(len(result)), y_test, "go-", label="True value")plt.plot(np.arange(len(result)), result, "ro-", label="Predict value")plt.title(f"method:{method}---score:{score}")plt.legend(loc="best")
try_different_method(model_decision_tree_regression,"model_decision_tree_regression")
try_different_method(model_linear_regression,"model_linear_regression")
try_different_method(model_svm,"model_svm")
try_different_method(model_k_neighbor,"model_k_neighbor")
try_different_method(model_random_forest_regressor,"model_random_forest_regressor")
try_different_method(model_adaboost_regressor,"model_adaboost_regressor")
try_different_method(model_gradient_boosting_regressor,"model_gradient_boosting_regressor")
try_different_method(model_bagging_regressor,"model_bagging_regressor")
try_different_method(model_extra_tree_regressor,"model_extra_tree_regressor")
model_gradient_boosting_regressor.fit(X_train, y_train)
model_gradient_boosting_regressor_result = model_gradient_boosting_regressor.predict(data_test1)
submission=pd.DataFrame({'Id':data_test['Id'],'SalePrice':model_gradient_boosting_regressor_result})
submission.to_csv('submission.csv',index=False)

最后提交下文件就可以得到分数了。只是做简单的特征工程,也没有对参数进行优化,可以继续进行优化

https://github.com/Timlcy/house_price/blob/master/housePrices.ipynb

House Prices: Advanced Regression Techniques(房价预测)相关推荐

  1. kaggle房价预测(House Prices: Advanced Regression Techniques)数据内容超级详细整理

    之前只是单纯的学习各种算法,没有实际联系过,因此决定在kaggle上先找一个入门级别的项目学习一下,希望能获得更多的知识.现在找的项目是预测房价:House Prices: Advanced Regr ...

  2. Kaggle: House Prices: Advanced Regression Techniques

    Kaggle: House Prices: Advanced Regression Techniques notebook来自https://www.kaggle.com/neviadomski/ho ...

  3. kaggle房价预测(House Prices: Advanced Regression Techniques)详解

    这几天做kaggle上的房价预测题目,有一些需要记录的点. 1.当数据是skew的时候需要进行log操作,比如这里的房价 之后可以把所有偏度大于一个阈值的都log化,至于偏度相关的知识,请看https ...

  4. Kaggle比赛(二)House Prices: Advanced Regression Techniques

    房价预测是我入门Kaggle的第二个比赛,参考学习了他人的一篇优秀教程:https://www.kaggle.com/serigne/stacked-regressions-top-4-on-lead ...

  5. kaggle房价预测特征意思_R语言实战:复杂数据处理和分析之Kaggle房价预测

    1)明确分析的目的 本次数据分析的数据来源于kaggle上有关于房价预测,数据来源:House Prices: Advanced Regression Techniques.此次的分析目的已经很明确了 ...

  6. kaggle房价预测特征意思_Kaggle之预测房价

    分析背景 要求购房者描述他们梦想中的房子,他们可能不会从地下室天花板的高度或靠近东西方铁路开始.但是这个游乐场比赛的数据集证明了价格谈判比卧室或白色栅栏的数量更多. 有79个解释变量描述(几乎)爱荷华 ...

  7. 0907实战KAGGLE房价预测数据

    数据集: 本文主要对KAGGLE房价预测数据进行预测,并提供模型的设计以及超参数的选择. 该数据集共有1460个数据样本,80个样本特征   数据集介绍可参照:  House Prices - Adv ...

  8. 论文翻译之——《基于XGBoost的房价预测优化》-陶然

    目录 摘要 1 介绍 2.相关工作 2.1 文献综述 2.2 研究方法 3. 特征重要性和准确性改进 3.1 特征工程 3.1.1 数据描述 3.1.2 数据清洗 3.1.3 响应变量归一化 3.1. ...

  9. 【学习记录】Kaggle房价预测

    问题描述: 让购房者描述他们梦想中的房子,他们可能不会从地下室的天花板的高度或者是距离铁路的距离来考虑.但是这个数据集可以包含了影响房价的因素.79 个解释变量(几乎)描述了爱荷华州艾姆斯住宅的各个方 ...

最新文章

  1. stm32f302实现斩波控制步进电机_什么是步进电机控制器?
  2. spring外部化配置
  3. 使用模板元编程快速的得到斐波那契数。。
  4. css transform旋转属性
  5. Android移动开发之【Android实战项目】DAY11-App实现截图分享qq,微信
  6. 数据一致性-分区可用性-性能—多副本强同步数据库系统实现之我见
  7. python中while循环_Python第12课:while循环案例 打印输出有规律的造型
  8. notepad python_Notepad++配置Python开发环境
  9. 跟着《架构探险》学轻量级微服务架构 (一)
  10. Vue 使用 screenfull 实现全屏
  11. 加点自已内容的新内核下L7-FILTER的应用实例!
  12. Jmeter JDBC请求-----数据库读取数据进行参数化 通过SSH跳板机连接数据库
  13. Django之管理权限
  14. web浏览器_Web上的分享(Share)API
  15. destoon 短信发送函数及短信接口修改
  16. 02-C#(基础)基本的定义和说明
  17. MVC系列博客之排球计分(六)Controller的实现(二)
  18. 常用企业管理工具介绍
  19. 文菌装NAS E5:超详细!手把手教您安装黑群晖918+6.2保姆级教程
  20. 详解示波器的三个主要参数:采样率,存储深度,带宽

热门文章

  1. Linux--yum的安装与管理
  2. 使用poi进行excel比对程序
  3. 计算机图形学 学习笔记 计算机图形软件
  4. listview刷新(litepal)
  5. 微信错误 errcode:40001,errmsg:invalid credential, access_token is invalid or not latest hint
  6. Overthewire wargame-bandit
  7. 2004年的魔幻巨片,《范海辛》Van Helsing
  8. Linearized ADMM vs ADMM
  9. SOLIDWORKS PDMManage升级SOP——客户端篇
  10. Linux企业级监控Zabbix——Zabbix 监控架构、优缺点、监控对象、监控方式、监控模块、组件、常用术语