数据分析回归问题：美国King County房价预测训练赛

这是DC竞赛网的一道基础回归问题，美国King County房价预测训练赛

竞赛详细信息：美国King County房价预测训练赛

任务：从给定的房屋基本信息以及房屋销售信息等，建立一个回归模型预测房屋的销售价格。

数据：

数据主要包括2014年5月至2015年5月美国King County的房屋销售价格以及房屋的基本信息。

数据分为训练数据和测试数据，分别保存在kc_train.csv和kc_test.csv两个文件中。

其中训练数据主要包括10000条记录，14个字段，主要字段说明如下：

第一列“销售日期”：2014年5月到2015年5月房屋出售时的日期

第二列“销售价格”：房屋交易价格，单位为美元，是目标预测值

第三列“卧室数”：房屋中的卧室数目

第四列“浴室数”：房屋中的浴室数目

第五列“房屋面积”：房屋里的生活面积

第六列“停车面积”：停车坪的面积

第七列“楼层数”：房屋的楼层数

第八列“房屋评分”：King County房屋评分系统对房屋的总体评分

第九列“建筑面积”：除了地下室之外的房屋建筑面积

第十列“地下室面积”：地下室的面积

第十一列“建筑年份”：房屋建成的年份

第十二列“修复年份”：房屋上次修复的年份

第十三列"纬度"：房屋所在纬度

第十四列“经度”：房屋所在经度

测试数据主要包括3000条记录，13个字段，跟训练数据的不同是测试数据并不包括房屋销售价格，通过由训练数据所建立的模型以及所给的测试数据，得出测试数据相应的房屋销售价格预测值。

评分算法：

算法通过计算平均预测误差来衡量回归模型的优劣。平均预测误差越小，说明回归模型越好。平均预测误差计算公式如下：

mse是平均预测误差，m是测试数据的记录数（即3000），是参赛者提交的房屋预测价格，y是对应房屋的真实销售价格。

1. 主函数，按顺序先导入数据，再数据预处理，然后预测模型搭建预测，最后输出预测结果。

from kc_data_import import read_data
from kc_data_preprocessing import preprocessing
from kc_data_prediction import predict, predict2def main():# 读取数据columns_test = ['date', 'bedroom', 'bathroom', 'floor space', 'parking space', 'floor', 'grade','covered area', 'basement area', 'build year', 'repair year', 'longitude', 'latitude']columns_train = ['date', 'price', 'bedroom', 'bathroom', 'floor space', 'parking space', 'floor', 'grade','covered area', 'basement area', 'build year', 'repair year', 'longitude', 'latitude']test = read_data('kc_test.csv', columns_test)train = read_data('kc_train.csv', columns_train)# 数据预处理train_data, test_data = preprocessing(train, test)# 预测模型搭建pred_y = predict(train_data, test_data, is_shuffle=False)# 输出预测结果pred_y.to_csv('./kc_pred_0925.csv', index=False, header=['price'])if __name__ == '__main__':main()

2.导入数据

其中， ‘销售日期’ 的数据是 20150302 形式，在读取时设定pd.read_csv(parse_dates=[0]) 能转化为日期值形式。

import os
import pandas as pddef assert_msg(condition, msg):if not condition:raise Exception(msg)def read_data(filename, columns):# 获取数据路径file_path = os.path.join(os.path.dirname(__file__), filename)# 判定文件是否存在assert_msg(file_path, '文件不存在')# 返回CSV文件return pd.read_csv(file_path,header=None,parse_dates=[0],  # 20150101 转换成日期值 2015-01-01infer_datetime_format=True,names=columns)

3. 数据预处理

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')#显示所有列
pd.set_option('display.max_columns', None)
#显示所有行
# pd.set_option('display.max_rows', None)# 极坐标转换
def polar_coordinates(x, y, x_min, y_min):# 极坐标半径radius =  np.sqrt((x - x_min) ** 2 + (y - y_min) ** 2)# radius = np.sqrt((x ** 2+y ** 2))# 极坐标角度angle = np.arctan((y - y_min) / (x - x_min)) * 180 / np.pi# angle = np.arctan(y / x * 180 / np.pi)return radius, angle# 极坐标地址
def get_radius_angle(loc_x, loc_y):x_min, y_min = loc_x.min(), loc_y.min()radius, angle = [], []for x, y in zip(loc_x, loc_y):radius.append(polar_coordinates(x, y, x_min, y_min)[0])angle.append(polar_coordinates(x, y, x_min, y_min)[1])radius = np.array(radius)angle = np.array(angle)return radius, angledef preprocessing(train, test):# 目标售房价格temp_target = pd.DataFrame()temp_target['price'] = train.pop('price')# 合并训练集 测试集data_all = pd.concat([train, test])data_all.reset_index(inplace=True)# temp_all = pd.DataFrame()columns = ['bedroom', 'bathroom', 'floor', 'grade','floor space', 'parking space', 'covered area', 'basement area',]for col in columns:temp_all[col] = data_all[col]# 年份 季度 月份temp_all['year'] = data_all['date'].apply(lambda x: x.year)temp_all['quarter'] = data_all['date'].apply(lambda x: x.quarter)temp_all['month'] = data_all['date'].apply(lambda x: x.month)# 房屋是否修复temp_all['is_repair'] = np.zeros((temp_all.shape[0], 1))for i in range(len(temp_all['is_repair'])):if data_all['repair year'][i] > 0:temp_all['is_repair'][i] = 1# 房屋有无地下室temp_all['have_basement'] = np.zeros((temp_all.shape[0], 1))for i in range(len(temp_all['have_basement'])):if data_all['basement area'][i] == 0:temp_all['have_basement'][i] = 1# 房龄temp_all['building_age'] = temp_all['year'] - data_all['build year']# 上次修复后年数temp_all['repair_age'] = temp_all['year'] - data_all['repair year']for i in range(len(temp_all['repair_age'])):if temp_all['repair_age'][i] == 2014 or temp_all['repair_age'][i] == 2015:temp_all['repair_age'][i] = temp_all['building_age'][i]# 卧室数/浴室数 比率data_all['bedroom'].replace(0, 1, inplace=True)data_all['bathroom'].replace(0, 1, inplace=True)temp_all['b_b_ratio'] = data_all['bedroom'] / data_all['bathroom']# 房屋面积/建筑面积 比率temp_all['f_c_ratio'] = temp_all['floor space'] / temp_all['covered area']# 房屋面积/停车面积 比率temp_all['f_p_ratio'] = temp_all['floor space'] / temp_all['parking space']# 经纬度 转换极坐标loc_x = data_all['longitude'].valuesloc_y = data_all['latitude'].valuesradius, angle = get_radius_angle(loc_x, loc_y)temp_all['radius'] = radius.round(decimals=8)temp_all['angle'] = angle.round(decimals=8)# 使用get_dummies进行one-hot编码temp_all = pd.get_dummies(temp_all, columns=['year', 'quarter', 'month','bedroom', 'bathroom', 'floor','is_repair', 'have_basement'])# 训练集  测试集划分temp_train = temp_all[temp_all.index < 10000]temp_test = temp_all[temp_all.index >= 10000]temp_train['price'] = temp_target['price']return temp_train, temp_test

4. 创建预测回归模型

import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler, LabelEncoder, Normalizer
from sklearn.feature_extraction import DictVectorizer
import warnings
warnings.filterwarnings('ignore')def feat_standard(data):st_scaler = StandardScaler()st_scaler.fit(data)data = st_scaler.transform(data)return datadef feat_normalizer(data):no_scaler = Normalizer()no_scaler.fit(data)data = no_scaler.transform(data)return datadef feat_encoder(data, cols):for c in cols:lbl = LabelEncoder()lbl.fit(list(data[c].values))data[c] = lbl.transform(list(data[c].values))return datadef feat_dictvectorizer(train_x, valid_x):dict_vec = DictVectorizer(sparse=False)train_x = dict_vec.fit_transform(train_x.to_dict(orient='record'))valid_x = dict_vec.transform(valid_x.to_dict(orient='record'))return train_x, valid_xdef mse_func(y_true, y_predict):assert isinstance(y_true, list), 'y_true must be type of list'assert isinstance(y_predict, list), 'y_true must be type of list'm = len(y_true)squared_error = 0for i in range(m):error = y_true[i] - y_predict[i]squared_error = squared_error + error ** 2mse = squared_error / (10000 * m)return msedef predict(train_, valid_, is_shuffle=True):print(f'data shape:\ntrain--{train_.shape}\nvalid--{valid_.shape}')folds = KFold(n_splits=5, shuffle=is_shuffle, random_state=1024)pred = [k for k in train_.columns if k not in ['price']]sub_preds = np.zeros((valid_.shape[0], folds.n_splits))print(f'Use {len(pred)} features ...')res_e = []for n_fold, (train_idx, valid_idx) in enumerate(folds.split(train_, train_['price']), start=1):print(f'the {n_fold} training start ...')train_x, train_y = train_[pred].iloc[train_idx], train_['price'].iloc[train_idx]valid_x, valid_y = train_[pred].iloc[valid_idx], train_['price'].iloc[valid_idx]print('数据标准化...')feat_st_cols = ['floor space', 'parking space', 'covered area', 'building_age']# train_x[feat_st_cols] = feat_standard(train_x[feat_st_cols])# valid_x[feat_st_cols] = feat_standard(valid_x[feat_st_cols])train_x, valid_x = feat_dictvectorizer(train_x, valid_x)dt_stump = DecisionTreeRegressor(max_depth=30,min_samples_split=15,min_samples_leaf=10,max_features=50,random_state=11,max_leaf_nodes=350)reg = AdaBoostRegressor(base_estimator=dt_stump, n_estimators=100)reg.fit(train_x, train_y)train_pred = reg.predict(valid_x)tmp_score = mse_func(list(valid_y), list(train_pred))res_e.append(tmp_score)sub_preds[:, n_fold - 1] = reg.predict(valid_[pred])print('5 folds 均值：', np.mean(res_e))valid_['price'] = np.mean(sub_preds, axis=1)return valid_['price']

5. 提交预测结果，下图为此模型得分，数据处理和预测模型还比较粗糙，需要进一步完善。

源码：我的GitHub

数据分析回归问题：美国King County房价预测训练赛相关推荐

数据挖掘竞赛-美国King County房价预测训练赛
美国King County房价预测训练赛简介 DC上的一个回归题(正经的回归题). 比较简单. 时间原因(暂时没什么时间看国内旧赛),看了一下网上的解答,改善了一下神经网络就提交了. 过程数据获取 ...
华为LAB实验室3-机器学习实验：（线性回归）美国King County房价预测训练赛
各位好,我是乾颐堂大堂子.领取完整实战指南可以私信我,关键词:实战指南导入相关python库 2.数据处理下载的是两个数据文件,一个是真实数据,一个是测试数据,打开kc_train.csv,能够看 ...
机器学习-员工离职预测训练赛
[数据来源]DC竞赛的员工离职预测训练赛一共两个csv表格,pfm_train.csv训练(1100行,31个字段),pfm_test.csv测试集(350行,30个字段) [字段说明] Age:员 ...
利用Python进行King County房价数据分析
本次又从kaggle上淘来了 King County 的房价数据,结合近期学习的Python分析工具,对影响房价的可能因素进行分析. 提出问题随着国家对房产市场的宏观调控越来越严格,此前一路高歌猛进 ...
PaddlePaddle 波斯顿房价预测训练结果
paddlepaddle是百度提出来的深度学习的框架,个人感觉其实和tensorflow差不多(语法上面),因为本人也是初学者,也不是很懂tensorflow,所以,这些都是个人观点. 百度的padd ...
天池竞赛员工离职预测训练赛
组员:欧阳略.陶奇辉.王曙光.吴轩毅数据来源:天池大数据竞赛员工离职预测训练赛中的数据大致数据截图如下根据所给数据,我组利用Pycharm编程源代码截图如下最终,我组预测准确率为0.89,基本 ...
数据分析与数据挖掘实战案例本地房价预测（716）：
数据分析与数据挖掘实战案例(7/16): 2022 年首届钉钉杯大学生大数据挑战赛练习题目练习题 A:二手房房价分析与预测要点: 1.机器学习 2.数据挖掘 3.数据清洗.分析.pyeahcrs可 ...
回归算法经典案例波士顿房价预测
回归是统计学中最有力的工具之一.机器学习监督学习算法分为分类算法和回归算法两种,其实就是根据类别标签分布类型为离散型.连续性而定义的.回归算法用于连续型分布预测,针对的是数值型的样本,使用回归,可以在 ...
基于python的回归与集成算法进行房价预测
项目介绍在房地产大热的时代,很多人倾尽一生的财富来获取一套房子,很多时候客户会根据地理位置去选取某一小区来购置房产,那么在特定的地理位置上,什么样的房型是最热门的,什么样的房子才是具有性价比的,开发 ...

数据分析回归问题：美国King County房价预测训练赛

数据分析回归问题：美国King County房价预测训练赛相关推荐

最新文章

热门文章

数据分析 回归问题： 美国King County房价预测训练赛

数据分析 回归问题： 美国King County房价预测训练赛相关推荐

最新文章

热门文章

数据分析回归问题：美国King County房价预测训练赛

数据分析回归问题：美国King County房价预测训练赛相关推荐