比赛网址  https://www.kaggle.com/c/house-prices-advanced-regression-techniques

下载数据集

读取数据集

import d2lzh as d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import numpy as np
import pandas as pdtrain_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/test.csv')print(train_data.shape)
print(test_data.shape)
#查看前4个样本的前4个特征、后2个特征和标签
print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])#将所有的训练数据和测试数据的79个特征按样本连结。
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

预处理数据集

import d2lzh as d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import numpy as np
import pandas as pdtrain_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/test.csv')print(train_data.shape)
print(test_data.shape)
#查看前4个样本的前4个特征、后2个特征和标签
print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])#将所有的训练数据和测试数据的79个特征按样本连结。
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))##预处理数据集
#对连续数值的特征做标准化(standardization):设该特征在整个数据集上的均值为 μ ,标准差为 σ 。那么,我们可以将该特征的每个值先减去 μ 再除以 σ 得到标准化后的每个特征值。对于缺失的特征值,我们将其替换成该特征的均值。numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
# 标准化后,每个特征的均值变为0,所以可以直接用0来替换缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)#将离散数值转成指示特征。举个例子,假设特征MSZoning里面有两个不同的离散值RL和RM,那么这一步转换将去掉MSZoning特征,并新加两个特征MSZoning_RL和MSZoning_RM,其值为0或1。如果一个样本原来在MSZoning里的值为RL,那么有MSZoning_RL=1且MSZoning_RM=0。# dummy_na=True将缺失值也当作合法的特征值并为其创建指示特征
all_features = pd.get_dummies(all_features, dummy_na=True)
print(all_features.shape)#通过values属性得到NumPy格式的数据,并转成NDArray方便后面的训练。
n_train = train_data.shape[0]
train_features = nd.array(all_features[:n_train].values)
test_features = nd.array(all_features[n_train:].values)
train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))

训练模型

import d2lzh as d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import numpy as np
import pandas as pdtrain_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/test.csv')print(train_data.shape)
print(test_data.shape)
#查看前4个样本的前4个特征、后2个特征和标签
print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])#将所有的训练数据和测试数据的79个特征按样本连结。
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))##预处理数据集
#对连续数值的特征做标准化(standardization):设该特征在整个数据集上的均值为 μ ,标准差为 σ 。那么,我们可以将该特征的每个值先减去 μ 再除以 σ 得到标准化后的每个特征值。对于缺失的特征值,我们将其替换成该特征的均值。numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
# 标准化后,每个特征的均值变为0,所以可以直接用0来替换缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)#将离散数值转成指示特征。举个例子,假设特征MSZoning里面有两个不同的离散值RL和RM,那么这一步转换将去掉MSZoning特征,并新加两个特征MSZoning_RL和MSZoning_RM,其值为0或1。如果一个样本原来在MSZoning里的值为RL,那么有MSZoning_RL=1且MSZoning_RM=0。# dummy_na=True将缺失值也当作合法的特征值并为其创建指示特征
all_features = pd.get_dummies(all_features, dummy_na=True)
print(all_features.shape)#通过values属性得到NumPy格式的数据,并转成NDArray方便后面的训练。
n_train = train_data.shape[0]
train_features = nd.array(all_features[:n_train].values)
test_features = nd.array(all_features[n_train:].values)
train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))##训练模型
#使用一个基本的线性回归模型和平方损失函数来训练模型。loss = gloss.L2Loss()def get_net():net = nn.Sequential()net.add(nn.Dense(1))net.initialize()return net#下面定义比赛用来评价模型的对数均方根误差。给定预测值 y^1,…,y^n 和对应的真实标签 y1,…,yn
def log_rmse(net, features, labels):# 将小于1的值设成1,使得取对数时数值更稳定clipped_preds = nd.clip(net(features), 1, float('inf'))rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())return rmse.asscalar()def train(net, train_features, train_labels, test_features, test_labels,num_epochs, learning_rate, weight_decay, batch_size):train_ls, test_ls = [], []train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)# 这里使用了Adam优化算法trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': learning_rate, 'wd': weight_decay})for epoch in range(num_epochs):for X, y in train_iter:with autograd.record():l = loss(net(X), y)l.backward()trainer.step(batch_size)train_ls.append(log_rmse(net, train_features, train_labels))if test_labels is not None:test_ls.append(log_rmse(net, test_features, test_labels))return train_ls, test_ls#k折交叉验证
def get_k_fold_data(k, i, X, y):assert k > 1fold_size = X.shape[0] // kX_train, y_train = None, Nonefor j in range(k):idx = slice(j * fold_size, (j + 1) * fold_size)X_part, y_part = X[idx, :], y[idx]if j == i:X_valid, y_valid = X_part, y_partelif X_train is None:X_train, y_train = X_part, y_partelse:X_train = nd.concat(X_train, X_part, dim=0)y_train = nd.concat(y_train, y_part, dim=0)return X_train, y_train, X_valid, y_valid#我们训练k次并返回训练和验证的平均误差
def k_fold(k, X_train, y_train, num_epochs,learning_rate, weight_decay, batch_size):train_l_sum, valid_l_sum = 0, 0for i in range(k):data = get_k_fold_data(k, i, X_train, y_train)net = get_net()train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,weight_decay, batch_size)train_l_sum += train_ls[-1]valid_l_sum += valid_ls[-1]if i == 0:d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse',range(1, num_epochs + 1), valid_ls,['train', 'valid'])print('fold %d, train rmse %f, valid rmse %f'% (i, train_ls[-1], valid_ls[-1]))return train_l_sum / k, valid_l_sum / k

模型选择

import d2lzh as d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import numpy as np
import pandas as pdtrain_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/test.csv')print(train_data.shape)
print(test_data.shape)
#查看前4个样本的前4个特征、后2个特征和标签
print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])#将所有的训练数据和测试数据的79个特征按样本连结。
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))##预处理数据集
#对连续数值的特征做标准化(standardization):设该特征在整个数据集上的均值为 μ ,标准差为 σ 。那么,我们可以将该特征的每个值先减去 μ 再除以 σ 得到标准化后的每个特征值。对于缺失的特征值,我们将其替换成该特征的均值。numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
# 标准化后,每个特征的均值变为0,所以可以直接用0来替换缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)#将离散数值转成指示特征。举个例子,假设特征MSZoning里面有两个不同的离散值RL和RM,那么这一步转换将去掉MSZoning特征,并新加两个特征MSZoning_RL和MSZoning_RM,其值为0或1。如果一个样本原来在MSZoning里的值为RL,那么有MSZoning_RL=1且MSZoning_RM=0。# dummy_na=True将缺失值也当作合法的特征值并为其创建指示特征
all_features = pd.get_dummies(all_features, dummy_na=True)
print(all_features.shape)#通过values属性得到NumPy格式的数据,并转成NDArray方便后面的训练。
n_train = train_data.shape[0]
train_features = nd.array(all_features[:n_train].values)
test_features = nd.array(all_features[n_train:].values)
train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))##训练模型
#使用一个基本的线性回归模型和平方损失函数来训练模型。loss = gloss.L2Loss()def get_net():net = nn.Sequential()net.add(nn.Dense(1))net.initialize()return net#下面定义比赛用来评价模型的对数均方根误差。给定预测值 y^1,…,y^n 和对应的真实标签 y1,…,yn
def log_rmse(net, features, labels):# 将小于1的值设成1,使得取对数时数值更稳定clipped_preds = nd.clip(net(features), 1, float('inf'))rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())return rmse.asscalar()def train(net, train_features, train_labels, test_features, test_labels,num_epochs, learning_rate, weight_decay, batch_size):train_ls, test_ls = [], []train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)# 这里使用了Adam优化算法trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': learning_rate, 'wd': weight_decay})for epoch in range(num_epochs):for X, y in train_iter:with autograd.record():l = loss(net(X), y)l.backward()trainer.step(batch_size)train_ls.append(log_rmse(net, train_features, train_labels))if test_labels is not None:test_ls.append(log_rmse(net, test_features, test_labels))return train_ls, test_ls#k折交叉验证
def get_k_fold_data(k, i, X, y):assert k > 1fold_size = X.shape[0] // kX_train, y_train = None, Nonefor j in range(k):idx = slice(j * fold_size, (j + 1) * fold_size)X_part, y_part = X[idx, :], y[idx]if j == i:X_valid, y_valid = X_part, y_partelif X_train is None:X_train, y_train = X_part, y_partelse:X_train = nd.concat(X_train, X_part, dim=0)y_train = nd.concat(y_train, y_part, dim=0)return X_train, y_train, X_valid, y_valid#我们训练k次并返回训练和验证的平均误差
def k_fold(k, X_train, y_train, num_epochs,learning_rate, weight_decay, batch_size):train_l_sum, valid_l_sum = 0, 0for i in range(k):data = get_k_fold_data(k, i, X_train, y_train)net = get_net()train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,weight_decay, batch_size)train_l_sum += train_ls[-1]valid_l_sum += valid_ls[-1]if i == 0:d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse',range(1, num_epochs + 1), valid_ls,['train', 'valid'])print('fold %d, train rmse %f, valid rmse %f'% (i, train_ls[-1], valid_ls[-1]))return train_l_sum / k, valid_l_sum / k#我们使用一组未经调优的超参数并计算交叉验证误差。可以改动这些超参数来尽可能减小平均测试误差。k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,weight_decay, batch_size)
print('%d-fold validation: avg train rmse %f, avg valid rmse %f'% (k, train_l, valid_l))

预测并在Kaggle提交结果

import d2lzh as d2l
from mxnet import autograd, gluon, init, nd
from mxnet.gluon import data as gdata, loss as gloss, nn
import numpy as np
import pandas as pdtrain_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/train.csv')
test_data = pd.read_csv('kaggle/house-prices-advanced-regression-techniques/test.csv')print(train_data.shape)
print(test_data.shape)
#查看前4个样本的前4个特征、后2个特征和标签
print(train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]])#将所有的训练数据和测试数据的79个特征按样本连结。
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))##预处理数据集
#对连续数值的特征做标准化(standardization):设该特征在整个数据集上的均值为 μ ,标准差为 σ 。那么,我们可以将该特征的每个值先减去 μ 再除以 σ 得到标准化后的每个特征值。对于缺失的特征值,我们将其替换成该特征的均值。numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(lambda x: (x - x.mean()) / (x.std()))
# 标准化后,每个特征的均值变为0,所以可以直接用0来替换缺失值
all_features[numeric_features] = all_features[numeric_features].fillna(0)#将离散数值转成指示特征。举个例子,假设特征MSZoning里面有两个不同的离散值RL和RM,那么这一步转换将去掉MSZoning特征,并新加两个特征MSZoning_RL和MSZoning_RM,其值为0或1。如果一个样本原来在MSZoning里的值为RL,那么有MSZoning_RL=1且MSZoning_RM=0。# dummy_na=True将缺失值也当作合法的特征值并为其创建指示特征
all_features = pd.get_dummies(all_features, dummy_na=True)
print(all_features.shape)#通过values属性得到NumPy格式的数据,并转成NDArray方便后面的训练。
n_train = train_data.shape[0]
train_features = nd.array(all_features[:n_train].values)
test_features = nd.array(all_features[n_train:].values)
train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))##训练模型
#使用一个基本的线性回归模型和平方损失函数来训练模型。loss = gloss.L2Loss()def get_net():net = nn.Sequential()net.add(nn.Dense(1))net.initialize()return net#下面定义比赛用来评价模型的对数均方根误差。给定预测值 y^1,…,y^n 和对应的真实标签 y1,…,yn
def log_rmse(net, features, labels):# 将小于1的值设成1,使得取对数时数值更稳定clipped_preds = nd.clip(net(features), 1, float('inf'))rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())return rmse.asscalar()def train(net, train_features, train_labels, test_features, test_labels,num_epochs, learning_rate, weight_decay, batch_size):train_ls, test_ls = [], []train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features, train_labels), batch_size, shuffle=True)# 这里使用了Adam优化算法trainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': learning_rate, 'wd': weight_decay})for epoch in range(num_epochs):for X, y in train_iter:with autograd.record():l = loss(net(X), y)l.backward()trainer.step(batch_size)train_ls.append(log_rmse(net, train_features, train_labels))if test_labels is not None:test_ls.append(log_rmse(net, test_features, test_labels))return train_ls, test_ls#k折交叉验证
def get_k_fold_data(k, i, X, y):assert k > 1fold_size = X.shape[0] // kX_train, y_train = None, Nonefor j in range(k):idx = slice(j * fold_size, (j + 1) * fold_size)X_part, y_part = X[idx, :], y[idx]if j == i:X_valid, y_valid = X_part, y_partelif X_train is None:X_train, y_train = X_part, y_partelse:X_train = nd.concat(X_train, X_part, dim=0)y_train = nd.concat(y_train, y_part, dim=0)return X_train, y_train, X_valid, y_valid#我们训练k次并返回训练和验证的平均误差
def k_fold(k, X_train, y_train, num_epochs,learning_rate, weight_decay, batch_size):train_l_sum, valid_l_sum = 0, 0for i in range(k):data = get_k_fold_data(k, i, X_train, y_train)net = get_net()train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,weight_decay, batch_size)train_l_sum += train_ls[-1]valid_l_sum += valid_ls[-1]if i == 0:d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse',range(1, num_epochs + 1), valid_ls,['train', 'valid'])print('fold %d, train rmse %f, valid rmse %f'% (i, train_ls[-1], valid_ls[-1]))return train_l_sum / k, valid_l_sum / k#我们使用一组未经调优的超参数并计算交叉验证误差。可以改动这些超参数来尽可能减小平均测试误差。k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,weight_decay, batch_size)
print('%d-fold validation: avg train rmse %f, avg valid rmse %f'% (k, train_l, valid_l))#预测并在Kaggle提交结果
def train_and_pred(train_features, test_features, train_labels, test_data,num_epochs, lr, weight_decay, batch_size):net = get_net()train_ls, _ = train(net, train_features, train_labels, None, None,num_epochs, lr, weight_decay, batch_size)d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse')print('train rmse %f' % train_ls[-1])preds = net(test_features).asnumpy()test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)submission.to_csv('kaggle/house-prices-advanced-regression-techniques/submission.csv', index=False)train_and_pred(train_features, test_features, train_labels, test_data,num_epochs, lr, weight_decay, batch_size)

在kaggle上提交,查看结果

参考

http://zh.gluon.ai/chapter_deep-learning-basics/kaggle-house-price.html

实战Kaggle比赛(二)——房价预测相关推荐

  1. 超详解pytorch实战Kaggle比赛:房价预测

    详解pytorch实战Kaggle比赛:房价预测 教程名称 教程地址 机器学习/深度学习 [李宏毅]机器学习/深度学习国语教程(双语字幕) 生成对抗网络 [李宏毅]生成对抗网络国语教程(双语字幕) 目 ...

  2. 动手学深度学习:3.16 实战Kaggle比赛:房价预测

    3.16 实战Kaggle比赛:房价预测 作为深度学习基础篇章的总结,我们将对本章内容学以致用.下面,让我们动手实战一个Kaggle比赛:房价预测.本节将提供未经调优的数据的预处理.模型的设计和超参数 ...

  3. [Kaggle] Housing Prices 房价预测

    文章目录 1. Baseline 1. 特征选择 2. 异常值剔除 3. 建模预测 2. 待优化特征工程 房价预测 kaggle 地址 参考文章:kaggle比赛:房价预测(排名前4%) 1. Bas ...

  4. 实战Kaggle比赛----预测房价(多层感知机)

    文章目录 实战Kaggle比赛----预测房价 下载和缓存数据集 Kaggle简介 访问和读取数据集 数据预处理 标准正态化.缺失值填充.离散值one-hot编码 小栗子帮助理解 训练 KKK折交叉验 ...

  5. 实战Kaggle比赛:预测房价

    文章目录 实战Kaggle比赛:预测房价 1 - 下载和缓存数据集 2 - 访问和读取数据集 3 - 数据预处理 4 - 训练 5 - K折交叉验证 6 - 模型选择 7 - 提交你的Kaggle预测 ...

  6. 《动手深度学习》4.10. 实战Kaggle比赛:预测房价

    4.10. 实战Kaggle比赛:预测房价 本节内容预览 数据 下载和缓存数据集 访问和读取数据集 使用pandas读入并处理数据 数据预处理 处理缺失值&对数值类数据标准化 处理离散值-on ...

  7. 04.10. 实战Kaggle比赛:预测房价

    4.10. 实战Kaggle比赛:预测房价 详细介绍数据预处理.模型设计和超参数选择. 通过亲身实践,你将获得一手经验,这些经验将有益数据科学家的职业成长. import hashlib import ...

  8. 沐神《动手学深度实战Kaggle比赛:狗的品种识别(ImageNet Dogs)

    沐神<动手学深度学习>飞桨版课程公开啦! hello各位飞桨的开发者,大家好!李沐老师的<动手学深度学习>飞桨版课程已经公开啦.本课程由PPSIG和飞桨工程师共同建设,将原书中 ...

  9. 深度学习+pytorch实战Kaggle比赛(一)——房价预测

    参考书籍<动手学深度学习(pytorch版),参考网址为: https://zh-v2.d2l.ai/chapter_multilayer-perceptrons/kaggle-house-pr ...

  10. pytorch学习笔记(十四):实战Kaggle比赛——房价预测

    文章目录 1. Kaggle比赛 2. 获取和读取数据集 3. 预处理数据 4. 训练模型 5. KKK折交叉验证 6. 模型选择 7. 预测并在Kaggle提交结果 1. Kaggle比赛 Kagg ...

最新文章

  1. 计算机设计原则,CISSP备考系列之计算机设计原则[10-39]
  2. 创建型模式:单例模式(懒汉+饿汉+双锁校验+内部类+枚举)
  3. 寒假每日一题(提高组)【Week 3 完结】
  4. 使用Jmeter 创建Post请求
  5. 关于TestNg注解执行
  6. Windows Server入门系列之十 注册表的基本使用
  7. ios 获取是否静音模式_高效人士进阶-IOS
  8. android自动登录_游戏社区App (三):客户端与服务端的加密处理 和 登录
  9. Sentinel控制台搭建使用
  10. 嵌入式学习——使用STM32F103基于HAL库移植uCOS-III
  11. GCC(Graph Contrastive Clustering)论文代码复现
  12. 两张图看清英伟达RTX 20系列显卡的新变化
  13. CentOS 安装meld
  14. 使用matplotlib画3d平面风场_新篇章:12激光SLM金属3D打印机开卖,速度提升20倍,工业批量化生产新利器...
  15. 解方程组的意义和过程 - Strang MIT 18.06 线性代数精髓 2
  16. Spring 教程(一)
  17. 计算机专业大几用到移动硬盘,16款移动硬盘大横评:速度能差这么多? 一比吓一跳...
  18. 【LC刷题笔记】第四天:23+26+33(1-16)
  19. tomcat网站根目录在哪里_不会代码怎么自己搭建一个小说网站
  20. AES加密解密(含python解析工具)

热门文章

  1. 数据安全平台——DSP
  2. c语言版输出2-200以内的素数
  3. opencv给视频加字幕加炫光
  4. php写超级简单的登陆注册页面(适用期末作业至少要求带有数据库的)
  5. Mac 安装 MAT内存分析工具
  6. 虎牙直播Js说书人弹幕
  7. 图书查找java_java图书信息查询实例
  8. [转帖]知乎卢克文 中国的石油战略
  9. ps 计算机 性能设置,Photoshop 中的性能首选项
  10. 淘宝省市区获取,淘宝国家省市区数据获取 2018-01-09更新