二手车交易价格预测-Baseline

Baseline-v1.0 版

Tip:这是一个最初始baseline版本,抛砖引玉,为大家提供一个基本Baseline和一个竞赛流程的基本介绍,欢迎大家多多交流。

赛题:零基础入门数据挖掘 - 二手车交易价格预测

地址:https://tianchi.aliyun.com/competition/entrance/231784/introduction?spm=5176.12281957.1004.1.38b02448ausjSX

# 查看数据文件目录  list datalab files
!ls datalab/
231784

Step 1:导入函数工具箱

## 基础工具
import numpy as np
import pandas as pd
import warnings
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.special import jn
from IPython.display import display, clear_output
import timewarnings.filterwarnings('ignore')
%matplotlib inline## 模型预测的
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor## 数据降维处理的
from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCAimport lightgbm as lgb
import xgboost as xgb## 参数搜索和评价的
from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

Step 2:数据读取

## 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)
Train_data = pd.read_csv('datalab/231784/used_car_train_20200313.csv', sep=' ')
TestA_data = pd.read_csv('datalab/231784/used_car_testA_20200313.csv', sep=' ')## 输出数据的大小信息
print('Train data shape:',Train_data.shape)
print('TestA data shape:',TestA_data.shape)
Train data shape: (150000, 31)
TestA data shape: (50000, 30)

1) 数据简要浏览

## 通过.head() 简要浏览读取数据的形式
Train_data.head()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
0 0 736 20040402 30.0 6 1.0 0.0 0.0 60 12.5 ... 0.235676 0.101988 0.129549 0.022816 0.097462 -2.881803 2.804097 -2.420821 0.795292 0.914762
1 1 2262 20030301 40.0 1 2.0 0.0 0.0 0 15.0 ... 0.264777 0.121004 0.135731 0.026597 0.020582 -4.900482 2.096338 -1.030483 -1.722674 0.245522
2 2 14874 20040403 115.0 15 1.0 0.0 0.0 163 12.5 ... 0.251410 0.114912 0.165147 0.062173 0.027075 -4.846749 1.803559 1.565330 -0.832687 -0.229963
3 3 71865 19960908 109.0 10 0.0 0.0 1.0 193 15.0 ... 0.274293 0.110300 0.121964 0.033395 0.000000 -4.509599 1.285940 -0.501868 -2.438353 -0.478699
4 4 111080 20120103 110.0 5 1.0 0.0 0.0 68 5.0 ... 0.228036 0.073205 0.091880 0.078819 0.121534 -1.896240 0.910783 0.931110 2.834518 1.923482

5 rows × 31 columns

2) 数据信息查看

## 通过 .info() 简要可以看到对应一些数据列名,以及NAN缺失信息
Train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):
SaleID               150000 non-null int64
name                 150000 non-null int64
regDate              150000 non-null int64
model                149999 non-null float64
brand                150000 non-null int64
bodyType             145494 non-null float64
fuelType             141320 non-null float64
gearbox              144019 non-null float64
power                150000 non-null int64
kilometer            150000 non-null float64
notRepairedDamage    150000 non-null object
regionCode           150000 non-null int64
seller               150000 non-null int64
offerType            150000 non-null int64
creatDate            150000 non-null int64
price                150000 non-null int64
v_0                  150000 non-null float64
v_1                  150000 non-null float64
v_2                  150000 non-null float64
v_3                  150000 non-null float64
v_4                  150000 non-null float64
v_5                  150000 non-null float64
v_6                  150000 non-null float64
v_7                  150000 non-null float64
v_8                  150000 non-null float64
v_9                  150000 non-null float64
v_10                 150000 non-null float64
v_11                 150000 non-null float64
v_12                 150000 non-null float64
v_13                 150000 non-null float64
v_14                 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
## 通过 .columns 查看列名
Train_data.columns
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'notRepairedDamage', 'regionCode','seller', 'offerType', 'creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3','v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12','v_13', 'v_14'],dtype='object')
TestA_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):
SaleID               50000 non-null int64
name                 50000 non-null int64
regDate              50000 non-null int64
model                50000 non-null float64
brand                50000 non-null int64
bodyType             48587 non-null float64
fuelType             47107 non-null float64
gearbox              48090 non-null float64
power                50000 non-null int64
kilometer            50000 non-null float64
notRepairedDamage    50000 non-null object
regionCode           50000 non-null int64
seller               50000 non-null int64
offerType            50000 non-null int64
creatDate            50000 non-null int64
v_0                  50000 non-null float64
v_1                  50000 non-null float64
v_2                  50000 non-null float64
v_3                  50000 non-null float64
v_4                  50000 non-null float64
v_5                  50000 non-null float64
v_6                  50000 non-null float64
v_7                  50000 non-null float64
v_8                  50000 non-null float64
v_9                  50000 non-null float64
v_10                 50000 non-null float64
v_11                 50000 non-null float64
v_12                 50000 non-null float64
v_13                 50000 non-null float64
v_14                 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB

3) 数据统计信息浏览

## 通过 .describe() 可以查看数值特征列的一些统计信息
Train_data.describe()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 150000.000000 150000.000000 1.500000e+05 149999.000000 150000.000000 145494.000000 141320.000000 144019.000000 150000.000000 150000.000000 ... 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000 150000.000000
mean 74999.500000 68349.172873 2.003417e+07 47.129021 8.052733 1.792369 0.375842 0.224943 119.316547 12.597160 ... 0.248204 0.044923 0.124692 0.058144 0.061996 -0.001000 0.009035 0.004813 0.000313 -0.000688
std 43301.414527 61103.875095 5.364988e+04 49.536040 7.864956 1.760640 0.548677 0.417546 177.168419 3.919576 ... 0.045804 0.051743 0.201410 0.029186 0.035692 3.772386 3.286071 2.517478 1.288988 1.038685
min 0.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 -9.168192 -5.558207 -9.639552 -4.153899 -6.546556
25% 37499.750000 11156.000000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 ... 0.243615 0.000038 0.062474 0.035334 0.033930 -3.722303 -1.951543 -1.871846 -1.057789 -0.437034
50% 74999.500000 51638.000000 2.003091e+07 30.000000 6.000000 1.000000 0.000000 0.000000 110.000000 15.000000 ... 0.257798 0.000812 0.095866 0.057014 0.058484 1.624076 -0.358053 -0.130753 -0.036245 0.141246
75% 112499.250000 118841.250000 2.007111e+07 66.000000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 ... 0.265297 0.102009 0.125243 0.079382 0.087491 2.844357 1.255022 1.776933 0.942813 0.680378
max 149999.000000 196812.000000 2.015121e+07 247.000000 39.000000 7.000000 6.000000 1.000000 19312.000000 15.000000 ... 0.291838 0.151420 1.404936 0.160791 0.222787 12.357011 18.819042 13.847792 11.147669 8.658418

8 rows × 30 columns

TestA_data.describe()
SaleID name regDate model brand bodyType fuelType gearbox power kilometer ... v_5 v_6 v_7 v_8 v_9 v_10 v_11 v_12 v_13 v_14
count 50000.000000 50000.000000 5.000000e+04 50000.000000 50000.000000 48587.000000 47107.000000 48090.000000 50000.000000 50000.000000 ... 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000 50000.000000
mean 174999.500000 68542.223280 2.003393e+07 46.844520 8.056240 1.782185 0.373405 0.224350 119.883620 12.595580 ... 0.248669 0.045021 0.122744 0.057997 0.062000 -0.017855 -0.013742 -0.013554 -0.003147 0.001516
std 14433.901067 61052.808133 5.368870e+04 49.469548 7.819477 1.760736 0.546442 0.417158 185.097387 3.908979 ... 0.044601 0.051766 0.195972 0.029211 0.035653 3.747985 3.231258 2.515962 1.286597 1.027360
min 150000.000000 0.000000 1.991000e+07 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.500000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 -9.160049 -5.411964 -8.916949 -4.123333 -6.112667
25% 162499.750000 11203.500000 1.999091e+07 10.000000 1.000000 0.000000 0.000000 0.000000 75.000000 12.500000 ... 0.243762 0.000044 0.062644 0.035084 0.033714 -3.700121 -1.971325 -1.876703 -1.060428 -0.437920
50% 174999.500000 52248.500000 2.003091e+07 29.000000 6.000000 1.000000 0.000000 0.000000 109.000000 15.000000 ... 0.257877 0.000815 0.095828 0.057084 0.058764 1.613212 -0.355843 -0.142779 -0.035956 0.138799
75% 187499.250000 118856.500000 2.007110e+07 65.000000 13.000000 3.000000 1.000000 0.000000 150.000000 15.000000 ... 0.265328 0.102025 0.125438 0.079077 0.087489 2.832708 1.262914 1.764335 0.941469 0.681163
max 199999.000000 196805.000000 2.015121e+07 246.000000 39.000000 7.000000 6.000000 1.000000 20000.000000 15.000000 ... 0.291618 0.153265 1.358813 0.156355 0.214775 12.338872 18.856218 12.950498 5.913273 2.624622

8 rows × 29 columns

Step 3:特征与标签构建

1) 提取数值类型特征列名

numerical_cols = Train_data.select_dtypes(exclude = 'object').columns
print(numerical_cols)
Index(['SaleID', 'name', 'regDate', 'model', 'brand', 'bodyType', 'fuelType','gearbox', 'power', 'kilometer', 'regionCode', 'seller', 'offerType','creatDate', 'price', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6','v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13', 'v_14'],dtype='object')
categorical_cols = Train_data.select_dtypes(include = 'object').columns
print(categorical_cols)
Index(['notRepairedDamage'], dtype='object')

2) 构建训练和测试样本

## 选择特征列
feature_cols = [col for col in numerical_cols if col not in ['SaleID','name','regDate','creatDate','price','model','brand','regionCode','seller']]
feature_cols = [col for col in feature_cols if 'Type' not in col]## 提前特征列,标签列构造训练样本和测试样本
X_data = Train_data[feature_cols]
Y_data = Train_data['price']X_test  = TestA_data[feature_cols]print('X train shape:',X_data.shape)
print('X test shape:',X_test.shape)
X train shape: (150000, 18)
X test shape: (50000, 18)
## 定义了一个统计函数,方便后续信息统计
def Sta_inf(data):print('_min',np.min(data))print('_max:',np.max(data))print('_mean',np.mean(data))print('_ptp',np.ptp(data))print('_std',np.std(data))print('_var',np.var(data))

3) 统计标签的基本分布信息

print('Sta of label:')
Sta_inf(Y_data)
Sta of label:
_min 11
_max: 99999
_mean 5923.32733333
_ptp 99988
_std 7501.97346988
_var 56279605.9427
## 绘制标签的统计图,查看标签分布
plt.hist(Y_data)
plt.show()
plt.close()

4) 缺省值用-1填补

X_data = X_data.fillna(-1)
X_test = X_test.fillna(-1)

Step 4:模型训练与预测

1) 利用xgb进行五折交叉验证查看模型的参数效果

## xgb-Model
xgr = xgb.XGBRegressor(n_estimators=120, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #,objective ='reg:squarederror'scores_train = []
scores = []## 5折交叉验证方式
sk=StratifiedKFold(n_splits=5,shuffle=True,random_state=0)
for train_ind,val_ind in sk.split(X_data,Y_data):train_x=X_data.iloc[train_ind].valuestrain_y=Y_data.iloc[train_ind]val_x=X_data.iloc[val_ind].valuesval_y=Y_data.iloc[val_ind]xgr.fit(train_x,train_y)pred_train_xgb=xgr.predict(train_x)pred_xgb=xgr.predict(val_x)score_train = mean_absolute_error(train_y,pred_train_xgb)scores_train.append(score_train)score = mean_absolute_error(val_y,pred_xgb)scores.append(score)print('Train mae:',np.mean(score_train))
print('Val mae',np.mean(scores))
Train mae: 628.086664863
Val mae 715.990013454

2) 定义xgb和lgb模型函数

def build_model_xgb(x_train,y_train):model = xgb.XGBRegressor(n_estimators=150, learning_rate=0.1, gamma=0, subsample=0.8,\colsample_bytree=0.9, max_depth=7) #, objective ='reg:squarederror'model.fit(x_train, y_train)return modeldef build_model_lgb(x_train,y_train):estimator = lgb.LGBMRegressor(num_leaves=127,n_estimators = 150)param_grid = {'learning_rate': [0.01, 0.05, 0.1, 0.2],}gbm = GridSearchCV(estimator, param_grid)gbm.fit(x_train, y_train)return gbm

3)切分数据集(Train,Val)进行模型训练,评价和预测

## Split data with val
x_train,x_val,y_train,y_val = train_test_split(X_data,Y_data,test_size=0.3)
print('Train lgb...')
model_lgb = build_model_lgb(x_train,y_train)
val_lgb = model_lgb.predict(x_val)
MAE_lgb = mean_absolute_error(y_val,val_lgb)
print('MAE of val with lgb:',MAE_lgb)print('Predict lgb...')
model_lgb_pre = build_model_lgb(X_data,Y_data)
subA_lgb = model_lgb_pre.predict(X_test)
print('Sta of Predict lgb:')
Sta_inf(subA_lgb)
Train lgb...
MAE of val with lgb: 689.084070621
Predict lgb...
Sta of Predict lgb:
_min -519.150259864
_max: 88575.1087721
_mean 5922.98242599
_ptp 89094.259032
_std 7377.29714126
_var 54424513.1104
print('Train xgb...')
model_xgb = build_model_xgb(x_train,y_train)
val_xgb = model_xgb.predict(x_val)
MAE_xgb = mean_absolute_error(y_val,val_xgb)
print('MAE of val with xgb:',MAE_xgb)print('Predict xgb...')
model_xgb_pre = build_model_xgb(X_data,Y_data)
subA_xgb = model_xgb_pre.predict(X_test)
print('Sta of Predict xgb:')
Sta_inf(subA_xgb)
Train xgb...
MAE of val with xgb: 715.37757816
Predict xgb...
Sta of Predict xgb:
_min -165.479
_max: 90051.8
_mean 5922.9
_ptp 90217.3
_std 7361.13
_var 5.41862e+07

4)进行两模型的结果加权融合

## 这里我们采取了简单的加权融合的方式
val_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*val_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*val_xgb
val_Weighted[val_Weighted<0]=10 # 由于我们发现预测的最小值有负数,而真实情况下,price为负是不存在的,由此我们进行对应的后修正
print('MAE of val with Weighted ensemble:',mean_absolute_error(y_val,val_Weighted))
MAE of val with Weighted ensemble: 687.275745703
sub_Weighted = (1-MAE_lgb/(MAE_xgb+MAE_lgb))*subA_lgb+(1-MAE_xgb/(MAE_xgb+MAE_lgb))*subA_xgb## 查看预测值的统计进行
plt.hist(Y_data)
plt.show()
plt.close()

5)输出结果

sub = pd.DataFrame()
sub['SaleID'] = X_test.SaleID
sub['price'] = sub_Weighted
sub.to_csv('./sub_Weighted.csv',index=False)
sub.head()
SaleID price
0 0 39533.727414
1 1 386.081960
2 2 7791.974571
3 3 11835.211966
4 4 585.420407

【算法竞赛学习】二手车交易价格预测-Baseline相关推荐

  1. 【直播】王茂霖:二手车交易价格预测 Baseline 提高(河北高校数据挖掘邀请赛)

    二手车交易价格预测 Baseline 提高 目前 河北高校数据挖掘邀请赛 正在如火如荼的进行中.为了大家更好的参赛,王茂霖分享了 从0梳理1场数据挖掘赛事!,完整梳理了从环境准备.数据读取.数据分析. ...

  2. 【算法竞赛学习】二手车交易价格预测-Task1赛题理解

    二手车交易价格预测-Task1 赛题理解 一. 赛题理解 Tip:此部分为零基础入门数据挖掘的 Task1 赛题理解 部分,为大家入门数据挖掘比赛提供一个基本的赛题入门讲解,欢迎后续大家多多交流. 赛 ...

  3. 【算法竞赛学习】二手车交易价格预测-Task5模型融合

    二手车交易价格预测-Task5 模型融合 五.模型融合 Tip:此部分为零基础入门数据挖掘的 Task5 模型融合 部分,带你来了解各种模型结果的融合方式,在比赛的攻坚时刻冲刺Top,欢迎大家后续多多 ...

  4. 【算法竞赛学习】二手车交易价格预测-Task4建模调参

    二手车交易价格预测-Task4 建模调参 四.建模与调参 Tip:此部分为零基础入门数据挖掘的 Task4 建模调参 部分,带你来了解各种模型以及模型的评价和调参策略,欢迎大家后续多多交流. 赛题:零 ...

  5. 【算法竞赛学习】二手车交易价格预测-Task3特征工程

    二手车交易价格预测-Task3 特征工程 三. 特征工程目标 Tip:此部分为零基础入门数据挖掘的 Task3 特征工程部分,带你来了解各种特征工程以及分析方法,欢迎大家后续多多交流. 赛题:零基础入 ...

  6. 【算法竞赛学习】二手车交易价格预测-Task2数据分析

    二手车交易价格预测-Task2 数据分析 二. EDA-数据探索性分析 Tip:此部分为零基础入门数据挖掘的 Task2 EDA-数据探索性分析 部分,带你来了解数据,熟悉数据,和数据做朋友,欢迎大家 ...

  7. 《零基础入门数据挖掘 - 二手车交易价格预测》Baseline实施

    @[TOC]<零基础入门数据挖掘 - 二手车交易价格预测>baseline实施 <零基础入门数据挖掘 - 二手车交易价格预测>Baseline实施 前面陆陆续续学习机器学习大概 ...

  8. 零基础入门数据挖掘——二手车交易价格预测:baseline

    零基础入门数据挖掘 - 二手车交易价格预测 赛题理解 比赛要求参赛选手根据给定的数据集,建立模型,二手汽车的交易价格. 赛题以预测二手车的交易价格为任务,数据集报名后可见并可下载,该数据来自某交易平台 ...

  9. 数据挖掘实战:二手车交易价格预测

    赛题数据 数据来自某交易平台的二手车交易记录,总数据量超过40w,包含31列变量信息,其中15列为匿名变量.从中抽取15万条作为训练集,5万条作为测试集A,5万条作为测试集B,同时会对name.mod ...

最新文章

  1. saltstack state模块-状态管理
  2. SAP CRM Division customizing
  3. 深度学习之非极大值抑制(Non-maximum suppression,NMS)
  4. numpy 转置_Numpy基础:数组转置和轴对换
  5. vmware中装的ubuntu上不了网
  6. LeetCode 396. 旋转函数(Rotate Function)
  7. 提交App中断出现 Cannot proceed with delivery an existing transporter instan
  8. mybatis的mysql参数传递_Mybatis参数传递及返回类型
  9. 下载腾讯视频极速版_怎么退出腾讯视频登录
  10. python数据透视表怎么存下来_大数据分析如何利用Python创建数据透视表?
  11. 9GAG客户端,五一3天尽心之作,Just Android Design!(开源)+毛玻璃效果
  12. spring-ant-处理zip
  13. stimulsoft oracle,【Stimulsoft Reports Java教程】使用Oracle数据库
  14. MatchNet: Unifying Feature and Metric Learning for Patch-Based Matching
  15. php面试php数组变ahp,php实现把数组按指定的个数分隔
  16. 基于WinUSB的异步方式bulk传输的稳定性问题
  17. python 答题卡识别_opencv+python机读卡识别整合版
  18. new 与 delete 操作符
  19. cookie 和 sessions 的区别
  20. 贤者之路,Tensorrt的int8 calibration创建

热门文章

  1. Android开发之Handler机制记录
  2. CentOS7.5下搭建zabbix3.4监控
  3. PHP str_replace() 和str_ireplace()函数
  4. 网页简单上传图片 imgareaselect插件
  5. 英国电信公司沃达丰遭到网络攻击
  6. 配置WCF同时支持WSDL和REST,swaggerwcf生成文档
  7. SQL Server 用表中已有数据造数据
  8. android使用handler记录
  9. 操作系统复习笔记(四)
  10. Visual C++中的异常处理浅析(上)