阿里云天池竞赛——二手车价格预测项目(个人练习+源代码)
# 导入需要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsfrom sklearn.ensemble import RandomForestRegressor as RFR
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.feature_selection import mutual_info_regression as MICimport xgboost as xgb
import lightgbm as lgbfrom sklearn.metrics import mean_squared_error, mean_absolute_error
# 读取文件
train = pd.read_csv('used_car_train_20200313.csv',sep=' ')
test = pd.read_csv('used_car_testB_20200421.csv',sep=' ')print('train:{}'.format(train.shape))
print('test:{}'.format(test.shape))
train:(150000, 31)
test:(50000, 30)
# 查看数据信息
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 31 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 150000 non-null int64 1 name 150000 non-null int64 2 regDate 150000 non-null int64 3 model 149999 non-null float644 brand 150000 non-null int64 5 bodyType 145494 non-null float646 fuelType 141320 non-null float647 gearbox 144019 non-null float648 power 150000 non-null int64 9 kilometer 150000 non-null float6410 notRepairedDamage 150000 non-null object 11 regionCode 150000 non-null int64 12 seller 150000 non-null int64 13 offerType 150000 non-null int64 14 creatDate 150000 non-null int64 15 price 150000 non-null int64 16 v_0 150000 non-null float6417 v_1 150000 non-null float6418 v_2 150000 non-null float6419 v_3 150000 non-null float6420 v_4 150000 non-null float6421 v_5 150000 non-null float6422 v_6 150000 non-null float6423 v_7 150000 non-null float6424 v_8 150000 non-null float6425 v_9 150000 non-null float6426 v_10 150000 non-null float6427 v_11 150000 non-null float6428 v_12 150000 non-null float6429 v_13 150000 non-null float6430 v_14 150000 non-null float64
dtypes: float64(20), int64(10), object(1)
memory usage: 35.5+ MB
# 查看数据前五行
train.head()
SaleID | name | regDate | model | brand | bodyType | fuelType | gearbox | power | kilometer | ... | v_5 | v_6 | v_7 | v_8 | v_9 | v_10 | v_11 | v_12 | v_13 | v_14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 736 | 20040402 | 30.0 | 6 | 1.0 | 0.0 | 0.0 | 60 | 12.5 | ... | 0.235676 | 0.101988 | 0.129549 | 0.022816 | 0.097462 | -2.881803 | 2.804097 | -2.420821 | 0.795292 | 0.914762 |
1 | 1 | 2262 | 20030301 | 40.0 | 1 | 2.0 | 0.0 | 0.0 | 0 | 15.0 | ... | 0.264777 | 0.121004 | 0.135731 | 0.026597 | 0.020582 | -4.900482 | 2.096338 | -1.030483 | -1.722674 | 0.245522 |
2 | 2 | 14874 | 20040403 | 115.0 | 15 | 1.0 | 0.0 | 0.0 | 163 | 12.5 | ... | 0.251410 | 0.114912 | 0.165147 | 0.062173 | 0.027075 | -4.846749 | 1.803559 | 1.565330 | -0.832687 | -0.229963 |
3 | 3 | 71865 | 19960908 | 109.0 | 10 | 0.0 | 0.0 | 1.0 | 193 | 15.0 | ... | 0.274293 | 0.110300 | 0.121964 | 0.033395 | 0.000000 | -4.509599 | 1.285940 | -0.501868 | -2.438353 | -0.478699 |
4 | 4 | 111080 | 20120103 | 110.0 | 5 | 1.0 | 0.0 | 0.0 | 68 | 5.0 | ... | 0.228036 | 0.073205 | 0.091880 | 0.078819 | 0.121534 | -1.896240 | 0.910783 | 0.931110 | 2.834518 | 1.923482 |
5 rows × 31 columns
# 查看测试集信息
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 30 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 SaleID 50000 non-null int64 1 name 50000 non-null int64 2 regDate 50000 non-null int64 3 model 50000 non-null float644 brand 50000 non-null int64 5 bodyType 48496 non-null float646 fuelType 47076 non-null float647 gearbox 48032 non-null float648 power 50000 non-null int64 9 kilometer 50000 non-null float6410 notRepairedDamage 50000 non-null object 11 regionCode 50000 non-null int64 12 seller 50000 non-null int64 13 offerType 50000 non-null int64 14 creatDate 50000 non-null int64 15 v_0 50000 non-null float6416 v_1 50000 non-null float6417 v_2 50000 non-null float6418 v_3 50000 non-null float6419 v_4 50000 non-null float6420 v_5 50000 non-null float6421 v_6 50000 non-null float6422 v_7 50000 non-null float6423 v_8 50000 non-null float6424 v_9 50000 non-null float6425 v_10 50000 non-null float6426 v_11 50000 non-null float6427 v_12 50000 non-null float6428 v_13 50000 non-null float6429 v_14 50000 non-null float64
dtypes: float64(20), int64(9), object(1)
memory usage: 11.4+ MB
train['notRepairedDamage'].value_counts()
0.0 111361
- 24324
1.0 14315
Name: notRepairedDamage, dtype: int64
train['notRepairedDamage'].value_counts()
0.0 111361
- 24324
1.0 14315
Name: notRepairedDamage, dtype: int64
# 将‘-’转化成空值,并将notRepairedDamage特征转换成数值型
train['notRepairedDamage'] = train['notRepairedDamage'].replace('-', np.nan).astype('float')
train['notRepairedDamage'].value_counts()
0.0 111361
1.0 14315
Name: notRepairedDamage, dtype: int64
test['notRepairedDamage'] = test['notRepairedDamage'].replace('-', np.nan).astype('float')
test['notRepairedDamage'].value_counts()
0.0 37224
1.0 4707
Name: notRepairedDamage, dtype: int64
# 查看缺失值特征, 看到全是分类特征
print(train['model'].value_counts())
print(train['bodyType'].value_counts())
print(train['fuelType'].value_counts())
print(train['gearbox'].value_counts())
0.0 11762
19.0 9573
4.0 8445
1.0 6038
29.0 5186...
240.0 2
209.0 2
245.0 2
242.0 2
247.0 1
Name: model, Length: 248, dtype: int64
0.0 41420
1.0 35272
2.0 30324
3.0 13491
4.0 9609
5.0 7607
6.0 6482
7.0 1289
Name: bodyType, dtype: int64
0.0 91656
1.0 46991
2.0 2212
3.0 262
4.0 118
5.0 45
6.0 36
Name: fuelType, dtype: int64
0.0 111623
1.0 32396
Name: gearbox, dtype: int64
# 获取有缺失值的特征
col_train_null = train.columns[train.isnull().any()].to_list()
col_test_null = test.columns[test.isnull().any()].to_list()print(col_train_null)
print(col_test_null)
['model', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']
['bodyType', 'fuelType', 'gearbox', 'notRepairedDamage']
# 使用SimpleImputer进行缺失值填充
imp = SimpleImputer(strategy='most_frequent')train[col_train_null] = imp.fit_transform(train[col_train_null])
test[col_train_null] = imp.fit_transform(test[col_train_null])
# 检查特征
train.isnull().any().sum()
0
test.isnull().any().sum()
0
# 检查日期列的异常值
train['regDate'].astype('str').str[4:6].value_counts()
03 14949
06 13809
04 12798
05 12614
07 11937
10 11490
00 11347
11 10687
12 10637
09 10522
01 9943
08 9936
02 9331
Name: regDate, dtype: int64
# 定义函数,用于转换月份为零的值
def tran_date(x):month = int(x[4:6])if month == 0:month = 1return x[0:4] + '-' + str(month) + '-' + x[6:]
# 日期替换
train['regDate'] = pd.to_datetime(train['regDate'].astype('str').apply(tran_date))
test['regDate'] = pd.to_datetime(test['regDate'].astype('str').apply(tran_date))
train['creatDate'] = pd.to_datetime(train['creatDate'].astype('str').apply(tran_date))
test['creatDate'] = pd.to_datetime(test['creatDate'].astype('str').apply(tran_date))
# 密度图查看price列
sns.distplot(train['price'])
# 取对数调整偏态
sns.distplot(np.log(train['price']))
# 查看一下分布
train['price'].describe([0.01,0.25,0.5,0.75,0.99])
count 150000.000000
mean 5923.327333
std 7501.998477
min 11.000000
1% 150.000000
25% 1300.000000
50% 3250.000000
75% 7700.000000
99% 34950.000000
max 99999.000000
Name: price, dtype: float64
# 查看年份和价格的关系
train.resample('Y', on='regDate')['price'].mean().to_period('Y').plot(kind='bar')
<AxesSubplot:xlabel='regDate'>
# 创造特征
train['diff_day'] = (train['creatDate'] - train['regDate']).dt.days
train['diff_year'] = round(train['diff_day'] / 365, 1)
train['regDate_year'] = train['regDate'].dt.year
train['regDate_month'] = train['regDate'].dt.month
train['regDate_day'] = train['regDate'].dt.day
train['creatDate_year'] = train['creatDate'].dt.year
train['creatDate_month'] = train['creatDate'].dt.month
train['creatDate_day'] = train['creatDate'].dt.day
test['diff_day'] = (test['creatDate'] - test['regDate']).dt.days
test['diff_year'] = round(test['diff_day'] / 365, 1)
test['regDate_year'] = test['regDate'].dt.year
test['regDate_month'] = test['regDate'].dt.month
test['regDate_day'] = test['regDate'].dt.day
test['creatDate_year'] = test['creatDate'].dt.year
test['creatDate_month'] = test['creatDate'].dt.month
test['creatDate_day'] = test['creatDate'].dt.day
# 查看name特征有多少种类
train['name'].unique().shape
(99662,)
# 对power的描述性统计
train['power'].describe()
count 150000.000000
mean 119.316547
std 177.168419
min 0.000000
25% 75.000000
50% 110.000000
75% 150.000000
max 19312.000000
Name: power, dtype: float64
# 对model的描述性统计
train['model'].describe()
count 150000.000000
mean 47.128707
std 49.536024
min 0.000000
25% 10.000000
50% 30.000000
75% 66.000000
max 247.000000
Name: model, dtype: float64
# 对power进行分箱
bin = [i*20 for i in range(0,31)]train['power_bin'] = pd.cut(train['power'], bin, labels=False).fillna(31)
test['power_bin'] = pd.cut(test['power'], bin, labels=False).fillna(31)
# 对model进行分箱
bin_model = [i*10 for i in range(0,26)]
train['model_bin'] = pd.cut(train['model'], bin_model, labels=False)
test['model_bin'] = pd.cut(test['model'], bin_model, labels=False)
train['model_bin'].value_counts()
0.0 25963
1.0 21123
2.0 18095
4.0 14872
3.0 11069
6.0 8748
7.0 5193
5.0 4629
8.0 3879
10.0 3818
11.0 3376
9.0 2550
12.0 2417
16.0 2096
15.0 1993
17.0 1699
13.0 1623
14.0 1162
18.0 988
19.0 860
21.0 771
20.0 640
22.0 473
23.0 171
24.0 29
Name: model_bin, dtype: int64
# 找出分类型特征
col_clf = ['brand', 'bodyType', 'fuelType', 'gearbox', 'kilometer', 'notRepairedDamage', 'seller', 'offerType']
train[col_clf]
brand | bodyType | fuelType | gearbox | kilometer | notRepairedDamage | seller | offerType | |
---|---|---|---|---|---|---|---|---|
0 | 6 | 1.0 | 0.0 | 0.0 | 12.5 | 0.0 | 0 | 0 |
1 | 1 | 2.0 | 0.0 | 0.0 | 15.0 | 0.0 | 0 | 0 |
2 | 15 | 1.0 | 0.0 | 0.0 | 12.5 | 0.0 | 0 | 0 |
3 | 10 | 0.0 | 0.0 | 1.0 | 15.0 | 0.0 | 0 | 0 |
4 | 5 | 1.0 | 0.0 | 0.0 | 5.0 | 0.0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
149995 | 10 | 4.0 | 0.0 | 1.0 | 15.0 | 0.0 | 0 | 0 |
149996 | 11 | 0.0 | 0.0 | 0.0 | 10.0 | 0.0 | 0 | 0 |
149997 | 11 | 1.0 | 1.0 | 0.0 | 6.0 | 0.0 | 0 | 0 |
149998 | 10 | 3.0 | 1.0 | 0.0 | 15.0 | 0.0 | 0 | 0 |
149999 | 28 | 6.0 | 0.0 | 1.0 | 12.5 | 0.0 | 0 | 0 |
150000 rows × 8 columns
# 观察分类型特征的数据分布
plt.figure(figsize=(18,10))
for i in range(len(col_clf)):plt.subplot(2,4,i+1)train[col_clf[i]].value_counts().plot(kind='bar',color='yellow')test[col_clf[i]].value_counts().plot(kind='bar',color='blue')plt.title(col_clf[i])
plt.tight_layout()
# 可以发现其中有两个特征只有一种类型,删除seller和offerType列
train = train.drop(['seller', 'offerType'], axis=1)
test = test.drop(['seller', 'offerType'], axis=1)
col_clf = ['brand', 'bodyType', 'fuelType', 'gearbox', 'kilometer', 'notRepairedDamage']
# 查看不同分类和价格的关系
plt.figure(figsize=(18,10))
for i in range(len(col_clf)):plt.subplot(2,3,i+1)train.groupby(col_clf[i])['price'].mean().plot(kind='bar')plt.title(col_clf[i])
plt.tight_layout()
# 删去id这一列
train = train.drop(['SaleID'], axis=1)
test = test.drop(['SaleID'], axis=1)
# 绘制热力图,观察特征之间的关系
plt.figure(figsize=(10,10))
corr = train.corr()
sns.heatmap(corr)
# 将name列转换成计数
train['name_count'] = train.groupby('name')['brand'].agg(['count'])
test['name_count'] = test.groupby('name')['brand'].agg(['count'])
# 删除name列
train = train.drop('name', axis=1)
test = test.drop('name', axis=1)
# 将分类特征和价格组合出新的特征
col_clf = ['brand', 'model', 'kilometer', 'fuelType', 'bodyType']
for col in col_clf:train_gb = train.groupby(col)all_info = {}for kind, kind_data in train_gb:info = {}info[col + '_amount'] = len(kind_data)info[col + '_price_max'] = kind_data.price.max()info[col + '_price_median'] = kind_data.price.median()info[col + '_price_min'] = kind_data.price.min()info[col + '_price_sum'] = kind_data.price.sum()info[col + '_price_std'] = kind_data.price.std()info[col+'_price_average'] = round(kind_data.price.sum() / (len(kind_data) + 1), 2)all_info[kind] = infofe = pd.DataFrame(all_info).T.reset_index().rename(columns={'index':col})train = train.copy().merge(fe, how='left', on=col)test = test.copy().merge(fe, how='left', on=col)print(train.shape)
print(test.shape)
(150000, 73)
(50000, 72)
# kilometer和power组合形成新特征
col_kp = ['kilometer', 'power']
t1 = train.groupby(col_kp[0], as_index=False)[col_kp[1]].agg({col_kp[0] + '_' + col_kp[1] + '_count':'count',col_kp[0] + '_' + col_kp[1] + '_max':'max',col_kp[0] + '_' + col_kp[1] + '_median':'median',col_kp[0] + '_' + col_kp[1] + '_min':'min',col_kp[0] + '_' + col_kp[1] + '_sum':'sum',col_kp[0] + '_' + col_kp[1] + '_std':'std',col_kp[0] + '_' + col_kp[1] + '_mean':'mean'
})
train = train.copy().merge(t1, how='left', on=col_kp[0])
test = test.copy().merge(t1, how='left', on=col_kp[0])print(train.shape)
print(test.shape)
(150000, 80)
(50000, 79)
# 将与价格相关性高的匿名特征分别进行计算,生成新的特征
col_v = [0,3,8,12]
for i in col_v:for j in col_v:train[str(i)+'*'+str(j)] = train['v_'+str(i)] * train['v_'+str(j)]test[str(i)+'*'+str(j)] = test['v_'+str(i)] * test['v_'+str(j)]for i in col_v:for j in col_v:train[str(i)+'+'+str(j)] = train['v_'+str(i)] + train['v_'+str(j)]test[str(i)+'+'+str(j)] = test['v_'+str(i)] + test['v_'+str(j)]for i in col_v:for j in col_v:train[str(i)+'-'+str(j)] = train['v_'+str(i)] - train['v_'+str(j)]test[str(i)+'-'+str(j)] = test['v_'+str(i)] - test['v_'+str(j)]for i in col_v:train[str(i)+'*diff_year'] = train['v_'+str(i)] * train['diff_year']test[str(i)+'*diff_year'] = test['v_'+str(i)] * test['diff_year']print(train.shape)
print(test.shape)
(150000, 132)
(50000, 131)
# 深复制
train_new = train.copy(deep=True)
test_new = test.copy(deep=True)
# 删除没有用的特征
X_train = train_new.drop(['price', 'regDate', 'creatDate', 'regionCode'], axis=1)
X_test = test_new.drop(['regDate', 'creatDate', 'regionCode'], axis=1)
y_train = train_new['price']
X_train.to_csv('X_train.csv')
X_test.to_csv('X_test.csv')
y_train.to_csv('y_train.csv')
X_train = pd.read_csv('X_train.csv', index_col=0)
X_test = pd.read_csv('X_test.csv', index_col=0)
y_train = pd.read_csv('y_train.csv', index_col=0)
y_train = np.ravel(y_train)
from sklearn.model_selection import GridSearchCV
from lightgbm.sklearn import LGBMRegressor
from time import time
import datetime
# lgbm模型调参 n_estimators
lgbm_scores = []
time0 = time()
for i in np.arange(200, 2001, 100):reg_lgbm = LGBMRegressor(learning_rate=0.1, n_estimators=i, objective='regression_l1', random_state=42)lgbm_score = cross_val_score(reg_lgbm, X_train, y_train, cv=3, scoring='neg_mean_absolute_error').mean()lgbm_scores.append(lgbm_score)print(time() - time0, lgbm_score)print(max(lgbm_scores))
print(np.arange(200, 2001, 100)[np.argmax(lgbm_scores)])plt.figure(figsize=(8,6))
plt.plot(np.arange(200, 2001, 100), lgbm_scores)
8.246474504470825 -594.8424586944835
19.095320463180542 -568.7034240842332
32.98492646217346 -552.3891983733455
48.85633111000061 -542.3964373457885
67.14593052864075 -535.7482170534481
87.76364278793335 -529.8703107609151
111.7964539527893 -525.7724715224499
137.1214382648468 -522.2536456032711
164.56334352493286 -519.7242720183268
194.57858514785767 -517.0013123928143
226.66670727729797 -515.3243156345435
261.38421607017517 -513.8374464322388
298.36878204345703 -512.2929007437376
337.6414248943329 -511.0992392114774
379.26119804382324 -510.01377926737086
423.065954208374 -508.88194129037237
468.8004615306854 -507.9469756382571
516.7109439373016 -506.9312361319216
567.1637334823608 -505.91707028368864
-505.91707028368864
2000
# lgbm模型调参 max_depth和num_leaves
parameters = {'max_depth':[4,5,6,7],'num_leaves':np.arange(5,100,5)
}
reg_lgbm = LGBMRegressor(learning_rate=0.1, n_estimators=2000, objective='regression_l1', random_state=42)gs = GridSearchCV(reg_lgbm, param_grid=parameters, cv=3, scoring='neg_mean_absolute_error', n_jobs=-1)
gs_model = gs.fit(X_train, y_train)
print('最优分数:{}'.format(gs_model.best_score_))
print('最优参数:{}'.format(gs_model.best_params_))
print('最优模型:{}'.format(gs_model.best_estimator_))
最优分数:-500.9400071901773
最优参数:{'max_depth': 7, 'num_leaves': 45}
最优模型:LGBMRegressor(max_depth=7, n_estimators=2000, num_leaves=45,objective='regression_l1', random_state=42)
# xgboost模型调参 learning_rate
xgb_scores = []
time0 = time()
for i in np.arange(0.05,0.31,0.05)reg_xgb = xgb.XGBRegressor(n_estimators=200, learning_rate=i)xgb_score = cross_val_score(reg_xgb, X_train, y_train, cv=3, scoring='neg_mean_absolute_error').mean()xgb_scores.append(xgb_score)print(time() - time0)print(max(xgb_scores))
print(np.arange(0.05,0.31,0.05)[np.argmax(xgb_scores)])plt.figure(figsize=(8,6))
plt.plot(np.arange(0.05,0.31,0.05), xgb_scores)
132.54270577430725
268.7969219684601
402.8747355937958
543.0700986385345
673.9736497402191
807.1916081905365
-559.3930964745617
0.15000000000000002
# xgboost模型调参 max_depth
xgb_scores = []
time0 = time()
for i in np.arange(5,12,1):reg_xgb = xgb.XGBRegressor(n_estimators=200, learning_rate=0.15, max_depth=i)xgb_score = cross_val_score(reg_xgb, X_train, y_train, cv=3, scoring='neg_mean_absolute_error').mean()xgb_scores.append(xgb_score)print(time() - time0, xgb_score)print(max(xgb_scores))
print(np.arange(5,12,1)[np.argmax(xgb_scores)])plt.figure(figsize=(8,6))
plt.plot(np.arange(5,12,1), xgb_scores)
106.13332343101501 -581.7825425249935
233.66106414794922 -559.3930964745617
386.93016719818115 -545.8423732084185
577.3439819812775 -540.0337358052888
789.385425567627 -535.3663493749481
1027.0641367435455 -537.2171228026253
1293.055543422699 -540.2481429747703
-535.3663493749481
9
# xgboost模型调参 colsample_bytree
xgb_scores = []
time0 = time()
for i in np.arange(0.4,0.8,0.1):reg_xgb = xgb.XGBRegressor(n_estimators=200, learning_rate=0.15, max_depth=9, colsample_bytree=i)xgb_score = cross_val_score(reg_xgb, X_train, y_train, cv=3, scoring='neg_mean_absolute_error').mean()xgb_scores.append(xgb_score)print(time() - time0, xgb_score)print(max(xgb_scores))
print(np.arange(0.4,0.8,0.1)[np.argmax(xgb_scores)])plt.figure(figsize=(8,6))
plt.plot(np.arange(0.4,0.8,0.1), xgb_scores)
101.24388527870178 -540.0945292119669
219.8485279083252 -536.1881194847441
358.25814485549927 -534.7100199133007
509.0920376777649 -534.7369636623599
-534.7100199133007
0.6
# xgboost模型调参 colsample_bylevel
xgb_scores = []
time0 = time()
for i in np.arange(0.5,1.1,0.1):reg_xgb = xgb.XGBRegressor(n_estimators=200, learning_rate=0.15, max_depth=9, colsample_bytree=0.6, colsample_bylevel=i)xgb_score = cross_val_score(reg_xgb, X_train, y_train, cv=3, scoring='neg_mean_absolute_error').mean()xgb_scores.append(xgb_score)print(time() - time0, xgb_score)print(max(xgb_scores))
print(np.arange(0.5,1.1,0.1)[np.argmax(xgb_scores)])plt.figure(figsize=(8,6))
plt.plot(np.arange(0.5,1.1,0.1), xgb_scores)
82.85519623756409 -534.1242236725466
176.23594546318054 -535.4707890065283
279.6718213558197 -534.5832091972042
391.7265202999115 -533.988677477093
518.4175012111664 -533.3711266578522
656.7450432777405 -534.7100199133007657.538763999939 nan
-533.3711266578522
1.0999999999999999
# 导入sklearn自带的模型融合库
from sklearn.ensemble import StackingRegressor
# 实例化模型
reg_lgbm = LGBMRegressor(max_depth=7, n_estimators=2000, num_leaves=45, objective='regression_l1', random_state=42)
reg_xgb = xgb.XGBRegressor(n_estimators=200, learning_rate=0.15, max_depth=9, colsample_bytree=0.6, colsample_bylevel=0.9)
#进行模型融合
estimators=[('lgbm',reg_lgbm), ('xgb',reg_xgb)]sr = StackingRegressor(estimators, verbose=True)
sr_scores = cross_val_score(sr, X_train, y_train, cv=3, scoring='neg_mean_absolute_error')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 1.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 2.9min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 1.2min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 2.8min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 1.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 2.8min finished
# 查看模型融合分数
sr_scores
array([-491.53751775, -494.7186037 , -486.69418657])
# 导出预测结果
sr.fit(X_train, y_train)
sr_predict = sr.predict(X_test)
pd.DataFrame(sr_predict).to_csv('stack_submit.csv')
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 1.8min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 4.1min finished
阿里云天池竞赛——二手车价格预测项目(个人练习+源代码)相关推荐
- 【机器学习】阿里云天池竞赛——工业蒸汽量预测(3)
机器学习经典赛题:工业蒸汽量预测(3) 机器学习经典赛题:工业蒸汽量预测(3) 4.模型训练 4.1 回归及相关模型 4.1.1 回归的概念 4.1.2 回归模型训练和预测 4.1.3 线性回归模型 ...
- 【机器学习】阿里云天池竞赛——工业蒸汽量预测(5)
机器学习经典赛题:工业蒸汽量预测(5) 机器学习经典赛题:工业蒸汽量预测(5):模型验证(赛题实战) 5.3 模型验证与调参实战 5.3.1 模型过拟合与欠拟合 5.3.2 模型正则化 5.3.3 模 ...
- 【机器学习】阿里云天池竞赛——工业蒸汽量预测(2)
机器学习经典赛题:工业蒸汽量预测(2) 机器学习经典赛题:工业蒸汽量预测(2) 3.1 特征工程的重要性和处理 3.2 数据预处理和特征处理 3.2.1 数据预处理 3.2.2 特征处理 3.3 特征 ...
- 【机器学习】阿里云天池竞赛——工业蒸汽量预测(1)
机器学习经典赛题:工业蒸汽量预测(1) 1. 赛题理解 1.1 背景 1.2 目标 1.3 数据概览 1. 数据描述 2. 数据说明 1.4 评估指标 1.5 赛题模型 1. 回归预测模型 2. 分类 ...
- 【机器学习】阿里云天池竞赛——工业蒸汽量预测(4)
机器学习经典赛题:工业蒸汽量预测(4) 机器学习经典赛题:工业蒸汽量预测(4):模型验证(模型评估与调参) 5.1 模型评估的概念和方法 5.1.1 欠拟合与过拟合 5.1.2 模型的泛化与正则化 5 ...
- 【机器学习】阿里云天池竞赛——工业蒸汽量预测(6)
机器学习经典赛题:工业蒸汽量预测(6) 机器学习经典赛题:工业蒸汽量预测(6):特征优化 6.1 特征优化的方法 6.1.1 合成特征 6.1.2 特征的简单变换 6.1.3 用决策树创造新特征 6. ...
- 阿里云天池竞赛-零基础入门CV赛事-Task4 模型训练与验证
在上一章节我们构建了一个简单的CNN进行训练,并可视化了训练过程中的误差损失和第一个字符预测准确率,但这些还远远不够.一个成熟合格的深度学习训练流程至少具备以下功能: 在训练集上进行训练,并在验证集上 ...
- 书里书外丨大咖论“赛”:阿里云天池 TOP 选手不得不说的那些事儿
七年前,天池团队的几名创始成员带着"让全世界没有能力Access到大数据的人可以公平地Access到大数据" 的信念,走上了建设天池的道路. 7年间,天池举办了超过200场来自 ...
- 数据挖掘二手车价格预测 Task05:模型融合
模型融合是kaggle等比赛中经常使用到的一个利器,它通常可以在各种不同的机器学习任务中使结果获得提升.顾名思义,模型融合就是综合考虑不同模型的情况,并将它们的结果融合到一起.模型融合主要通过几部分来 ...
最新文章
- mysql数据类型查询命令_Linux命令:MySQL系列之三--mysql数据类型及SQL结构化查询语句使用...
- javascript内存泄漏调试工具mac_node.js 内存泄漏的秘密
- 如何评价强gis与弱gis_什么是gis的简化解释
- 新年立个小目标!代码写得更规范!
- 如何在命令中授权给MySQL_MySQL如何使用授权命令grant
- C#LeetCode刷题之#896-单调数列(Monotonic Array)
- angularjs中按回车事件_浅谈angularjs中响应回车事件
- 【PyCharm】Pycharm使用技巧
- HTML学习---HTML状态码
- 【Django 2021年最新版教程7】前端html接收后端传递的变量值 渲染
- java企业项目进度绩效考核管理系统 SSH
- 摘 :史丰收快速计算法的口诀及其简单应用方法(部分)
- Git-Clone succeeded, but checkout failed
- 产品需求文档(PRD)
- bomb和mysql_Bigger Bomb
- NanoPi R2S 专用软件源
- 怎么在手机上取消双重认证_谁知道苹果手机双重认证怎么弄?手机号都不知道谁的?...
- mysql缓存优化,MySQL优化步骤及my.cnf缓存优化
- BUG:索引字段超长-Specified key was too long; max key length is 767 bytes
- xsi:schemaLocation详解