xgboost的调优

我想要给大家说明的是：不要幻想仅仅通过参数调优或者换一个稍微更好的模型使得最终结果有巨大的飞跃。要想最后的结果有巨大的提升，可以通过特征工程、模型集成来实现。

基本的

model2 = xgb.XGBRegressor(max_depth=6,learning_rate=0.05,n_estimators=100,min_child_weight=1,randam_state=42)

max_depth :每棵二叉树的最大深度,默认是6; 值越大,越容易过拟合,越小,容易欠拟合

learning_rate: 学习率

n_estumators: 基学习器个数

min_child_weight:默认值为1,。值越大，越容易欠拟合；值越小，越容易过拟合（值较大时，避免模型学习到局部的特殊样本）

gamma：系统默认为0,我们也常用0。在节点分裂时，只有分裂后损失函数的值下降了，才会分裂这个节点。gamma指定了节点分裂所需的最小损失函数下降值。这个参数的值越大，算法越保守。因为gamma值越大的时候，损失函数下降更多才可以分裂节点。所以树生成的时候更不容易分裂节点。范围: [0,∞]

subsample [default=1]:样本的采样率，如果设置成0.5，那么Xgboost会随机选择一般的样本作为训练集。

colsample_bytree [default=1]: 构造每棵树时，列采样率（一般是特征采样率）。

alpha [default=0, alias: reg_alpha]: L1正则化（与lasso回归中的正则化类似：传送门）这个主要是用在数据维度很高的情况下，可以提高运行速度。

调整max_depth 和 min_child_weight

使用网格搜索

GridSearchCV:

estimator: 分类器

param_grid: 参数值

cv: 交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。

njobs: 线程为4

verbose:10输出进度

jupter 在n_jobs=4 不知为啥不输出进度设置成1 即可输出

from sklearn.model_selection import GridSearchCV
param_test1 = {
'max_depth':list(range(3,10,2)),
'min_child_weight':list(range(1,6,2))
}
gsearch1 = GridSearchCV(

estimator = XGBClassifier( learning_rate =0.1, n_estimators=20, max_depth=5,min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), param_grid = param_test1,# scoring='roc_auc',n_jobs=1,verbose=10,cv=5)gsearch1.fit(train[predictors],train[target])

gsearch1.best_params_

输出为: {'max_depth': 3, 'min_child_weight': 5}

因此在 max_depth: 2,3,4. 及 min_child_weight:4,5,6 中搜索

进一步搜索

param_test2 = {
'max_depth':list(range(2,4,1)),
'min_child_weight':list(range(4,6,1))
}
gsearch2 = GridSearchCV(

estimator =xgb.XGBRegressor(max_depth=6,learning_rate=0.05,n_estimators=100,randam_state=42),
param_grid = param_test2,n_jobs=1,
verbose=10,
cv=5)gsearch2.fit(df6.drop(['label','cust_wid'], axis=1), df6['label'])

gsearch2.best_params_

搜索下来最优的依然是 {'max_depth': 3, 'min_child_weight': 5}

调整gamma

gamma 从 0 到0.5

param_test3 = {
'gamma':[i/10.0 for i in range(0,5)]
}
gsearch3 = GridSearchCV(

estimator =xgb.XGBRegressor(max_depth=3,learning_rate=0.05,min_child_weight=5,n_estimators=100,randam_state=42),
param_grid = param_test3,n_jobs=1,
verbose=10,
cv=5)gsearch3.fit(df6.drop(['label','cust_wid'], axis=1), df6['label'])

最优Gamma:0.4

调整subsample 和colsample_bytree

param_test4 = {
'subsample':[i/100.0 for i in range(75,100,5)],
'colsample_bytree':[i/100.0 for i in range(75,100,5)]
}
gsearch4 = GridSearchCV(

estimator =xgb.XGBRegressor(max_depth=3,learning_rate=0.05,min_child_weight=5,n_estimators=100,randam_state=42),
param_grid = param_test4,n_jobs=1,
verbose=10,
cv=5)gsearch4.fit(df6.drop(['label','cust_wid'], axis=1), df6['label'])

输出最优{'colsample_bytree': 0.85, 'subsample': 0.75}

调整正则化参数

param_test5 = {
'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]
}

gsearch5 = GridSearchCV(

estimator =xgb.XGBRegressor(max_depth=3,learning_rate=0.05,min_child_weight=5,n_estimators=100,randam_state=42,colsample_bytree=0.85, subsample=0.75),
param_grid = param_test5,n_jobs=1,
verbose=10,
cv=5)gsearch5.fit(df6.drop(['label','cust_wid'], axis=1), df6['label'])

输出{'reg_alpha': 100}

调整学习率

param_test6 = {
'learning_rate':[0.005,0.01, 0.05, 0.1,0.5,1]
}

gsearch6 = GridSearchCV(

estimator =xgb.XGBRegressor(max_depth=3,learning_rate=0.05,min_child_weight=5,n_estimators=100,randam_state=42,colsample_bytree=0.85, subsample=0.75),
param_grid = param_test6,n_jobs=1,
verbose=10,
cv=5)gsearch6.fit(df6.drop(['label','cust_wid'], axis=1), df6['label'])

学习率最优: {'learning_rate': 0.05}

得到最优的模型 -- 结果

最终结果提升了0.001 哈哈哈哈

名次从103到96上升了7名感觉还是这句话

我想要给大家说明的是：不要幻想仅仅通过参数调优或者换一个稍微更好的模型使得最终结果有巨大的飞跃。要想最后的结果有巨大的提升，可以通过特征工程、模型集成来实现。

pred_y=model2.predict(df7.drop(['label','cust_wid'], axis=1))
y_pred=pred_y.astype(int)
np.save("xgboost_best",y_pred)