ML之XGBoost：XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(三)

3. 参数微调案例/Parameter Tuning with Example

参数微调的一般方法/General Approach for Parameter Tuning

Step 1: Fix learning rate and number of estimators for tuning tree-based parameters步骤1：固定学习率和基于树的参数调整估计数

Step 2: Tune max_depth and min_child_weight第二步：调整max_depth 和min_child_weight

原文题目：《Complete Guide to Parameter Tuning in XGBoost with codes in Python》
原文地址：https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
所有权为原文所有，本文只负责翻译。

相关文章
ML之XGBoost：XGBoost算法模型(相关配图)的简介(XGBoost并行处理)、关键思路、代码实现(目标函数/评价函数)、安装、使用方法、案例应用之详细攻略
ML之XGBoost：Kaggle神器XGBoost算法模型的简介(资源)、安装、使用方法、案例应用之详细攻略
ML之XGBoost：XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(一)
ML之XGBoost：XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(二)
ML之XGBoost：XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(三)
ML之XGBoost：XGBoost参数调优的优秀外文翻译—《XGBoost中的参数调优完整指南(带python中的代码)》(四)

3. 参数微调案例/Parameter Tuning with Example

We will take the data set from Data Hackathon 3.x AV hackathon, same as that taken in the GBM article. The details of the problem can be found on the competition page. You can download the data set from here. I have performed the following steps:
我们将从Data Hackathon 3.x AV Hackathon获取数据集，与GBM文章中的数据集相同。问题详情可在竞赛页面上找到。您可以从这里下载数据集。我已执行以下步骤：

City variable dropped because of too many categories
由于类别太多，城市变量已删除
DOB converted to Age | DOB dropped
出生日期转换为年龄出生日期下降
EMI_Loan_Submitted_Missing created which is 1 if EMI_Loan_Submitted was missing else 0 | Original variable EMI_Loan_Submitted dropped
如果Emi_Loan_Submitted丢失，则为1，否则为0提交的原始变量Emi_Loan_已删除。
EmployerName dropped because of too many categories
由于类别太多，EmployerName被删除
Existing_EMI imputed with 0 (median) since only 111 values were missing
由于只缺少111个值，因此用0（中位数）输入的现有EMI
Interest_Rate_Missing created which is 1 if Interest_Rate was missing else 0 | Original variable Interest_Rate dropped
如果利率丢失，则创建利率丢失，如果利率丢失，则为1；否则0原始可变利率丢失
Lead_Creation_Date dropped because made little intuitive impact on outcome
由于对结果的直观影响很小，导致潜在客户创建日期下降。
Loan_Amount_Applied, Loan_Tenure_Applied imputed with median values
贷款额，贷款期限
Loan_Amount_Submitted_Missing created which is 1 if Loan_Amount_Submitted was missing else 0 | Original variable Loan_Amount_Submitted dropped
如果Loan_Amount_Submitted缺失，则为1；否则0原始可变贷款_Amount_Submitted已删除。
Loan_Tenure_Submitted_Missing created which is 1 if Loan_Tenure_Submitted was missing else 0 | Original variable Loan_Tenure_Submitted dropped
如果Loan_Perioration_Submitted缺失，则为1；否则0_Original Variable Loan_Perioration_Submitted已删除
LoggedIn, Salary_Account dropped
工资账户被删除
Processing_Fee_Missing created which is 1 if Processing_Fee was missing else 0 | Original variable Processing_Fee dropped
处理费丢失已创建，如果处理费丢失则为1，否则为0原始变量处理费丢失
Source – top 2 kept as is and all others combined into different category
资料来源——前2名保持原样，其他所有人合并为不同类别
Numerical and One-Hot-Coding performed
数字化和独热编码

For those who have the original data from competition, you can check out these steps from the data_preparation iPython notebook in the repository.
对于那些拥有来自竞争对手的原始数据的人，您可以从存储库中的“数据准备”ipython笔记本中查看这些步骤。

Lets start by importing the required libraries and loading the data: 首先导入所需的库并加载数据：

#Import libraries:
import pandas as pd
import numpy as np
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn import cross_validation, metrics   #Additional scklearn functions
from sklearn.grid_search import GridSearchCV   #Perforing grid searchimport matplotlib.pylab as plt
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4train = pd.read_csv('train_modified.csv')
target = 'Disbursed'
IDcol = 'ID'

Note that I have imported 2 forms of XGBoost:
请注意，我导入了两种XGBoost形式：

xgb – this is the direct xgboost library. I will use a specific function “cv” from this library
xgb–这是直接xgboost库。我将使用这个库中的特定函数“cv”
XGBClassifier – this is an sklearn wrapper for XGBoost. This allows us to use sklearn’s Grid Search with parallel processing in the same way we did for GBM
XGBClassifier–这是XGBoost的sklearn包装。这使得我们可以使用sklearn的网格搜索和并行处理就像我们对gbm所做的那样

Before proceeding further, lets define a function which will help us create XGBoost models and perform cross-validation. The best part is that you can take this function as it is and use it later for your own models.
在继续之前，我们先定义一个函数，它将帮助我们创建xgboost模型并执行交叉验证。最好的一点是，您可以将此函数按原样使用，稍后将其用于您自己的模型。

def modelfit(alg, dtrain, predictors,useTrainCV=True, cv_folds=5, early_stopping_rounds=50):if useTrainCV:xgb_param = alg.get_xgb_params()xgtrain = xgb.DMatrix(dtrain[predictors].values, label=dtrain[target].values)cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,metrics='auc', early_stopping_rounds=early_stopping_rounds, show_progress=False)alg.set_params(n_estimators=cvresult.shape[0])#Fit the algorithm on the dataalg.fit(dtrain[predictors], dtrain['Disbursed'],eval_metric='auc')#Predict training set:dtrain_predictions = alg.predict(dtrain[predictors])dtrain_predprob = alg.predict_proba(dtrain[predictors])[:,1]#Print model report:print "\nModel Report"print "Accuracy : %.4g" % metrics.accuracy_score(dtrain['Disbursed'].values, dtrain_predictions)print "AUC Score (Train): %f" % metrics.roc_auc_score(dtrain['Disbursed'], dtrain_predprob)feat_imp = pd.Series(alg.booster().get_fscore()).sort_values(ascending=False)feat_imp.plot(kind='bar', title='Feature Importances')plt.ylabel('Feature Importance Score')

This code is slightly different from what I used for GBM. The focus of this article is to cover the concepts and not coding. Please feel free to drop a note in the comments if you find any challenges in understanding any part of it. Note that xgboost’s sklearn wrapper doesn’t have a “feature_importances” metric but a get_fscore() function which does the same job.
此代码与我用于GBM的代码稍有不同。本文的重点是涵盖概念而不是编码。如果您在理解其中的任何部分时发现任何挑战，请随时在评论中留言。请注意，xgboost的sklearn包装器没有“feature-importances”度量标准，而是有一个get-fscore（）函数，它执行相同的工作。

参数微调的一般方法/General Approach for Parameter Tuning

We will use an approach similar to that of GBM here. The various steps to be performed are:
我们将使用类似于GBM的方法。要执行的各种步骤包括：

Choose a relatively high learning rate. Generally a learning rate of 0.1 works but somewhere between 0.05 to 0.3 should work for different problems. Determine the optimum number of trees for this learning rate. XGBoost has a very useful function called as “cv” which performs cross-validation at each boosting iteration and thus returns the optimum number of trees required.
选择相对较高的学习率。一般来说，学习率为0.1有效，但在0.05到0.3之间的某个值应适用于不同的问题。确定此学习率的最佳树数。XGBoost有一个非常有用的函数叫做“cv”，它在每次提升迭代时执行交叉验证，从而返回所需的最佳树数。
Tune tree-specific parameters ( max_depth, min_child_weight, gamma, subsample, colsample_bytree) for decided learning rate and number of trees. Note that we can choose different parameters to define a tree and I’ll take up an example here.
调整特定于树的参数（最大深度、最小子树重量、gamma、子样本、colsample字节树），以确定学习速率和树数。注意，我们可以选择不同的参数来定义一个树，我将在这里举一个例子。
Tune regularization parameters (lambda, alpha) for xgboost which can help reduce model complexity and enhance performance.
调整xgboost的正则化参数（lambda，alpha），这有助于降低模型复杂性和提高性能。
Lower the learning rate and decide the optimal parameters .
降低学习率，确定最佳参数。

Let us look at a more detailed step by step approach.
让我们看一个更详细的逐步方法。

Step 1: Fix learning rate and number of estimators for tuning tree-based parameters
步骤1：固定学习率和基于树的参数调整估计数

In order to decide on boosting parameters, we need to set some initial values of other parameters. Lets take the following values:
为了确定 boosting参数，我们需要设置其他参数的一些初始值。让我们采用以下值：

max_depth = 5 : This should be between 3-10. I’ve started with 5 but you can choose a different number as well. 4-6 can be good starting points.
最大深度=5：应该在3-10之间。我从5开始，但你也可以选择不同的数字。4-6可以是很好的起点。
min_child_weight = 1 : A smaller value is chosen because it is a highly imbalanced class problem and leaf nodes can have smaller size groups.
min_child_weight=1：选择较小的值是因为它是高度不平衡的类问题，叶节点可以有较小的大小组。
gamma = 0 : A smaller value like 0.1-0.2 can also be chosen for starting. This will anyways be tuned later.
gamma=0：也可以选择较小的值（如0.1-0.2）启动。无论如何，稍后将对此进行调整。
subsample, colsample_bytree = 0.8 : This is a commonly used used start value. Typical values range between 0.5-0.9.
subsample，colsample_bytree=0.8：这是常用的起始值。典型值在0.5-0.9之间。
scale_pos_weight = 1: Because of high class imbalance.
scale_pos_weight =1：因为等级不平衡。

Please note that all the above are just initial estimates and will be tuned later. Lets take the default learning rate of 0.1 here and check the optimum number of trees using cv function of xgboost. The function defined above will do it for us.
请注意，以上只是初步估计，稍后将进行调整。让我们在这里取默认的学习率0.1，并使用xgboost的cv函数检查最佳树数。上面定义的函数将为我们做这件事。

#Choose all predictors except target & IDcols
predictors = [x for x in train.columns if x not in [target, IDcol]]
xgb1 = XGBClassifier(learning_rate =0.1,n_estimators=1000,max_depth=5,min_child_weight=1,gamma=0,subsample=0.8,colsample_bytree=0.8,objective= 'binary:logistic',nthread=4,scale_pos_weight=1,seed=27)
modelfit(xgb1, train, predictors)

As you can see that here we got 140 as the optimal estimators for 0.1 learning rate. Note that this value might be too high for you depending on the power of your system. In that case you can increase the learning rate and re-run the command to get the reduced number of estimators.
如你所见，这里我们得到140作为0.1学习率的最佳估计量。请注意，根据系统的功率，此值可能对您来说太高。在这种情况下，您可以提高学习率，并重新运行该命令以获得减少的估计数。

Note: You will see the test AUC as “AUC Score (Test)” in the outputs here. But this would not appear if you try to run the command on your system as the data is not made public. It’s provided here just for reference. The part of the code which generates this output has been removed here.
注意：您将在这里的输出中将测试AUC视为“AUC得分（测试）”。但如果您试图在系统上运行该命令，因为数据不会公开，则不会出现这种情况。这里提供仅供参考。生成此输出的代码部分已在此处删除。

Step 2: Tune max_depth and min_child_weight
第二步：调整max_depth 和min_child_weight

We tune these first as they will have the highest impact on model outcome. To start with, let’s set wider ranges and then we will perform another iteration for smaller ranges.
我们首先对它们进行调整，因为它们对模型结果的影响最大。首先，让我们设置更宽的范围，然后对较小的范围执行另一个迭代。

Important Note: I’ll be doing some heavy-duty grid searched in this section which can take 15-30 mins or even more time to run depending on your system. You can vary the number of values you are testing based on what your system can handle.
重要提示：我将在本节中搜索一些重载网格，根据您的系统，这可能需要15-30分钟甚至更长的时间来运行。您可以根据系统可以处理的内容改变正在测试的值的数量。

param_test1 = {'max_depth':range(3,10,2),'min_child_weight':range(1,6,2)
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=140, max_depth=5,min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_

Here, we have run 12 combinations with wider intervals between values. The ideal values are 5 for max_depth and 5 for min_child_weight. Lets go one step deeper and look for optimum values. We’ll search for values 1 above and below the optimum values because we took an interval of two.
在这里，我们运行12组合，值之间的间隔更大。理想值为5 max_depth 和5min_child_weight。我们再深入一步，寻找最佳值。我们将在最佳值的上方和下方搜索值1，因为我们采用了两个间隔。

param_test2 = {'max_depth':[4,5,6],'min_child_weight':[4,5,6]
}
gsearch2 = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=5,min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), param_grid = param_test2, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2.fit(train[predictors],train[target])
gsearch2.grid_scores_, gsearch2.best_params_, gsearch2.best_score_

Here, we get the optimum values as 4 for max_depth and 6 for min_child_weight. Also, we can see the CV score increasing slightly. Note that as the model performance increases, it becomes exponentially difficult to achieve even marginal gains in performance. You would have noticed that here we got 6 as optimum value for min_child_weight but we haven’t tried values more than 6. We can do that as follow:.
在这里，我们得到了最大深度4和最小儿童体重6的最佳值。此外，我们还可以看到简历分数略有增加。请注意，随着模型性能的提高，甚至很难获得性能的边际收益。你可能会注意到，这里我们有6个最小儿童体重的最佳值，但我们没有尝试超过6个。我们可以这样做：。

param_test2b = {'min_child_weight':[6,8,10,12]
}
gsearch2b = GridSearchCV(estimator = XGBClassifier( learning_rate=0.1, n_estimators=140, max_depth=4,min_child_weight=2, gamma=0, subsample=0.8, colsample_bytree=0.8,objective= 'binary:logistic', nthread=4, scale_pos_weight=1,seed=27), param_grid = param_test2b, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch2b.fit(train[predictors],train[target])modelfit(gsearch3.best_estimator_, train, predictors)
gsearch2b.grid_scores_, gsearch2b.best_params_, gsearch2b.best_score_

We see 6 as the optimal value.
我们认为6是最佳值。