答题卡的计分方式

We are all aware of, and keep track of, our credit scores, don’t we? That all-important number that has been around since the 1950s and determines our creditworthiness. I suppose we all also have a basic intuition of how a credit score is calculated, or which factors affect it. Refer to my previous article for some further details on what a credit score is.

我们都知道并跟踪我们的信用评分，不是吗？自1950年代以来一直存在的最重要的数字，决定了我们的信誉。我想我们所有人都对如何计算信用评分或哪些因素有影响有一个基本的直觉。有关信用评分的更多详细信息，请参阅我的上一篇文章。

In this article, we will go through detailed steps to develop a data-driven credit risk model in Python to predict the probabilities of default (PD) and assign credit scores to existing or potential borrowers. We will determine credit scores using a highly interpretable, easy to understand and implement scorecard that makes calculating the credit score a breeze.

在本文中，我们将详细介绍用Python开发数据驱动的信用风险模型的步骤，以预测违约概率(PD)并将信用评分分配给现有或潜在借款人。我们将使用易于理解且易于理解的易于理解的计分卡来确定信用分数，从而轻松计算信用分数。

I will assume a working Python knowledge and a basic understanding of certain statistical and credit risk concepts while working through this case study.

在完成本案例研究的过程中，我将假设您具备一定的Python知识并且对某些统计和信用风险概念有基本的了解。

We have a lot to cover, so let’s get started.

我们有很多要讨论的内容，所以让我们开始吧。

初步数据探索和拆分 (Preliminary Data Exploration & Splitting)

We will use a dataset made available on Kaggle that relates to consumer loans issued by the Lending Club, a US P2P lender. The raw data includes information on over 450,000 consumer loans issued between 2007 and 2014 with almost 75 features, including the current loan status and various attributes related to both borrowers and their payment behavior. Refer to the data dictionary for further details on each column.

我们将使用Kaggle上提供的数据集，该数据集与美国P2P贷方Lending Club发行的消费者贷款有关。原始数据包含有关2007年至2014年之间发行的450,000笔消费贷款的信息，具有近75种特征，包括当前的贷款状态以及与借款人及其付款行为相关的各种属性。有关每一列的更多详细信息，请参阅数据字典。

The concepts and overall methodology, as explained here, are also applicable to a corporate loan portfolio.

如此处所述，概念和总体方法也适用于公司贷款组合。

Initial data exploration reveals the following:

初步数据探索显示以下内容：

18 features with more than 80% of missing values. Given the high proportion of missing values, any technique to impute them will most likely result in inaccurate results18个要素的缺失值超过80％。鉴于遗漏值的比例很高，任何估算这些值的技术很可能会导致结果不准确
Certain static features not related to credit risk, e.g., id, member_id, url, title

某些与信用风险无关的静态功能，例如id ， member_id ， url ， title
Other forward-looking features that are expected to be populated only once the borrower has defaulted, e.g., recoveries, collection_recovery_fee. Since our objective here is to predict the future probability of default, having such features in our model will be counterintuitive, as these will not be observed until the default event has occurred

预期仅在借款人违约后才会填充的其他前瞻性功能，例如recoveries ， collection_recovery_fee 。由于我们的目的是预测未来发生违约的可能性，因此在模型中具有此类特征将是违反直觉的，因为在发生违约事件之前不会观察到这些特征

We will drop all the above features.

我们将删除所有上述功能。

识别目标变量 (Identify Target Variable)

Based on the data exploration, our target variable appears to be loan_status. A quick look at its unique values and their proportion thereof confirms the same.

基于数据探索，我们的目标变量似乎是loan_status 。快速浏览其唯一值及其比例可确认相同。

Based on domain knowledge, we will classify loans with the following loan_status values as being in default (bad, or 1):

基于域知识，我们将具有以下loan_status值的贷款分类为默认(不良或1)：

Charged Off充值
Default默认
Late (31–120 days)延迟(31-120天)
Does not meet the credit policy. Status:Charged Off不符合信用政策。状态：已充电

All the other values will be classified as good (or 0).

所有其他值将被分类为好(或0)。

数据分割 (Data Split)

Let us now split our data into the following sets: training (80%) and test (20%). We will perform Repeated Stratified k Fold testing on the training test to preliminary evaluate our model while the test set will remain untouched till final model evaluation. This approach follows the best model evaluation practice.

现在让我们将数据分为以下几组：训练(80％)和测试(20％)。我们将在训练测试上执行重复分层k折测试以初步评估我们的模型，而测试集将保持不变直到最终模型评估。此方法遵循最佳模型评估实践。

Image 1 above shows us that our data, as expected, is heavily skewed towards good loans. Accordingly, in addition to random shuffled sampling, we will also stratify the train/test split so that the distribution of good and bad loans in the test set is the same as that in the pre-split data. This is achieved through the train_test_split function’s stratify parameter.

上面的图1向我们显示，正如我们所预期的，我们的数据严重偏向良好的贷款。因此，除了随机洗牌抽样之外，我们还将分层训练/测试拆分，以使测试集中的好坏贷款分布与拆分前数据相同。这是通过实现train_test_split功能的stratify参数。

Splitting our data before any data cleaning or missing value imputation prevents any data leakage from the test set to the training set and results in more accurate model evaluation. Refer to my previous article for further details.

在进行任何数据清理或缺失值插补之前拆分我们的数据可防止任何数据从测试集泄漏到训练集，从而可以更准确地评估模型。有关更多详细信息，请参阅我的上一篇文章。

A code snippet for the work performed so far follows:

到目前为止完成的工作的代码段如下：

数据清理 (Data Cleaning)

Next comes some necessary data cleaning tasks as follows:

接下来是一些必要的数据清理任务，如下所示：

Remove text from the emp_length column (e.g., years) and convert it to numeric

从emp_length列中删除文本(例如，年份)并将其转换为数字
For all columns with dates: convert them to Python’s datetime format, create a new column as a difference between model development date and the respective date feature and then drop the original feature

对于所有带有日期的列：将它们转换为Python的datetime格式，创建一个新列作为模型开发日期与相应日期特征之间的差，然后删除原始特征
Remove text from the term column and convert it to numeric

从term列中删除文本并将其转换为数字

We will define helper functions for each of the above tasks and apply them to the training dataset. Having these helper functions will assist us with performing these same tasks again on the test dataset without repeating our code.

我们将为上述每个任务定义辅助函数，并将其应用于训练数据集。拥有这些帮助器功能将帮助我们在测试数据集上再次执行这些相同的任务，而无需重复我们的代码。

功能选择 (Feature Selection)

Next up, we will perform feature selection to identify the most suitable features for our binary classification problem using the Chi-squared test for categorical features and ANOVA F-statistic for numerical features. Refer to my previous article for further details on these feature selection techniques and why different techniques are applied to categorical and numerical variables.

接下来，我们将使用卡方检验(用于分类特征)和ANOVA F统计量(用于数字特征)来进行特征选择，以找到最适合我们的二元分类问题的特征。有关这些特征选择技术以及为什么将不同的技术应用于分类和数值变量的更多详细信息，请参阅我的上一篇文章。

The p-values, in ascending order, from our Chi-squared test on the categorical features are as below:

来自我们对分类特征的卡方检验的p值按升序排列如下：

For the sake of simplicity, we will only retain the top four features and drop the rest.

为了简单起见，我们将仅保留前四个功能，其余的将保留。

The ANOVA F-statistic for 34 numeric features shows a wide range of F values, from 23,513 to 0.39. We will keep the top 20 features and potentially come back to select more in case our model evaluation results are not reasonable enough.

用于34个数字特征的ANOVA F统计量显示从23,513到0.39的大范围F值。如果我们的模型评估结果不够合理，我们将保留前20个功能，并可能会再选择更多功能。

Next, we will calculate the pair-wise correlations of the selected top 20 numerical features to detect any potentially multicollinear variables. A heat-map of these pair-wise correlations identifies two features (out_prncp_inv and total_pymnt_inv) as highly correlated. Therefore, we will drop them also for our model.

接下来，我们将计算所选的前20个数字特征的成对相关性，以检测任何潜在的多重共线性变量。这些成对相关的热图将两个特征( out_prncp_inv和total_pymnt_inv )标识为高度相关。因此，我们也会在模型中删除它们。

Next, we will simply save all the features to be dropped in a list and define a function to drop them. The code for these feature selection techniques follows:

接下来，我们将简单地将所有要删除的功能保存在列表中，并定义一个将其删除的功能。这些功能选择技术的代码如下：

一站式编码和更新测试数据集 (One-Hot Encoding and Update Test Dataset)

Next, we will create dummy variables of the four final categorical variables and update the test dataset through all the functions applied so far to the training dataset.

接下来，我们将创建四个最终分类变量的伪变量，并通过到目前为止应用于训练数据集的所有函数来更新测试数据集。

Note a couple of points regarding the way we create dummy variables:

请注意有关创建虚拟变量的方式的几点注意事项：

We will use a particular naming convention for all variables: original variable name, colon, category name我们将对所有变量使用特定的命名约定：原始变量名称，冒号，类别名称
Generally speaking, in order to avoid multicollinearity, one of the dummy variables is dropped through the drop_first parameter of pd.get_dummies. However, we will not do so at this stage as we require all the dummy variables to calculate the Weight of Evidence (WoE) and Information Values (IV) of our categories — more on this later. We will drop one dummy variable for each category later on

一般来说，为了避免多重共线性，虚拟变量之一是通过下降drop_first的参数pd.get_dummies 。但是，由于我们需要所有虚拟变量来计算类别的证据权重(WoE)和信息值(IV)，因此我们在现阶段不会这样做-稍后将对此进行详细介绍。稍后，我们将为每个类别删除一个虚拟变量
We will also not create the dummy variables directly in our training data, as doing so would drop the categorical variable, which we require for WoE calculations. Therefore, we will create a new dataframe of dummy variables and then concatenate it to the original training/test dataframe.我们也不会直接在训练数据中创建虚拟变量，因为这样做会删除分类变量，这是我们计算WoE所需的。因此，我们将创建一个新的虚拟变量数据框，然后将其连接到原始训练/测试数据框。

Next up, we will update the test dataset by passing it through all the functions defined so far. Pay special attention to reindexing the updated test dataset after creating dummy variables. Let me explain this by a practical example.

接下来，我们将通过将测试数据集传递到到目前为止定义的所有函数来更新它。创建虚拟变量后，请特别注意为已更新的测试数据集重新编制索引。让我通过一个实际的例子对此进行解释。

Consider a categorical feature called grade with the following unique values in the pre-split data: A, B, C, and D. Suppose that the proportion of D is very low, and due to the random nature of train/test split, none of the observations with D in the grade category is selected in the test set. Therefore, grade’s dummy variables in the training data will be grade:A, grade:B, grade:C, and grade:D, but grade:D will not be created as a dummy variable in the test set. We will be unable to apply a fitted model on the test set to make predictions, given the absence of a feature expected to be present by the model. Therefore, we reindex the test set to ensure that it has the same columns as the training data, with any missing columns being added with 0 values. A 0 value is pretty intuitive since that category will never be observed in any of the test samples.

考虑一个称为grade的分类特征，在分割前的数据中具有以下唯一值：A，B，C和D。假设D的比例非常低，并且由于训练/测试拆分的随机性，没有在测试集中选择grade类别为D的观察结果中的一个。因此，训练数据中grade的虚拟变量将是等级：A，等级：B，等级：C和等级：D，但等级：D将不会在测试集中创建为虚拟变量。鉴于缺少模型预期提供的功能，我们将无法在测试集上应用拟合模型进行预测。因此，我们对测试集重新编制索引，以确保它与训练数据具有相同的列，所有缺失的列均添加0值。 0值非常直观，因为在任何测试样本中都不会观察到该类别。

WoE分箱和特征工程 (WoE Binning and Feature Engineering)

Creating new categorical features for all numerical and categorical variables based on WoE is one of the most critical steps before developing a credit risk model, and also quite time-consuming. There are specific custom Python packages and functions available on GitHub and elsewhere to perform this exercise. However, I prefer to do it manually as it allows me a bit more flexibility and control over the process.

在建立信用风险模型之前，基于WoE为所有数字和分类变量创建新的分类特征是最关键的步骤之一，而且非常耗时。 GitHub和其他地方提供了特定的自定义Python包和函数来执行此练习。但是，我更喜欢手动操作，因为它使我更具灵活性并可以控制该过程。

但是什么是WoE和IV？ (But What is WoE and IV?)

Weight of Evidence (WoE) and Information Value (IV) are used for feature engineering and selection and are extensively used in the credit scoring domain.

证据权重(WoE)和信息价值(IV)用于特征工程和选择，并广泛用于信用评分领域。

WoE is a measure of the predictive power of an independent variable in relation to the target variable. It measures the extent a specific feature can differentiate between target classes, in our case: good and bad customers.

WoE是自变量相对于目标变量的预测能力的度量。它衡量特定功能可以区分目标类别的程度，在我们的例子中是：好客户和坏客户。

IV assists with ranking our features based on their relative importance.

IV有助于根据其相对重要性对我们的功能进行排名。

According to Baesens et al.¹ and Siddiqi², WOE and IV analyses enable one to:

根据Baesens等人1和Siddiqi²的研究，WOE和IV分析使人们能够：

Consider each variable’s independent contribution to the outcome考虑每个变量对结果的独立贡献
Detect linear and non-linear relationships检测线性和非线性关系
Rank variables in terms of its univariate predictive strength根据单变量预测强度对变量进行排名
Visualize the correlations between the variables and the binary outcome可视化变量和二进制结果之间的相关性
Seamlessly compare the strength of continuous and categorical variables without creating dummy variables无缝比较连续变量和分类变量的强度，而无需创建伪变量
Seamlessly handle missing values without imputation. (Note that we have not imputed any missing values so far, this is the reason why. Missing values will be assigned a separate category during the WoE feature engineering step)无缝处理缺失值而无需进行插补。 (请注意，到目前为止，我们尚未估算任何缺失值，这就是原因。在WoE功能工程步骤中，会将缺失值分配给一个单独的类别)
Assess the predictive power of missing values评估缺失值的预测能力

Weight of Evidence (WoE)

证据权重(WoE)

The formula to calculate WoE is as follow:

计算WoE的公式如下：

A positive WoE means that the proportion of good customers is more than that of bad customers and vice versa for a negative WoE value.

好的WoE意味着，好的WOE值比坏的客户高，反之亦然。

Steps for WoE feature engineering

WoE功能工程的步骤

Calculate WoE for each unique value (bin) of a categorical variable, e.g., for each of grad:A, grad:B, grad:C, etc.计算分类变量的每个唯一值(bin)的WoE，例如grad：A，grad：B，grad：C等的每个。
Bin a continuous variable into discrete bins based on its distribution and number of unique observations, maybe using pd.cut (called fine classing)

根据连续变量的分布和唯一观测值的数量将连续变量分为离散的bin，也许使用pd.cut (称为精细分类)
Calculate WoE for each derived bin of the continuous variable计算连续变量的每个导出仓的WoE
Once WoE has been calculated for each bin of both categorical and numerical features, combine bins as per the following rules (called coarse classing)一旦为分类和数值特征的每个分类计算了WoE，请按照以下规则组合分类(称为粗分类)

Rules related to combining WoE bins

与合并WoE箱相关的规则

Each bin should have at least 5% of the observations每个垃圾箱应至少有5％的观测值
Each bin should be non-zero for both good and bad loans对于不良贷款，每个分类应不为零
The WOE should be distinct for each category. Similar groups should be aggregated or binned together. It is because the bins with similar WoE have almost the same proportion of good or bad loans, implying the same predictive power每个类别的WOE应该是不同的。相似的组应该汇总或合并在一起。这是因为具有类似WoE的垃圾箱的好坏贷款比例几乎相同，这意味着具有相同的预测能力
The WOE should be monotonic, i.e., either growing or decreasing with the binsWOE应该是单调的，即随垃圾箱增加或减少
Missing values are binned separately缺少的值将分别装箱

The above rules are generally accepted and well documented in academic literature³.

上述规则是公认的，并且在学术文献中有很好的记录³。

Why discretize numerical features

为什么要离散化数字特征

Discretization, or binning, of numerical features, is generally not recommended for machine learning algorithms as it often results in loss of data. However, our end objective here is to create a scorecard based on the credit scoring model eventually. A scorecard is utilized by classifying a new untrained observation (e.g., that from the test dataset) as per the scorecard criteria.

对于机器学习算法，通常不建议对数字特征进行离散化或装仓，因为它经常会导致数据丢失。但是，我们的最终目标是最终基于信用评分模型创建一个计分卡。通过根据记分卡标准对新的未经训练的观察结果(例如，来自测试数据集的观察值)进行分类来利用记分卡。

Consider that we don’t bin continuous variables, then we will have only one category for income with a corresponding coefficient/weight, and all future potential borrowers would be given the same score in this category, irrespective of their income. If, however, we discretize the income category into discrete classes (each with different WoE) resulting in multiple categories, then the potential new borrowers would be classified into one of the income categories according to their income and would be scored accordingly.

考虑到我们不对连续变量进行分类，那么我们将只有一个类别的收入具有相应的系数/权重，并且所有未来的潜在借款人在这一类别中的得分均相同，而与他们的收入无关。但是，如果我们将收入类别离散化为离散类别(每个类别具有不同的WoE)，从而产生多个类别，那么潜在的新借款人将根据其收入分类为收入类别之一，并据此进行评分。

WoE binning of continuous variables is an established industry practice that has been in place since FICO first developed a commercial scorecard in the 1960s, and there is substantial literature out there to support it. Some of the other rationales to discretize continuous features from the literature are:

自从FICO于1960年代首次开发出商业记分卡以来，连续变量的WoE分箱是一种行之有效的行业惯例，并且有大量文献对此提供支持。从文献中离散连续特征的其他一些理由是：

A scorecard is usually legally required to be easily interpretable by a layperson (a requirement imposed by the Basel Accord, almost all central banks, and various lending entities) given the high monetary and non-monetary misclassification costs. This is easily achieved by a scorecard that does not has any continuous variables, with all of them being discretized. Reasons for low or high scores can be easily understood and explained to third parties. All of this makes it easier for scorecards to get ‘buy-in’ from end-users compared to more complex models考虑到高昂的货币和非货币分类错误成本，通常法律要求记分卡必须由外行轻松解释(《巴塞尔协议》，几乎所有中央银行和各种借贷实体施加的要求)。这可以通过不具有任何连续变量的记分卡轻松实现，所有变量都被离散化。得分低或高的原因可以很容易地理解并向第三方解释。与更复杂的模型相比，所有这些使记分卡更容易从最终用户那里“买入”
Another legal requirement for scorecards is that they should be able to separate low and high-risk observations⁴. WoE binning takes care of that as WoE is based on this very concept计分卡的另一个法律要求是，记分卡应该能够区分低风险和高风险的观察结果⁴。 WoE装箱便解决了这一问题，因为WoE正是基于这一概念
Monotonicity. It is expected from the binning algorithm to divide an input dataset on bins in such a way that if you walk from one bin to another in the same direction, there is a monotonic change of credit risk indicator, i.e., no sudden jumps in the credit score if your income changes. This arises from the underlying assumption that a predictor variable can separate higher risks from lower risks in case of the global non-monotonous relationship⁵单调性。分箱算法期望以如下方式划分输入数据集：如果您以相同方向从一个分箱步行到另一个分箱，则信用风险指标会单调变化，即信用不会突然跳跃如果您的收入发生变化，则得分。这是由于以下基本假设：在全球非单调关系的情况下，预测变量可以将较高风险与较低风险区分开来⁵
An underlying assumption of the logistic regression model is that all features have a linear relationship with the log-odds (logit) of the target variable. Is there a difference between someone with an income of $38,000 and someone with $39,000? Most likely not, but treating income as a continuous variable makes this assumption. By categorizing based on WoE, we can let our model decide if there is a statistical difference; if there isn’t, they can be combined in the same categoryLogistic回归模型的基本假设是，所有要素均与目标变量的对数奇数(logit)具有线性关系。收入38,000美元的人和收入39,000美元的人之间有区别吗？很有可能不会，但是将收入视为连续变量会得出此假设。通过基于WoE进行分类，我们可以让我们的模型确定是否存在统计差异；如果没有，则可以将它们合并在同一类别中
Missing and outlier values can be categorized separately or binned together with the largest or smallest bin — therefore, no assumptions need to be made to impute missing values or handle outliers缺失值和离群值可以分别分类，也可以与最大或最小的仓位合并在一起-因此，无需做任何假设就可以估算缺失值或处理离群值

Information Value (IV)

信息价值(IV)

IV is calculated as follows:

IV计算如下：

According to Siddiqi², by convention, the values of IV in credit scoring is interpreted as follows:

根据Siddiqi²，按照惯例，信用评分中的IV值解释如下：

Note that IV is only useful as a feature selection and importance technique when using a binary logistic regression model.

请注意 ，在使用二进制逻辑回归模型时，IV仅用作特征选择和重要性技术。

数据的WoE特征工程和IV计算 (WoE Feature Engineering and IV Calculation for our Data)

Enough with the theory, let’s now calculate WoE and IV for our training data and perform the required feature engineering. We will define three functions as follows, each one to:

有了足够的理论，现在让我们为训练数据计算WoE和IV并执行所需的特征工程。我们将定义以下三个函数，每个函数分别为：

calculate and display WoE and IV values for categorical variables计算并显示分类变量的WoE和IV值
calculate and display WoE and IV values for numerical variables计算并显示数值变量的WoE和IV值
plot the WoE values against the bins to help us in visualizing WoE and combining similar WoE bins根据垃圾箱绘制WoE值，以帮助我们可视化WoE并组合类似的WoE垃圾箱

Sample output of these two functions when applied to a categorical feature, grade, is shown below:

应用于分类特征grade时，这两个函数的示例输出如下所示：

Once we have calculated and visualized WoE and IV values, next comes the most tedious task to select which bins to combine and whether to drop any feature given its IV. The shortlisted features that we are left with until this point will be treated in one of the following ways:

一旦我们计算并可视化了WoE和IV值，接下来便是最繁琐的任务，即选择要合并的仓，并根据给定的IV是否删除任何特征。到目前为止，我们将使用以下候选功能：

There is no need to combine WoE bins or create a separate missing category given the discrete and monotonic WoE and absence of any missing values: grade, verification_status, term

给定离散和单调的WoE且不存在任何缺失值，则无需合并WoE箱或创建单独的缺失类别： grade ， verification_status ， term
Combine WoE bins with very low observations with the neighboring bin: home_ownership, purpose

将具有低观测值的home_ownership箱与相邻箱结合： home_ownership ， purpose
Combine WoE bins with similar WoE values together, potentially with a separate missing category: int_rate, annual_inc, dti, inq_last_6mths, revol_util, out_prncp, total_pymnt, total_rec_int, total_rev_hi_lim, mths_since_earliest_cr_line, mths_since_issue_d, mths_since_last_credit_pull_d

具有类似WOE值与单独的丢失类结合WOE箱一起，潜在： int_rate ， annual_inc ， dti ， inq_last_6mths ， revol_util ， out_prncp ， total_pymnt ， total_rec_int ， total_rev_hi_lim ， mths_since_earliest_cr_line ， mths_since_issue_d ， mths_since_last_credit_pull_d
Ignore features with a low or very high IV value: emp_length, total_acc, last_pymnt_amnt, tot_cur_bal, mths_since_last_pymnt_d_factor

忽略具有低或很高IV值的功能： emp_length ， total_acc ， last_pymnt_amnt ， tot_cur_bal ， mths_since_last_pymnt_d_factor

Note that for certain numerical features with outliers, we will calculate and plot WoE after excluding them that will be assigned to a separate category of their own.

请注意，对于某些具有离群值的数值特征，我们将排除它们后将计算和绘制WoE，这些值将被分配给它们自己的单独类别。

Once we have explored our features and identified the categories to be created, we will define a custom ‘transformer’ class using sci-kit learn’s BaseEstimator and TransformerMixin classes. Like other sci-kit learn’s ML models, this class can be fit on a dataset to transform it as per our requirements. Another significant advantage of this class is that it can be used as part of a sci-kit learn’s Pipeline to evaluate our training data using Repeated Stratified k-Fold Cross-Validation. Using a Pipeline in this structured way will allow us to perform cross-validation without any potential data leakage between the training and test folds.

探索功能并确定要创建的类别后，我们将使用sci-kit Learn的BaseEstimator和TransformerMixin类定义自定义的“ transformer”类。像其他sci-kit Learn的ML模型一样，此类可以适合数据集以按照我们的要求进行转换。该课程的另一个显着优势是，它可以用作sci-kit学习Pipeline一部分，使用重复分层k折交叉验证来评估我们的训练数据。以这种结构化方式使用管道将使我们能够执行交叉验证，而不会在训练和测试折叠之间泄漏任何潜在的数据。

Remember that we have been using all the dummy variables so far, so we will also drop one dummy variable for each category using our custom class to avoid multicollinearity.

请记住，到目前为止，我们一直在使用所有虚拟变量，因此我们还将使用自定义类为每个类别删除一个虚拟变量，以避免多重共线性。

The code for our three functions and the transformer class related to WoE and IV follows:

我们的三个功能的代码以及与WoE和IV相关的转换器类如下：

模型训练 (Model Training)

Finally, we come to the stage where some actual machine learning is involved. We will fit a logistic regression model on our training set and evaluate it using RepeatedStratifiedKFold. Note that we have defined the class_weight parameter of the LogisticRegression class to be balanced. This will force the logistic regression model to learn the model coefficients using cost-sensitive learning, i.e., penalize false negatives more than false positives during model training. Cost-sensitive learning is useful for imbalanced datasets, which is usually the case in credit scoring. Refer to my previous article for further details on imbalanced classification problems.

最后，我们进入涉及一些实际机器学习的阶段。我们将在我们的训练集中拟合一个逻辑回归模型，并使用RepeatedStratifiedKFold对其进行评估。请注意，我们已经定义了class_weight的参数LogisticRegression类进行balanced 。这将迫使逻辑回归模型使用对成本敏感的学习方法来学习模型系数，即在模型训练期间对假阴性的惩罚要比对假阳性的惩罚更大。成本敏感型学习对于不平衡的数据集很有用，这在信用评分中通常是这样。有关不平衡分类问题的更多详细信息，请参阅我的上一篇文章。

Our evaluation metric will be Area Under the Receiver Operating Characteristic Curve (AUROC), a widely used and accepted metric for credit scoring. RepeatedStratifiedKFold will split the data while preserving the class imbalance and perform k-fold validation multiple times.

我们的评估指标将是“收信人操作特征曲线下的面积(AUROC)”，这是信用评分中广泛使用并被接受的指标。 RepeatedStratifiedKFold将在保留类不平衡的同时拆分数据，并多次执行k倍验证。

After performing k-folds validation on our training set and being satisfied with AUROC, we will fit the pipeline on the entire training set and create a summary table with feature names and the coefficients returned from the model.

对我们的训练集执行k折验证并获得AUROC的满意后，我们将把管道适合整个训练集，并创建一个包含特征名称和从模型返回的系数的汇总表。

预测时间 (Prediction Time)

It all comes down to this: apply our trained logistic regression model to predict the probability of default on the test set, which has not been used so far (other than for the generic data cleaning and feature selection tasks). We will save the predicted probabilities of default in a separate dataframe together with the actual classes.

一切都归结为这一点：应用我们训练有素的逻辑回归模型来预测测试集上的默认概率(到目前为止尚未使用(用于通用数据清理和功能选择任务))。我们会将预测的默认概率与实际类保存在单独的数据框中。

Next, we will draw a ROC curve, PR curve, and calculate AUROC and Gini. Our AUROC on test set comes out to 0.866 with a Gini of 0.732, both being considered as quite acceptable evaluation scores. Our ROC and PR curves will be something like this:

接下来，我们将绘制ROC曲线，PR曲线，并计算AUROC和Gini。我们的测试集上的AUROC得分为0.866，基尼系数为0.732，两者均被认为是可以接受的评估分数。我们的ROC和PR曲线如下所示：

Code for predictions and model evaluation on the test set is:

测试集上的预测和模型评估代码为：

记分卡开发 (Scorecard Development)

The final piece of our puzzle is creating a simple, easy-to-use, and implement credit risk scorecard that can be used by any layperson to calculate an individual’s credit score given certain required information about him and his credit history.

我们难题的最后一步是创建一个简单，易于使用的实施信用风险计分卡，任何外行人士都可以使用该卡来计算个人的信用评分，前提是获得有关其本人及其信用历史的某些必要信息。

Remember the summary table created during the model training phase? We will append all the reference categories that we left out from our model to it, with a coefficient value of 0, together with another column for the original feature name (e.g., grade to represent grade:A, grade:B, etc.).

还记得在模型训练阶段创建的汇总表吗？我们将追加所有参考类别，我们从我们的模型忽略了它，为0的系数值，与原来的功能名称的另一列在一起(例如， grade代表grade:A ， grade:B等) 。

We will then determine the minimum and maximum scores that our scorecard should spit out. As a starting point, we will use the same range of scores used by FICO: from 300 to 850.

然后，我们将确定计分卡应吐出的最低和最高分数。首先，我们将使用FICO使用的相同分数范围：从300到850。

The coefficients returned by the logistic regression model for each feature category are then scaled to our range of credit scores through simple arithmetic. An additional step here is to update the model intercept’s credit score through further scaling that will then be used as the starting point of each scoring calculation.

然后通过简单的算术，将逻辑回归模型为每个特征类别返回的系数缩放到我们的信用分数范围。此处的另一个步骤是通过进一步缩放来更新模型拦截器的信用评分，然后将其用作每次评分计算的起点。

At this stage, our scorecard will look like this (the Score-Preliminary column is a simple rounding of the calculated scores):

在此阶段，我们的计分卡将如下所示(“分数-初步”列是对计算所得分数的简单四舍五入)：

Depending on your circumstances, you may have to manually adjust the Score for a random category to ensure that the minimum and maximum possible scores for any given situation remains 300 and 850. Some trial and error will be involved here.

根据您的情况，您可能必须手动调整随机类别的分数，以确保在任何给定情况下的最低和最高分数仍然保持在300和850之间。此处将涉及一些反复试验。

计算测试集的信用分数 (Calculate Credit Scores for Test Set)

Once we have our final scorecard, we are ready to calculate credit scores for all the observations in our test set. Remember, our training and test sets are a simple collection of dummy variables with 1s and 0s representing whether an observation belongs to a specific dummy variable. For example, in the image below, observation 395346 had a C grade, owns its own home, and its verification status was Source Verified.

有了最终的计分卡后，就可以为测试集中的所有观测值计算信用评分了。请记住，我们的训练和测试集是一个简单的虚拟变量集合，其中的1和0代表观察值是否属于特定的虚拟变量。例如，在下面的图像中，观测值395346具有C级，拥有自己的房屋，并且其验证状态为“已验证来源”。

Accordingly, after making certain adjustments to our test set, the credit scores are calculated as a simple matrix dot multiplication between the test set and the final score for each category. Consider the above observations together with the following final scores for the intercept and grade categories from our scorecard:

因此，在对我们的测试集进行某些调整之后，信用分数将被计算为每个类别的测试集和最终分数之间的简单矩阵点乘。考虑以上观察结果以及我们计分卡中截取和等级类别的以下最终分数：

Intuitively, observation 395346 will start with the intercept score of 598 and receive 15 additional points for being in the grade:C category. Similarly, observation 3766583 will be assigned a score of 598 plus 24 for being in the grade:A category. We will automate these calculations across all feature categories using matrix dot multiplication. The final credit score is then a simple sum of individual scores of each feature category applicable for an observation.

直观地，观测395346将以598的拦截得分开始，并且由于达到Grade：C类别而获得15个额外的分数。同样，在等级A类别中，观察值3766583将被分配598分和24分。我们将使用矩阵点乘法在所有要素类别中自动执行这些计算。最终信用分数就是适用于观察的每个要素类别的各个分数的简单总和。

设置贷款批准截止时间 (Setting Loan Approval Cut-offs)

So how do we determine which loans should we approve and reject? What is the ideal credit score cut-off point, i.e., potential borrowers with a credit score higher than this cut-off point will be accepted and those less than it will be rejected? This cut-off point should also strike a fine balance between the expected loan approval and rejection rates.

那么，我们如何确定应批准和拒绝哪些贷款？理想的信用分数临界点是什么，即，信用分数高于此临界点的潜在借款人将被接受，而信用分数低于该信用点的潜在借款人将被拒绝？这个分界点还应该在预期的贷款批准率和拒绝率之间达到良好的平衡。

To find this cut-off, we need to go back to the probability thresholds from the ROC curve. Remember that a ROC curve plots FPR and TPR for all probability thresholds between 0 and 1. Since we aim to minimize FPR while maximizing TPR, the top left corner probability threshold of the curve is what we are looking for. This ideal threshold is calculated using the Youden’s J statistic that is a simple difference between TPR and FPR.

为了找到这个临界点，我们需要回到ROC曲线的概率阈值。请记住，ROC曲线绘制的是介于0和1之间的所有概率阈值的FPR和TPR。由于我们旨在最小化FPR而最大化TPR，因此曲线的左上角概率阈值正是我们要寻找的。该理想阈值是使用Youden的J统计量计算得出的，该统计量是TPR和FPR之间的简单差异。

The ideal probability threshold in our case comes out to be 0.187. All observations with a predicted probability higher than this should be classified as in Default and vice versa. At first, this ideal threshold appears to be counterintuitive compared to a more intuitive probability threshold of 0.5. But remember that we used the class_weight parameter when fitting the logistic regression model that would have penalized false negatives more than false positives.

在我们的案例中，理想概率阈值为0.187。所有预测概率高于此概率的观测值都应归类为“默认”，反之亦然。首先，与更直观的概率阈值0.5相比，该理想阈值似乎违反直觉。但是请记住，在拟合逻辑回归模型时我们使用了class_weight参数，该模型对假阴性的惩罚要比对假阳性的惩罚更大。

We then calculate the scaled score at this threshold point. As shown in the code example below, we can also calculate the credit scores and expected approval and rejection rates at each threshold from the ROC curve. This can help the business to further manually tweak the score cut-off based on their requirements.

然后，我们在此阈值点计算缩放分数。如下面的代码示例所示，我们还可以从ROC曲线计算每个阈值处的信用评分以及预期的批准和拒绝率。这可以帮助企业根据他们的要求进一步手动调整分数截止值。

All the code related to scorecard development is below:

与记分卡开发相关的所有代码如下：

结论 (Conclusion)

Well, there you have it — a complete working PD model and credit scorecard! The complete notebook is available here on GitHub. Feel free to play around with it or comment in case of any clarifications required or other queries.

好了，您已经掌握了–完整的PD模型和信用评分卡！完整的笔记本电脑可在这里 GitHub上。如果需要任何澄清或其他疑问，请随意使用它或发表评论。

As always, feel free to reach out to me if you would like to discuss anything related to data analytics, machine learning, financial analysis, or financial analytics.

与往常一样，如果您想讨论与数据分析，机器学习，财务分析或财务分析有关的任何事情，请随时与我联系。

Till next time, rock on!

直到下一次，继续前进！

翻译自: https://towardsdatascience.com/how-to-develop-a-credit-risk-model-and-scorecard-91335fc01f03