cox风险回归模型参数估计

The final part aims to walk you through the process of applying different classification algorithms on our transformed dataset as well as producing the best-performing model using Hyperparameter Tuning.

最后一部分旨在引导您完成在转换后的数据集上应用不同分类算法的过程,以及使用超参数调整生成性能最佳的模型的过程。

As a reminder, this end-to-end project aims to solve a classification problem in Data Science, particularly in finance industry and is divided into 3 parts:

提醒一下,此端到端项目旨在解决数据科学(特别是金融行业)中的分类问题,分为三个部分:

  1. Explanatory Data Analysis (EDA) & Feature Engineering解释性数据分析(EDA)和特征工程
  2. Feature Scaling and Selection (Bonus: Imbalanced Data Handling)功能缩放和选择(奖金:不平衡的数据处理)
  3. Machine Learning Modelling (Classification)

    机器学习建模(分类)

If you have missed the previous two parts, feel free to check them out here and here before going through the final part which leveraged their output in producing the best classification model.

如果您错过了前两个部分,请随时在此处查看 在进行最后一部分之前, 这里将利用他们的输出来产生最佳分类模型。

A.分类模型 (A. Classification Models)

Which algorithms should be used to build a model that addresses and solves a classification problem?

应该使用哪种算法来构建可解决并解决分类问题的模型?

When it comes to classification, we have quite a handful of different algorithms to use unlike regression. To name some, Logistic Regression, K-Neighbors, SVC, Decision Tree and Random Forest are the top common and widely used algorithms to solve such problems.

关于分类,与回归不同,我们有很多不同的算法可以使用。 仅举一些例子,逻辑回归,K邻居,SVC,决策树和随机森林是解决此类问题的最常用且广泛使用的算法。

Here’s a quick recap of what each algorithm does and how it distinguishes itself from the others:

以下是每种算法的功能及其与众不同之处的快速概述:

  • Logistic Regression: this algorithm uses regression to predict the continuous probability of a data sample (from 0 to 1), then classifies that sample to the more probable target (either 0 or 1). However, it assumes a linear relationship between the the inputs and the target, which might not be a good choice if the dataset does not follow Gaussian Distribution.

    Logistic回归 :此算法使用回归来预测数据样本的连续概率 (从0到1),然后将该样本分类为更可能的目标(0或1)。 但是,它假设输入和目标之间存在线性关系,如果数据集不遵循高斯分布,则可能不是一个好的选择。

  • K-Neighbors: this algorithm assumes data points which are in close proximity to each other belong to the same class. Particularly, it classifies the target (either 0 or 1) of a data sample by a plurality vote of the neighbors which are close in distance to it.

    K-Neighbors :该算法假定彼此接近的数据点属于同一类。 特别是,它通过距离最近的邻居的多次投票对数据样本的目标(0或1)进行分类。

  • SVC: this algorithm makes classifications by defining a decision boundary and then classify the data sample to the target (either 0 or 1) by seeing which side of the boundary it falls on. Essentially, the algorithm aims to maximize the distance between the decision boundary and points in each class to decrease the chance of false classification.

    SVC :此算法通过定义决策边界进行分类,然后通过查看数据样本落在边界的哪一侧将其分类到目标(0或1)。 本质上,该算法旨在最大化决策边界和每个类别中的点之间的距离,以减少错误分类的机会。

  • Decision Tree: as the name tells, this algorithm splits the root of the tree (the entire dataset) into decision nodes, and each decision node will be split until no further node is splittable. Then, the algorithm classifies the data sample by sorting them down the tree from the root to the leaf/terminal node and seeing which target node it falls on.

    决策树 :顾名思义,此算法将树的根 (整个数据集)拆分为决策节点,并且每个决策节点都将被拆分,直到没有其他节点可拆分为止。 然后,该算法通过对数据样本从根到叶/终端节点的树进行分类,并查看其落在哪个目标节点上,从而对数据样本进行分类。

  • Random Forest: this algorithm is an ensemble technique developed from the Decision Tree, in which it involves many decision tree that work together. Particularly, the random forest gives that data sample to each of the decision trees and returns the most popular classification to assign the target to that data sample. This algorithm helps avoid overfitting which may occurs to Decision Tree, as it aggregates the classification from multiple trees instead of 1.

    随机森林 :此算法是从决策树开发的一种集成技术,其中涉及许多协同工作的决策树。 特别地,随机森林将数据样本提供给每个决策树,并返回最流行的分类以将目标分配给该数据样本。 该算法有助于避免决策树可能发生的过拟合,因为它会聚合来自多个树而不是1的分类。

Let’s see how they work with our dataset compared to one another:

让我们比较一下它们如何与我们的数据集一起工作:

from sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.svm import SVCfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.ensemble import RandomForestClassifierclassifiers = {    "LogisticRegression" : LogisticRegression(),    "KNeighbors" : KNeighborsClassifier(),    "SVC" : SVC(),    "DecisionTree" : DecisionTreeClassifier(),    "RandomForest" : RandomForestClassifier()}

After importing the algorithms from sklearn, I created a dictionary which combines all algorithms into one place, so that it’s easier to apply them on the data at once, without the need to manually iterate each individually.

从sklearn导入算法后,我创建了一个字典,将所有算法组合到一个位置 ,这样可以更轻松地将它们一次应用于数据,而无需手动进行单独迭代。

#Compute the training score of each modelstrain_scores = []test_scores = []for key, classifier in classifiers.items():    classifier.fit(x_a_train_rs_over_pca, y_a_train_over)    train_score = round(classifier.score(x_a_train_rs_over_pca, y_a_train_over),2)    train_scores.append(train_score)    test_score = round(classifier.score(x_a_test_rs_over_pca, y_a_test_over),2)    test_scores.append(test_score)print(train_scores)print(test_scores)

After applying the algorithms on both train and test sets, it seems that Logistic Regression doesn’t work well for the dataset as the scores are relatively low (around 50%, which indicates that the model is not able to classify the target). This is quite understandable and somehow proves that our original dataset is not normally distributed.

在训练集和测试集上应用算法后,由于分数相对较低(大约50%,这表明该模型无法对目标进行分类),因此Logistic回归似乎不适用于数据集。 这是完全可以理解的,并且以某种方式证明了我们的原始数据集不是正态分布的。

In contrast, Decision Tree and Random Forest produced a significantly high accuracy scores on the train sets (85%). Yet, it’s the otherwise for the test set when the scores are remarkably low (over 50%). Possible reasons that might explain the large gap is (1) overfitting the train set, (2) leaking target to the test set. However, after cross checking, it doesn’t seem as the case.

相反,决策树和随机森林在火车上产生了很高的准确性得分(85%)。 但是,当分数非常低(超过50%)时,则是测试集的其他情况。 可能解释大差距的可能原因是:(1)过度安装了列车组;(2)目标泄漏到测试组。 但是,经过交叉检查后,情况似乎并非如此。

Hence, I decided to look into another scoring metric, Cross Validation Score, to see if there’s any difference. Basically, this technique splits the training set into n folds (default = 5), then fits the data on n-1 folds and score on the other fold. This process is repeated in n folds from which the average score will be calculated. Cross validation score brings a more objective analysis on how the models works as compared to the standard accuracy score.

因此,我决定研究另一个得分指标, 交叉验证得分,以查看是否存在任何差异。 基本上,此技术将训练集分为n折(默认= 5),然后将数据拟合为n-1折,而得分为另一折。 该过程以n倍重复,将从中计算出平均分数。 与标准准确性分数相比,交叉验证分数可更客观地分析模型的工作方式。

from sklearn.model_selection import cross_val_scoretrain_cross_scores = []test_cross_scores = []for key, classifier in classifiers.items():    classifier.fit(x_a_train_rs_over_pca, y_a_train_over)    train_score = cross_val_score(classifier, x_a_train_rs_over_pca, y_a_train_over, cv=5)    train_cross_scores.append(round(train_score.mean(),2))    test_score = cross_val_score(classifier, x_a_test_rs_over_pca, y_a_test_over, cv=5)    test_cross_scores.append(round(test_score.mean(),2))

print(train_cross_scores)print(test_cross_scores)

As seen, the gap between the train and test scores was significantly bridged!

如图所示,训练成绩和考试成绩之间的差距已大大缩小!

Since Random Forest model produced the highest cross validation score, we will test it against another score metric named ROC AUC Score as well as see how it performs on the ROC Curve.

由于随机森林模型产生了最高的交叉验证得分,因此我们将使用另一个名为ROC AUC得分的得分度量标准对其进行测试,并查看其在ROC曲线上的表现

Essentially, ROC Curve is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) within the threshold between 0 and 1 while AUC represents the degree or measure of separability (simply, the ability to distinguish the target).

本质上, ROC曲线是在0到1之间的阈值内,假阳性率(x轴)与真阳性率(y轴)的关系图,而AUC表示可分离性的程度或度量(简单地,区分能力目标)。

Image credit: https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
图片来源: https : //towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

Below is a quick summary table of how to calculate FPR (the inversion of Specificity) and TPR (also known as Sensitivity):

以下是有关如何计算FPR (特异性倒置)和TPR (也称为灵敏度)的快速摘要表:

Image credit: https://towardsdatascience.com/hackcvilleds-4636c6c1ba53
图片来源: https : //towardsdatascience.com/hackcvilleds-4636c6c1ba53
from sklearn.model_selection import cross_val_predictfrom sklearn.metrics import roc_curvefrom sklearn.metrics import roc_auc_scorerf = RandomForestClassifier()rf.fit(x_a_train_rs_over_pca, y_a_train_over)rf_pred = cross_val_predict(rf, x_a_test_rs_over_pca, y_a_test_over, cv=5)print(roc_auc_score(y_a_test_over, rf_pred))#Plot the ROC Curvefpr, tpr, _ = roc_curve(y_a_test_over, rf_pred)plt.plot(fpr, tpr)plt.show()
ROC Curve with ROC AUC Score = 76%
ROC AUC分数= 76%的ROC曲线

As I had proved that cross validation worked on this dataset, I then applied another cross validation technique called “cross_val_predict”, which follows similar methodology of splitting n-folds and predicting the value accordingly.

当我证明交叉验证可在该数据集上工作时,我随后应用了另一种称为“ cross_val_predict ”的交叉验证技术,该技术遵循类似的拆分n折并相应地预测值的方法。

B.超参数调整 (B. Hyperparameter Tuning)

What is hyperparameter tuning and how does it help to improve the accuracy of the model?

什么是超参数调整,它如何帮助提高模型的准确性?

After computing the model from the default estimators of each algorithm, I was hoping to see if further improvement could be made, which comes down to Hyperparameter Tuning. Essentially, this technique chooses a set of optimal estimators from each algorithm that (might) produces the highest accuracy score on the given dataset.

从每种算法的默认估计量计算出模型后,我希望看看是否可以进行进一步的改进,这归结为“超参数调整”。 本质上,此技术从(可能)在给定数据集上产生最高准确性得分的每种算法中选择一组最佳估计量

The reason why I put (might) in the definition is that for some cases, little to none improvement is seen depends on the dataset as well as the preparation done initially (plus it takes like forever to run). However, Hyperparameter Tuning should be taken into consideration with the hope of finding the best performing model.

我之所以定义(可能),是因为在某些情况下,几乎看不到任何改善,这取决于数据集以及最初完成的准备工作(而且要花很长时间才能运行)。 但是,应考虑超参数调整,以期找到性能最佳的模型。

#Use GridSearchCV to find the best parametersfrom sklearn.model_selection import GridSearchCV#Logistic Regressionlr = LogisticRegression()lr_params = {"penalty": ['l1', 'l2'], "C": [0.001, 0.01, 0.1, 1, 10, 100, 1000], "solver": ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}grid_logistic = GridSearchCV(lr, lr_params)grid_logistic.fit(x_a_train_rs_over_pca, y_a_train_over)lr_best = grid_logistic.best_estimator_#KNearest Neighborsknear = KNeighborsClassifier()knear_params = {"n_neighbors": list(range(2,7,1)), "algorithm": ['auto', 'ball_tree', 'kd_tree', 'brutle']}grid_knear = GridSearchCV(knear, knear_params)grid_knear.fit(x_a_train_rs_over_pca, y_a_train_over)knear_best = grid_knear.best_estimator_#SVCsvc = SVC()svc_params = {"C": [0.5, 0.7, 0.9, 1], "kernel":['rbf', 'poly', 'sigmoid', 'linear']}grid_svc = GridSearchCV(svc, svc_params)grid_svc.fit(x_a_train_rs_over_pca, y_a_train_over)svc_best = grid_svc.best_estimator_#Decision Treetree = DecisionTreeClassifier()tree_params = {"criterion": ['gini', 'entropy'], "max_depth":list(range(2,5,1)), "min_samples_leaf":list(range(5,7,1))}grid_tree = GridSearchCV(tree, tree_params)grid_tree.fit(x_a_train_rs_over_pca, y_a_train_over)tree_best = grid_tree.best_estimator_

GridSearchCV is the key to finding the set of optimal estimators in each algorithm, as it scrutinizes and combines different estimators to fit the dataset, then returns the best set among all.

GridSearchCV是在每种算法中找到最佳估计量集合的关键,因为它会仔细检查并组合不同的估计量以适合数据集,然后返回所有之中的最佳估计量。

One thing of note is that we have to remember by heart all available estimators of each algorithm to be able to use. For example, with Logistic Regression, we have a set of “penalty”, “C”, and “solver” which do not belong to other algorithms.

需要注意的一件事是,我们必须牢记每种算法能够使用的所有可用估计量。 例如,对于Logistic回归,我们拥有一组不属于其他算法的“惩罚”,“ C”和“求解器”。

After finding the .best_estimator_ of each algorithm, fit and predict the data using each algorithm with its best set. However, we need to compare the new scores against the original to determine if any improvement is seen or to continue fine-tuning the estimators again.

找到每种算法的.best_estimator_之后,使用每种算法的最佳组合来拟合和预测数据。 但是,我们需要将新得分与原始得分进行比较,以确定是否看到了任何改进,或者继续对估计量进行微调。

奖励:XGBoost和LightGBM (Bonus: XGBoost and LightGBM)

What are XGBoost and LightGBM and how significantly better do these algorithms do compared to the traditional?

什么是XGBoost和LightGBM?与传统算法相比,这些算法的效果有多明显?

Apart from the common classification algorithms I’ve heard of, I also have known a couple of advanced algorithms which rooted from the traditional. In this case, XGBoost and LightGBM can be considered as the successor of Decision and Random Forest. Look at the below timeline for a better understanding of how these algorithms were developed:

除了我听说过的常见分类算法外,我还知道一些源自传统的高级算法。 在这种情况下,可以将XGBoost和LightGBM视为“决策和随机森林”的后继者。 请查看以下时间轴,以更好地了解这些算法的开发方式:

Image credit: https://www.slideshare.net/GabrielCyprianoSaca/xgboost-lightgbm
图片来源: https : //www.slideshare.net/GabrielCyprianoSaca/xgboost-lightgbm

I’m not going to go into details of how these algorithms differ mathematically, but in general, they are able to prune the decision trees better while handling missing values + avoid overfitting at the same time.

我不会详细介绍这些算法在数学上的区别,但总的来说,它们能够在处理缺失值的同时更好地修剪决策树,同时避免过度拟合。

#XGBoostimport xgboost as xgbxgb_model = xgb.XGBClassifier()xgb_model.fit(x_a_train_rs_over_pca, y_a_train_over)xgb_train_score = cross_val_score(xgb_model, x_a_train_rs_over_pca, y_a_train_over, cv=5)xgb_test_score = cross_val_score(xgb_model, x_a_test_rs_over_pca, y_a_test_over, cv=5)print(round(xgb_train_score.mean(),2))print(round(xgb_test_score.mean(),2))#LightGBMimport lightgbm as lgblgb_model = lgb.LGBMClassifier()lgb_model.fit(x_a_train_rs_over_pca, y_a_train_over)lgb_train_score = cross_val_score(lgb_model, x_a_train_rs_over_pca, y_a_train_over, cv=5)lgb_test_score = cross_val_score(lgb_model, x_a_test_rs_over_pca, y_a_test_over, cv=5)print(round(lgb_train_score.mean(),2))print(round(lgb_test_score.mean(),2))

After computing, the train and set scores of each model are 72% & 73% (XGBoost) and 69% & 72% (LightGBM), which is relatively the same as Random Forest model computed above. However, we are still able to make further optimisations via Hyperparameter Tuning for these advanced models, but beware that it might take forever since XGBoost and LightGBM have longer runtime due to the complexity of their algorithm.

经过计算,每个模型的训练和设定分数分别为72%和73%(XGBoost)和69%和72%(LightGBM),与上面计算的随机森林模型相对相同。 但是,对于这些高级模型,我们仍然可以通过Hyperparameter Tuning进行进一步的优化,但是请注意,由于XGBoost和LightGBM的算法复杂性,它们的运行时间更长,因此可能要花很长时间。

Voila! That’s the wrap for this end-to-end project with regards to Classification! If you are keen to explore the entire code, feel free to check out my Github below:

瞧! 这就是有关分类的端到端项目的内容! 如果您热衷于浏览整个代码,请随时在下面查看我的Github:

Repository: https://github.com/andrewnguyen07/credit-risk-managementLinkedIn: www.linkedin.com/in/andrewnguyen07

资料库: https : //github.com/andrewnguyen07/credit-risk-management LinkedIn: www.linkedin.com/in/andrewnguyen07

Follow my Medium to keep posted on future projects coming up soon!

按照我的中号来发布即将发布的未来项目!

翻译自: https://towardsdatascience.com/credit-risk-management-classification-models-hyperparameter-tuning-d3785edd8371

cox风险回归模型参数估计


http://www.taodudu.cc/news/show-863403.html

相关文章:

  • 支持向量机 回归分析_支持向量机和回归分析
  • ai/ml_您本周应阅读的有趣的AI / ML文章(8月15日)
  • chime-4 lstm_CHIME-6挑战赛回顾
  • 文本文件加密和解密_解密文本见解和相关业务用例
  • 有关糖尿病模型建立的论文_预测糖尿病结果的模型比较
  • chi-squared检验_每位数据科学家都必须具备Chi-S方检验统计量:客户流失中的案例研究
  • 深度学习:在图像上找到手势_使用深度学习的人类情绪和手势检测器:第2部分
  • 爆破登录测试网页_预测危险的地震爆破第一部分:EDA,特征工程和针对不平衡数据集的列车测试拆分
  • 概率论在数据挖掘_为什么概率论在数据科学中很重要
  • 集合计数 二项式反演_对计数数据使用负二项式
  • 使用TorchElastic训练DeepSpeech
  • 神经网络架构搜索_神经网络架构
  • raspberry pi_通过串行蓝牙从Raspberry Pi传感器单元发送数据
  • 问答机器人接口python_设计用于机器学习工程的Python接口
  • k均值算法 二分k均值算法_如何获得K均值算法面试问题
  • 支持向量机概念图解_支持向量机:基本概念
  • 如何设置Jupiter Notebook服务器并从任何地方访问它(Windows 10)
  • 无监督学习 k-means_监督学习-它意味着什么?
  • logistic 回归_具有Logistic回归的优秀初学者项目
  • 脉冲多普勒雷达_是人类还是动物? 多普勒脉冲雷达和神经网络的目标分类
  • pandas内置绘图_使用Pandas内置功能探索数据集
  • sim卡rfm_信用卡客户的RFM集群
  • 需求分析与建模最佳实践_社交媒体和主题建模:如何在实践中分析帖子
  • 机器学习 数据模型_使用PyCaret将机器学习模型运送到数据—第二部分
  • 大数据平台蓝图_数据科学面试蓝图
  • 算法竞赛训练指南代码仓库_数据仓库综合指南
  • 深度学习 图像分类_深度学习时代您应该阅读的10篇文章了解图像分类
  • 蝙蝠侠遥控器pcb_通过蝙蝠侠从Circle到ML:第一部分
  • cnn卷积神经网络_5分钟内卷积神经网络(CNN)
  • 基于树的模型的机器学习

cox风险回归模型参数估计_信用风险管理:分类模型和超参数调整相关推荐

  1. 如何用Matlab建立信用风险模型,基于Matlab计算的KMV模型在商业银行信用风险管理中的实践应用.pdf...

    科技论坛 2014.6 基于Matlab计算的KMV模型在商业银行信用风险管理中的 应用 李 园 (天津大学管理与经济学部,天津,300072) 摘要:社会经济的不断发展,金融行业也处于不断发展之中, ...

  2. 数学建模_随机森林分类模型详解Python代码

    数学建模_随机森林分类模型详解Python代码 随机森林需要调整的参数有: (1) 决策树的个数 (2) 特征属性的个数 (3) 递归次数(即决策树的深度)''' from numpy import ...

  3. 交叉验证和超参数调整:如何优化您的机器学习模型

    In the first two parts of this article I obtained and preprocessed Fitbit sleep data, split the data ...

  4. 第十四章_超参数调整

    文章目录 14.1 写在前面 14.2 超参数概述 14.2.1 什么是超参数,参数和超参数的区别 14.2.2 神经网络中包含哪些超参数 14.2.3 模型优化寻找最优解和正则项之间的关系 14.2 ...

  5. Lasso 和 Ridge回归中的超参数调整技巧

    在这篇文章中,我们将首先看看Lasso和Ridge回归中一些常见的错误,然后我将描述我通常采取的步骤来优化超参数.代码是用Python编写的,我们主要依赖scikit-learn.本文章主要关注Las ...

  6. 数据数据泄露泄露_通过超参数调整进行数据泄漏

    数据数据泄露泄露 介绍 (Introduction) Data Leakage is when the model somehow knows the patterns in the test dat ...

  7. 降维后的高维特征的参数_高维超参数调整简介

    降维后的高维特征的参数 by Thalles Silva 由Thalles Silva 高维超参数调整简介 (An introduction to high-dimensional hyper-par ...

  8. R使用LSTM模型构建深度学习文本分类模型(Quora Insincere Questions Classification)

    R使用LSTM模型构建深度学习文本分类模型(Quora Insincere Questions Classification) Long Short Term 网络-- 一般就叫做 LSTM --是一 ...

  9. 逻辑回归和线性回归的区别_[PRML]线性分类模型贝叶斯逻辑回归

    线性分类相关文章:1.Fisher线性判别分析(LDA)[1]2.广义模型与线性模型& 判别分析 [2]3.逻辑回归[3]4.线性分类模型简介5.感知机原理及代码复现6.概率生成模型7.概率判 ...

最新文章

  1. python中的变量、Debug和数据类型
  2. 设计模式------工厂方法模式
  3. 理解Java中的弱引用(Weak Reference)
  4. 『求助』请求服务器超时或失败问题
  5. 计算机专业女兵,陈豪2010《点解阿Sir》剧照
  6. 读书学习:我编程我快乐(一.2)
  7. C语言基础练习题初学者可参考
  8. python数据结构之递归
  9. 打开plsqldev报错解决
  10. 网络编程 : 基于UDP的网络群聊聊天室
  11. 软件公司之间合作的保密协议范本
  12. 200万年薪请不到!清华姚班到底有多牛X?
  13. Http请求常见Header
  14. 移动端input提起数字键盘如何设置小数点?
  15. vue使用file-saver本地文件导出
  16. MySQL的Workbench中pk nn uq等的含义
  17. snmpset对象不可写_写 I/O 路径 (FTT1/RF2) 对比 – Nutanix vs VMware vSAN
  18. 如何自己动手给笔记本电脑增加内存
  19. 《光之圣境放置次元》1.26上线链游玩家|放置挂机、重塑神域
  20. 获取一个对象或数组的所有属性及值

热门文章

  1. 以HANA为核心 SAP实时数据平台详解
  2. No compiler is provided in this environment. Perhaps you are running on a JRE rather than a JDK?
  3. Cocos2dx学习笔记(1) Ref类型数据 垃圾回收机制
  4. Android Resource介绍和使用
  5. vmware workstation 8上面装vsphere5
  6. 如何在客户端调用服务端代码
  7. vue中用的swiper轮播图的用法github的地址
  8. java tessdata训练_Tesseract For Java为可执行jar设置Tessdata_Prefix
  9. open source protocols
  10. 深入jvm虚拟机第4版_深入理解JVM虚拟机