



接下来,我讲解信用评分卡系列内容第3集变量筛选。变量筛选是建模前最重要的一个基础工作。很多模型所谓依赖大数据,用了成千上万个变量建模,这是不科学的。真正优雅模型的维度是经过科学设计的,不会太多,也不会太少。模型维度太高,模型系统风险高,操作风险高,运行慢,效率低。模型维度太低,模型预测能力不足,ks,auc可能上不去。----By Toby,一个持牌照消费金融模型专家


变量筛选Variables Selection in Predictive Analytics

Predictive Analytics: Variables Selection – by Roopam

The following story goes back to the time when I just started my transition from physics to business. I met this investment banker* in his mid-thirties during a Friday night party. After gulping down a few pints of beer, his mood became a bit somber and he told me how he hates his job. However, he had a plan of working his ass off until he retires at 45. Then he will do everything that makes him happy. I was thoroughly confused, how could someone debar himself from an emotion – happiness – for so many years and rediscover it later? I was wondering about the recipe for happiness – raindrops on roses and whiskers on kittens. An individual’s happiness is a tricky thing; however, I shall attempt to tackle this issue in my later article on logistic regression. For now, let us try to explore how states measure the collective well-being of their people. I shall use this topic of population well-being to explore an interesting topic in analytical scorecard development: variables selection.

以下故事可以追溯到我刚开始从物理到商业的过渡时期。 我在周五晚上的聚会期间遇到了这位投资银行家*。 在喝了几品脱啤酒之后,他的心情变得有些忧郁,他告诉我他是如何讨厌自己的工作的。 然而,他有一个计划工作他的屁股,直到他在45退休。然后他会做一切让他开心的事情。 我彻底搞糊涂了,这么多年以后,有多少人会从情感 - 快乐中贬低自己,并在以后重新发现它? 我想知道快乐的秘诀 - 玫瑰上的雨滴和小猫的胡须。 个人的幸福是一件棘手的事情; 但是,我将在后面关于逻辑回归的文章中尝试解决这个问题。 现在,让我们试着探讨各国如何衡量其人民的集体福祉。 我将利用这个人口福祉主题来探索分析记分卡开发中的一个有趣话题:

Variables Selection – Lessons from GDP & GNH

The most popular measure for national prosperity, unanimously projected by economists and TV channels, is Gross Domestic Product (GDP). The equation for measuring GDP as taught in macroeconomics 101 is:

Clearly, there are 5 factors/variables that govern GDP according to this equation. The first look at GDP as a measure for national well-being seemed incomplete to me. All the variables for GDP were from commerce. They are important but cannot be the only factors for country’s well-being, more so in a highly diverse & complicated country like India.

ariables Selection - 来自GDP和GNH的经验教训
经济学家和电视频道一致预测的最受国民兴趣的衡量标准是国内生产总值(GDP)。 宏观经济学101中教授的衡量GDP的等式是:


显然,根据这个等式,有5个因素/变量可以控制GDP。 首先将国内生产总值视为衡量国家福祉的指标对我来说似乎不完整。 GDP的所有变量都来自商业。 它们很重要,但不能成为国家福祉的唯一因素,在印度等高度多样化和复杂的国家更是如此。

Gross National Happiness Index – The Story of Bhutan Naresh

Variables Selection

Ok, so what else do we have? A lesser-known index is Gross National Happiness (GNH). The origins of GNH are in Bhutan. They measure their country’s progress through GNH. The term was coined and implemented by Jigme Singye Wangchuck. This name immediately takes me back to the early nineties live telecast of the SAARC summit by India’s national broadcaster Doordarshan (DD). The old-timer Hindi commentators were referring to a modest man in a bathrobe-like-attire as ‘Bhutan Naresh’ – King of Bhutan. At first glance, he did not fit well with the power horses of the south Asian region. Nevertheless, he seems to have devised a more holistic metric to measure his country’s well-being. GNH is a combination of the following broad categories:

1. Living standard & income
2. Health coverage
3. Physiological well-being
4. Time spent at work and relaxing
5. Good governance
6. Schooling & education
7. Cultural diversity
8. Community vitality
9. Environmentalism and conservatism

There are 72 total variables in GNH measured on a scale of 0 to 1, such as daily hours of sleep and trust in media; hmmm, not a bad start! You could do your own research on GNH and let me know what you feel about it. Actually, we can work out our own formula for a GNH like metric. The idea is to select the right variables to build your model!

国民幸福总指数 - 不丹纳雷什的故事


好的,那我们还有什么呢?一个鲜为人知的指数是国民幸福总值(GNH)。 GNH的起源在不丹。他们通过GNH衡量他们国家的进步。该术语由Jigme Singye Wangchuck创造和实施。这个名字让我回到了印度国家广播公司Doordarshan(DD)在九十年代早期的SAARC峰会现场直播。旧时的印地语评论员指的是一个穿着浴衣般装扮的谦虚男人,就像不丹之王“不丹纳雷什”。乍一看,他并不适合南亚地区的动力马。然而,他似乎已经设计了一个更全面的衡量标准来衡量他的国家的福祉。 GNH是以下大类的组合:




Variables Selection in Credit Scoring

In data mining and statistical model building exercises, similar to credit scoring, variables selection process is performed through statistical significance – a reasonably automated process through advanced software. However, the variables are still created and measured by humans. High impact analyses in businesses are still driven by hunches. Human intelligence is not obsolete yet.

In one of the projects I did with a financial organization, the result of credit risk analysis and scoring led to redesigning of the application form. Application forms are a major source of data collection regarding the borrower. However, nobody wants to fill a lengthy form hence an optimal size of the form ensures accurate information provided by the borrower. The idea is to select the right variable and ensure accurate measurement.

There are several aspects regarding variables but I will mention just one of them here (coarse classing).

在数据挖掘和统计模型构建练习中,类似于信用评分,变量选择过程通过统计显着性来执行 - 通过高级软件进行合理自动化的过程。 但是,变量仍由人类创造和测量。 企业的高影响力分析仍然受到预感的驱动。 人类智慧尚未过时。

在我与金融机构合作的一个项目中,信用风险分析和评分的结果导致了申请表的重新设计。 申请表是有关借款人的主要数据收集来源。 然而,没有人想要填写冗长的表格,因此表格的最佳尺寸确保了借款人提供的准确信息。 我们的想法是选择正确的变量并确保准确的测量。


Coarse Classing in Credit Scoring

One of my favorite activities as a kid was going to a shoe store and getting my feet measured every summer before the school started. The shoe shops had a strange, miniature, slide-like device to measure foot size. It was fun to see my feet grow from one size to another every year or two. The growth was quantized i.e you are size-2 or 3 never 2.5 or 2.7. This aspect of converting measure such as 2.5 & 2.7 to 3 is called grouping, bucketing or classing. This is an integral part of creating scorecards that you will find in all the books I have listed in the first part of this blog series.

I have been a part of several heated discussions on the relevance of coarse class in scorecard development throughout my career. In most, if not all academic articles you will rarely see coarse classing as a technique during model development. Quite a few academicians & practitioners for a good reason believe that coarse classing results in loss of information. However, in my opinion, coarse classing has the following advantage over using raw measurement for a variable.

1. It reduces random noise that exists in raw variables – similar to averaging and yes, you lose some information here.
2. It handles extreme events – on two extremes of a variable – much better where you have thin data.
3. It handles the non-linear relationship between dependent and independent variable without a lot of effort of variable transformation from the analyst.



1.它减少了原始变量中存在的随机噪声 - 类似于平均值,是的,你在这里丢失了一些信息。
它处理极端事件 - 在变量的两个极端情况下 - 在您拥有精简数据的情况下更好。

Sign-off Note

We are half way through this series on ‘Analytical Scorecard Development’ and I am enjoying writing this thoroughly. I hope as a reader you are on the same page. Scorecard building is highly technical and I have tried to discuss some aspects with easy to understand examples. However, to manage the length of the article, I am not able to get into the details. I must say that I love the details!



信用评分卡 (part 3of 7)相关推荐

  1. 信用评分卡模型的理论准备

    目录 0 前言 1 构建评分卡的整个流程图 2 信息值 IV(Information Value)和 证据权重 WOE(Weight of Evidence) 2.1 WOE 定义 2.2 IV 定义 ...

  2. 基于R的信用评分卡模型解析

    信用评分流程 1.数据获取 我使用的信贷数据共有3000条数据,每条数据11个特征. rm(list=ls()) setwd("D:\\case") library(xlsx) d ...

  3. r k-means 分类结果_R语言信用评分卡:数据分箱(binning)

    作者:黄天元,复旦大学博士在读,热爱数据科学与R,热衷推广R在工业界与学术界的应用.邮箱:huang.tian-yuan@qq.com.欢迎合作交流 library(knitr) opts_chunk ...

  4. 信用评分python_信用评分卡(python)

    目录 导入数据 缺失值和异常值处理 特征可视化 特征选择 模型训练 模型评估 模型结果转评分 计算用户总分 一.导入数据 #导入模块 importpandas as pdimportnumpy as ...

  5. python信用评分卡_基于Python的信用评分卡模型分析(二)

    上一篇文章基于Python的信用评分卡模型分析(一)已经介绍了信用评分卡模型的数据预处理.探索性数据分析.变量分箱和变量选择等.接下来我们将继续讨论信用评分卡的模型实现和分析,信用评分的方法和自动评分 ...

  6. 数据挖掘项目:银行信用评分卡建模分析(上篇)

    kaggle上的Give Me Some Credit一个8年前的老项目,网上的分析说明有很多,但本人通过阅读后,也发现了很多的问题.比如正常随着月薪越高,违约率会下降.但对于过低的月薪,违约率却为0 ...

  7. 数据挖掘项目:银行信用评分卡建模分析(下篇)

    以下是银行信用评分卡建模分析下篇的内容,包括特征工程,构建模型,模型评估,评分卡建立这四部分.其中如果有一些地方分析的不正确,希望大家多多指正,感谢! 上篇文章的链接:数据挖掘项目:银行信用评分卡建模 ...

  8. [机器学习] 信用评分卡中的应用 | 干货

    背景介绍与评分卡模型的基本概念 如今在银行.消费金融公司等各种贷款业务机构,普遍使用信用评分,对客户实行打分制,以期对客户有一个优质与否的评判.交易对手未能履行约定契约中的义务而造成经济损失的风险,即 ...

  9. 基于Python的信用评分卡建模分析

    1.背景介绍 信用评分技术是一种应用统计模型,其作用是对贷款申请人(信用卡申请人)做风险评估分值的方法.信用评分卡模型是一种成熟的预测方法,尤其在信用风险评估以及金融风险控制领域更是得到了比较广泛的使 ...

  10. 3分钟搞明白信用评分卡模型模型验证

    2019独角兽企业重金招聘Python工程师标准>>> 信用评分卡模型在国外是一种成熟的预测方法,尤其在信用风险评估以及金融风险控制领域更是得到了比较广泛的使用,其原理是将模型变量W ...


  1. 一开工,就遇到上亿(MySQL)大表的优化,我的天...
  2. 城市大脑全球标准研究3:如何理解城市大脑中的“大脑”?
  3. python爬取天气_python3爬取各类天气信息
  4. linux shell 脚本 获取当前函数名
  5. 安装 sql server时变更检查解决方案
  6. Java基础篇:嵌套 switch 语句
  7. pp助手苹果版_iOS 版 PP 助手下线,再见了
  8. JDK源码——源码学习总结与分析
  9. 九款好看的后台管理系统登录模板
  10. 使用Oracle的sshUserSetup.sh脚本配置SSH互信
  11. java打印表情包_表情包生成器
  12. 关于如何解决启动Kali报错问题
  13. org.springframework.scheduling.quartz.CronTriggerBean 配置
  14. oracle 中将number类型的数据转化成指定格式的小数
  15. ui设计岗位招聘要求有哪些?
  16. HDFS文件系统内的文件格式转换(zip格式转化成gzip格式)
  17. Windows/Linux搭建ISCSI协议存储服务(IPSAN存储)并对接给虚拟化平台
  18. JDY-31蓝牙模块远程控制STM32F103单片机
  19. Win7摄像头驱动错误怎么办?Win7摄像头驱动错误的解决方法
  20. 山东丝网除沫器-临沂丝网除沫器-不锈钢丝网除沫器--[品牌推荐]


  1. c语言关于内存编程,c语言内存
  2. 下列关于python2.x和3.x的区别说法正确_1.??下列关于Python2.x和Python3.x的说法,正确的是()...
  3. 以xml格式发送外部系统交易错误_在知行EDI系统中实施SNIP验证
  4. pyqt qdialog 默认按钮_QT编程的QDialog对话框右上角的问号按钮如何取消呢
  5. qt 启动时黑屏闪一下_每次启动车辆时最好查看一下这些地方,车辆事故率能下降三分之二_搜狐汽车...
  6. ant design vue table 高度自适应_Table行内的开关组件的使用
  7. 3D变形tranform(附实例、图解)
  8. 运用集合把文字写入读出文件
  9. 徐州医科大学党委书记夏有兵一行莅临云创
  10. Linux 小知识翻译 - 目录 (完结)