r语言 bsda包_使用R语言creditmodel包进行Vintage分析或留存率分析

1 什么是vintage分析？

Vintage分析（账龄分析法）被广泛应用于信用卡及信贷行业，这个概念起源于葡萄酒，即不同年份出产的葡萄酒的品质有差异，那么不同时期开户或者放款的资产质量也有差异，其核心在于，对不同时期不同批次的资产分别跟踪，按照账龄同步对比，从而能够了解不同时期放款或发行信用卡的资产质量情况。

vintage分析从更广泛的意义来讲属于同期群分析，跟社会跟踪调查、人口学的队列分析技术，互联网运营的留存分析是类似的，具体概念不再赘述。我们直接进入主题，如何使用R语言creditmodel包做Vintage分析。

2 creditmodel包的cohort analysis模块简介

creditmodel是汉森老师开发的一个强大的R语言数据科学工具包，有数据预处理、变量衍生、数据分析、数据可视化、自动化建模五大功能模块。而今天所讲的vintage分析是creditmodel包数据分析模块的一个子模块，包括cohort_analysis、cohort_table、cohort_table_plot、cohort_plot四个主要函数。

3 cohort analysis 模块简介

Description

cohort_analysis cohort_analysis is for cohort(vintage) analysis.

Usage

cohort_analysis(dat, obs_id = NULL, occur_time = NULL, MOB = NULL,period = "monthly", status = NULL, amount = NULL, by_out = "cnt",start_date = NULL, end_date = NULL, dead_status = 30)cohort_table(dat, obs_id = NULL, occur_time = NULL, MOB = NULL,period = "monthly", status = NULL, amount = NULL, by_out = "cnt",start_date = NULL, end_date = NULL, dead_status = 30)

Arguments

datA data.frame contained id, occur_time, mob, status ...

obs_idThe name of ID of observations or key variable of data. Default is NULL.

occur_timeThe name of the variable that represents the time at which each observation takes place.

MOBMonth of book

periodPeriod of event to analysis. Default is "monthly"

statusStatus of observations

amountThe name of variable representing amount. Default is NULL.

by_outOutput: amount (amt) or count (cnt)

start_dateThe earliest occurrence time of observations.

end_dateThe latest occurrence time of observations.

dead_statusStatus of dead observations.

4 使用vintage分析步骤

4.1 数据准备

进行vintage分析，输入的数据至少要有放款编号（loan_id）, 放款时间(loan_time)、放款金额(loan_amount)和账户状态(max_overdue_days或age_overdue_days)四列。

#安装和加载creditmodel包
#install.packages("creditmodel")
library(creditmodel)
#使用read_data读入数据。
vin_dat = read_data("vin_dat.csv")
#使用creditmodel包的数据清晰模块主函数对数据进行清洗，关于数据清洗模块，以后会做详细接受，在此简单描述下各个参数的含义。
vin_dat = data_cleansing(vin_dat, obs_id = "loan_id",#主键occur_time = 'loan_time',#事件发生时间outlier_proc = FALSE,#不进行异常值处理missing_proc = FALSE,#不进行确实值处理remove_dup = FALSE,#不删除重复观测merge_cat = FALSE,#不对类别变量的类别进行合并low_var = 0.9999,#删除单一值比例大于0.9999的变量missing_rate = 0.9999 # 对缺失值比例大于0.9999的变量进行二值化处理)
#可使用creditmodel包的data_exploration函数来观察数据概貌
data_exploration(vin_dat)
>
* Observations      : 204697
* Numeric_variables : 7
* Category_variables: 1
* Date_variables    : 1
$numFeature  NMiss Miss_Rate        Max        75%     Median       25%        Min       Mean   Std
1   age_overdue_days 194695    95.11%        440        122         61         30          1         90    76
2 age_overdue_period     33     0.02%         15          0          0          0          0          0  0.87
3           loan_age     33     0.02%         15          6          3          1          0          4   3.2
4        loan_amount      0        0%     500000      60000      40000      30000       1000      47911 24279
5       loan_balance    823      0.4%     500000      53186      39046      27871        920      43112 22157
6          loan_time      0        0% 2017-09-30 2017-03-31 2016-12-29 2016-11-01 2016-06-01 2017-01-11    99
7   max_overdue_days 160974    78.64%        440         13          2          1          1         23    52
8 max_overdue_period      0        0%       15.0        0.0        0.0        0.0        0.0        0.3  0.93$charFeature  NMiss Miss_Rate                 Value1                 Value2              Value3              Value4              Value5              Value6
1 loan_id 204697      100% n2016060100000030 : 48 n2016060100000010 : 30 n1605300032102 : 16 n1606010034102 : 16 n1606020032402 : 16 n1606020034202 : 16Value7
1 (Other) : 204555
>
#使用plot_table画出数值型变量的数据概要
plot_table(data_exploration(vin_dat)$num)

4.2 vintage分析

4.2.1 cohort_dat表的构建

使用cohort_analysis函数来构建cohort_dat表。

cohort_dat = cohort_analysis(vin_dat,obs_id = 'loan_id',#放款编号occur_time = 'loan_time', #放款时间MOB = NULL,#month on book在账月份，找个可以自己定义为一个变量，默认以自然月为月份。period = 'monthly',#以月作为同一时期，也可按周weeklystatus = "age_overdue_days",#使用账龄末逾期天数作为状态，也为自己定义的0、1变量dead_status = 30, #逾期天数大于30天则为dead状态，若为0、1变量，此处应设为0.amount = "loan_amount", #如果以金额统计，则必须设置，此处按放款金额计算,也可以按余额by_out = 'amt',#如果以金额统计则为‘amt’,以笔数统计则为‘cnt’start_date = "2016-08-01",#统计日开始时间end_date = '2017-05-31'#统计日结束时间)

最终表结构如下：

4.2.2 画出vintage图

画出vintage图，特别简单，直接使用cohort_plot函数，输入上一步计算的cohort_dat即可。

cohort_plot(cohort_dat)

从上图我们可以直观地看出，近期的资产质量有劣化的趋势，需要复核当前风控策略是否需要收紧。

4.2.3 vintage表格

使用cohort_table函数得到vintage表格，其入参与cohort_analysis 入参完全一致。

vin_table = cohort_table(vin_dat, obs_id = 'loan_id', occur_time = 'loan_time', MOB = NULL,period = 'monthly', status = "max_overdue_days",dead_status = 30, amount = "loan_balance", by_out = 'amt',start_date = "2016-09-01", end_date = '2017-07-31')

最终表格如下表所示：

4.2.4 画出vintage表格

如何优雅地画出vintage表格呢？本来只需要一步：cohort_table_plot(cohort_dat)即可，但由于汉森老师粗心大意，R语言CRAN库最新的creditmodel1.1.8版本的该函数有一些bug，不能一步画出来，因此我把修复了bug的源码贴出来，在画vintage表格前先加载这个函数。

#' cohort_table_plot
#' code{cohort_table_plot} is for ploting cohort(vintage) analysis table.
#' @param cohort_dat A data.frame generated by code{cohort_analysis}.
#' @import ggplot2
#' @export
cohort_table_plot = function(cohort_dat) {#set global variablesopt = options('warn' = -1, scipen = 200, stringsAsFactors = FALSE, digits = 6) #cohort_dat[is.na(cohort_dat)] = 0#initial parametersCohort_Group = Cohort_Period = Events = Events_rate = Opening_Total = Retention_Total = cohor_dat = final_Events = m_a = max_age = NULL#plotcohort_plot = ggplot(cohort_dat, aes(reorder(paste0(Cohort_Period), Cohort_Period),Cohort_Group, fill = Events_rate)) +geom_tile(colour = 'white') +geom_text(aes(label = as_percent(Events_rate, 4)), size = 3) +scale_fill_gradient2(limits = c(0, max(cohort_dat$Events_rate)),low = love_color('deep_red'), mid = 'white',high = love_color(),midpoint = median(cohort_dat$Events_rate,na.rm = TRUE),na.value = love_color('pale_grey')) +scale_y_discrete(limits = rev(unique(cohort_dat$Cohort_Group))) +scale_x_discrete(position = "top") +labs(x = "Cohort_Period", title = "Cohort Analysis") +theme(text = element_text(size = 15), rect = element_blank()) +plot_theme(legend.position = 'right', angle = 0)return(cohort_plot)options(opt) #reset global variables
}

creditmodel包的数据可视化模块依赖ggplot2包画图，因此在画图前，别忘了加载ggplot2

vin_table = cohort_table(vin_dat, obs_id = 'loan_id', occur_time = 'loan_time', MOB = NULL,period = 'monthly', status = "max_overdue_days",dead_status = 30, amount = "loan_balance", by_out = 'amt',start_date = "2016-09-01", end_date = '2017-07-31')
cohort_table_plot(cohort_dat)

都看到这里了，双击屏幕点个赞，再走吧！

5总结

R语言creditmodel包是集变量衍生、数据预处理、数据分析、建模、数据可视化为一体的强大的数据科学工具包，关于该包的更深入的使用，还请关注汉森老师的公众号hansenmode。

觉得本文有参考意义的同学请点个赞或者转发，以鼓励汉森老师产出更多的作品。

另外，以上分析过程所使用的数据均为模拟数据，没有任何实际参考价值。