大家好啊,我是小爱同学。
先上一个比赛链接:
链接: link.
这个一个风险用户识别的比赛。如果大家感兴趣的话,可以阅读本文:
一、赛题理解

比赛提供三个数据表格,分别是用户基础信息,用户操作行为记录,用户交易行为记录。评价指标是AUC,因此我们可以不考虑该题样本不均衡对我们的模型产生的影响。因为是用户信用风险识别,所以时间,金额,地域是我们构造特征的关键。

二、数据预处理
1、缺失值处理
因为赛题的特殊性,我们不对缺失值进行常规填充,而是将其作为单独的一种特征:将类别型特征赋一个‘\N’,数字型赋-1。
下面展示缺失值处理的 代码片

    # 缺失值处理cols = ['sex', 'balance_avg', 'balance1_avg', 'provider', 'province', 'city','level']for col in cols:data[col].fillna(r'\N', inplace=True)cols = ['balance_avg','balance1_avg','level']for col in cols:data[col].replace({r'\N': -1}, inplace=True)data[col] = data[col]
    # 缺失值处理cols = ['sex', 'balance_avg', 'balance1_avg', 'provider', 'province', 'city','level']for col in cols:data[col].fillna(r'\N', inplace=True)cols = ['balance_avg','balance1_avg','level']for col in cols:data[col].replace({r'\N': -1}, inplace=True)data[col] = data[col]

2、编码
(1)无序低基数类别特征(例如性别这样的):我们用Label Encoder进行编码
下面展示一些 内联代码片

    cols = ['sex','provider','verified','regist_type','agreement1','agreement2','agreement3','agreement4','province','city','service3']for col in cols:if data[col].dtype == 'object':data[col] = data[col].astype(str)labelEncoder_df(data, cols)print(data.info())
// 无序低基数类别特征cols = ['sex','provider','verified','regist_type','agreement1','agreement2','agreement3','agreement4','province','city','service3']for col in cols:if data[col].dtype == 'object':data[col] = data[col].astype(str)labelEncoder_df(data, cols)print(data.info())

(2)无序高基数类别特征(例如城市,省份这样的):我们用目标编码,为减小过拟合现象,采用5折交叉验证的思路,转化特征值,见下图

下面展示一些 内联代码片

// A code block
def kfold_stats_feature(train, test, feats, k):folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=44)  # 这里最好和后面模型的K折交叉验证保持一致train['fold'] = Nonefor fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['label'])):train.loc[val_idx, 'fold'] = fold_kfold_features = []for feat in feats:nums_columns = ['label']for f in nums_columns:colname = feat + '_' + f + '_kfold_mean'kfold_features.append(colname)train[colname] = Nonefor fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['label'])):tmp_trn = train.iloc[trn_idx]order_label = tmp_trn.groupby([feat])[f].mean()tmp = train.loc[train.fold == fold_, [feat]]train.loc[train.fold == fold_, colname] = tmp[feat].map(order_label)# fillnaglobal_mean = train[f].mean()train.loc[train.fold == fold_, colname] = train.loc[train.fold == fold_, colname].fillna(global_mean)train[colname] = train[colname].astype(float)for f in nums_columns:colname = feat + '_' + f + '_kfold_mean'test[colname] = Noneorder_label = train.groupby([feat])[f].mean()test[colname] = test[feat].map(order_label)# fillnaglobal_mean = train[f].mean()test[colname] = test[colname].fillna(global_mean)test[colname] = test[colname].astype(float)del train['fold']return train, test
// 目标编码
def kfold_stats_feature(train, test, feats, k):folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=44)  # 这里最好和后面模型的K折交叉验证保持一致train['fold'] = Nonefor fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['label'])):train.loc[val_idx, 'fold'] = fold_kfold_features = []for feat in feats:nums_columns = ['label']for f in nums_columns:colname = feat + '_' + f + '_kfold_mean'kfold_features.append(colname)train[colname] = Nonefor fold_, (trn_idx, val_idx) in enumerate(folds.split(train, train['label'])):tmp_trn = train.iloc[trn_idx]order_label = tmp_trn.groupby([feat])[f].mean()tmp = train.loc[train.fold == fold_, [feat]]train.loc[train.fold == fold_, colname] = tmp[feat].map(order_label)# fillnaglobal_mean = train[f].mean()train.loc[train.fold == fold_, colname] = train.loc[train.fold == fold_, colname].fillna(global_mean)train[colname] = train[colname].astype(float)for f in nums_columns:colname = feat + '_' + f + '_kfold_mean'test[colname] = Noneorder_label = train.groupby([feat])[f].mean()test[colname] = test[feat].map(order_label)# fillnaglobal_mean = train[f].mean()test[colname] = test[colname].fillna(global_mean)test[colname] = test[colname].astype(float)del train['fold']return train, test

(3)等级和连续型:转为整型数值
下面展示一些 内联代码片

    # 转等级和连续型cols_int = [f for f in data.columns iff in ['level', 'balance', 'balance_avg', 'balance1', 'balance1_avg', 'balance2', 'balance2_avg','product1_amount', 'product2_amount', 'product3_amount', 'product4_amount', 'product5_amount','product6_amount']]for col in cols_int:for i in range(0, 50):data.loc[data[col] == "category %d" % i, col] = idata.loc[data[col] == "level %d" % i, col] = idata[col].isnull().sum()data[col].astype(int)
    # 转等级和连续型cols_int = [f for f in data.columns iff in ['level', 'balance', 'balance_avg', 'balance1', 'balance1_avg', 'balance2', 'balance2_avg','product1_amount', 'product2_amount', 'product3_amount', 'product4_amount', 'product5_amount','product6_amount']]for col in cols_int:for i in range(0, 50):data.loc[data[col] == "category %d" % i, col] = idata.loc[data[col] == "level %d" % i, col] = idata[col].isnull().sum()data[col].astype(int)

三、特征工程
1、时间特征分析


对比train和test的操作量的时间分布,可以假定起始时间点相同。.根据正负样本的每小时交易量分布差异,我们可以放心大胆的构造窗口时间特征。
例如:用户在星期n的交易金额的统计特征,用户在交易n天之后的交易金额的统计特征,用户在每天n点之后的交易金额的统计特征
下面展示一些 内联代码片

def gen_user_window_amount_features(df, window):group_df = df[df['days_diff']>window].groupby('user')['amount'].agg({'user_amount_mean_{}d'.format(window): 'mean','user_amount_std_{}d'.format(window): 'std','user_amount_max_{}d'.format(window): 'max','user_amount_min_{}d'.format(window): 'min','user_amount_sum_{}d'.format(window): 'sum','user_amount_med_{}d'.format(window): 'median','user_amount_cnt_{}d'.format(window): 'count',}).reset_index()return group_df
def gen_user_window_amount_hour_features(df, window):group_df = df[df['hour']>window].groupby('user')['amount'].agg({'user_amount_mean_{}h'.format(window): 'mean','user_amount_std_{}h'.format(window): 'std','user_amount_max_{}h'.format(window): 'max','user_amount_min_{}h'.format(window): 'min','user_amount_sum_{}h'.format(window): 'sum','user_amount_med_{}h'.format(window): 'median','user_amount_cnt_{}h'.format(window): 'count',}).reset_index()return group_df
def gen_user_window_amount_week_features(df, window):group_df = df[df['week']==window].groupby('user')['amount'].agg({'user_amount_mean_{}w'.format(window): 'mean','user_amount_std_{}w'.format(window): 'std','user_amount_max_{}w'.format(window):'max','user_amount_min_{}w'.format(window): 'min','user_amount_sum_{}w'.format(window):'sum','user_amount_med_{}w'.format(window):'median','user_amount_cnt_{}w'.format(window):'count',}).reset_index()return group_df
def gen_user_window_amount_features(df, window):group_df = df[df['days_diff']>window].groupby('user')['amount'].agg({'user_amount_mean_{}d'.format(window): 'mean','user_amount_std_{}d'.format(window): 'std','user_amount_max_{}d'.format(window): 'max','user_amount_min_{}d'.format(window): 'min','user_amount_sum_{}d'.format(window): 'sum','user_amount_med_{}d'.format(window): 'median','user_amount_cnt_{}d'.format(window): 'count',}).reset_index()return group_df
def gen_user_window_amount_hour_features(df, window):group_df = df[df['hour']>window].groupby('user')['amount'].agg({'user_amount_mean_{}h'.format(window): 'mean','user_amount_std_{}h'.format(window): 'std','user_amount_max_{}h'.format(window): 'max','user_amount_min_{}h'.format(window): 'min','user_amount_sum_{}h'.format(window): 'sum','user_amount_med_{}h'.format(window): 'median','user_amount_cnt_{}h'.format(window): 'count',}).reset_index()return group_df
def gen_user_window_amount_week_features(df, window):group_df = df[df['week']==window].groupby('user')['amount'].agg({'user_amount_mean_{}w'.format(window): 'mean','user_amount_std_{}w'.format(window): 'std','user_amount_max_{}w'.format(window):'max','user_amount_min_{}w'.format(window): 'min','user_amount_sum_{}w'.format(window):'sum','user_amount_med_{}w'.format(window):'median','user_amount_cnt_{}w'.format(window):'count',}).reset_index()return group_df

2、RFM特征
通过调查资料,我们了解了RFM模型,他是衡量客户价值和客户创利能力的重要工具和手段。通过这个信息我们构造出了很多有用特征。具体见下图


3、TF-IDF特征
我们对操作模式和操作类型进行提取TF-IDF特征。
下面展示一些 内联代码片

def gen_user_tfidf_features(df, value):df[value] = df[value].astype(str)df[value].fillna('-1', inplace=True)group_df = df.groupby(['user']).apply(lambda x: x[value].tolist()).reset_index()#把每个用户的op_mode转成列表group_df.columns = ['user', 'list']group_df['list'] = group_df['list'].apply(lambda x: ','.join(x))#将op_mode用,连接enc_vec = TfidfVectorizer()#得到tf-idf矩阵tfidf_vec = enc_vec.fit_transform(group_df['list'])#得到词频矩阵,将op_mode转为词向量,即计算机能识别的编码svd_enc = TruncatedSVD(n_components=10, n_iter=20, random_state=2020)#降维,提取op_mode的特征,TtuncatedSVD和SVD:TSVD可以选择需要提取的维度vec_svd = svd_enc.fit_transform(tfidf_vec)vec_svd = pd.DataFrame(vec_svd)vec_svd.columns = ['svd_tfidf_{}_{}'.format(value, i) for i in range(10)]group_df = pd.concat([group_df, vec_svd], axis=1)del group_df['list']return group_df
def gen_user_tfidf_features(df, value):df[value] = df[value].astype(str)df[value].fillna('-1', inplace=True)group_df = df.groupby(['user']).apply(lambda x: x[value].tolist()).reset_index()#把每个用户的op_mode转成列表group_df.columns = ['user', 'list']group_df['list'] = group_df['list'].apply(lambda x: ','.join(x))#将op_mode用,连接enc_vec = TfidfVectorizer()#得到tf-idf矩阵tfidf_vec = enc_vec.fit_transform(group_df['list'])#得到词频矩阵,将op_mode转为词向量,即计算机能识别的编码svd_enc = TruncatedSVD(n_components=10, n_iter=20, random_state=2020)#降维,提取op_mode的特征,TtuncatedSVD和SVD:TSVD可以选择需要提取的维度vec_svd = svd_enc.fit_transform(tfidf_vec)vec_svd = pd.DataFrame(vec_svd)vec_svd.columns = ['svd_tfidf_{}_{}'.format(value, i) for i in range(10)]group_df = pd.concat([group_df, vec_svd], axis=1)del group_df['list']return group_df

四、模型融合
我们采用了三个模型:LightGBM,Xgboost,Catboost多个参数进行模型融合。
具体模型相关性分析见下图:


下面展示一些 内联代码片

def lgb_model(train, target, test, k):feats = [f for f in train.columns if f not in ['user', 'label']]print('Current num of features:', len(feats))oof_probs = np.zeros(train.shape[0])output_preds = 0offline_score = []feature_importance_df = pd.DataFrame()parameters = {'learning_rate': 0.01,'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','num_leaves': 68,'feature_fraction': 0.4,'bagging_fraction': 0.8,'min_data_in_leaf': 25,'verbose': -1,'nthread': 8,'max_depth':8}seeds = [2020]for seed in seeds:folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)for i, (train_index, test_index) in enumerate(folds.split(train, target)):train_y, test_y = target[train_index], target[test_index]train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :]dtrain = lgb.Dataset(train_X,label=train_y)dval = lgb.Dataset(test_X,label=test_y)lgb_model = lgb.train(parameters,dtrain,num_boost_round=5000,valid_sets=[dval],early_stopping_rounds=200,verbose_eval=100,)oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration)/len(seeds)offline_score.append(lgb_model.best_score['valid_0']['auc'])output_preds += lgb_model.predict(test[feats], num_iteration=lgb_model.best_iteration)/folds.n_splits/len(seeds)print(offline_score)# feature importancefold_importance_df = pd.DataFrame()fold_importance_df["feature"] = featsfold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain')fold_importance_df["fold"] = i + 1feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score)))print('feature importance:')print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(310))feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(457).to_csv('../importance/08_26_452.csv')return output_preds, oof_probs, np.mean(offline_score)
def lgb_model(train, target, test, k):feats = [f for f in train.columns if f not in ['user', 'label']]print('Current num of features:', len(feats))oof_probs = np.zeros(train.shape[0])output_preds = 0offline_score = []feature_importance_df = pd.DataFrame()parameters = {'learning_rate': 0.01,'boosting_type': 'gbdt','objective': 'binary','metric': 'auc','num_leaves': 68,'feature_fraction': 0.4,'bagging_fraction': 0.8,'min_data_in_leaf': 25,'verbose': -1,'nthread': 8,'max_depth':8}seeds = [2020]for seed in seeds:folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)for i, (train_index, test_index) in enumerate(folds.split(train, target)):train_y, test_y = target[train_index], target[test_index]train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :]dtrain = lgb.Dataset(train_X,label=train_y)dval = lgb.Dataset(test_X,label=test_y)lgb_model = lgb.train(parameters,dtrain,num_boost_round=5000,valid_sets=[dval],early_stopping_rounds=200,verbose_eval=100,)oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration)/len(seeds)offline_score.append(lgb_model.best_score['valid_0']['auc'])output_preds += lgb_model.predict(test[feats], num_iteration=lgb_model.best_iteration)/folds.n_splits/len(seeds)print(offline_score)# feature importancefold_importance_df = pd.DataFrame()fold_importance_df["feature"] = featsfold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain')fold_importance_df["fold"] = i + 1feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score)))print('feature importance:')print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(310))feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(457).to_csv('../importance/08_26_452.csv')return output_preds, oof_probs, np.mean(offline_score)

五、规则上分
通过分析,我们发现当用户的余额等级为1,产品金额等级为21时,用户风险率极大,我们将其结果置为原来的1/2。

具体代码文件在这里:https://github.com/poplar1hhh/yipay,希望大家给点个star,记得双击么么哒!

翼支付杯大数据建模大赛-季军方案相关推荐

  1. 前海征信“好信杯”大数据算法大赛——入门篇笔记

    ctr+4/5注释 数据下载地址 1.先导入包: import pandas as pd import numpy as np import seaborn as sns import matplot ...

  2. 又一数据挖掘赛事,在校生专属,翼支付杯来了(直通实习机会)

    Datawhale 主办方:中国电信-翼支付,数据挖掘赛事 为了积极研究探索"金融科技FinTech"技术并努力应用到实际业务中,挖掘更多金融科技在实际普惠金融业务的应用方案.由翼 ...

  3. “甜橙金融杯”数据建模大赛发布,8万重金寻找大数据金融人才!

    全世界有3.14 % 的人已经关注了 数据与算法之美 随着互联网+概念不断发展,越来越多的商家进入这一市场.为了在竞争中拉取新用户,培养用户的消费习惯,各种类型的营销和补贴活动层出不穷.为正常用户带来 ...

  4. 报名即将截止,中国移动“梧桐杯”大数据应用创新大赛,寻找大数据敢想者!...

    ↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale赛事 主办方:中国移动信息技术中心 也许,你在大数据分析路上踽踽独行 来这里, ...

  5. 报名即将截止,“梧桐杯”大数据应用创新大赛,邀你进入大数据先锋阵营

    大数据建模不是孤独寂寞的路,面对无数的数字,一定要有人作伴 大数据建模不是望不到头的路,步履坚实的历程,要用成绩来证明 大数据建模不是沉闷无声的路,凝结心血的方案,值得大声说出来 路上的人 或许在电脑 ...

  6. 2020“东方国信杯”高校大数据开发大赛最终榜第三名思路分享

    2020"东方国信杯"高校大数据开发大赛最终榜第三名思路分享 2020"东方国信杯"高校大数据开发大赛刚刚结束,我所在的队伍"三人运动团"最 ...

  7. 三城演义!第二届中国移动“梧桐杯”大数据应用创新大赛复赛完美收官

    8月19日至26日,第二届中国移动"梧桐杯"大数据应用创新大赛暨大数据创客马拉松大赛的三场复赛路演暨颁奖典礼在杭州.武汉.广州三地成功举办.35支队伍在数智乡村.数智城市.数智交通 ...

  8. “甜橙金融杯”数据建模大赛征程过半,数据玩家高手过招!

    "甜橙金融杯"数据建模大赛,自10月20日在DataCastle数据科学社区开放报名通道以来,受到了社会各界的广泛关注! 目前初赛阶段已经过半,不少校园数据爱好者纷纷入场,新颖的赛 ...

  9. 报名开始!第二届中国移动“梧桐杯”大数据应用创新大赛邀你夺52w大奖!

    为进一步落实中国移动战略,助力公司数字化转型发展,推动高校人才创新培养,由中国移动通信集团有限公司主办,中国移动通信集团北京有限公司.中国移动通信集团湖北有限公司.中国移动通信集团广东有限公司.中国移 ...

  10. 【开赛啦!邀你来战 】2022年“桂林银行杯”数据建模大赛暨全国大学生数学建模竞赛广西赛区热身赛

    2022年"桂林银行杯"数据建模大赛 大赛背景 桂林银行股份有限公司(以下称"桂林银行"),以金融成就美好生活为使命,以承接国家和地方发展战略为己任,服务地方经 ...

最新文章

  1. 中国HBase技术社区第一届Meetup资料大合集
  2. GCC全过程详解+剖析生成的.o文件(2)
  3. VTK:Rendering之HiddenLineRemoval
  4. Oracle数据库中游标的游标的使用
  5. Qt connect()的第五种重载[=](){}
  6. 国内外卫星数据查询地址
  7. Tekla插件(材料备料定尺工具)
  8. C语言基础进阶之 MessageBox()用法简介
  9. 多线程编程——pthread
  10. 河北科技师范学院对口计算机分数线,河北科技师范学院对口分数线
  11. Android使用Socket.IO实现即时通讯
  12. Qt自带示例演示程序
  13. 原生js实现简易版消消乐
  14. Linux正则表达式和文本处理工具(gred、awk、sed)
  15. 第6章 详细设计(软件工程导论 第6版)
  16. hdu6608 Fansblog(威尔逊定理)
  17. Git 必知必会《上》
  18. python如何启动excel_Python启动Excel
  19. 《高性能MySQL》(第三版)之一:MySQL架构与基础
  20. DeAuth 无线信道MDK3攻击辅助工具

热门文章

  1. js中value^= 是什么意思
  2. email邮箱格式校验
  3. conda install报错 ValueError: check_hostname requires server_hostname
  4. 去年我国出生率跌破1%,有什么影响?
  5. linux如何上传数据到百度网盘,Linux命令行上传文件到百度网盘
  6. 高德地图之逆地理编码
  7. 关于HTML系统学习(1)
  8. 集团企业税务管理浅析
  9. 快来喝杯Java(初级第一章)
  10. lamp兄弟连PHP视频教程 笔记心得