转自AI Studio,原文链接:2022AIWIN大赛-发债企业的违约风险预警赛题Baseline - 飞桨AI Studio

1 赛题背景

自2014年我国债券市场“刚性兑付”神话被打破后,债券违约现象日益升温,2018年债券市场有160只债券发生违约,涉及44家发债企业,违约余额高达1505.25亿元,违约严重程度达历史之最。在债券市场信用风险加速暴露、违约事件发生趋于常态化的背景下,如何对发债企业违约风险进行有效评估与提前预测成为当前面临的重要监管难题。由于信息不完全,单纯依靠财务数据已难以充分解释违约风险溢价问题。如何有效利用财务以外的其他数据,例如发债企业的舆情数据、股权上下游数据,对发债企业违约风险进行预测具有重要意义。

2 赛题任务

本赛题任务是利用机器学习、深度学习等方法训练一个预测模型,该模型可以学习发债企业的相关信息,以预测发债企业未来一段时间内是否存在违约风险。赛题的难点在于数据集中包括大量的发债企业相关信息(股东信息、对外投资信息以及舆情信息等等),如何从中提取有效的特征并进行风险预测成为本赛题的关键问题。

3 赛题数据

本赛题将提供发债企业2019-2020年之间的违约数据用于模型训练,以预测发债企业在2021年发生违约风险的概率,其中发债企业范围为2019-2021年发行过债券的企业。初复赛提供用于预测的发债企业范围不变,在初赛的基础上,复赛将增加发债企业的股东数据、对外投资数据以及相关企业的舆情数据。

3.1 初赛提供的数据集如下:

  • 企业的基本信息(只包含发债企业)
  • 2018-2020年的财务指标数据
  • 2018-2020年的舆情信息(只包含发债企业)
  • 2019-2020年的违约记录

3.2 复赛提供的数据集如下:

  • 企业的基本信息(包含发债企业及其股东企业和对外投资企业)
  • 发债企业的股东信息
  • 发债企业的对外投资信息
  • 2018-2020年的财务指标数据
  • 2018-2020年的舆情信息(包含发债企业及其股东企业和对外投资企业)
  • 2019-2020年的违约记录

4 提交样例

列名 中文名 类型 说明
ent_id 公司ID string
default_score 违约风险值 double 取值范围:[0,1]

5 评价指标(AUC)

5.1 注意问题:

正负样本量对深度学习影响较大,本数据集负样本过少,合理采样正负样本可显著提升。

5.2 Fork后支持一键运行、推理,欢迎Fork、Star

6 解压数据集

In [1]

# 解压数据集
!rm -rf dataset 初赛数据集
!unzip -O CP936 data/data139820/dataset.zip
!mv 初赛数据集 dataset
Archive:  data/data139820/dataset.zipcreating: 初赛数据集/inflating: 初赛数据集/ent_default.csv  inflating: 初赛数据集/ent_financial_indicator.csv  inflating: 初赛数据集/ent_info.csv  inflating: 初赛数据集/ent_news.csv  inflating: 初赛数据集/submission.csv

7 导入相关库

In [1]

import os
import gc
import json
import time
import random
import argparse
import numpy as np
import pandas as pd
import numpy.linalg as nplfrom tqdm import tqdm
from functools import partial
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformerimport paddle
import paddle.nn as nn
import paddlenlp as ppnlpfrom paddlenlp.datasets import load_dataset
from paddlenlp.data import Stack, Tuple, Pad
from paddlenlp.transformers import LinearDecayWithWarmup

8 配置参数

In [3]

# yapf: disable
class Args(object):def __init__(self):self.data_path = '/home/aistudio/dataset/'self.data_set = '/home/aistudio/dataset/df_data.feather'self.train_set = '/home/aistudio/temp/train.feather'self.valid_set = '/home/aistudio/temp/valid.feather'self.test_set =  '/home/aistudio/temp/test.feather'self.epochs = 10 # 可选, 训练轮次,默认为3self.weight_decay = 0.0 # 可选,控制正则项力度的参数,用于防止过拟合,默认为0.1self.learning_rate = 1e-5 # 可选,Fine-tune的最大学习率;默认为5e-5self.warmup_proportion = 0.1 # 可选,学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.1self.max_seq_length = 256 # 可选,ERNIE-Gram-zh 模型使用的最大序列长度,最大不能超过512, 若出现显存不足,请适当调低这一参数;默认为256self.train_batch_size = 64self.valid_batch_size = 128self.init_from_ckpt = None # 可选,模型参数路径,热启动模型训练;默认为None。self.num_kfolds = 100 # 可选,负样本偏少,可不采用K折交叉验证self.eval_step = 10self.eval_accuracy = 0.75self.save_step = 100self.save_accuracy = 0.75self.max_steps = 10000  # 可选,最大迭代bsself.print_steps = 1  # 可选,训练打印stepsself.rdrop_coef = 0.0self.save_dir = './checkpoint' # 可选,保存训练模型的目录;默认保存在当前目录checkpoints文件夹下self.input_dir = './result/submit.csv' # 推理结果保存位置self.seed = 2022 # 可选,随机种子,默认为2022self.device = 'gpu' # 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号self.num_classes = 2self.label = 'default_score'self.from_pretrained = 'ernie-gram-zh' # 可选,配置预训练模型,可替换
# yapf: enableargs = Args()# 固定随机数种子
def set_seed(seed):random.seed(seed)np.random.seed(seed)paddle.seed(seed)set_seed(args.seed)# 配置GPU
!FLAGS_cudnn_deterministic=True
paddle.set_device(args.device)
rank = paddle.distributed.get_rank()
if paddle.distributed.get_world_size() > 1:paddle.distributed.init_parallel_env()# 加载tokenizer
tokenizer = None
if args.from_pretrained == 'ernie-1.0':tokenizer = ppnlp.transformers.ErnieTokenizer.from_pretrained('ernie-1.0')
elif args.from_pretrained == 'ernie-gram-zh':tokenizer = ppnlp.transformers.ErnieGramTokenizer.from_pretrained('ernie-gram-zh')
[2022-04-20 00:19:57,773] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-gram-zh/vocab.txt

In [4]

# 特征统计函数
def cnt_features(col, key, data, fea_data, his_i='all', prefix='', mode='count', astype='int'):if mode == 'min':cnt_col = his_i + '_' + prefix + '_' + ''.join(col) + '_min_' + keycnt = fea_data.groupby(col)[key].min().reset_index().rename(columns={key: cnt_col})if mode == 'max':cnt_col = his_i + '_' + prefix + '_' + ''.join(col) + '_max_' + keycnt = fea_data.groupby(col)[key].max().reset_index().rename(columns={key: cnt_col})if mode == 'mean':cnt_col = his_i + '_' + prefix + '_' + ''.join(col) + '_mean_' + keycnt = fea_data.groupby(col)[key].mean().reset_index().rename(columns={key: cnt_col})if mode == 'sum':cnt_col = his_i + '_' + prefix + '_' + ''.join(col) + '_sum_' + keycnt = fea_data.groupby(col)[key].sum().reset_index().rename(columns={key: cnt_col})if astype == 'float':data = data.merge(cnt, on=col, how='left')data[cnt_col] = data[cnt_col].fillna(0).astype(float)return data# 特征onehot函数
def onehoting(data, oh_cols):oh_model = OneHotEncoder()for i, col in tqdm(enumerate(oh_cols)):temp = pd.get_dummies(data[col].fillna('-999').values)temp.columns = (col + '_' + pd.Series(temp.columns).astype(str)).valuesdata = pd.concat([data, temp], axis=1)return data# 特征工程函数
def extract_features(data, ent_info, ent_news, ent_financial_indicator):data = data.merge(ent_info, on='ent_id', how='left')for col in tqdm(ent_info.columns[1:]):if col not in ['opfrom', 'opto', 'esdate', 'apprdate']:data = onehoting(data, [col])del data[col]; gc.collect()# 统计ent_news['impscore']中max、min、mean、sum特征ent_news['impscore'] = ent_news['impscore'].astype(int)for key in ['ent_id']:for col in tqdm(['impscore']):data = cnt_features(key, col, data, ent_news, mode='min', astype='float')data = cnt_features(key, col, data, ent_news, mode='max', astype='float')data = cnt_features(key, col, data, ent_news, mode='sum', astype='float')data = cnt_features(key, col, data, ent_news, mode='mean', astype='float')# ent_financial_indicator数值特征中max、min、mean、sum特征for key in ['ent_id']:for col in tqdm(ent_financial_indicator.columns[2:]):data = cnt_features(key, col, data, ent_financial_indicator, mode='min', astype='float')data = cnt_features(key, col, data, ent_financial_indicator, mode='max', astype='float')data = cnt_features(key, col, data, ent_financial_indicator, mode='sum', astype='float')data = cnt_features(key, col, data, ent_financial_indicator, mode='mean', astype='float')# 统计连续值归一化scaler = MinMaxScaler(feature_range=(0, 1))for col in tqdm(data.columns[2:]):data[col] = scaler.fit_transform(data[col].fillna(-1).values.reshape(-1, 1)).reshape(-1)# 获取Ent_id最近14次报道新闻(合并来源、标题)ent_news['context'] = ent_news['newssource'] + ':' + ent_news['newstitle']ent_news = ent_news.groupby(['ent_id'])['context'].apply(lambda x: '。'.join(x.tolist()[-14:][::-1]))data['context'] = data['ent_id'].map(ent_news).fillna('')data['label'] = data['default_score']del data['default_score']; gc.collect()# 删除相同数值列,降低内存消耗for col in tqdm(data.columns[1:-1]):if data[col].nunique() == 1: del data[col]return data

9 加载数据集

In [88]

ent_news = pd.read_csv(args.data_path + 'ent_news.csv', sep='\n') # "sep='|'" requires "error_bad_lines=True"
for i, col in enumerate(ent_news.columns[0].split('|')):ent_news[col] = ent_news[ent_news.columns[0]].apply(lambda x: x.split('|')[i])
del ent_news[ent_news.columns[0]]; gc.collect()
ent_news = ent_news.sort_values('publishdate', ascending=True).reset_index(drop=True)ent_info = pd.read_csv(args.data_path + 'ent_info.csv', sep='|')
ent_default = pd.read_csv(args.data_path + 'ent_default.csv', sep='|')
ent_financial_indicator = pd.read_csv(args.data_path + 'ent_financial_indicator.csv', sep='|')test_data = pd.read_csv(args.data_path + 'submission.csv', sep='|')print(ent_info.shape, ent_news.shape, ent_default.shape, ent_financial_indicator.shape, test_data.shape)# 构建训练集正负样本
ent_default[args.label] = 1
train_data = ent_info[['ent_id']].merge(ent_default, on='ent_id', how='left')[['ent_id', args.label]]
train_data[args.label] = train_data[args.label].fillna(0)# 合并训练集、测试集
df_data = pd.concat([train_data, test_data[['ent_id']]], ignore_index=True)
print(df_data.shape)
(10152, 14) (639276, 9) (317, 2) (62856, 164) (8963, 2)
(19309, 2)

9.1 保存处理后文件

In [7]

df_data = extract_features(df_data, ent_info, ent_news, ent_financial_indicator)
del ent_info, ent_news, ent_financial_indicator; gc.collect()
df_data.to_feather(args.data_set)
  0%|          | 0/13 [00:00<?, ?it/s]
0it [00:00, ?it/s]8%|▊         | 1/13 [00:00<00:01,  8.60it/s]
0it [00:00, ?it/s]
1it [00:00,  7.06it/s]46%|████▌     | 6/13 [00:00<00:00,  8.38it/s]
0it [00:00, ?it/s]
1it [00:03,  3.15s/it]54%|█████▍    | 7/13 [00:04<00:06,  1.06s/it]
0it [00:00, ?it/s]
1it [00:00,  2.50it/s]62%|██████▏   | 8/13 [00:04<00:04,  1.11it/s]
0it [00:00, ?it/s]
1it [00:00,  2.46it/s]69%|██████▉   | 9/13 [00:05<00:03,  1.27it/s]
0it [00:00, ?it/s]
1it [00:00,  2.46it/s]77%|███████▋  | 10/13 [00:05<00:02,  1.41it/s]
0it [00:00, ?it/s]
1it [00:00,  2.28it/s]85%|████████▍ | 11/13 [00:06<00:01,  1.50it/s]
0it [00:00, ?it/s]
1it [00:00,  1.23it/s]92%|█████████▏| 12/13 [00:07<00:00,  1.34it/s]
0it [00:00, ?it/s]
1it [00:00,  1.89it/s]
100%|██████████| 13/13 [00:07<00:00,  1.39it/s]
100%|██████████| 1/1 [00:06<00:00,  6.06s/it]0%|          | 0/2 [00:00<?, ?it/s]
tf_idf...
svd...
 50%|█████     | 1/2 [00:04<00:04,  4.90s/it]
tf_idf...
svd...
100%|██████████| 2/2 [00:07<00:00,  4.17s/it]
100%|██████████| 162/162 [14:11<00:00,  5.61s/it]
100%|██████████| 8921/8921 [10:47<00:00, 13.77it/s]
100%|██████████| 8922/8922 [00:05<00:00, 1681.27it/s]

9.2 读取处理后文件

In [8]

df_data = pd.read_feather(args.data_set)

10 数据结构

10.1 构建模型所需Batch数据结构

In [4]

# 数据迭代函数(yield)
def read_text_pair(data_path, is_test=False):f = pd.read_feather(data_path).valuesfor line in f:if is_test == False:yield {'ent_id': line[0], 'nums': line[1:-2], 'text': line[-2], 'label': line[-1]}else:yield {'ent_id': line[0], 'nums': line[1:-2], 'text': line[-2]}# 数据处理函数
def convert_example(example, tokenizer, max_seq_length=512, is_test=False):text = example["text"]encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)input_ids = encoded_inputs["input_ids"]token_type_ids = encoded_inputs["token_type_ids"]nums = np.array(example['nums'])if is_test == False:label = np.array([example["label"]], dtype="int64")return input_ids, token_type_ids, nums, labelelse:return input_ids, token_type_ids, nums# 创建PaddlePaddle DataLoader
def create_dataloader(dataset, mode='train', batch_size=1, batchify_fn=None, trans_fn=None):if trans_fn: dataset = dataset.map(trans_fn)shuffle = True if mode == 'train' else Falseif mode == 'train':batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)else:batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle)return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True)

11 文本结构

ERNIE-Gram-zh

12 整体结构

ERNIE-Gram-zh + FC(5 layer)

In [2]

# 模型结构定义(可替换LSTM、GRU等结构层进行提升优化)
class PaddleModel(nn.Layer):def __init__(self, pretrained_model):super().__init__()self.ptm = pretrained_modelself.fc1 = nn.Linear(8880, 2048)  # 第一层fc结构定义self.fc2 = nn.Linear(2048, 512)   # 第二层fc结构定义self.fc3 = nn.Linear(512, 128)    # 第三层fc结构定义self.fc4 = nn.Linear(128, 32)     # 第四层fc结构定义self.fc5 = nn.Linear(32, 2)       # 第五层fc结构定义def forward(self, input_ids, token_type_ids, nums):logits = self.ptm(input_ids, token_type_ids)logits_fc = paddle.concat([logits, nums], 1)logits_fc1 = self.fc1(logits_fc)logits_fc2 = self.fc2(logits_fc1)logits_fc3 = self.fc3(logits_fc2)logits_fc4 = self.fc4(logits_fc3)logits_fc5 = self.fc5(logits_fc4)return logits_fc5 # 输出违约风险预警概率# 模型验证函数
@paddle.no_grad()
def evaluate(model, criterion, metric, data_loader):model.eval()metric.reset()losses = []total_num = 0for batch in data_loader:input_ids, token_type_ids, nums, labels = batchtotal_num += len(labels)logits = model(input_ids=input_ids, token_type_ids=token_type_ids, nums=nums)loss = criterion(logits, labels)losses.append(loss.numpy())correct = metric.compute(logits, labels)metric.update(correct)accu = metric.accumulate()print("dev_loss: {:.5}, acc: {:.5}, total_num:{}".format(np.mean(losses), accu, total_num))model.train()metric.reset()return accu# 模型预测函数
def predict(model, data_loader):batch_logits = []model.eval()with paddle.no_grad():for batch_data in tqdm(data_loader):input_ids, token_type_ids, nums = batch_datainput_ids = paddle.to_tensor(input_ids)token_type_ids = paddle.to_tensor(token_type_ids)nums = paddle.to_tensor(nums)batch_logit = model(input_ids=input_ids, token_type_ids=token_type_ids, nums=nums)batch_logits.append(batch_logit.numpy())batch_logits = np.concatenate(batch_logits, axis=0)return batch_logits

In [129]

loss_array = []def kfold_train():global loss_array# 读取训练集data = pd.read_feather(args.data_set)data = data[data['label'].notnull()].reset_index(drop=True)kfold = StratifiedKFold(n_splits=args.num_kfolds, shuffle=True, random_state=args.seed)kf = kfold.split(data[data.columns[1:-1]], data['label'])#  K折交叉验证for kf_i, (train_iloc, valid_iloc) in enumerate(kf):print('--------------------------------- 第' + str(kf_i) + '折 ---------------------------------')# 存储并读取该折训练集data.iloc[train_iloc, :].reset_index(drop=True).to_feather(args.train_set)train_ds = load_dataset(read_text_pair, data_path=args.train_set, is_test=False, lazy=False)# 存储并读取该折验证集data.iloc[valid_iloc, :].reset_index(drop=True).to_feather(args.valid_set)valid_ds = load_dataset(read_text_pair, data_path=args.valid_set, is_test=False, lazy=False)batchify_fn = lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_idsPad(axis=0, pad_val=tokenizer.pad_token_type_id),  # input_idsStack(dtype="float32"),   # numsStack(dtype="int64")   # label): [data for data in fn(samples)]trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length)train_data_loader = create_dataloader(train_ds, mode='train', batch_size=args.train_batch_size, batchify_fn=batchify_fn, trans_fn=trans_func)valid_data_loader = create_dataloader(valid_ds, mode='valid', batch_size=args.valid_batch_size, batchify_fn=batchify_fn, trans_fn=trans_func)# 模型框架声明# 使用 ERNIE-Gram-zh 预训练模型if args.from_pretrained == 'ernie-1.0':pretrained_model = ppnlp.transformers.ErnieForSequenceClassification.from_pretrained('ernie-1.0', num_classes=args.num_classes)elif args.from_pretrained == 'ernie-gram-zh':pretrained_model = ppnlp.transformers.ErnieGramForSequenceClassification.from_pretrained('ernie-gram-zh', num_classes=args.num_classes) model = paddle.DataParallel(PaddleModel(pretrained_model))metric = paddle.metric.Accuracy() # 评价函数criterion = paddle.nn.loss.CrossEntropyLoss() # 损失函数num_training_steps = len(train_data_loader) * args.epochslr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion)decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(), weight_decay=args.weight_decay, apply_decay_param_fun=lambda x: x in decay_params)global_step = 0best_accuracy = 0.00gc.collect()# 训练集训练tic_train = time.time()for epoch in range(1, args.epochs + 1):for step, batch in enumerate(train_data_loader, start=0):input_ids, token_type_ids, nums, labels = batchlogits = model(input_ids, token_type_ids, nums)correct = metric.compute(logits, labels)metric.update(correct)acc = metric.accumulate()loss = criterion(logits, labels)global_step += 1if (global_step % args.print_steps == 0):print("global step %d, epoch: %d, batch: %d, loss: %.4f, accu: %.4f, speed: %.2f step/s"% (global_step, epoch, step, loss, acc,args.print_steps / (time.time() - tic_train)))tic_train = time.time()loss_array.append(loss)gc.collect()loss.backward()optimizer.step()lr_scheduler.step()optimizer.clear_grad()# 验证集验证if (acc >= args.eval_accuracy) and (global_step % args.eval_step == 0):accuracy = evaluate(model, criterion, metric, valid_data_loader)if accuracy < args.save_accuracy:continue# 保存最佳模型参数if (global_step % args.save_step == 0) or (accuracy > (best_accuracy - 0.001)):save_dir = os.path.join(args.save_dir + str(kf_i), "model_accu_%.6f_%d" % (accuracy, global_step))if not os.path.exists(save_dir):os.makedirs(save_dir)save_param_path = os.path.join(save_dir, 'model_state.pdparams')paddle.save(model.state_dict(), save_param_path)tokenizer.save_pretrained(save_dir)if accuracy > best_accuracy:best_accuracy = accuracy

13 微调模型 + 训练模型

In [130]

kfold_train() # K折交叉训练
--------------------------------- 0 ---------------------------------
[2022-04-20 00:44:33,740] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-gram-zh/ernie_gram_zh.pdparams
global step 1, epoch: 1, batch: 0, loss: 0.5982, accu: 0.8750, speed: 0.33 step/s
global step 2, epoch: 1, batch: 1, loss: 0.6093, accu: 0.8672, speed: 0.52 step/s
global step 3, epoch: 1, batch: 2, loss: 0.6200, accu: 0.8490, speed: 0.53 step/s
global step 4, epoch: 1, batch: 3, loss: 0.6036, accu: 0.8750, speed: 0.53 step/s
global step 5, epoch: 1, batch: 4, loss: 0.5858, accu: 0.8844, speed: 0.52 step/s
global step 6, epoch: 1, batch: 5, loss: 0.5818, accu: 0.8828, speed: 0.53 step/s
global step 7, epoch: 1, batch: 6, loss: 0.5704, accu: 0.8906, speed: 0.52 step/s
global step 8, epoch: 1, batch: 7, loss: 0.5792, accu: 0.8984, speed: 0.52 step/s
global step 9, epoch: 1, batch: 8, loss: 0.5281, accu: 0.9097, speed: 0.52 step/s
global step 10, epoch: 1, batch: 9, loss: 0.5611, accu: 0.9125, speed: 0.52 step/s
dev_loss: 0.55111, acc: 0.95413, total_num:109
global step 11, epoch: 1, batch: 10, loss: 0.5359, accu: 0.9531, speed: 0.25 step/s
global step 12, epoch: 1, batch: 11, loss: 0.5170, accu: 0.9609, speed: 0.52 step/s
global step 13, epoch: 1, batch: 12, loss: 0.5057, accu: 0.9635, speed: 0.53 step/s
global step 14, epoch: 1, batch: 13, loss: 0.5173, accu: 0.9609, speed: 0.53 step/s
global step 15, epoch: 1, batch: 14, loss: 0.4925, accu: 0.9594, speed: 0.52 step/s
global step 16, epoch: 1, batch: 15, loss: 0.4712, accu: 0.9609, speed: 0.51 step/s
global step 17, epoch: 1, batch: 16, loss: 0.4662, accu: 0.9598, speed: 0.50 step/s
global step 18, epoch: 1, batch: 17, loss: 0.4037, accu: 0.9629, speed: 0.53 step/s
global step 19, epoch: 1, batch: 18, loss: 0.4046, accu: 0.9653, speed: 0.52 step/s
global step 20, epoch: 1, batch: 19, loss: 0.3894, accu: 0.9656, speed: 0.51 step/s
dev_loss: 0.38271, acc: 0.9633, total_num:109
global step 21, epoch: 1, batch: 20, loss: 0.3694, accu: 0.9844, speed: 0.25 step/s
global step 22, epoch: 1, batch: 21, loss: 0.3476, accu: 0.9766, speed: 0.54 step/s
global step 23, epoch: 1, batch: 22, loss: 0.3462, accu: 0.9740, speed: 0.54 step/s
global step 24, epoch: 1, batch: 23, loss: 0.3707, accu: 0.9648, speed: 0.53 step/s
global step 25, epoch: 1, batch: 24, loss: 0.3025, accu: 0.9656, speed: 0.53 step/s
global step 26, epoch: 1, batch: 25, loss: 0.3322, accu: 0.9609, speed: 0.53 step/s
global step 27, epoch: 1, batch: 26, loss: 0.2915, accu: 0.9598, speed: 0.53 step/s
global step 28, epoch: 1, batch: 27, loss: 0.2413, accu: 0.9609, speed: 0.54 step/s
global step 29, epoch: 1, batch: 28, loss: 0.3191, accu: 0.9566, speed: 0.50 step/s
global step 30, epoch: 1, batch: 29, loss: 0.2274, accu: 0.9578, speed: 0.48 step/s
dev_loss: 0.22401, acc: 0.9633, total_num:109
global step 31, epoch: 1, batch: 30, loss: 0.2463, accu: 0.9531, speed: 0.24 step/s
global step 32, epoch: 1, batch: 31, loss: 0.1996, accu: 0.9609, speed: 0.49 step/s
global step 33, epoch: 1, batch: 32, loss: 0.1600, accu: 0.9688, speed: 0.48 step/s
global step 34, epoch: 1, batch: 33, loss: 0.2456, accu: 0.9609, speed: 0.49 step/s
global step 35, epoch: 1, batch: 34, loss: 0.0998, accu: 0.9688, speed: 0.49 step/s
global step 36, epoch: 1, batch: 35, loss: 0.1676, accu: 0.9688, speed: 0.51 step/s
global step 37, epoch: 1, batch: 36, loss: 0.1554, accu: 0.9688, speed: 0.50 step/s
global step 38, epoch: 1, batch: 37, loss: 0.0687, accu: 0.9727, speed: 0.52 step/s
global step 39, epoch: 1, batch: 38, loss: 0.1512, accu: 0.9722, speed: 0.50 step/s
global step 40, epoch: 1, batch: 39, loss: 0.1448, accu: 0.9719, speed: 0.48 step/s
dev_loss: 0.16079, acc: 0.9633, total_num:109
global step 41, epoch: 1, batch: 40, loss: 0.0468, accu: 1.0000, speed: 0.24 step/s
global step 42, epoch: 1, batch: 41, loss: 0.0925, accu: 0.9922, speed: 0.48 step/s
global step 43, epoch: 1, batch: 42, loss: 0.3018, accu: 0.9688, speed: 0.49 step/s
global step 44, epoch: 1, batch: 43, loss: 0.0311, accu: 0.9766, speed: 0.50 step/s
global step 45, epoch: 1, batch: 44, loss: 0.0886, accu: 0.9781, speed: 0.50 step/s

---------------------------------------------------------------------------KeyboardInterrupt Traceback (most recent call last)/tmp/ipykernel_17650/1626292781.py in <module> ----> 1 kfold_train() /tmp/ipykernel_17650/3823707195.py in kfold_train() 53 for step, batch in enumerate(train_data_loader, start=0): 54 input_ids, token_type_ids, nums, labels = batch ---> 55 logits = model(input_ids, token_type_ids, nums) 56 correct = metric.compute(logits, labels) 57 metric.update(correct) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in __call__(self, *inputs, **kwargs) 915 916 def __call__(self, *inputs, **kwargs): --> 917 return self._dygraph_call_func(*inputs, **kwargs) 918 919 def forward(self, *inputs, **kwargs): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in _dygraph_call_func(self, *inputs, **kwargs) 905 self._built = True 906 --> 907 outputs = self.forward(*inputs, **kwargs) 908 909 for forward_post_hook in self._forward_post_hooks.values(): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel.py in forward(self, *inputs, **kwargs) 702 703 def forward(self, *inputs, **kwargs): --> 704 outputs = self._layers(*inputs, **kwargs) 705 if self._strategy.nranks > 1 and framework._dygraph_tracer( 706 )._has_grad and self.grad_need_sync: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in __call__(self, *inputs, **kwargs) 915 916 def __call__(self, *inputs, **kwargs): --> 917 return self._dygraph_call_func(*inputs, **kwargs) 918 919 def forward(self, *inputs, **kwargs): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in _dygraph_call_func(self, *inputs, **kwargs) 905 self._built = True 906 --> 907 outputs = self.forward(*inputs, **kwargs) 908 909 for forward_post_hook in self._forward_post_hooks.values(): /tmp/ipykernel_17650/2790883427.py in forward(self, input_ids, token_type_ids, nums) 10 11 def forward(self, input_ids, token_type_ids, nums): ---> 12 logits = self.ptm(input_ids, token_type_ids) 13 logits_fc = paddle.concat([logits, nums], 1) 14 logits_fc1 = self.fc1(logits_fc) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in __call__(self, *inputs, **kwargs) 915 916 def __call__(self, *inputs, **kwargs): --> 917 return self._dygraph_call_func(*inputs, **kwargs) 918 919 def forward(self, *inputs, **kwargs): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in _dygraph_call_func(self, *inputs, **kwargs) 905 self._built = True 906 --> 907 outputs = self.forward(*inputs, **kwargs) 908 909 for forward_post_hook in self._forward_post_hooks.values(): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/ernie_gram/modeling.py in forward(self, input_ids, token_type_ids, position_ids, attention_mask) 513 token_type_ids=token_type_ids, 514 position_ids=position_ids, --> 515 attention_mask=attention_mask) 516 517 pooled_output = self.dropout(pooled_output) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in __call__(self, *inputs, **kwargs) 915 916 def __call__(self, *inputs, **kwargs): --> 917 return self._dygraph_call_func(*inputs, **kwargs) 918 919 def forward(self, *inputs, **kwargs): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in _dygraph_call_func(self, *inputs, **kwargs) 905 self._built = True 906 --> 907 outputs = self.forward(*inputs, **kwargs) 908 909 for forward_post_hook in self._forward_post_hooks.values(): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddlenlp/transformers/ernie_gram/modeling.py in forward(self, input_ids, token_type_ids, position_ids, attention_mask) 293 position_ids=position_ids, 294 token_type_ids=token_type_ids) --> 295 encoder_outputs = self.encoder(embedding_output, attention_mask) 296 sequence_output = encoder_outputs 297 pooled_output = self.pooler(sequence_output) /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in __call__(self, *inputs, **kwargs) 915 916 def __call__(self, *inputs, **kwargs): --> 917 return self._dygraph_call_func(*inputs, **kwargs) 918 919 def forward(self, *inputs, **kwargs): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in _dygraph_call_func(self, *inputs, **kwargs) 905 self._built = True 906 --> 907 outputs = self.forward(*inputs, **kwargs) 908 909 for forward_post_hook in self._forward_post_hooks.values(): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py in forward(self, src, src_mask, cache) 696 for i, mod in enumerate(self.layers): 697 if cache is None: --> 698 output = mod(output, src_mask=src_mask) 699 else: 700 output, new_cache = mod(output, /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in __call__(self, *inputs, **kwargs) 915 916 def __call__(self, *inputs, **kwargs): --> 917 return self._dygraph_call_func(*inputs, **kwargs) 918 919 def forward(self, *inputs, **kwargs): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in _dygraph_call_func(self, *inputs, **kwargs) 905 self._built = True 906 --> 907 outputs = self.forward(*inputs, **kwargs) 908 909 for forward_post_hook in self._forward_post_hooks.values(): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py in forward(self, src, src_mask, cache) 592 if self.normalize_before: 593 src = self.norm2(src) --> 594 src = self.linear2(self.dropout(self.activation(self.linear1(src)))) 595 src = residual + self.dropout2(src) 596 if not self.normalize_before: /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in __call__(self, *inputs, **kwargs) 915 916 def __call__(self, *inputs, **kwargs): --> 917 return self._dygraph_call_func(*inputs, **kwargs) 918 919 def forward(self, *inputs, **kwargs): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py in _dygraph_call_func(self, *inputs, **kwargs) 905 self._built = True 906 --> 907 outputs = self.forward(*inputs, **kwargs) 908 909 for forward_post_hook in self._forward_post_hooks.values(): /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/common.py in forward(self, input) 170 def forward(self, input): 171 out = F.linear( --> 172 x=input, weight=self.weight, bias=self.bias, name=self.name) 173 return out 174 /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/functional/common.py in linear(x, weight, bias, name) 1478 if in_dygraph_mode(): 1479 pre_bias = _C_ops.matmul_v2(x, weight, 'trans_x', False, 'trans_y', -> 1480 False) 1481 1482 if bias is None: KeyboardInterrupt:

In [101]

# print('打印损失结构:', loss_array)
# 保存测试集
test_data = pd.read_feather(args.data_set)
test_data = test_data[test_data['label'].isnull()].drop_duplicates('ent_id').reset_index(drop=True)
test_data.to_feather(args.test_set)

14 推理测试集

In [82]

batchify_fn = lambda samples, fn=Tuple(Pad(axis=0, pad_val=tokenizer.pad_token_id),  # input_idsPad(axis=0, pad_val=tokenizer.pad_token_type_id),  # token_type_idsStack(dtype="float32"),  # nums
): [data for data in fn(samples)]# 创建测试集PaddlePaddle DataLoader
test_ds = load_dataset(read_text_pair, data_path=args.test_set, is_test=True, lazy=False)
tests_func = partial(convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True)
test_data_loader = create_dataloader(test_ds, mode='predict', batch_size=args.valid_batch_size, batchify_fn=batchify_fn, trans_fn=tests_func)# 第0折最佳模型参数热启动
kf_i = 0
params_path = './checkpoint0/' + sorted(os.listdir('checkpoint0/'))[-1] + '/model_state.pdparams'
state_dict = paddle.load(params_path)# 模型构建
pretrained_model = ppnlp.transformers.ErnieGramForSequenceClassification.from_pretrained('ernie-gram-zh', num_classes=args.num_classes)
model = paddle.DataParallel(PaddleModel(pretrained_model))
model.set_dict(state_dict)
print("Loaded parameters from %s" % params_path)# 模型预测
y_probs = predict(model, test_data_loader)
[2022-04-20 00:37:04,298] [    INFO] - Already cached /home/aistudio/.paddlenlp/models/ernie-gram-zh/ernie_gram_zh.pdparams
Loaded parameters from ./checkpoint0/model_accu_0.994422_40/model_state.pdparams
100%|██████████| 9/9 [01:23<00:00,  8.39s/it]

In [131]

# 合并推理结果并导出测试集
test_data = pd.read_feather(args.test_set)
test_data[args.label] = y_probs[:, 1]
test_data[['ent_id', args.label]].to_csv('answer.csv', header=True, index=False, sep='|')

测试脚本运行完成后,answer.csv文件,将该文件提交至评测官网。

steps acc auc
10 0.94 0.7

通过超参调节,增加训练次数,模型可以获得更高的精度。同时,ERNIE-Gram-zh仅输出最近14条合并新闻向量,可以延长新闻滑窗时间进行优化。模型计算指标采用Acc,提交结果使用概率Auc,有效后处理Auc可提交线上结果类似,正负样本量对深度学习影响较大,本次负样本过少,合理采样正负样本可显著提升,欢迎用户试用,并进一步调优,探索违约贷款模型的新方法~

2022AIWIN大赛-发债企业的违约风险预警赛题Baseline相关推荐

  1. 发债企业的违约风险预警初赛方案【AI比赛】

    1.赛题 2.赛题解读 2.1赛题精炼 2.2特征提取 2.2.1类别特征 2.2.2数值特征 2.3特征选择 2.4时序验证 2.5模型选择 3上分攻略 4方案分享 4.1导入模块 4.2读入数据 ...

  2. 山西省第二届网络安全技能大赛(企业组)部分赛题WP(二)

    前言 有幸参加了2022年山西省第二届网络安全技能大赛企业组的比赛,这是第一次参加ctf比赛,本着积累实战经验的目的去的,竟然排名14,有点意外. 提示:以下是本篇文章正文内容. 一.题目 题目: 现 ...

  3. 山西省第二届网络安全技能大赛(企业组)部分赛题WP(三)

    前言 有幸参加了2022年山西省第二届网络安全技能大赛企业组的比赛,这是第一次参加ctf比赛,本着积累实战经验的目的去的,竟然排名14,有点意外. 提示:以下是本篇文章正文内容. 一.题目 题目: 流 ...

  4. 第46届世界技能大赛网络系统管理项目湖北省选拔赛赛题-模块C-Cisco

    第46届世界技能大赛网络系统管理项目湖北省选拔赛赛题-模块C-Cisco

  5. “星河杯”隐私计算大赛-赛题Baseline来啦

    "星河杯"隐私计算大赛正在火热报名.大赛将于5月13日(12:00)截止报名和组队,5月15日(24:00)截止初赛作品提交,小伙伴们还有充足的时间报名参赛哟~ 为了帮助大家更好地 ...

  6. 二等奖方案|2021 CCF BDCI个贷违约预测赛题@Faulty 队解题思路

    今日分享:中原银行「个贷违约预测」赛题二等奖获奖方案 赛题链接:https://www.datafountain.cn/competitions/530 团队简介 获奖团队:Faulty 单人参赛,广 ...

  7. 全国职业院校技能大赛 网络建设与运维 赛题(三)

    目录 赛题说明 模块一:网络理论测试 模块二:网络建设与调试 一.工程统筹 二.交换配置 三.路由调试 四.无线部署 五.安全维护 模块三:服务搭建与运维 一.X86架构计算机操作系统安装与管理 二. ...

  8. 一等奖方案|2021 CCF BDCI个贷违约预测赛题@雅俗共赏 队解题思路

    今日分享:中原银行「个贷违约预测」赛题一等奖获奖方案 赛题链接:https://www.datafountain.cn/competitions/530 团队简介 获奖团队:雅俗共赏 翟亚雷,现就读于 ...

  9. 全国职业院校技能大赛-网络建设与运维赛题(一)

    目录 赛题说明 模块一:网络理论测试 模块二:网络建设与调试 一.工程统筹 二.交换配置 三.路由调试 四.无线部署 五.安全维护 模块三:服务搭建与运维 一.X86架构计算机操作系统安装与管理 二. ...

最新文章

  1. 云计算正在告别DIY时代 阿里云专有云挑起企业级市场大梁
  2. 千万级智能推荐系统架构演进!
  3. 三维重建PCL:点云单侧面正射投影
  4. Python多篇新闻自动采集
  5. saltstack二
  6. android秋招面试题及答案,阿里巴巴2019秋招客户端开发工程师在线笔试题和面试题答案...
  7. bzoj 3672 购票 点分治+dp
  8. Windows Service开发点滴20130622
  9. eclipse 中使用字符流复制文件乱码解决
  10. 逆向破解flash视频url
  11. 使用pyinstaller将python脚本转成EXE可执行文件遇到的问题和总结
  12. 三个显示图像的matlab函数图像,如何在matlab中将三个隐函数图像画在同一个图上...
  13. 如何实现电脑远程操控西门子触摸屏画面
  14. 2017年10月23日提高组T2 灵知的太阳信仰 单调队列优化dp
  15. Proud kids最新测评:1对4的课堂效果怎么样?
  16. 【问题解决】关于AltiumDesigner 10(AD10)原理图中拖动和缩放页面出现卡顿问题
  17. socket编程之DEV C++配置winpcap开发环境并编写网络嗅探器sniffer
  18. C#求解圆和圆的交点
  19. The disadvantages of an elite education精英教育的弊端
  20. PF使用率过高的解析以及处理方法

热门文章

  1. 微信小程序里引入SVG矢量图标
  2. SpringCloud 组件各大区别与回顾
  3. [ Ubuntu ] 桌面便签
  4. 人月神话读书笔记(3)外科手术队伍
  5. 同城物流附近的物流公司电话源码
  6. 实训第四天(环境:VM中的Centos)
  7. 最全软件测试学习工具集合
  8. 微分方程2_常微分方程、相空间
  9. Python OpenCV cv2和二进制图片互转
  10. 【网络编程入门】使用socket在Linux下实现即时通信软件