表中的内容 (Table of Content)

· Introduction· About the Dataset· Import Dataset into the Database· Connect Python to MySQL Database· Feature Extraction· Feature Transformation· Modeling· Conclusion and Future Directions· About Me

· 简介 · 关于数据集 · 将数据集 导入数据库 · 将Python连接到MySQL数据库 · 特征提取 · 特征转换 · 建模 · 结论和未来方向 · 关于我

Note: If you are interested in the details beyond this post, the Berka Dataset, all the code, and notebooks can be found in my GitHub Page.

注意 :如果您对本文之外的详细信息感兴趣,可以在我的GitHub Page中找到Berka Dataset,所有代码和笔记本。

介绍 (Introduction)

For banks, it is always an interesting and challenging problem to predict how likely a client is going to default the loan when they only have a handful of information. In the modern era, the data science teams in the banks build predictive models using machine learning. The datasets used by them are most likely to be proprietary and are usually collected internally through their daily businesses. In other words, there are not many real-world datasets that we can use if we want to work on such financial projects. Fortunately, there is an exception: the Berka Dataset.

对于银行而言,预测客户仅拥有少量信息时将拖欠贷款的可能性始终是一个有趣且具有挑战性的问题。 在现代时代,银行中的数据科学团队使用机器学习来构建预测模型。 他们使用的数据集很可能是专有数据,通常是通过日常业务在内部收集的。 换句话说,如果我们要从事此类金融项目,则可以使用的现实世界数据集并不多。 幸运的是,有一个例外: Berka Dataset

关于数据集 (About the Dataset)

The Berka Dataset, or the PKDD’99 Financial Dataset, is a collection of real anonymized financial information from a Czech bank, used for PKDD’99 Discovery Challenge. The dataset can be accessed from my GitHub page.

Berka数据集或PKDD'99财务数据集是来自捷克银行的真实匿名财务信息的集合,用于PKDD'99发现挑战赛。 可以从我的GitHub页面访问该数据集。

In the dataset, 8 raw files include 8 tables:


  • account (4500 objects in the file ACCOUNT.ASC) — each record describes static characteristics of an account.

    帐户 (文件ACCOUNT.ASC中有4500个对象)—每个记录描述一个帐户的静态特征。

  • client (5369 objects in the file CLIENT.ASC) — each record describes characteristics of a client.

    客户 (文件CLIENT.ASC中有5369个对象)—每个记录都描述了客户的特征。

  • disposition (5369 objects in the file DISP.ASC) — each record relates together a client with an account i.e. this relation describes the rights of clients to operate accounts.

    处置 (文件DISP.ASC中的5369个对象)—每个记录将一个客户与一个帐户关联在一起,即该关系描述了客户操作帐户的权利。

  • permanent order (6471 objects in the file ORDER.ASC) — each record describes characteristics of a payment order.

    永久订单 (文件ORDER.ASC中有6471个对象)—每个记录都描述了付款订单的特征。

  • transaction (1056320 objects in the file TRANS.ASC) — each record describes one transaction on an account.

    交易 (文件TRANS.ASC中有1056320个对象)—每个记录描述一个帐户上的一项交易。

  • loan (682 objects in the file LOAN.ASC) — each record describes a loan granted for a given account.

    贷款 (文件LOAN.ASC中的682个对象)—每个记录都描述了为给定帐户授予的贷款。

  • credit card (892 objects in the file CARD.ASC) — each record describes a credit card issued to an account.

    信用卡 (CARD.ASC文件中的892个对象)—每个记录都描述了发给帐户的信用卡。

  • demographic data (77 objects in the file DISTRICT.ASC) — each record describes demographic characteristics of a district.

    人口统计数据 (文件DISTRICT.ASC中有77个对象)—每个记录都描述一个地区的人口统计特征。

Relational Dataset Repository关系数据集存储库中的表关系
  • Each account has both static characteristics (e.g. date of creation, address of the branch) given in relation “account” and dynamic characteristics (e.g. payments debited or credited, balances) given in the relations “permanent order” and “transaction”.每个帐户都具有在“帐户”关系中给出的静态特征(例如,创建日期,分支机构的地址)和在“永久订单”和“交易”关系中给出的动态特征(例如,借方或贷方的付款,余额)。
  • Relation “client” describes the characteristics of persons who can manipulate the accounts.关系“客户”描述了可以操纵账户的人的特征。
  • One client can have more accounts, more clients can manipulate with a single account; clients and accounts are related together in relation “disposition”.一个客户可以拥有更多帐户,更多的客户可以使用一个帐户进行操作; 客户和帐户在“处置”关系中相互关联。
  • Relations “loan” and “credit card” describe some services which the bank offers to its clients.关系“贷款”和“信用卡”描述了银行向客户提供的一些服务。
  • More than one credit card can be issued to an account.一个帐户可以发行一张以上的信用卡。
  • At most one loan can be granted for an account.一个账户最多可以提供一笔贷款。
  • Relation “demographic data” gives some publicly available information about the districts (e.g. the unemployment rate); additional information about the clients can be deduced from this.关系“人口数据”提供了有关地区的一些公共可用信息(例如失业率); 由此可以推断出有关客户的其他信息。

将数据集导入数据库 (Import Dataset into the Database)

This is an optional step since the raw files contain only delimiter-separated values, so it can be directly imported into data frames using pandas.


Here I wrote SQL queries to import the raw data files into MySQL database for simple and fast data manipulations (eg. select, join and aggregation functions) on the data.


/* Create Bank Database */
USE bank;/* Create Account Table */
CREATE TABLE IF NOT EXISTS Account(account_id INT,district_id INT,frequency VARCHAR(20),`date` DATE
);/* Load Data into the Account Table */
INFILE '~/Documents/DataScience/ds_projects/loan_default_prediction/data/account.asc'
(account_id, district_id, frequency, @c4)
SET `date` = STR_TO_DATE(@c4, '%y%m%d');

Above is a code snippet showing how to create the bank database and import the Account table. It includes three steps:

上面的代码段显示了如何创建银行数据库和导入Account表。 它包括三个步骤:

  • Create and use database创建和使用数据库
  • Create a table建立表格
  • Load data into the table将数据加载到表中

There should not be any troubles in the first two steps if you are familiar with MySQL and the database systems. For the “Load data” step, you need to make sure that you have enabled the LOCAL_INFILE in MySQL. Detailed instruction can be found from this thread.

如果您熟悉MySQL和数据库系统,则前两个步骤应该不会有任何麻烦。 对于“加载数据”步骤,您需要确保已在MySQL中启用LOCAL_INFILE 。 可以从该线程中找到详细的说明。

By repeating step 2 and step 3 on each table, all the data can be imported into the database.


将Python连接到MySQL数据库 (Connect Python to MySQL Database)

Again, if you choose to import the data directly into Python using Pandas, this step is optional. But if you have created the database and become familiar with the dataset through some SQL data manipulations, the next step is to transfer the prepared tables into Python and perform data analysis there. One way is to use the MySQL Connector for Python to execute SQL queries in Python and make Pandas DataFrames using the results. Here is my approach:

同样,如果您选择使用Pandas将数据直接导入Python,则此步骤是可选的。 但是,如果您已创建数据库并通过一些SQL数据操作熟悉了数据集,则下一步是将准备好的表转移到Python中并在其中执行数据分析。 一种方法是使用MySQL Connector for Python在Python中执行SQL查询,并使用结果创建Pandas DataFrame。 这是我的方法:

import mysql.connectorclass MysqlIO:"""Connect to MySQL server with python and excecute SQL commands."""def __init__(self, database='test'):try:# Change the host, user and password as neededconnection = mysql.connector.connect(host='localhost',database=database,user='Zhou',password='jojojo',use_pure=True)if connection.is_connected():db_info = connection.get_server_info()print("Connected to MySQL Server version", db_info)print("Your're connected to database:", database)self.connection = connectionexcept Exception as e:print("Error while connecting to MySQL", e)def execute(self, query, header=False):"""Execute SQL commands and return retrieved queries."""cursor = self.connection.cursor(buffered=True)cursor.execute(query)try:record = cursor.fetchall()if header:header = [i[0] for i in cursor.description]return {'header': header, 'record': record}else:    return recordexcept:passdef to_df(self, query):"""Return the retrieved SQL queries into pandas dataframe."""res = self.execute(query, header=True)df = pd.DataFrame(res['record'])df.columns = res['header']return df

After modifying the database info such as host, database, user, password, we can initiate a connection instance, execute the query and convert it into Pandas DataFrame:

修改数据库信息(例如主机,数据库,用户,密码)后,我们可以启动连接实例,执行查询并将其转换为Pandas DataFrame:

# Create a connection instance
db = MysqlIO()# Call .to_df method to execute the query and make dataframe from the results.
query = """select *from Loan join Account using(account_id);"""
df = db.to_df(query)

Even though this is an optional step, it is advantageous in terms of speed, convenience, and good for experimentation purposes compared to directly import the files into Pandas DataFrames. Unlike other ML projects where we are only given with acsv file (1 table), this dataset is quite complicated and there is a lot of useful information hidden between the connections of tables, so this is another reason why I want to introduce the way of loading data into the database first.

即使这是一个可选步骤,与直接将文件导入Pandas DataFrames相比,它在速度,便利性和实验性方面都具有优势。 与其他仅提供csv文件(1个表)的ML项目不同,此数据集非常复杂,并且在表的连接之间隐藏了许多有用的信息,因此这也是我要介绍这种方式的另一个原因首先将数据加载到数据库中。

Now the data is in MySQL server and we have connected it Python so that we can smoothly access the data in data frames. The next steps are to extract features from the table, transform the variables, load them into one array, and train a machine learning model.

现在,数据位于MySQL服务器中,并且已将其连接到Python,以便我们可以顺利访问数据帧中的数据。 下一步是从表中提取特征,转换变量,将它们加载到一个数组中以及训练机器学习模型。

特征提取 (Feature Extraction)

Since predicting the loan default is a binary classification problem, we first need to know how many instances in each class. By looking at the status variable in the Loan table, there are 4 distinct values: A, B, C, and D.

由于预测贷款违约是一个二进制分类问题,因此我们首先需要知道每个类中有多少个实例。 通过查看“ Loan表中的status变量,有4个不同的值:A,B,C和D。

  • A: Contract finished, no problems.答:合同完成,没有问题。
  • B: Contract finished, loan not paid.B:合同完成,未偿还贷款。
  • C: Running contract, okay so far.C:签合同,到目前为止还可以。
  • D: Running contract, client in debt.D:签订合同,客户欠债。

According to the definitions from the dataset description, we can make them into binary classes: good (A or C) and bad (B or D). There are 606 loans that fall into the “good” class and 76 of them are in the “bad” class.

根据数据集描述中的定义,我们可以将它们分为二类:好(A或C)和坏(B或D)。 有606笔贷款属于“好”类,其中76笔属于“坏”类。

With the two distinct classes defined, we can look into the variables and plot the histograms to see if they correspond to different distributions.


The loan amount shown below is a good example to see the difference between the two classes. Even though both are right-skewed, it still shows an interesting pattern that loans with a higher amount tend to default.

下面显示的贷款金额是了解两个类别之间差异的一个很好的例子。 即使两者都是右偏,它仍然显示出一种有趣的模式,即较高金额的贷款倾向于违约。

Histogram of Loan Amount (Good vs Bad)

When extracting features, they don’t have to be the existing variables provided in the tables. Instead, we can always be creative and come up with some out-of-the-box solutions on creating our own features. For example, when joining the Loan table and the Account table, we can get both the date of loan issuance and the date of account creation. We may wonder if the time gap between creating the account and applying for the loan plays a role, so a simple subtraction would give us a new variable consists of days between the two such activities on the same account. The histograms are shown below, where a clear trend can be seen that people who apply for the loan right after creating the bank account tend to default.

提取要素时,它们不必是表中提供的现有变量。 相反,我们始终可以发挥创造力,并在创建我们自己的功能时提出一些现成的解决方案。 例如,当加入“ Loan表和“ Account表时,我们可以同时获得贷款发放日期和帐户创建日期。 我们可能想知道在创建帐户和申请贷款之间的时间间隔是否起作用,因此简单的减法将为我们提供一个新变量,该变量包括在同一帐户上两次此类活动之间的天数。 直方图如下所示,可以清楚地看到在创建银行帐户后立即申请贷款的人倾向于违约的趋势。

Histogram of Days between Account Creation and Loan Issuance

By repeating the process of experimenting with existing features and created features, I finally prepared a table that consists of 18 feature columns and 1 label column. The selected features are:

通过重复试验现有功能和创建的功能的过程,我最终准备了一个表,该表包含18个功能列和1个标签列。 所选功能为:

  • amount: Loan amount金额:贷款金额
  • duration: Loan duration期限:贷款期限
  • payments: Loan payments付款:贷款付款
  • days_between: Days between account creation and loan issuancedays_between:创建帐户和发放贷款之间的天数
  • frequency: Frequency of issuance of statements频率:报表的发布频率
  • average_order_amount: Average amount of the permanent orders made by the accountaverage_order_amount:该帐户发出的永久订单的平均数量
  • average_trans_amount: Average amount of the transactions made by the accountaverage_trans_amount:该帐户进行的平均交易金额
  • average_trans_balance: Average balance amount after transactions made by the accountaverage_trans_balance:帐户进行交易后的平均余额
  • n_trans: Transaction number of accountn_trans:帐户的交易号
  • card_type: Type of credit card associated with the accountcard_type:与帐户关联的信用卡类型
  • n_inhabitants: Number of inhabitants in the district of accountn_inhabitants:帐户区域中的居民数量
  • average_salary: Average salary in the district of accountaverage_salary:会计区域中的平均工资
  • average_unemployment: Average unemployment rate in the district of accountaverage_unemployment:会计区域的平均失业率
  • entrepreneur_rate: Number of entrepreneurs per 1000 inhabitants in the district of account企业家率:账户区每千居民中企业家人数
  • average_crime_rate: Average crime rate in the district of accountaverage_crime_rate:帐户区域中的平均犯罪率
  • owner_gender: Account owner’s genderowner_gender:帐户所有者的性别
  • owner_age: Account owner’s ageowner_age:帐户所有者的年龄
  • same_district: A boolean that represents if the owner has the same district information as the accountsame_district:布尔值,表示所有者是否具有与帐户相同的地区信息

特征转换 (Feature Transformation)

After the features are extracted and put into a big table, it is necessary to transform the data so that they can be fed into the machine learning model in an organic way. In our case, we have two types of features. One is numerical, such as amount, duration, and n_trans. The other one is categorical, such as card_type and owners_gender.

将特征提取并放入大表中之后,有必要对数据进行转换,以便以有机方式将其输入到机器学习模型中。 就我们而言,我们有两种类型的功能。 一个是数字 ,例如amountdurationn_trans 。 另一个是分类的 ,例如card_typeowners_gender

Our dataset is pretty clean and there is any missing value, so we can skip the imputation and directly jumpy into scaling for the numerical values. The are several options of scalers from scikit-learn , such as StandardScaler , MinMaxScaler and RobustScaler . Here, I used MinMaxScaler to rescale the numerical values between 0 and 1. On the other hand, the typical strategy of dealing with categorical variables is to use OneHotEncoder to transform the features into binary 0 and 1 values.

我们的数据集非常干净,并且没有任何遗漏的值,因此我们可以跳过插补,直接跳入数值的换算。 scikit-learn的缩放器有多个选项,例如StandardScalerMinMaxScalerRobustScaler 。 在这里,我使用MinMaxScaler重新缩放0到1之间的数值。另一方面,处理分类变量的典型策略是使用OneHotEncoderOneHotEncoder转换为二进制01值。

The code below is a representation of the feature transformation steps:


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler# Define the numerical and categorical columns
num_cols = df_ml.columns[:-5]
cat_cols = df_ml.columns[-5:]# Build the column transformer and transform the dataframe
col_trans = ColumnTransformer([('num', MinMaxScaler(), num_cols),('cat', OneHotEncoder(drop='if_binary'), cat_cols)
df_transformed = col_trans.fit_transform(df_ml)

造型 (Modeling)

The first thing in training a machine learning model is to split the train and test sets. It is tricky in our dataset because it is not balanced: there are almost 10 times more good loans than bad loans. A stratified split is a good option here because it preserves the ratio between classes in both train and test sets.

训练机器学习模型的第一件事是将训练集和测试集分开。 这在我们的数据集中非常棘手,因为它不平衡:好贷比坏贷几乎多10倍。 分层拆分在这里是一个很好的选择,因为它可以保留训练集和测试集中的类之间的比率。

from sklearn.model_selection import train_test_split# Stratified split of the train and test set with train-test ratio of 7:3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=10)

There are many good machine learning models for binary classification tasks. Here, the Random Forest model is used in this project for its decent performance and quick-prototyping capability. An initial RandomForrestClassifier model is fit and three distinct measures are used to represent the model performance: Accuracy, F1 Score, and ROC AUC.

对于二进制分类任务,有许多好的机器学习模型。 在这里,该项目使用了随机森林模型,因为它具有不错的性能和快速原型设计能力。 初始的RandomForrestClassifier模型是拟合的,并且使用三种不同的度量来表示模型的性能: 准确性F1得分ROC AUC

It is noticeable that Accuracy is not sufficient for this unbalanced dataset. If we finetune the model purely by accuracy, then it would favor toward predicting the loan as “good loan”. F1 score is the harmonic mean between precision and recall, and ROC AUC is the area under the ROC curve. These two are better metrics for evaluating the model performance for unbalanced data.

值得注意的是,精度对于此不平衡数据集是不够的。 如果我们仅通过准确性对模型进行微调,那么它将有助于将贷款预测为“良好贷款”。 F1分数是精度和查全率之间的谐波平均值,ROC AUC是ROC曲线下的面积。 这两个是评估不平衡数据的模型性能的更好指标。

The code below shows how to apply 5-fold stratified cross-validation on the training set, and calculate the average of each score:


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold# See the inital model performance
clf = RandomForestClassifier(random_state=10)
print('Acc:', cross_val_score(clf, X_train, y_train, cv=StratifiedKFold(n_splits=5), scoring='accuracy').mean())
print('F1:', cross_val_score(clf, X_train, y_train, cv=StratifiedKFold(n_splits=5), scoring='f1').mean())
print('ROC AUC:', cross_val_score(clf, X_train, y_train, cv=StratifiedKFold(n_splits=5), scoring='roc_auc').mean())
Acc: 0.8973F1: 0.1620ROC AUC: 0.7253

It is clearly seen that the accuracy is high, almost 0.9, but the F1 score is very low because of low recall. There is room for the model to be finetuned and strive for better performance, and one of the methods is Grid Search. By assigning different values to the hyperparameters of theRandomForestClassifier such as n_estimators max_depth min_samples_split and min_samples_leaf , it will iterate through the combinations of hyperparameters and output the one with the best performance on the score that we are interested in. A code snippet is shown below:

可以清楚地看到,准确性很高,几乎为0.9,但是由于召回率低,F1分数非常低。 可以对模型进行微调并争取更好的性能,而其中的一种方法是网格搜索。 通过向的超参数指定不同的价值RandomForestClassifiern_estimators max_depth min_samples_splitmin_samples_leaf ,它将通过超参数和输出的一个与所述分数,我们感兴趣的是一个代码段的最佳性能的组合迭代如下所示:

from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier# Assign different values for the hyperparameter
params = {'n_estimators': [10, 50, 100, 200],'max_depth': [None, 10, 20, 30],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 5]
}# Grid search with 5-fold cross-validation on F1-score
clf = GridSearchCV(RandomForestClassifier(random_state=10), param_grid=params, cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=10),scoring='f1')
clf.fit(X_train, y_train)print(clf.best_params_)

Refitting the model with the best parameters, we can take a look at the model performance one the whole train set and the test set:


Performance on Train Set:Acc: 0.9706F1: 0.8478ROC AUC: 0.9952Performance on Test Set:Acc: 0.8927F1: 0.2667ROC AUC: 0.6957

The performance on the train set is great: more than 2/3 of the bad loans and all of the good loans are correctly classified, and all of the three performance measures are above 0.84. On the other hand, when the model is used on the test set, the result is not quite satisfying: most of the bad loans are labeled as “good” and the F1 score is only 0.267. There is evidence that overfitting is involved, so more effort should be put into such iterative processes in order to get better model performance.

火车上的表现很棒:正确分类了超过2/3的不良贷款和所有不良贷款,并且这三个绩效指标均高于0.84。 另一方面,在测试集上使用该模型时,结果并不十分令人满意:大多数不良贷款被标记为“好”,F1分数仅为0.267。 有证据表明涉及过度拟合,因此应该在这种迭代过程中付出更多的努力,以获得更好的模型性能。

With the model built, we can now rank the features based on their importance. The top 5 features that have the most prediction powers are:

建立模型后,我们现在可以根据功能的重要性对其进行排名。 具有最大预测能力的前5个功能是:

  • Average Transaction Balance平均交易余额
  • Average Transaction Amount平均交易金额
  • Loan Amount贷款额度
  • Average Salary平均工资
  • Days between account creation and loan application创建账户和申请贷款之间的天数

There is not to much surprise here, since for many of these, we have already seen the unusual behaviors that could be related to the loan default, such as the loan amount and days between account creation and loan application.


结论和未来方向 (Conclusion and Future Directions)

In this post, I introduced the whole pipeline of an end-to-end machine learning model in a banking application, loan default prediction, with real-world banking dataset Berka. I described the Berka dataset and the relationships between each table. Steps and codes were demonstrated on how to import the dataset into MySQL database and then connect to Python and convert processed records into Pandas DataFrame. Features were extracted and transformed into an array, ready for feeding into machine learning models. As the last step, I fit a Random Forest model using the data, evaluated the model performance, and generated the list of top 5 features that play roles in predicting loan default.

在本文中,我介绍了银行应用程序中端到端机器学习模型的整个流程,贷款违约预测以及真实银行数据集Berka。 我描述了Berka数据集以及每个表之间的关系。 演示了有关如何将数据集导入MySQL数据库,然后连接至Python并将处理后的记录转换为Pandas DataFrame的步骤和代码。 提取特征并将其转换为数组,以供输入机器学习模型。 作为最后一步,我使用数据拟合了一个随机森林模型,评估了模型的性能,并生成了在预测贷款违约中起重要作用的前5个功能的列表。

This machine learning pipeline is just a gentle touch of the one application that could be used with the Berka dataset. It could go deeper since there is more useful information hidden in the intricate relationship among tables; it could also go wider since it can be extended to other applications such as credit card and client’s transaction behaviors. But if just focusing on this loan default prediction, there could be three directions to dive further in the future:

这个机器学习管道只是可以与Berka数据集一起使用的一个应用程序的一种轻柔的接触。 由于表之间错综复杂的关系中隐藏着更多有用的信息,因此可能会更深入。 它也可以扩展,因为它可以扩展到其他应用程序,例如信用卡和客户的交易行为。 但是,如果仅关注此贷款违约预测,将来可能会有三个方向进一步跳水:

  1. Extract more features: Due to the time limit, it is not possible to conduct a thorough study and have a deep understanding of the dataset. There are still many features in the dataset that are unused and a lot of the information has not been fully digested with knowledge in the banking industry.

    提取更多功能 :由于时间限制,无法进行深入研究并深入了解数据集。 数据集中仍然有许多未使用的功能,并且银行业的知识还没有完全消化很多信息。

  2. Try other models: Only the Random Forest model is used, but there are many good ones out there, such as Logistic Regression, XGBoost, SVM, or even neural networks. The models can also be improved further by finer tunings on hyperparameters or using ensemble methods such as bagging, boosting, and stacking.

    尝试其他模型 :仅使用随机森林模型,但那里有很多好的模型,例如Logistic回归,XGBoost,SVM甚至神经网络。 还可以通过对超参数进行更精细的调整或使用集成方法(例如装袋,增强和堆叠)来进一步改进模型。

  3. Deal with the unbalanced data: It is important to notice this fact that the default loans are only about 10% of the total loans, thus during the training process, the model will favor predicting more negatives than positive results. We have already used the F1 score and ROC AUC instead of just accuracy. However, the performance is still not as good as it could be. In order to solve this problem, other methods such as collecting or resampling more data can be used in the future.

    处理不平衡的数据 :值得注意的事实是,拖欠贷款仅占总贷款的10%,因此在训练过程中,该模型将倾向于预测负数而不是正数结果。 我们已经使用了F1分数和ROC AUC,而不仅仅是准确性。 但是,性能仍未达到应有的水平。 为了解决此问题,将来可以使用其他方法,例如收集或重新采样更多数据。

关于我 (About Me)

I am a data scientist with engineering backgrounds. I embrace technology and learn new skills every day. Currently, I am seeking career opportunities in Toronto. You are welcome to reach me from Medium Blog, LinkedIn, or GitHub.

我是具有工程背景的数据科学家。 我每天都拥抱技术并学习新技能。 目前,我正在多伦多寻求职业机会。 欢迎您通过Medium Blog , LinkedIn或GitHub与我联系 。

翻译自: https://towardsdatascience.com/loan-default-prediction-an-end-to-end-ml-project-with-real-bank-data-part-1-1405f7aecb9e



  • 法国三家银行加入R3 Corda区块链贷款平台
  • 全栈技术详解1-个人贷款违约预测模型
  • 迪拜政府和当地银行合作推出基于区块链的贷款平台
  • 马云有自己的银行,为什么还要贷款?
  • 韩国历史最悠久的银行推出全国区块链贷款平台
  • 世界银行提供10亿美元贷款助印度发展太阳能
  • 世界银行为孟加拉国建设数据中心提供贷款
  • [附源码]SSM计算机毕业设计小型银行贷款管理系统JAVA
  • 我的世界服务器银行系统,我的世界多功能银行系统制作教程
  • 世界银行贷款可持续发展农业项目商业计划书
  • 格式工厂 wav 比特率_鸡娃常用工具系列一格式工厂(音频转换软件)
  • 工厂模式实现多种数据库连接
  • 异星工厂服务器无响应,异星工厂无法联机解决方法 异星工厂无法联机怎么办...
  • 鼠标点击操作实际上如何传递到显示器?【全流程图解】
  • TCL与京东方比拼技术创新,前者的发明专利首次居于领先地位
  • vhg电路是什么意思_显示装置和电力监测电路的制作方法
  • 2022年高职院校技能大赛电子产品设计及制作赛项国赛交流
  • 鸿雁召开智能家居新品发布会,智能面板等多款全屋智能新品亮相
  • 友达光电(昆山)第六代LTPS液晶面板厂 成功点亮首片5.5吋Full HD面板 缔造最快速量产记录 展现领先LTPS技术实力...
  • 什么是NAT技术与代理服务器
  • NAT技术与代理服务器调研
  • Android安装步骤
  • 六轴UR机械臂正逆运动学求解_MATLAB代码(标准DH参数表)
  • git pull git push 报spawn ssh错误,vscode更换默认终端
  • Docker实用指令整理
  • 手把手教你千万级唯一ID如何生成
  • jitsi各工程编译笔记(一)各工程大概
  • 在SQL Server 2000里设置和使用数据库复制
  • 正则基础介绍
  • NodeJS必知基础知识(非巨详细)


  1. Datawhale学习笔记【阿里云天池 金融风控-贷款违约预测】Task2 数据分析

    阿里云天池学习赛[金融风控-贷款违约预测] 赛题数据及背景 python库的导入 国内镜像源网址及使用方法 镜像使用方法 文件读取 数据的总体了解 查看数据集中特征缺失值,唯一值等 检查缺失值 缺失值 ...

  2. 数据竞赛入门-金融风控(贷款违约预测)五、模型融合

    前言 本次活动为datawhale与天池联合举办,为金融风控之贷款违约预测挑战赛(入门) 比赛地址:https://tianchi.aliyun.com/competition/entrance/53 ...

  3. 数据竞赛入门-金融风控(贷款违约预测)四、建模与调参

    前言 本次活动为datawhale与天池联合举办,为金融风控之贷款违约预测挑战赛(入门) 比赛地址:https://tianchi.aliyun.com/competition/entrance/53 ...

  4. 数据竞赛入门-金融风控(贷款违约预测)三、特征工程

    前言 本次活动为datawhale与天池联合举办,为金融风控之贷款违约预测挑战赛(入门) 比赛地址:https://tianchi.aliyun.com/competition/entrance/53 ...

  5. 【算法竞赛学习】金融风控之贷款违约预测-建模与调参

    Task4 建模与调参 此部分为零基础入门金融风控的 Task4 建模调参部分,带你来了解各种模型以及模型的评价和调参策略,欢迎大家后续多多交流. 赛题:零基础入门数据挖掘 - 零基础入门金融风控之贷 ...

  6. 基于机器学习与深度学习的金融风控贷款违约预测

    基于机器学习与深度学习的金融风控贷款违约预测 目录 一.赛题分析 1. 任务分析 2. 数据属性 3. 评价指标 4. 问题归类 5. 整体思路 二.数据可视化分析 1. 总体数据分析 2. 数值型数 ...

  7. 笔记之零基础入门金融风控-贷款违约预测

    零基础入门金融风控-贷款违约预测 赛题描述 赛题概况 数据概况 合理的创建标题,有助于目录的生成 预测指标 赛题流程 评分卡 笔记记录转载 赛题描述 赛题以金融风控中的个人信贷为背景,要求选手根据贷款 ...

  8. 「机器学习」天池比赛:金融风控贷款违约预测

    一.前言 1.1 赛题背景 赛题以金融风控中的个人信贷为背景,要求选手根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题. 任务:预测用户贷款是否违约 比 ...

  9. 数据挖掘机器学习[六]---项目实战金融风控之贷款违约预测

    相关文章: 特征工程详解及实战项目[参考] 数据挖掘---汽车车交易价格预测[一](测评指标:EDA) 数据挖掘机器学习---汽车交易价格预测详细版本[二]{EDA-数据探索性分析} 数据挖掘机器学习 ...

  10. 入门金融风控【贷款违约预测】

    入门金融风控[贷款违约预测] 赛题以金融风控中的个人信贷为背景,要求选手根据贷款申请人的数据信息预测其是否有违约的可能,以此判断是否通过此项贷款,这是一个典型的分类问题.通过这道赛题来引导大家了解金融 ...


  1. Error: bin/bash^M: bad interpreter: no such file o
  2. 去掉 java BigDecimal 类对象后面没用的零
  3. 网络安全概念是什么?互联网时代它为何如此重要?
  4. Lambdas中的例外:有点混乱的优雅解决方案
  5. 对于新生代农民工,你有什么想说的?
  6. vue-video-player修改src就会报错_4、修改入口点代码
  7. 无法在Web服务器上启动调试。与Web服务器通信时出现身份验证错误
  8. 自动改变文字大小和颜色的javascript效果
  9. Springboot 配置文件、隐私数据脱敏的最佳实践(原理+源码)
  10. pytorch核心模块
  11. maven install 安装项目问题总结An unknown compilation problem occurred
  12. matlab单机无限大系统_单机无穷大系统暂态仿真(完整).docx
  13. 基于动态优先级的时间片轮转调度算法c语言
  14. 华为手机图标怎么变小_华为手机怎么设置图标由大变小
  15. 金田变频器说明书_金田BH386系列变频器使用手册.pdf
  16. 校友故事|我在科大感受理工科“严谨的浪漫主义”
  17. 计算机怎么合并单元格并保存内容,怎么合并单元格并保留所有数据
  18. python使用opencv_玩转Python图片处理 (OpenCV-Python )
  19. Windows MinWG 编译 thrift
  20. 奥迪A6(C5)遥控器钥匙更换电池后无法使用的适配(对码)方法


  1. php微信上传图文素材,php使用curl 上传微信公共平台素材文件
  2. 政务内网、政务外网、政务专网
  3. 命名实体识别难在哪?
  4. 规划控制下的二阶段设计理论 -【多核服务价值链协同】
  5. [学习]17 每天只睡6小时,依然精力充沛
  6. hishop6.0和易分销2.0数据库迁移手册
  7. 通过HOOK获取QQ游戏登录密码
  8. 离职前一定要做好这7件事情,少一件都很麻烦。
  9. win7搭建nas存储服务器_FreeNas 0.7.1:普通电脑变成网络存储服务器
  10. 谷歌李开复 我的传奇人生源于十句箴言