人口预测和阻尼-增长模型

A couple of years ago, I started working for a quant company called M2X Investments, and my first challenge was to create a model that could predict the interest rate movement.

几年前，我开始为一家名为M2X Investments的定量公司工作，我的第一个挑战是创建一个可以预测利率变动的模型。

After a couple of days working solely to clean and prepare the data, I took the following approach: build a simple model and then reverse engineer it to make it better (optimizing and selecting features). Then, if the results weren’t so good, I would change the model and make the same process again and so forth.

在仅清理和准备数据几天后，我采取了以下方法：建立一个简单的模型 ，然后对其进行反向工程以使其更好(优化和选择功能)。然后，如果结果不是很好，我将更改模型并再次执行相同的过程 ，依此类推。

Therefore, these series of posts objective is to apply different classification models to predict the upward movement of the interest rate, providing a brief intuition of the model (there are a lot of posts that cover the model's mathematics and concepts), and compare their results. By giving more attention to the upward movements, we simplify the problem.

因此，这些职位系列的目的是应用不同的分类模型来预测利率的上升趋势，从而提供对该模型的简短直觉(很多职位都涉及该模型的数学和概念)，并比较其结果。通过更多地关注向上运动，我们简化了问题。

Note: from here on, the data set I will use is fictitious and for educational purposes only.

注意：从这里开始，我将使用的数据集是虚构的，仅用于教育目的。

The data set used in this post is from Quandl, specifically from Commodity Indices, Merrill Lynch, and US Federal Reserve. The idea was to use agriculture, metals, and energy indices, along with corporate yield bond rates, to classify the up movements of the Federal funds' effective rate.

这篇文章中使用的数据集来自Quandl ，特别是商品指数，美林和美联储。这个想法是利用农业，金属和能源指数以及公司收益债券利率来对联邦基金有效利率的上升趋势进行分类。

A brief introduction to Logistic Regression

Logistic回归简介

Logistic Regression is a binary classification method. It is a type of Generalized Linear Model that predicts the occurrence’s probability of a binary or categorical variable utilizing a logit function. It relies on a kind of function called sigmoid, that map the input to a value between 0 and 1.

Logistic回归是一种二进制分类方法。它是一种广义线性模型，它利用对数函数预测二进制或分类变量的出现概率。它依赖于一种称为sigmoid的函数，该函数将输入映射到0到1之间的值。

When building the regression model with the sigmoid function, we end up with an equation, as shown above, that will give us the occurrence´s probability (p) of the dependent variable.

当使用S形函数建立回归模型时，我们最终得到一个方程，如上所示，该方程将为我们提供因变量的出现概率( p )。

The model is estimated by using Maximum Likelihood Estimation (MLE) and there are basically three types of Logistic Regression models: Binary, Multinomial, and Ordinal. In this post, we are going to work with the Binary model.

该模型是使用最大似然估计(MLE)进行估计的，基本上存在三种Logistic回归模型：二进制，多项式和有序。在本文中，我们将使用Binary模型。

The code

代码

First, we import the libraries we are going to use and include Quandl’s API key to download the variables we need.

首先，我们导入将要使用的库，并包含Quandl的API密钥以下载所需的变量。

import numpy as npimport pandas as pdimport quandl as qdlimport matplotlib.pyplot as pltimport seaborn as snssns.set(style="white")from imblearn.over_sampling import ADASYNfrom sklearn.preprocessing import MinMaxScalerfrom sklearn.model_selection import train_test_splitfrom sklearn.feature_selection import RFEimport statsmodels.api as smfrom sklearn.linear_model import LogisticRegressionfrom sklearn import metrics# API key from Quandl (free but not necessary)qdl.ApiConfig.api_key = "JsDf-rbjTsUCP8TzomaW"# get data from Quandldata = pd.DataFrame()meta_data = ['RICIA','RICIM','RICIE']for code in meta_data:    df=qdl.get('RICI/'+code,start_date="2005-01-03", end_date="2020-07-01")    df.columns = [code]    data = pd.concat([data, df], axis=1)meta_data = ['EMHYY','AAAEY','USEY']for code in meta_data:    df=qdl.get('ML/'+code,start_date="2005-01-03", end_date="2020-07-01")    df.columns = [code]    data = pd.concat([data, df], axis=1)

An essential part of the process is dealing with NaN values. The methods we use to fill or drop them will depend on the problem we have in hands. Unfortunately, it is not the purpose of the post, so I am going to make a basic solution and transform them into the average value of my variables. Sometimes this is a naive solution, but for our purposes, it is just fine.

该过程的重要部分是处理NaN值。我们用来填充或删除它们的方法将取决于我们面临的问题。不幸的是，这不是帖子的目的，因此我将提出一个基本的解决方案并将其转换为变量的平均值。有时这是一个幼稚的解决方案，但就我们的目的而言，这很好。

# dealing with possible empty values (not much attention to this part, but it is very important)data.fillna(data.mean(), inplace=True)print(data.head())print("\nData shape:\n",data.shape)

Let’s remember our variables in more detail. RICIA is the Euronext Rogers International Agriculture Commodity Index, RICIM is the Euronext Rogers International Metals Commodity Index, RICIE is the Euronext Rogers International Energy Commodity Index, EMHYY is the Emerging Markets High Yield Corporate Bond Index Yield, AAAEY is the US AAA-rated Bond Index (yield) and, finally, USEY is the US Corporate Bond Index Yield.

让我们更详细地记住我们的变量。 RICIA是泛欧罗杰斯国际农业商品指数，RICIM是泛欧罗杰斯国际金属商品指数，RICIE是泛欧罗杰斯国际能源商品指数，EMHYY是新兴市场高收益企业债券指数收益率，AAAEY是美国AAA级债券指数(收益率) ，最后，USEY是美国公司债券指数收益率。

Back to the code! Now we are going to look at our data and see if we can find out characteristics that will help us improve our future model.

回到代码！现在，我们将查看数据，看看是否可以找到有助于我们改进未来模型的特征。

#histogramsdata.hist()plt.title('Histograms')plt.xlabel('Value')plt.ylabel('Frequency')plt.show()

The first thing we can notice is that they vary a lot in scale from each other. We can deal with that by Min-Max scaling.

我们可以注意到的第一件事是它们彼此之间的规模差异很大。我们可以通过最小最大缩放来处理。

# scaling values to maked them vary between 0 and 1scaler = MinMaxScaler()data_scaled = pd.DataFrame(scaler.fit_transform(data.values), columns=data.columns, index=data.index)

I don’t want to get overextended in this matter, so let’s imagine that it was all that we were able to figure it out. Next, we will move to our dependent variable, the RIFSPFF_N_D (more commonly known as Federal funds effective rate).

我不想在这个问题上过分夸张，所以让我们想象一下，这就是我们能够弄清楚的一切。接下来，我们将移至因变量RIFSPFF_N_D(通常称为联邦基金有效利率 )。

# pulling dependent variable from Quandl (par yield curve)par_yield = qdl.get('FED/RIFSPFF_N_D',start_date="2005-01-03", end_date="2020-07-01")par_yield.columns = ['FED/RIFSPFF_N_D']# create an empty df with same index as variables and fill it with our independent var values (I think this is unnecessary whith this data set... =))par_data = pd.DataFrame(index=data_scaled.index, columns=['FED/RIFSPFF_N_D'])par_data.update(par_yield['FED/RIFSPFF_N_D'])# get the variation and binarize itpar_data=par_data.pct_change()par_data.fillna(0, inplace=True)par_data = par_data.apply(lambda x: [0 if y <= 0 else 1 for y in x])print("Number of 0 and 1s:\n",par_data.value_counts())# plot number of 0 and 1s sns.countplot(x='FED/RIFSPFF_N_D', data=par_data, palette='Blues')plt.title('0s and 1s')plt.savefig('0s and 1s')

We downloaded our dependent variable, took its % variation, and transformed it into 0s (when ≤0) and 1s (when >0). Here is what we got: 3143 zeros and 909 ones.

我们下载了因变量，获取了％的变化，然后将其转换为0(≤0)和1(> 0)。这是我们得到的：3143个零和909个。

Important to note that by binarizing the data that way, we are preoccupied with the up movements only and labeling downward and no movements equal.

重要的是要注意，通过以这种方式对数据进行二值化处理，我们只专注于向上运动，而向下运动则标记为相等，没有运动等于运动。

Well, that’s not a good ratio of 0s and 1s right? To deal with this issue we can use some methods for oversampling data. We are going to use the ADASYN method. The fundamental difference of ADASYN for SMOTE is that the first uses a density distribution while the last utilizes uniform weights for the minority points. Don't worry, now is the moment to have faith and believe that this is a suitable method!

好吧，这不是0和1的好比率，对吧？为了解决这个问题，我们可以使用一些方法对数据进行过采样。我们将使用ADASYN方法。 ADASYN为根本区别SMOTE的是，第一次使用的密度分布，而最后采用了针对少数点一致的权重。 别担心，现在是时候有了信心，相信这是一种合适的方法！

# Over-sampling with ADASYN methodsampler = ADASYN(random_state=13)X_os, y_os = sampler.fit_sample(data_scaled, par_data.values.ravel())columns = data_scaled.columnsdata_scaled = pd.DataFrame(data=X_os,columns=columns )par_data= pd.DataFrame(data=y_os,columns=['FED/RIFSPFF_N_D'])print("\nProportion of 0s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==0])/len(data_scaled))print("\nProportion 1s in oversampled data: ",len(par_data[par_data['FED/RIFSPFF_N_D']==1])/len(data_scaled))

Now that we have our data well balanced, let’s split it into the train and test sets and make a logit regression to analyze de p-values. The purpose of this step is to filter the independent variables.

现在我们已经使数据平衡，现在将其分为训练集和测试集，并进行logit回归以分析de p值。此步骤的目的是过滤自变量。

# split data into test and train setX_train, X_test, y_train, y_test = train_test_split(data_scaled, par_data, test_size=0.2, random_state=13)# just make it easier to write yy = y_train['FED/RIFSPFF_N_D']# logit model to analyze p-value and filter remaining variableslogit_model=sm.Logit(y,X_train)result=logit_model.fit()print('\nComplete logit regression:\n',result.summary2())

Ok, all variables seem to show a p-value<0.05. So we are going to stick to them and fire up our model!

好的，所有变量似乎都显示p值<0.05。因此，我们将坚持下去并完善我们的模型！

# ligistic regression modellogreg = LogisticRegression()logreg.fit(X_train, y)y_pred = logreg.predict(X_test)print('\nAccuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))# confusion matrixconfusion_matrix = metrics.confusion_matrix(y_test, y_pred)print('\nConfusion matrix:\n',confusion_matrix)print('\nClassification report:\n',metrics.classification_report(y_test, y_pred))# plot confusion matrixdisp = metrics.plot_confusion_matrix(logreg, X_test, y_test,cmap=plt.cm.Blues)disp.ax_.set_title('Confusion Matrix')plt.savefig('Confusion Matrix')

So there it is! The attempt to solve the problem using Logistic Regression turned out to give us an accuracy of 66%, predicting 810 labels correctly. We know that accuracy itself is not that informative, so let's look at the classification report and the ROC curve.

就是这样！尝试使用Logistic回归解决问题的方法为我们提供了66％的准确度，可以正确预测810个标签。我们知道准确性本身并不能提供足够的信息，因此让我们看一下分类报告和ROC曲线。

# roc curve (beautiful code from Susan Li) logit_roc_auc = metrics.roc_auc_score(y_test, logreg.predict(X_test))fpr, tpr, thresholds = metrics.roc_curve(y_test, logreg.predict_proba(X_test)[:,1])plt.figure()plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)plt.plot([0, 1], [0, 1],'r--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('ROC curve - Logistic Regression')plt.legend(loc="lower right")plt.savefig('Log_ROC')

The classification report gives us Precision, Recall, and F1-Score. Precision talks about how accurate our model is. It means that out of those predicted positive, how many of them are actually positive. Recall tells us how many of the true positives our model capture through classifying them as positives. The F1-Score takes both, Precision and Recall, into consideration and it is useful if the data is unbalanced. It seems that our metrics are well balanced despite their low values.

分类报告为我们提供了Precision，Recall和F1-Score。 Precision谈论我们的模型有多精确。这意味着在那些预测为积极的人中，实际上有多少是积极的。回想率告诉我们，通过将模型分类为肯定值，我们的模型可以捕获多少真正的肯定值。 F1-Score同时考虑了Precision和Recall，如果数据不平衡，则非常有用。尽管我们的指标值很低，但看起来还是很平衡。

The objective of analyzing the ROC curve is to see if the model is as far as possible from the red line, which is the result of a pure random classifier. So the closest to the top left corner, the better. In other words, the bigger the area under the curve, the better. We got an area of 0.65; it is noticeable that we still have a long way to go… In the next post (Part 2), we are going to tackle the problem by applying the Naive Bayes method.

分析ROC曲线的目的是查看模型是否离红线尽可能远，这是纯随机分类器的结果。因此，离左上角越近越好。换句话说，曲线下的面积越大越好。我们得到了0.65的面积；值得注意的是，我们还有很长的路要走……在下一篇文章(第2部分)中，我们将通过应用朴素贝叶斯方法来解决该问题。

This article was written in conjunction with Guilherme Bezerra Pujades Magalhães.

本文与 Guilherme Bezerra PujadesMagalhães 一起撰写 。

参考和重要链接 (References and great links)

[1] J. Starmer, StatQuest with Josh Starmer on Logistic Regression, YouTube.

[1] J. Starmer， StatQuest与Josh Starmer谈 YouTube的Logistic回归。

[2] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic Minority Over-sampling Technique (2002), Journal Of Artificial Intelligence Research, Volume 16, pages 321–357, 2002.

[2] NV Chawla，KW Bowyer，LO Hall，WP Kegelmeyer，SMOTE ：“综合少数群体过采样技术” (2002年)，《人工智能研究杂志》，第16卷，第321–357页，2002年。

[3] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li, ADASYN: Adaptive synthetic sampling approach for imbalanced learning (2008) IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008, pp. 1322–1328.

[3]何海波，杨洋，EA Garcia和李树涛， ADASYN：用于不平衡学习的自适应合成采样方法 (2008)IEEE国际神经网络联合会议(IEEE世界计算智能大会)，香港，2008年，第pp 1322–1328。

翻译自: https://towardsdatascience.com/predicting-interest-rate-with-classification-models-part-1-c7d6f82b739a

人口预测和阻尼-增长模型

查看全文

http://www.taodudu.cc/news/show-863408.html

基于kb的问答系统_1KB以下基于表的Q学习
图论为什么这么难_图论是什么，为什么要关心？
使用RNN和TensorFlow创建自己的Harry Potter短故事
bitnami如何使用_使用Bitnami获取完全配置的Apache Airflow Docker开发堆栈
cox风险回归模型参数估计_信用风险管理：分类模型和超参数调整
支持向量机回归分析_支持向量机和回归分析
ai/ml_您本周应阅读的有趣的AI / ML文章（8月15日）
chime-4 lstm_CHIME-6挑战赛回顾
文本文件加密和解密_解密文本见解和相关业务用例
有关糖尿病模型建立的论文_预测糖尿病结果的模型比较
chi-squared检验_每位数据科学家都必须具备Chi-S方检验统计量：客户流失中的案例研究
深度学习：在图像上找到手势_使用深度学习的人类情绪和手势检测器：第2部分
爆破登录测试网页_预测危险的地震爆破第一部分：EDA，特征工程和针对不平衡数据集的列车测试拆分
概率论在数据挖掘_为什么概率论在数据科学中很重要
集合计数二项式反演_对计数数据使用负二项式
使用TorchElastic训练DeepSpeech
神经网络架构搜索_神经网络架构
raspberry pi_通过串行蓝牙从Raspberry Pi传感器单元发送数据
问答机器人接口python_设计用于机器学习工程的Python接口
k均值算法二分k均值算法_如何获得K均值算法面试问题
支持向量机概念图解_支持向量机：基本概念
如何设置Jupiter Notebook服务器并从任何地方访问它（Windows 10）
无监督学习 k-means_监督学习-它意味着什么？
logistic 回归_具有Logistic回归的优秀初学者项目
脉冲多普勒雷达_是人类还是动物？多普勒脉冲雷达和神经网络的目标分类
pandas内置绘图_使用Pandas内置功能探索数据集
sim卡rfm_信用卡客户的RFM集群
需求分析与建模最佳实践_社交媒体和主题建模：如何在实践中分析帖子
机器学习数据模型_使用PyCaret将机器学习模型运送到数据—第二部分
大数据平台蓝图_数据科学面试蓝图

人口预测和阻尼-增长模型_使用分类模型预测利率-第1部分相关推荐

人口预测和阻尼-增长模型_使用分类模型预测利率-第3部分
人口预测和阻尼-增长模型 This is the final article of the series " Predicting Interest Rate with Classifica ...
人口预测和阻尼-增长模型_使用分类模型预测利率-第2部分
人口预测和阻尼-增长模型 We are back! This post is a continuation of the series "Predicting Interest Rate w ...
文本分类模型_文本分类模型之TextCNN
六年的大学生涯结束了,目前在搜索推荐岗位上继续进阶,近期正好在做类目预测多标签分类的项目,因此把相关的模型记录总结一下,便于后续查阅总结. 一.理论篇: 在我们的场景中,文本数据量比较大,因此直接采用 ...
机器学习数据预处理之缺失值：预测填充（回归模型填充、分类模型填充）
机器学习数据预处理之缺失值:预测填充(回归模型填充.分类模型填充) garbage in, garbage out. 没有高质量的数据,就没有高质量的数据挖掘结果,数据值缺失是数据分析中经常遇到的问题 ...
R语言VaR市场风险计算方法与回测、用LOGIT逻辑回归、PROBIT模型信用风险与分类模型...
全文链接:http://tecdat.cn/?p=27530 市场风险指的是由金融市场中资产的价格下跌或价格波动增加所导致的可能损失. 相关视频市场风险包含两种类型:相对风险和绝对风险.绝对风险关 ...
airbnb机器学习模型_机器学习基础：预测Airbnb价格
airbnb机器学习模型 Machine learning is easily one of the biggest buzzwords in tech right now. Over the pas ...
决策树模型朴素贝叶斯模型_有关决策树模型的概述
决策树模型朴素贝叶斯模型 Decision Trees are one of the highly interpretable models and can perform both classif ...
零信任模型_关于信任模型
零信任模型 In the world of deep learning, there are certain safety-critical applications where a cold pre ...
机器学习训练较快的模型_通过心理模型更快地学习软件，第1部分
机器学习训练较快的模型什么是心理模型? (What Are Mental Models?) The easiest way to describe them is that they're pat ...

人口预测和阻尼-增长模型_使用分类模型预测利率-第1部分

参考和重要链接 (References and great links)

相关文章：

人口预测和阻尼-增长模型_使用分类模型预测利率-第1部分相关推荐

最新文章

热门文章