多元线性回归中多重共线性

Linear Regression is one of the simplest and most widely used algorithms for Supervised machine learning problems where the output is a numerical quantitative variable and the input is a bunch of independent variables or single variable.

对于有监督的机器学习问题，线性回归是最简单且使用最广泛的算法之一，其中输出是数字量化变量，输入是一堆自变量或单变量。

The math behind it is easy to understand and that’s what makes Linear Regression one of my most favorite algorithms to work with. But this simplicity comes with a price.

它背后的数学原理很容易理解，这就是线性回归成为我最喜欢使用的算法之一的原因。但是，这种简单性需要付出代价。

When we decide to fit a Linear Regression model, we have to make sure that some conditions are satisfied or else our model will perform poorly or will give us incorrect interpretations. So what are some of these conditions that have to be met?

当我们决定拟合线性回归模型时，我们必须确保满足某些条件，否则我们的模型将无法正常运行或会给我们错误的解释。那么必须满足哪些条件呢？

Linearity: X and the mean of Y have a Linear Relationship

线性：X和Y的平均值具有线性关系

2. Homoscedasticity: variance of the error terms is the same for all values of X.

2. 均方差：所有X值的误差项的方差都相同。

3. No collinearity: independent variables are not highly correlated with each other

3. 没有共线性：自变量彼此之间没有高度相关性

4.Normality: Y is normally distributed for any value of X.

4. 正态性 ：对于任何X值，Y 均呈正态分布。

If the above four conditions are satisfied, we can expect our Linear Regression model to perform well.

如果满足以上四个条件，我们可以期望线性回归模型表现良好。

So how do we ensure the above conditions are met? Well, If I start going into the depth of all of the above conditions, it might result in a very long article. So for this article, I will go over the third condition of No collinearity meaning I will explain what Multicollinearity and how it is a problem in the first place and what can be done to overcome it.

那么我们如何确保满足以上条件？好吧，如果我开始深入研究以上所有条件，可能会导致篇幅太长。因此，对于本文，我将介绍“无共线性”的第三个条件，这意味着我将首先解释什么是“多重共线性”，以及这是一个问题以及如何解决该问题。

When we have a Supervised Machine Learning Regression problem, We know we have a bunch of Independent variables and an Output variable which will be used to train our model and make predictions and interpretations.

当我们遇到监督机器学习回归问题时，我们知道我们有一堆独立变量和一个输出变量，这些变量将用于训练我们的模型并进行预测和解释。

In a Multivariate Linear Regression problem, we make predictions based off of the model trained and use the coefficients to make interpretations of the model for example:

在多元线性回归问题中，我们根据训练的模型进行预测，并使用系数对模型进行解释，例如：

The above equation states that a unit increase in X1, will result in a B1 increase in the value of Y and a unit increase in X2 will result in a B2 increase in the value of Y.

上面的等式指出，X1的单位增加将导致Y值增加B1，X2的单位增加将导致Y值增加B2。

The coefficients are mandatory in order to understand which variable has the highest influence on the model.

系数是强制性的，以便了解哪个变量对模型的影响最大。

So how is multicollinearity a problem? Well, When we have independent variables that are highly related to each other, our coefficients won’t be reliable and we cannot make accurate interpretations based on their values.

那么多重共线性如何成为问题呢？ 好吧，当我们有彼此高度相关的自变量时，我们的系数将不可靠，并且我们无法基于它们的值进行准确的解释。

To explain this point further, I created two dummy input variables in Python and one dependent output variable.

为了进一步说明这一点，我在Python中创建了两个伪输入变量和一个从属输出变量。

x3 = np.random.randint(0,100,100)x4 = 3*x3 + np.random.randint(0,100,100)y1 = (4*x3) + np.random.randint(0,100,100)

Creating the scatterplot for the variables gives us:

为变量创建散点图可以使我们：

plt.figure(figsize = (12,5))plt.subplot(1,2,1)plt.xlabel('x3')sns.scatterplot(x3,y1)plt.subplot(1,2,2)plt.xlabel('x4')sns.scatterplot(x4,y1)

Scatterplots for the Input variables against y1

The scatterplot shows that both x3 and x4 are have a linear relationship with y1. Lets look at the correlation matrix for the variables and see what else can we interpret.I put my variables into a DataFrame by the name of S2 and created a correlation matrix.

散点图显示x3和x4都与y1线性相关。让我们看一下变量的相关矩阵，看看还能解释什么。我将变量以S2的名称放入DataFrame中，并创建了一个相关矩阵。

S2.corr()

By the looks of the correlation matrix, it seems that both X3 and X4 not only have a high positive correlation with y1 but also are highly correlated with each other. Let’s see how this will affect our results.

从相关矩阵的外观来看，似乎X3和X4不仅与y1具有很高的正相关性，而且彼此之间也具有高度相关性。让我们看看这将如何影响我们的结果。

Before I fit a Linear Regression model to my variables, we have to understand the concept of P-values and the Null hypothesis.

在将线性回归模型拟合到变量之前，我们必须了解P值的概念和Null假设 。

The P-value is used to either reject or accept the Null Hypothesis.

P值用于拒绝或接受零假设。

The Null Hypothesis in our case is that ‘The variable does not have a significant relation with y”.

在我们的情况下，零假设是“ 变量与y没有显着关系 ”。

If the P-value is less than the threshold of 0.005, then we have to reject the Null hypothesis, otherwise, we have to accept it. So let’s move forward

如果P值小于0.005的阈值， 则我们必须拒绝Null假设 ，否则， 我们必须接受它 。所以让我们前进

I import the stats model from the scipy library and use it to fit an Ordinary Least Squares model to my variables.

我从scipy库中导入统计模型，并使用它来将普通最小二乘模型拟合到我的变量中。

The independent variables being X3 and X4 and the dependent variable being y1.

自变量为X3和X4，因变量为y1。

X = S2[['X3','X4']]y = S2['y1']import statsmodels.api as smfrom scipy import statsX = sm.add_constant(X3)est = sm.OLS(y,X)est2 = est.fit()print(est2.summary())

The results we get are:

我们得到的结果是：

We get a very high R2 score which shows that our model explains the variance in the model quite well. The coefficients on the other hand, tell an entirely different story.

我们获得了很高的R2分数，这表明我们的模型很好地解释了模型中的方差。另一方面，系数则讲述了一个完全不同的故事。

The P-value for our X4 variable shows that we cannot reject the Null-hypothesis meaning X4 does not have a significant relation with y.

X4变量的P值表明，我们不能拒绝零假设，这意味着X4与y没有显着关系。

Furthermore, the coefficient is negative as well which cannot be possible as the scatterplots showed that y had a positive relationship will the independent variable.

此外，该系数也为负，这是不可能的，因为散点图表明y与正变量具有正相关关系。

So to sum it up, Our coefficients are not reliable and our P-values cannot be trusted.

综上所述， 我们的系数不可靠，我们的P值不可信。

仅使用一个变量进行回归 (Regression with one variable only)

In the previous multivariate example, our results showed that X4 did not have a significant relation with y1. So let us try to analyze y1 and X4 alone and see what we get.

在前面的多变量示例中，我们的结果表明X4与y1没有显着关系。因此，让我们尝试单独分析y1和X4并查看得到的结果。

X3 = S2['X4']y1 = S2['y1']import statsmodels.api as smfrom scipy import statsX = sm.add_constant(X3)est = sm.OLS(y1,X)est2 = est.fit()print(est2.summary())

After fitting our OLS model, we get

拟合我们的OLS模型后，我们得到

The coefficient is now positive and we can reject the Null Hypothesis that X4 is not related to y1. But one more thing we can take from this model is that our R-squared value has reduced significantly from 0.942 to 0.826. So what does that tell us? Well, if our goal is prediction, we might need to think before removing variables but if our goal is an interpretation of each coefficient, then collinearity can be troublesome and we have to consider which variables to keep and which to remove.

系数现在为正， 我们可以拒绝零假设 ，即X4与y1不相关。但是，我们可以从该模型中得出的另一点是，我们的R平方值已从0.942大幅降低至0.826。那这告诉我们什么呢？好吧，如果我们的目标是预测，则可能需要在删除变量之前进行思考，但是如果我们的目标是对每个系数的解释，则共线性可能会很麻烦，我们必须考虑保留哪些变量以及删除哪些变量。

[1]: Gareth James.. An introduction to statistical learninghttp://faculty.marshall.usc.edu/gareth-james/ISL/

[1]：Gareth James .. 统计学习简介 http://faculty.marshall.usc.edu/gareth-james/ISL/

翻译自: https://medium.com/analytics-vidhya/how-multicollinearity-is-a-problem-in-linear-regression-dbb76e25cd80

多元线性回归中多重共线性

查看全文

http://www.taodudu.cc/news/show-863793.html

opencv 创建图像_非艺术家的图像创建（OpenCV项目演练）
使用TensorFlow进行深度学习-第2部分
基于bert的语义匹配_构建基于BERT的语义搜索系统…针对“星际迷航”
一个数据包的旅程_如何学习数据科学并开始您的惊人旅程
jupyter 托管_如何在本地托管的Jupyter Notebook上进行协作
fitbit手表中文说明书_如何获取和分析Fitbit睡眠分数
熔池沉积_用于3D打印的AI（第2部分）：异常熔池检测的一课学习
机器学习可视化_机器学习-可视化
学习javascript_使用5行JavaScript进行机器学习
强化学习-动态规划_强化学习-第4部分
神经网络优化器的选择_神经网络：优化器选择的重要性
客户细分_客户细分：K-Means聚类和A / B测试
菜品三级分类_分类器的惊人替代品
开关变压器绕制教程_教程：如何将变压器权重和令牌化器从AllenNLP上传到HuggingFace
一般线性模型和混合线性模型_线性混合模型如何工作
为什么基于数字的技术公司进行机器人研究
人类视觉系统_对人类视觉系统的对抗攻击
在神经网络中使用辍学：不是一个神奇的子弹
线程监视器模型_为什么模型验证如此重要，它与模型监视有何不同
dash使用_使用Dash和SHAP构建和部署可解释的AI仪表盘
面向表开发面向服务开发_面向繁忙开发人员的计算机视觉
可视化 nltk_词嵌入：具有Genism，NLTK和t-SNE可视化的Word2Vec
fitbit手表中文说明书_使用机器学习预测Fitbit睡眠分数
redis生产环境持久化_在SageMaker上安装持久性Julia环境
alexnet vgg_从零开始：建立著名的分类网2（AlexNet / VGG）
垃圾邮件分类 python_在python中创建SMS垃圾邮件分类器
脑电波之父:汉斯·贝格尔_深度学习，认识聪明的汉斯
PyCaret 2.0在这里-新增功能？
特征选择回归_如何执行回归问题的特征选择
建立神经网络来预测贷款风险

多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。相关推荐

数据多重共线性_多重共线性对您的数据科学项目的影响比您所知道的要多
数据多重共线性 Multicollinearity is likely far down on a mental list of things to check for, if it is on a ...
java中引用类型_您真的了解Java中的4种引用类型吗？
Java中提供了四个级别的引用:SoftReference,FinalReference,WeakReference和PhantomReference.在四种引用类型中,只有FinalReferenc ...
mysql 搜索标题中字符串_如何在MySQL表中搜索特定字符串？
使用等于运算符进行完全匹配-select *from yourTableName where yourColumnName=yourValue; 让我们首先创建一个表-mysql> create ...
ios 如何在cell中去掉_经典问题：代码中如何去掉烦人的“!=nullquot;判空语句
问题为了避免空指针调用,我们经常会看到这样的语句 if (someobject != null) { someobject.doCalc();} 最终,项目中会存在大量判空代码,多么丑陋繁冗!如何避 ...
mysql 同步中历史记录_[Mysql]备份同库中一张表的历史记录 insert into ..select
#!/usr/bin/python2.7 # -*- coding: utf-8 -*- #python2.7x #authror: orangleliu #备份radius中的上网记录表,每一个月备 ...
mfc 找到字符串中字符_[LeetCode] 467. 环绕字符串中唯一的子字符串
题目链接: https://leetcode-cn.com/problems/unique-substrings-in-wraparound-string 难度:中等通过率:35.6% 题目描述: ...
weblogic中数据源_如何在WebLogic Server中创建MySQL数据源
weblogic中数据源使用应用程序服务器的一件很酷的事情是,它允许您在应用程序外部创建DataSource,并且可以与线程池和事务管理器等一起管理它.对于WebLogic Server,它附带了许 ...
python交互模式中换行_在Python日志模式中禁止换行
新行\n插入到StreamHandler类中. 如果您真的设置了修复这个行为,那么这里有一个例子说明我是如何通过logging.StreamHandler类中的monkey patching方法解决这 ...
python类中函数_如何在Python类中使用模块函数
参见英文答案 > How do you call a private module function from inside a class? ...

多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。

仅使用一个变量进行回归 (Regression with one variable only)

相关文章：

多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。相关推荐

最新文章

热门文章