直线回归数据离群值

When it comes to regression analysis — outliers (or values that are well outside of the mean for a particular set of data) can cause issues.

说到回归分析，离群值(或特定数据集的均值超出平均值)可能会引起问题。

背景 (Background)

Let’s consider this issue further using the Pima Indians Diabetes dataset.

让我们使用比马印第安人糖尿病数据集来进一步考虑这个问题。

Here is a boxplot of BMI values across patients. We can see that according to the above boxplot, there are several outliers present that are much larger than that indicated by the interquartile range.

这是患者之间BMI值的箱线图。我们可以看到，根据上面的箱线图，存在一些离群值大于四分位数范围指示的离群值。

Furthermore, we also have visual indication of a positively skewed distribution — where several positive outliers “push” the distribution out to the right:

此外，我们还可以看到正偏分布的视觉指示-多个正离群值将分布“推”到右侧：

Outliers can cause issues when it comes to conducting regression analysis. A key assumption of this model is the line of best fit, or the regression line that minimises the distance between the regression line and the individual observations.

在进行回归分析时，异常值可能会引起问题。该模型的关键假设是最佳拟合线 ，或者是使回归线与各个观测值之间的距离最小的回归线。

Clearly, if outliers are present, then this weakens the predictive power of the regression model as the line of best fit. It also violates the assumption that the data is normally distributed.

显然，如果存在离群值，则这会削弱回归模型作为最佳拟合线的预测能力。它也违反了数据是正态分布的假设。

In this regard, both an OLS regression model and robust regression models (using Huber and Bisquare weights) are run in order to predict BMI values across the test set — with a view to measuring whether accuracy was significantly improved by using the latter model.

在这方面，为了预测整个测试集的BMI值，运行了OLS回归模型和鲁棒回归模型(使用Huber和Bisquare权重)，以期通过使用后者模型来测量准确性是否得到了显着提高。

Here is a quick overview of the data and the correlations between each feature:

以下是数据及其每个功能之间的相关性的简要概述：

最小二乘 (OLS)

Using the above correlation plot in ensuring that the independent variables in the regression model are not strongly correlated with each other, the regression model is defined as follows:

使用上面的相关图，以确保回归模型中的自变量之间不存在强烈的相关性，回归模型的定义如下：

reg1 <- lm(BMI ~ Outcome + Age + Insulin + SkinThickness, data=trainset)

Note that Outcome is a categorical variable between 0 and 1 (not diabetic vs. diabetic).

请注意，结果是介于0和1之间的分类变量(非糖尿病vs.糖尿病)。

The data is split into both a training set and a test set (to serve as unseen data for the model).

数据被分为训练集和测试集(作为模型的看不见的数据)。

For the training set — 80% of this set is used to train the regression model, while 20% is used as a validation set to assess the results.

对于训练集，该训练集的80％用于训练回归模型，而20％作为验证集来评估结果。

# Training and Validation Datatrainset <- diabetes1[1:479, ]valset <- diabetes1[480:599, ]

Here are the OLS results:

这是OLS结果：

> # OLS Regression> summary(ols <- lm(BMI ~ Outcome + Age + Insulin + SkinThickness, data=trainset))Call:lm(formula = BMI ~ Outcome + Age + Insulin + SkinThickness, data = trainset)Residuals:     Min       1Q   Median       3Q      Max -12.0813  -4.2762  -0.8733   3.4031  28.2196Coefficients:                Estimate Std. Error t value Pr(>|t|)    (Intercept)   28.0498978  0.9740705  28.797  < 2e-16 ***Outcome        4.1290646  0.6171707   6.690 6.30e-11 ***Age           -0.0101171  0.0248626  -0.407    0.684    Insulin        0.0000262  0.0027077   0.010    0.992    SkinThickness  0.1513285  0.0195945   7.723 6.81e-14 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.176 on 474 degrees of freedomMultiple R-squared:  0.2135, Adjusted R-squared:  0.2069 F-statistic: 32.17 on 4 and 474 DF,  p-value: < 2.2e-16

Outcome and SkinThickness are identified as significant variables at the 5% level. While the R-Squared of 21.35% is quite low — this can be expected as there are many more variables that can influence BMI which have not been included in the model.

结果和皮肤厚度被确定为5％水平的重要变量。尽管21.35％的R平方非常低-但可以预期，因为还有更多可能影响BMI的变量尚未包含在模型中。

Let’s drop the age and insulin variables from the OLS model and run it once again.

让我们从OLS模型中删除年龄和胰岛素变量，然后再次运行。

> # OLS Regression> summary(ols <- lm(BMI ~ Outcome + SkinThickness, data=trainset))Call:lm(formula = BMI ~ Outcome + SkinThickness, data = trainset)Residuals:     Min       1Q   Median       3Q      Max -12.1740  -4.2115  -0.8532   3.3852  28.3072Coefficients:              Estimate Std. Error t value Pr(>|t|)    (Intercept)   27.70940    0.49723  55.728   <2e-16 ***Outcome        4.06953    0.59223   6.872    2e-11 ***SkinThickness  0.15247    0.01774   8.595   <2e-16 ***---Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 6.165 on 476 degrees of freedomMultiple R-squared:  0.2132, Adjusted R-squared:  0.2099 F-statistic: 64.51 on 2 and 476 DF,  p-value: < 2.2e-16

稳健回归 (Robust Regressions)

A modified version of the above regression is now run — also known as a robust regression. The reason we refer to the regression as “robust” is that such models are less sensitive to violations of OLS, including the presence of outliers in the data. The following presentation gives more information as to how specifically a robust regression works.

现在运行上述回归的修改版本-也称为健壮回归。 我们将回归称为“稳健”的原因是，此类模型对违反OLS(包括数据中存在异常值)的敏感度较低。以下演示提供了有关稳健回归如何具体工作的更多信息。

In this example, we will use two different types of weighting to run this type of regression: Huber and Bisquare weights.

在此示例中，我们将使用两种不同类型的权重来运行这种类型的回归： Huber和Bisquare权重。

The same regressions are run once again, but this time using the above weightings.

再次运行相同的回归，但是这次使用上述权重。

胡贝尔权重 (Huber Weights)

> # Huber Weights> rr.huber <- rlm(BMI ~ Outcome + SkinThickness, data=trainset)> summary(rr.huber)Call: rlm(formula = BMI ~ Outcome + SkinThickness, data = trainset)Residuals:     Min       1Q   Median       3Q      Max -12.4130  -3.6492  -0.3479   3.7717  28.7081Coefficients:              Value   Std. Error t value(Intercept)   27.0596  0.4685    57.7581Outcome        3.7631  0.5580     6.7438SkinThickness  0.1645  0.0167     9.8445Residual standard error: 5.47 on 476 degrees of freedom

双平方权重 (Bisquare Weights)

> # Bisquare weighting> rr.bisquare <- rlm(BMI ~ Outcome + SkinThickness, data=trainset, psi = psi.bisquare)> summary(rr.bisquare)Call: rlm(formula = BMI ~ Outcome + SkinThickness, data = trainset,     psi = psi.bisquare)Residuals:     Min       1Q   Median       3Q      Max -12.1991  -3.6106  -0.3015   3.8074  28.8724Coefficients:              Value   Std. Error t value(Intercept)   27.0524  0.4793    56.4472Outcome        3.6491  0.5708     6.3927SkinThickness  0.1636  0.0171     9.5689Residual standard error: 5.483 on 476 degrees of freedom

比较方式 (Comparison)

Here is the performance of the regression models in predicting the test set values (both on a root mean squared error and mean absolute percentage error basis):

这是回归模型预测测试集值的性能(均基于均方根误差和均值绝对百分比误差 )：

RMSE (RMSE)

OLS: 5.81OLS：5.81
Huber: 5.86胡贝尔：5.86
Bisquare: 5.87双方块：5.87

玛普 (MAPE)

OLS: 0.139OLS：0.139
Huber: 0.137胡贝尔：0.137
Bisquare: 0.137双平方：0.137

We can see that the errors increased slightly on an RMSE basis (contrary to our expectations), while there was only a marginal decrease on an MAPE basis.

我们可以看到，基于RMSE的误差略有增加(与我们的预期相反)，而基于MAPE的误差仅略有减少。

异常值是否“具有影响力”？ (Are the outliers “influential”?)

Using a robust regression to account for outliers did not show significant accuracy improvements as might have been expected.

使用稳健的回归来解决离群值并不能像预期的那样显示出明显的准确性提高。

However, simply because outliers might be present in a dataset — doesn’t necessarily mean that those outliers are influential.

但是，仅因为异常值可能存在于数据集中-并不一定意味着这些异常值具有影响力。

By influential, we mean that the outlier has a direct effect on the response variable.

所谓影响力，是指离群值直接影响响应变量。

This can be determined by using Cook’s Distance.

这可以通过使用库克距离来确定。

We can see that while outliers are indicated as being present in the dataset — they still do not approach the threshold as outlined by Cook’s distance at the top right-hand corner of the graph.

我们可以看到，虽然异常值被指示为存在于数据集中-但它们仍未接近图右上角的库克距离所概述的阈值。

In this regard, it is now evident why the robust regression did not show superior performance to OLS from an accuracy standpoint — the outliers are not influential enough to warrant using a robust regression.

在这方面，现在显而易见的是，为什么从准确性的角度来看，稳健回归没有表现出优于OLS的性能-异常值的影响力不足以保证使用稳健回归。

结论 (Conclusion)

Robust regressions are useful when it comes to modelling outliers in a dataset and there have been cases where they can produce superior results to OLS.

在对数据集中的异常值进行建模时，稳健的回归非常有用，并且在某些情况下可以产生比OLS更好的结果。

However, those outliers must be influential and in this regard one must practice caution in using robust regressions in a situation such as this — where outliers are present but they do not particularly influence the response variable.

但是，这些离群值必须具有影响力，因此在这种情况下(在存在离群值但它们并没有特别影响响应变量的情况下)，使用稳健回归时必须谨慎行事。

Hope you enjoyed this article, and any questions or feedback are greatly welcomed. You can find the code and datasets for this example at the MGCodesandStats GitHub repository.

希望您喜欢本文，并欢迎任何问题或反馈。您可以在MGCodesandStats GitHub存储库中找到此示例的代码和数据集。

Disclaimer: This article is written on an “as is” basis and without warranty. It was written with the intention of providing an overview of data science concepts, and should not be interpreted as professional advice in any way.

免责声明：本文按“原样”撰写，不作任何担保。 它旨在提供数据科学概念的概述，并且不应以任何方式解释为专业建议。

翻译自: https://towardsdatascience.com/working-with-outliers-ols-vs-robust-regressions-5cf861168ac4

直线回归数据离群值

查看全文

http://www.taodudu.cc/news/show-863831.html

Python中机器学习的特征选择技术
聚类树状图_聚集聚类和树状图-解释
机器学习与分布式机器学习_我将如何再次开始学习机器学习（3年以上）
机器学习算法机器人足球_购买足球队：一种机器学习方法
机器学习与不确定性_机器学习求职中的不确定性
pandas数据处理代码_使用Pandas方法链接提高代码可读性
opencv 检测几何图形_使用OpenCV + ConvNets检测几何形状
立即学习AI：03-使用卷积神经网络进行马铃薯分类
netflix 开源_Netflix的Polynote是一个新的开源框架，可用来构建更好的数据科学笔记本
电场大学_人工电场优化算法
主题建模lda_使用LDA的Google Play商店应用评论的主题建模
胶囊路由_评论：胶囊之间的动态路由
交叉验证python_交叉验证
open ai gpt_您实际上想尝试的GPT-3 AI发明鸡尾酒
python 线性回归_Python中的简化线性回归
机器学习模型的性能指标
利用云功能和API监视Google表格中的Cloud Dataprep作业状态
谷歌联合学习的论文_Google的未来联合学习
使用cnn预测房价_使用CNN的人和马预测
利用colab保存模型_在Google Colab上训练您的机器学习模型中的“后门”
java 回归遍历_回归基础：代码遍历
sql 12天内的数据_想要在12周内成为数据科学家吗？
SorterBot-第1部分
算法题指南书_分类算法指南
小米 pegasus_使用Google的Pegasus库生成摘要
数据集准备及数据预处理_1.准备数据集
ai模型_这就是AI的样子：用于回答问题的BiDAF模型
正则化技术
检测对抗样本_避免使用对抗性T恤进行检测
大数据数据量估算_如何估算数据科学项目的数据收集成本

直线回归数据离群值_处理离群值：OLS与稳健回归相关推荐

python 离群值_数据预处理初学者宝典：360° 掌握离群值识别
全文共6023字,预计学习时长20分钟或更长来源:Pexels 离群值监测和处理是数据预处理中最重要的环节之一.机器学习算法注重数据点的范围和分布,而数据离群值掩盖训练进程,导致训练时间加长.模型准 ...
python 离群值_python:删除离群值操作(每一行为一类数据)
删除有多行字符串的json文件中的离群值 def processhold(eachsubject,directory,newfile): filename = 'cmudatacol/hold/sub ...
ols线性回归_普通最小二乘[OLS]方法使用于机器学习的简单线性回归变得容易
ols线性回归 Hello Everyone! 大家好! I am super excited to be writing another article after a long time sinc ...
政府公开数据可视化_公开演讲如何帮助您设计更好的数据可视化
政府公开数据可视化 What do good speeches and good data visualisation have in common? More than you may think. ...
数据探查_数据科学家，开始使用探查器
数据探查 Data scientists often need to write a lot of complex, slow, CPU- and I/O-heavy code - whether y ...
基于plotly数据可视化_如何使用Plotly进行数据可视化
基于plotly数据可视化 The amount of data in the world is growing every second. From sending a text to clicki ...
数据科学家数据工程师_数据科学家实际上赚了多少钱？
数据科学家数据工程师目录 (Table of Contents) Introduction介绍 Junior Data Scientist初级数据科学家 Mid-Level Data Scient ...
对数据仓库进行数据建模_确定是否可以对您的数据进行建模
对数据仓库进行数据建模 Some data sets are just not meant to have the geospatial representation that can be clus ...
jquery数据折叠_通过位折叠缩小大数据
jquery数据折叠 Sometimes your dataset is just too large, and you need a way to shrink it down to a reaso ...

直线回归数据离群值_处理离群值：OLS与稳健回归

背景 (Background)

最小二乘 (OLS)

稳健回归 (Robust Regressions)

胡贝尔权重 (Huber Weights)

双平方权重 (Bisquare Weights)

比较方式 (Comparison)

RMSE (RMSE)

玛普 (MAPE)

异常值是否“具有影响力”？ (Are the outliers “influential”?)

结论 (Conclusion)

相关文章：

直线回归数据离群值_处理离群值：OLS与稳健回归相关推荐

最新文章

热门文章

直线回归数据 离群值_处理离群值：OLS与稳健回归

背景 (Background)

最小二乘 (OLS)

稳健回归 (Robust Regressions)

胡贝尔权重 (Huber Weights)

双平方权重 (Bisquare Weights)

比较方式 (Comparison)

RMSE (RMSE)

玛普 (MAPE)

异常值是否“具有影响力”？ (Are the outliers “influential”?)

结论 (Conclusion)

相关文章：

直线回归数据 离群值_处理离群值：OLS与稳健回归相关推荐

最新文章

热门文章

直线回归数据离群值_处理离群值：OLS与稳健回归

直线回归数据离群值_处理离群值：OLS与稳健回归相关推荐