r语言中检测异常值

介绍 (Introduction)

An outlier is a value or an observation that is distant from other observations, that is to say, a data point that differs significantly from other data points. Enderlein (1987) goes even further as the author considers outliers as values that deviate so much from other observations one might suppose a different underlying sampling mechanism.

离群值是一个值或一个观察是从其他观察远处 ,即,数据点,从其他数据点不同显著。 Enderlein(1987)更进一步,因为作者认为离群值是与其他观察值相差很大的值,可能会假设存在不同的潜在采样机制。

An observation must always be compared to other observations made on the same phenomenon before actually calling it an outlier. Indeed, someone who is 200 cm tall (6’7" in US) will most likely be considered as an outlier compared to the general population, but that same person may not be considered as an outlier if we measured the height of basketball players.

在将其实际称为异常值之前,必须始终将其与对同一现象进行的其他观察进行比较。 确实,与一般人群相比,身高200厘米(在美国为6'7“)的人最有可能被视为离群值,但是如果我们测量篮球运动员的身高,则该人可能不会被视为离群值。

An outlier may be due to the variability inherent in the observed phenomenon. For example, it is often the case that there are outliers when collecting data on salaries, as some people make much more money than the rest. Outliers can also arise due to an experimental, measurement or encoding error. For instance, a human weighting 786 kg (1733 pounds) is clearly an error when encoding the weight of the subject. Her or his weight is most probably 78.6 kg (173 pounds) or 7.86 kg (17 pounds) depending on whether weights of adults or babies have been measured.

离群值可能是由于观察到的现象固有的可变性所致。 例如,通常情况下,在收集工资数据时会有异常值,因为有些人的收入要比其他人高得多。 由于实验,测量或编码错误,也可能出现异常值。 例如,当对受试者的体重进行编码时,人类的重量786千克(1733磅)显然是错误的。 她或他的体重最有可能是78.6公斤(173磅)或7.86公斤(17磅),具体取决于是否测量了成人或婴儿的体重。

For this reason, it sometimes makes sense to formally distinguish two classes of outliers: (i) extreme values and (ii) mistakes. Extreme values are statistically and philosophically more interesting, because they are possible but unlikely responses. (Thanks Felix Kluxen for the valuable suggestion.)

因此,有时可以正式区分两类离群值:(i)极端值和(ii)错误。 从统计学和哲学上讲,极端值更有趣,因为它们是可能的,但不太可能响应。 (感谢Felix Kluxen提出的宝贵建议。)

In this article, I present several approaches to detect outliers in R, from simple techniques such as descriptive statistics (including minimum, maximum, histogram, boxplot and percentiles) to more formal techniques such as the Hampel filter, the Grubbs, the Dixon and the Rosner tests for outliers.

在本文中,我提供了几种检测R中离群值的方法,从简单的技术(例如描述性统计信息 (包括最小值,最大值,直方图,箱线图和百分位数))到更正式的技术(例如Hampel过滤器,Grubbs,Dixon和Rosner测试异常值。

Although there is no strict or unique rule whether outliers should be removed or not from the dataset before doing statistical analyses, it is quite common to, at least, remove or impute outliers that are due to an experimental or measurement error (like the weight of 786 kg (1733 pounds) for a human). Some statistical tests require the absence of outliers in order to draw sound conclusions, but removing outliers is not recommended in all cases and must be done with caution.

尽管没有严格或唯一的规则在进行统计分析之前是否应从数据集中删除异常值,但至少是由于实验或测量误差(例如权重)而删除或估算异常值是很普遍的。 786公斤(1733磅)。 一些统计检验要求没有异常值才能得出合理的结论,但是建议不要在所有情况下都删除异常值,必须谨慎行事。

This article will not tell you whether you should remove outliers or not (nor if you should impute them with the median, mean, mode or any other value), but it will help you to detect them in order to, as a first step, verify them. After their verification, it is then your choice to exclude or include them for your analyses (and this usually requires a thoughtful reflection on the researcher’s side). Removing or keeping outliers mostly depend on three factors:

本文不会告诉您是否应删除离群值(也不要使用中位数,均值,众数或其他任何值来估算离群值),但可以帮助您检测出异常值,作为第一步,验证他们。 经过他们的验证后,您可以选择排除或包括它们以进行分析(这通常需要研究者方面进行深思熟虑)。 删除或保留异常值主要取决于三个因素:

  1. The domain/context of your analyses and the research question. In some domains, it is common to remove outliers as they often occur due to a malfunctioning process. In other fields, outliers are kept because they contain valuable information. It also happens that analyses are performed twice, once with and once without outliers to evaluate their impact on the conclusions. If results change drastically due to some influential values, this should caution the researcher to make overambitious claims.您的分析和研究问题的领域/背景。 在某些域中,通常会删除异常值,因为异常值通常是由于故障处理导致的。 在其他字段中,离群值得以保留,因为它们包含有价值的信息。 也可能会执行两次分析,一次使用异常值,一次不使用异常值来评估其对结论的影响。 如果结果由于某些影响力值而发生巨大变化,则应警告研究人员做出过分雄辩的主张。
  2. Whether the tests you are going to apply are robust to the presence of outliers or not. For instance, the slope of a simple linear regression may significantly varies with just one outlier, whereas non-parametric tests such as the Wilcoxon test are usually robust to outliers.

    您将要应用的测试对于异常值的存在是否具有鲁棒性。 例如,简单的线性回归的斜率可能仅在一个离群点上发生显着变化,而非参数检验(例如Wilcoxon检验)通常对离群点具有鲁棒性。

  3. How distant are the outliers from other observations? Some observations considered as outliers (according to the techniques presented below) are actually not really extreme compared to all other observations, while other potential outliers may be really distant from the rest of the observations.离群值与其他观测值有多远? 与所有其他观察值相比,一些被视为离群值的观察值(根据下面介绍的技术)实际上并不是真正的极端,而其他潜在的离群值可能与其他观察值确实相距甚远。

The dataset mpg from the {ggplot2} package will be used to illustrate the different approaches of outliers detection in R, and in particular we will focus on the variable hwy (highway miles per gallon).

{ggplot2}包中的mpg数据集将用于说明R中异常值检测的不同方法,特别是,我们将重点关注变量hwy (每加仑的公路里程)。

描述性统计 (Descriptive statistics)

最小和最大 (Minimum and maximum)

The first step to detect outliers in R is to start with some descriptive statistics, and in particular with the minimum and maximum.

要检测R中的异常值,第一步是从一些描述性统计信息开始,尤其是从最小值和最大值开始 。

In R, this can easily be done with the summary() function:

在R中,可以使用summary()函数轻松完成此操作:

dat <- ggplot2::mpgsummary(dat$hwy)##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. ##   12.00   18.00   24.00   23.44   27.00   44.00

where the minimum and maximum are respectively the first and last values in the output above. Alternatively, they can also be computed with the min() and max() functions:

其中最小值和最大值分别是上面输出中的第一个和最后一个值。 另外,也可以使用min()max()函数来计算它们:

min(dat$hwy)## [1] 12max(dat$hwy)## [1] 44

Some clear encoding mistake like a weight of 786 kg (1733 pounds) for a human will already be easily detected by this very simple technique.

通过这种非常简单的技术,已经很容易检测到一些明显的编码错误,例如人的体重为786千克(1733磅)。

直方图 (Histogram)

Another basic way to detect outliers is to draw a histogram of the data.

检测异常值的另一种基本方法是绘制数据的直方图 。

Using R base (with the number of bins corresponding to the square root of the number of observations in order to have more bins than the default option):

使用R base(箱数对应于观察数的平方根,以使箱数多于默认选项):

hist(dat$hwy,  xlab = "hwy",  main = "Histogram of hwy",  breaks = sqrt(nrow(dat))) # set number of bins

or using ggplot2 (via the esquisse addin):

或使用ggplot2 (通过esquisse addin ):

library(ggplot2)ggplot(dat) +  aes(x = hwy) +  geom_histogram(bins = 30L, fill = "#0c4c8a") +  theme_minimal()

From the histogram, there seems to be a couple of observations higher than all other observations (see the bar on the right side of the plot).

从直方图中,似乎有几个观测值高于所有其他观测值(请参见图右侧的条)。

箱形图 (Boxplot)

In addition to histograms, boxplots are also useful to detect potential outliers.

除直方图外, 箱线图还可用于检测潜在的异常值。

Using R base:

使用R base:

boxplot(dat$hwy,  ylab = "hwy")

or using ggplot2:

或使用ggplot2

ggplot(dat) +  aes(x = "", y = hwy) +  geom_boxplot(fill = "#0c4c8a") +  theme_minimal()

A boxplot helps to visualize a quantitative variable by displaying five common location summary (minimum, median, first and third quartiles and maximum) and any observation that was classified as a suspected outlier using the interquartile range (IQR) criterion. The IQR criterion means that all observations above q0.75 + 1.5 ⋅ IQR or below q0.25 − 1.5 ⋅ IQR (where q0.25 and q0.75 correspond to first and third quartile respectively, and IQR is the difference between the third and first quartile) are considered as potential outliers by R. In other words, all observations outside of the following interval will be considered as potential outliers:

箱形图通过显示五个常见的位置摘要(最小,中位数,第一和第三四分位数和最大值)以及使用四分位数间距(IQR)标准被分类为可疑离群值的任何观察值,有助于可视化定量变量。 IQR标准意味着所有高于q0.75 + 1.5⋅IQR或低于q0.25 − 1.5⋅IQR的观测值(其中q0.25和q0.75分别对应于第一和第三四分位数,而IQR是第三和第二四分位数之间的差R)将其视为潜在离群值。换句话说,以下间隔之外的所有观测值都将被视为潜在离群值:

I = [q0.25 − 1.5 ⋅ IQR; q0.75 + 1.5 ⋅ IQR]

I = [q0.25 − 1.5⋅IQR; q0.75 + 1.5⋅IQR]

Observations considered as potential outliers by the IQR criterion are displayed as points in the boxplot. Based on this criterion, there are 2 potential outliers (see the 2 points above the vertical line, at the top of the boxplot).

在IQR标准中被视为潜在异常值的观察结果显示为箱图中的点。 基于此标准,存在2个潜在的异常值(请参见垂直线上方2个点,位于箱线图的顶部)。

Remember that it is not because an observation is considered as a potential outlier by the IQR criterion that you should remove it. Removing or keeping an outlier depends on (i) the context of your analysis, (ii) whether the tests you are going to perform on the dataset are robust to outliers or not, and (iii) how far is the outlier from other observations.

请记住,不是因为应该将IQR准则视为观测值是潜在的异常值,否则应将其删除。 移除或保留离群值取决于(i)您的分析背景,(ii)您要在数据集上执行的测试是否对离群值稳健,以及(iii)与其他观察值之间的离群值有多远。

It is also possible to extract the values of the potential outliers based on the IQR criterion thanks to the boxplot.stats()$out function:

boxplot.stats()$out函数,还可以根据IQR标准提取潜在离群值:

boxplot.stats(dat$hwy)$out## [1] 44 44 41

As you can see, there are actually 3 points considered as potential outliers: 2 observations with a value of 44 and 1 observation with a value of 41.

如您所见,实际上有3个点被认为是潜在的离群值:2个观测值为44,1个观测值为41。

Thanks to the which() function it is possible to extract the row number corresponding to these outliers:

多亏了which()函数,可以提取与这些异常值相对应的行号:

out <- boxplot.stats(dat$hwy)$outout_ind <- which(dat$hwy %in% c(out))out_ind## [1] 213 222 223

With this information you can now easily go back to the specific rows in the dataset to verify them, or print all variables for these outliers:

有了这些信息,您现在可以轻松地返回到数据集中的特定行以进行验证,或者打印这些异常值的所有变量:

dat[out_ind, ]## # A tibble: 3 x 11##   manufacturer model   displ  year   cyl trans   drv     cty   hwy fl    class  ##   <chr>        <chr>   <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>  ## 1 volkswagen   jetta     1.9  1999     4 manual… f        33    44 d     compact## 2 volkswagen   new be…   1.9  1999     4 manual… f        35    44 d     subcom…## 3 volkswagen   new be…   1.9  1999     4 auto(l… f        29    41 d     subcom…

It is also possible to print the values of the outliers directly on the boxplot with the mtext() function:

也可以使用mtext()函数在箱线图上直接打印离群值:

boxplot(dat$hwy,  ylab = "hwy",  main = "Boxplot of highway miles per gallon")mtext(paste("Outliers: ", paste(out, collapse = ", ")))

百分位数 (Percentiles)

This method of outliers detection is based on the percentiles. With the percentiles method, all observations that lie outside the interval formed by the 2.5 and 97.5 percentiles will be considered as potential outliers. Other percentiles such as the 1 and 99, or the 5 and 95 percentiles can also be considered to construct the interval.

这种离群值检测方法基于百分位数 。 使用百分位数方法,所有位于2.5和97.5个百分位数形成的区间之外的观测值都将被视为潜在异常值。 也可以考虑使用其他百分比,例如1和99,或5和95,构建区间。

The values of the lower and upper percentiles (and thus the lower and upper limits of the interval) can be computed with the quantile() function:

上下百分比的值(以及区间的上下限值)可以使用quantile()函数来计算:

lower_bound <- quantile(dat$hwy, 0.025)lower_bound## 2.5% ##   14upper_bound <- quantile(dat$hwy, 0.975)upper_bound##  97.5% ## 35.175

According to this method, all observations below 14 and above 35.175 will be considered as potential outliers. The row numbers of the observations outside of the interval can then be extracted with the which() function:

根据此方法,所有低于14和高于35.175的观测值都将被视为潜在异常值。 然后可以使用which()函数提取区间外的观测的行号:

outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)outlier_ind##  [1]  55  60  66  70 106 107 127 197 213 222 223

Then their values of highway miles per gallon can be printed:

然后可以打印出每加仑的高速公路英里数:

dat[outlier_ind, "hwy"]## # A tibble: 11 x 1##      hwy##    <int>##  1    12##  2    12##  3    12##  4    12##  5    36##  6    36##  7    12##  8    37##  9    44## 10    44## 11    41

Alternatively, all variables for these outliers can be printed:

或者,可以打印这些离群值的所有变量:

dat[outlier_ind, ]## # A tibble: 11 x 11##    manufacturer model    displ  year   cyl trans  drv     cty   hwy fl    class ##    <chr>        <chr>    <dbl> <int> <int> <chr>  <chr> <int> <int> <chr> <chr> ##  1 dodge        dakota …   4.7  2008     8 auto(… 4         9    12 e     pickup##  2 dodge        durango…   4.7  2008     8 auto(… 4         9    12 e     suv   ##  3 dodge        ram 150…   4.7  2008     8 auto(… 4         9    12 e     pickup##  4 dodge        ram 150…   4.7  2008     8 manua… 4         9    12 e     pickup##  5 honda        civic      1.8  2008     4 auto(… f        25    36 r     subco…##  6 honda        civic      1.8  2008     4 auto(… f        24    36 c     subco…##  7 jeep         grand c…   4.7  2008     8 auto(… 4         9    12 e     suv   ##  8 toyota       corolla    1.8  2008     4 manua… f        28    37 r     compa…##  9 volkswagen   jetta      1.9  1999     4 manua… f        33    44 d     compa…## 10 volkswagen   new bee…   1.9  1999     4 manua… f        35    44 d     subco…## 11 volkswagen   new bee…   1.9  1999     4 auto(… f        29    41 d     subco…

There are 11 potential outliers according to the percentiles method. To reduce this number, you can set the percentiles to 1 and 99:

根据百分位数方法,有11个潜在的异常值。 要减少此数字,可以将百分位数设置为1和99:

lower_bound <- quantile(dat$hwy, 0.01)upper_bound <- quantile(dat$hwy, 0.99)outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)dat[outlier_ind, ]## # A tibble: 3 x 11##   manufacturer model   displ  year   cyl trans   drv     cty   hwy fl    class  ##   <chr>        <chr>   <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>  ## 1 volkswagen   jetta     1.9  1999     4 manual… f        33    44 d     compact## 2 volkswagen   new be…   1.9  1999     4 manual… f        35    44 d     subcom…## 3 volkswagen   new be…   1.9  1999     4 auto(l… f        29    41 d     subcom…

Setting the percentiles to 1 and 99 gives the same potential outliers as with the IQR criterion.

将百分位数设置为1和99可提供与IQR标准相同的潜在异常值。

汉Perl过滤器 (Hampel filter)

Another method, known as Hampel filter, consists of considering as outliers the values outside the interval (I) formed by the median, plus or minus 3 median absolute deviations (MAD):1

另一种方法称为Hampel滤波器,包括将中间值加上或减去3个中间值绝对偏差(MAD)形成的间隔(I)之外的值视为异常值: 1

I = [median − 3 ⋅ MAD; median + 3 ⋅ MAD]

I = [中位数-3⋅MAD; 中位数+ 3⋅MAD]

where MAD is the median absolute deviation and is defined as the median of the absolute deviations from the data’s median ~X=median(X):

其中MAD是中位数绝对偏差,并且定义为与数据中位数〜X = median(X)的绝对偏差中位数:

MAD = median(|Xi − ~X|)

MAD =中位数(| Xi −〜X |)

For this method we first set the interval limits thanks to the median() and mad() functions:

对于这种方法,我们首先要设置median()限制,这要感谢median()mad()函数:

lower_bound <- median(dat$hwy) - 3 * mad(dat$hwy)lower_bound## [1] 1.761upper_bound <- median(dat$hwy) + 3 * mad(dat$hwy)upper_bound## [1] 46.239

According to this method, all observations below 1.761 and above 46.239 will be considered as potential outliers. The row numbers of the observations outside of the interval can then be extracted with the which() function:

根据此方法,所有低于1.761和高于46.239的观测值都将被视为潜在异常值。 然后可以使用which()函数提取区间外的观测的行号:

outlier_ind <- which(dat$hwy < lower_bound | dat$hwy > upper_bound)outlier_ind## integer(0)

According to the Hampel filter, there is no potential outlier for the hwy variable.

根据Hampel过滤器, hwy变量没有潜在的异常值。

统计检验 (Statistical tests)

In this section, we present 3 more formal techniques to detect outliers:

在本节中,我们介绍了3种更正式的技术来检测离群值:

  1. Grubbs’s test格拉布斯的测验
  2. Dixon’s test迪克森的测验
  3. Rosner’s test罗斯纳测试

These 3 statistical tests are part of more formal techniques of outliers detection as they all involve the computation of a test statistic that is compared to tabulated critical values (that are based on the sample size and the desired confidence level).

这3个统计测试是离群值检测的更正式技术的一部分,因为它们都涉及计算测试统计量,并将该统计量与列表化的临界值(基于样本量和所需的置信度)进行比较。

Note that the 3 tests are appropriate only when the data (without any outliers) are approximately normally distributed. The normality assumption must thus be verified before applying these tests for outliers (see how to test the normality assumption in R).

请注意,只有当数据(没有任何异常值) 近似正态分布时,这3个测试才是合适的。 因此,必须在对异常值应用这些测试之前验证正态性假设(请参阅如何在R中测试正态性假设 )。

格拉布斯的测试 (Grubbs’s test)

The Grubbs test allows to detect whether the highest or lowest value in a dataset is an outlier.

Grubbs测试允许检测数据集中的最高或最低值是异常值。

The Grubbs test detects one outlier at a time (highest or lowest value), so the null and alternative hypotheses are as follows:

Grubbs检验一次检测一个异常值(最高或最低值),因此零假设和替代假设如下:

  • H0: The highest value is not an outlier

    H0: 最高不是异常值

  • H1: The highest value is an outlier

    H1: 最高值是异常值

if we want to test the highest value, or:

如果我们要测试最高值,或者:

  • H0: The lowest value is not an outlier

    H0: 最低不是异常值

  • H1: The lowest value is an outlier

    H1: 最低值是一个离群值

if we want to test the lowest value.

如果我们要测试最低值。

As for any statistical test, if the p-value is less than the chosen significance threshold (generally α = 0.05) then the null hypothesis is rejected and we will conclude that the lowest/highest value is an outlier. On the contrary, if the p-value is greater or equal than the significance level, the null hypothesis is not rejected, and we will conclude that, based on the data, we do not reject the hypothesis that the lowest/highest value is not an outlier.

对于任何统计检验,如果p值 小于选定的显着性阈值 (通常为α= 0.05),则零假设被拒绝,我们将得出最低/最高值是一个异常值的结论。 相反,如果p值大于或等于显着性水平,则不会拒绝原假设,并且我们将得出结论,基于数据,我们不会拒绝最低/最高值不 等于假设。 离群值

Note that the Grubbs test is not appropriate for sample size of 6 or less (n≤6).

请注意,Grubbs测试不适用于6个以下(n≤6)的样本。

To perform the Grubbs test in R, we use the grubbs.test() function from the {outliers} package:

要在R中执行Grubbs测试,我们使用{outliers}包中的grubbs.test()函数:

# install.packages("outliers")library(outliers)test <- grubbs.test(dat$hwy)test## ##  Grubbs test for one outlier## ## data:  dat$hwy## G = 3.45274, U = 0.94862, p-value = 0.05555## alternative hypothesis: highest value 44 is an outlier

The p-value is 0.056. At the 5% significance level, we do not reject the hypothesis that the highest value 44 is not an outlier.

p值为0.056。 在5%的显着性水平上,我们不会拒绝最高值44 不是异常值的假设。

By default, the test is performed on the highest value (as shown in the R output: alternative hypothesis: highest value 44 is an outlier). If you want to do the test for the lowest value, simply add the argument opposite = TRUE in the grubbs.test() function:

默认情况下,对最高值执行测试(如R输出中所示: alternative hypothesis: highest value 44 is an outlier )。 如果要对最小值进行测试,只需在grubbs.test()函数中添加opposite = TRUE的参数:

test <- grubbs.test(dat$hwy, opposite = TRUE)test## ##  Grubbs test for one outlier## ## data:  dat$hwy## G = 1.92122, U = 0.98409, p-value = 1## alternative hypothesis: lowest value 12 is an outlier

The R output indicates that the test is now performed on the lowest value (see alternative hypothesis: lowest value 12 is an outlier).

R输出表明现在正在对最低值进行测试(请参阅alternative hypothesis: lowest value 12 is an outlier )。

The p-value is 1. At the 5% significance level, we do not reject the hypothesis that the lowest value 12 is not an outlier.

p值是1。在5%的显着性水平上,我们不会拒绝最低值12 不是异常值的假设。

For the sake of illustration, we will now replace an observation with a more extreme value and perform the Grubbs test on this new dataset. Let’s replace the 34th row with a value of 212:

为了说明起见,我们现在将观察值替换为更高的值,并对这个新数据集执行Grubbs测试。 让我们用值212替换第34行:

dat[34, "hwy"] <- 212

And we now apply the Grubbs test to test whether the highest value is an outlier:

现在,我们应用Grubbs测试来测试最高值是否是离群值:

test <- grubbs.test(dat$hwy)test## ##  Grubbs test for one outlier## ## data:  dat$hwy## G = 13.72240, U = 0.18836, p-value < 2.2e-16## alternative hypothesis: highest value 212 is an outlier

The p-value is < 0.001. At the 5% significance level, we conclude that the highest value 212 is an outlier.

p值<0.001。 在5%的显着性水平上,我们得出结论, 最高值212是一个异常值。

迪克森的测验 (Dixon’s test)

Similar to the Grubbs test, Dixon test is used to test whether a single low or high value is an outlier. So if more than one outliers is suspected, the test has to be performed on these suspected outliers individually.

与Grubbs测试类似,Dixon测试用于测试单个低值或高值是否是异常值。 因此,如果怀疑有多个异常值,则必须分别对这些可疑异常值执行测试。

Note that Dixon test is most useful for small sample size (usually n ≤ 25).

请注意,Dixon测试对于小样本量(通常n≤25)最有用。

To perform the Dixon’s test in R, we use the dixon.test() function from the {outliers} package. However, we restrict our dataset to the 20 first observations as the Dixon test can only be done on small sample size (R will throw an error and accepts only dataset of 3 to 30 observations):

要在R中执行Dixon的测试,我们使用{outliers}包中的dixon.test()函数。 但是,我们将数据集限制为前20个观察值,因为Dixon检验只能在较小的样本量上完成(R会引发错误,并且仅接受3到30个观察值的数据集):

subdat <- dat[1:20, ]test <- dixon.test(subdat$hwy)test## ##  Dixon test for outliers## ## data:  subdat$hwy## Q = 0.57143, p-value = 0.006508## alternative hypothesis: lowest value 15 is an outlier

The results show that the lowest value 15 is an outlier (p-value = 0.007).

结果表明,最小值15是一个离群值( p -value = 0.007)。

To test for the highest value, simply add the opposite = TRUE argument to the dixon.test() function:

要测试最高值,只需在dixon.test()函数中添加opposite = TRUE参数:

test <- dixon.test(subdat$hwy,  opposite = TRUE)test## ##  Dixon test for outliers## ## data:  subdat$hwy## Q = 0.25, p-value = 0.8582## alternative hypothesis: highest value 31 is an outlier

The results show that the highest value 31 is not an outlier (p-value = 0.858).

结果表明,最大值31 不是异常值( p值= 0.858)。

It is a good practice to always check the results of the statistical test for outliers against the boxplot to make sure we tested all potential outliers:

最好始终根据箱线图检查统计检验的异常值,以确保我们测试了所有可能的异常值,这是一个好习惯:

out <- boxplot.stats(subdat$hwy)$outboxplot(subdat$hwy,  ylab = "hwy")mtext(paste("Outliers: ", paste(out, collapse = ", ")))

From the boxplot, we see that we could also apply the Dixon test on the value 20 in addition to the value 15 done previously. This can be done by finding the row number of the minimum value, excluding this row number from the dataset and then finally apply the Dixon test on this new dataset:

从箱线图中,我们看到除了先前​​完成的值15之外,我们还可以对值20应用Dixon检验。 这可以通过找到最小值的行号(从数据集中排除该行号),然后最后对这个新数据集应用Dixon测试来完成:

# find and exclude lowest valueremove_ind <- which.min(subdat$hwy)subsubdat <- subdat[-remove_ind, ]# Dixon test on dataset without the minimumtest <- dixon.test(subsubdat$hwy)test## ##  Dixon test for outliers## ## data:  subsubdat$hwy## Q = 0.44444, p-value = 0.1297## alternative hypothesis: lowest value 20 is an outlier

The results show that the second lowest value 20 is not an outlier (p-value = 0.13).

结果表明,第二最低值20 不是异常值( p值= 0.13)。

罗斯纳测试 (Rosner’s test)

Rosner’s test for outliers has the advantages that:

Rosner的离群值测试具有以下优点:

  1. it is used to detect several outliers at once (unlike Grubbs and Dixon test which must be performed iteratively to screen for multiple outliers), and

    它用于一次检测多个离群值 (与必须反复执行以筛选多个离群值的Grubbs和Dixon测试不同),以及

  2. it is designed to avoid the problem of masking, where an outlier that is close in value to another outlier can go undetected.它旨在避免掩盖问题,因为在这种情况下可能无法检测到与另一个异常值接近的异常值。

Unlike Dixon test, note that Rosner test is most appropriate when the sample size is large (n ≥ 20). We therefore use again the initial dataset dat, which includes 234 observations.

与Dixon检验不同,请注意,当样本量较大(n≥20)时,Rosner检验最为合适。 因此,我们再次使用包含234个观测值的初始数据集dat

To perform the Rosner test we use the rosnerTest() function from the {EnvStats} package. This function requires at least 2 arguments: the data and the number of suspected outliers k (with k = 3 as the default number of suspected outliers).

要执行Rosner测试,我们使用{EnvStats}包中的rosnerTest()函数。 此函数至少需要2个参数:数据和可疑离群值k ( k = 3为可疑离群值的默认数目)。

For this example, we set the number of suspected outliers to be equal to 3, as suggested by the number of potential outliers outlined in the boxplot at the beginning of the article.2

对于此示例,我们将可疑离群值的数量设置为等于3,这由本文开头的方框图中概述的潜在离群值建议。 2

library(EnvStats)test <- rosnerTest(dat$hwy,  k = 3)test## $distribution## [1] "Normal"## ## $statistic##       R.1       R.2       R.3 ## 13.722399  3.459098  3.559936 ## ## $sample.size## [1] 234## ## $parameters## k ## 3 ## ## $alpha## [1] 0.05## ## $crit.value## lambda.1 lambda.2 lambda.3 ## 3.652091 3.650836 3.649575 ## ## $n.outliers## [1] 1## ## $alternative## [1] "Up to 3 observations are not\n                                 from the same Distribution."## ## $method## [1] "Rosner's Test for Outliers"## ## $data##   [1]  29  29  31  30  26  26  27  26  25  28  27  25  25  25  25  24  25  23##  [19]  20  15  20  17  17  26  23  26  25  24  19  14  15  17  27 212  26  29##  [37]  26  24  24  22  22  24  24  17  22  21  23  23  19  18  17  17  19  19##  [55]  12  17  15  17  17  12  17  16  18  15  16  12  17  17  16  12  15  16##  [73]  17  15  17  17  18  17  19  17  19  19  17  17  17  16  16  17  15  17##  [91]  26  25  26  24  21  22  23  22  20  33  32  32  29  32  34  36  36  29## [109]  26  27  30  31  26  26  28  26  29  28  27  24  24  24  22  19  20  17## [127]  12  19  18  14  15  18  18  15  17  16  18  17  19  19  17  29  27  31## [145]  32  27  26  26  25  25  17  17  20  18  26  26  27  28  25  25  24  27## [163]  25  26  23  26  26  26  26  25  27  25  27  20  20  19  17  20  17  29## [181]  27  31  31  26  26  28  27  29  31  31  26  26  27  30  33  35  37  35## [199]  15  18  20  20  22  17  19  18  20  29  26  29  29  24  44  29  26  29## [217]  29  29  29  23  24  44  41  29  26  28  29  29  29  28  29  26  26  26## ## $data.name## [1] "dat$hwy"## ## $bad.obs## [1] 0## ## $all.stats##   i   Mean.i      SD.i Value Obs.Num     R.i+1 lambda.i+1 Outlier## 1 0 24.21795 13.684345   212      34 13.722399   3.652091    TRUE## 2 1 23.41202  5.951835    44     213  3.459098   3.650836   FALSE## 3 2 23.32328  5.808172    44     222  3.559936   3.649575   FALSE## ## attr(,"class")## [1] "gofOutlier"

The interesting results are provided in the $all.stats table:

$all.stats表中提供了有趣的结果:

test$all.stats##   i   Mean.i      SD.i Value Obs.Num     R.i+1 lambda.i+1 Outlier## 1 0 24.21795 13.684345   212      34 13.722399   3.652091    TRUE## 2 1 23.41202  5.951835    44     213  3.459098   3.650836   FALSE## 3 2 23.32328  5.808172    44     222  3.559936   3.649575   FALSE

Based on the Rosner test, we see that there is only one outlier (see the Outlier column), and that it is the observation 34 (see Obs.Num) with a value of 212 (see Value).

根据Rosner检验,我们看到只有一个异常值(请参阅“ Outlier列),并且它是观测值34(请参见Obs.Num ),值为212(请参见Value )。

补充说明 (Additional remarks)

You will find many other methods to detect outliers:

您会发现许多其他方法可以检测离群值:

  1. in the {outliers} packages,

    {outliers}包中,

  2. via the lofactor() function from the {DMwR} package: Local Outlier Factor (LOF) is an algorithm used to identify outliers by comparing the local density of a point with that of its neighbors,

    通过{DMwR}包中的lofactor()函数:Local Outlier Factor(LOF)是一种算法,用于通过将一个点的局部密度与其附近的局部密度进行比较来识别异常值,

  3. the outlierTest() from the {car} package gives the most extreme observation based on the given model and allows to test whether it is an outlier, and

    {car}包中的outlierTest()根据给定的模型给出了最极端的观察结果,并允许测试它是否是离群值,以及

  4. in the {OutlierDetection} package, and

    {OutlierDetection}包中,以及

  5. with the aq.plot() function from the {mvoutlier} package (Thanks KTR for the suggestion.):

    使用{mvoutlier}软件包中的aq.plot()函数(感谢KTR的建议。):

library(mvoutlier)Y <- as.matrix(ggplot2::mpg[, c("cyl", "hwy")])res <- aq.plot(Y)

Note also that some transformations may “naturally” eliminate outliers. The natural log or square root of a value reduces the variation caused by extreme values, so in some cases applying these transformations will eliminate the outliers.

另请注意,某些转换可能会“自然地”消除异常值。 值的自然对数或平方根会减少由极值引起的变化,因此在某些情况下,应用这些转换将消除异常值。

Thanks for reading. I hope this article helped you to detect outliers in R via several descriptive statistics (including minimum, maximum, histogram, boxplot and percentiles) or thanks to more formal techniques of outliers detection (including Hampel filter, Grubbs, Dixon and Rosner test). It is now your turn to verify them, and if they are correct, decide how to treat them (i.e., keeping, removing or imputing them) before conducting your analyses.

谢谢阅读。 我希望本文能通过几种描述性统计信息 (包括最小值,最大值,直方图,箱线图和百分位数)或借助更正式的离群值检测技术(包括Hampel过滤器,Grubbs,Dixon和Rosner测试)帮助您检测R中的离群值。 现在轮到您对它们进行验证了,如果它们是正确的,请在进行分析之前决定如何处理它们(即,保留,移除或估算它们)。

As always, if you have a question or a suggestion related to the topic covered in this article, please add it as a comment so other readers can benefit from the discussion.

与往常一样,如果您对本文涉及的主题有疑问或建议,请将其添加为评论,以便其他读者可以从讨论中受益。

  1. The default is 3 (according to Pearson’s rule), but another value is also possible.↩︎

    默认值为3(根据Pearson的规则),但是另一个值也是可能的。 ↩︎

  2. In order to avoid flawed conclusions, it is important to pre-screen the data (graphically with a boxplot for example) to make the selection of the number of potential outliers as accurate as possible prior to running Rosner’s test.↩︎

    为了避免得出错误的结论,重要的是在进行Rosner检验之前,预先筛选数据(例如,使用箱形图以图形方式),以使对潜在异常值的选择尽可能准确。 ↩︎

相关文章 (Related articles)

  • Wilcoxon test in R: how to compare 2 groups under the non-normality assumption

    R中的Wilcoxon检验:如何在非正态假设下比较两组

  • Correlation coefficient and correlation test in R

    R中的相关系数和相关检验

  • One-proportion and goodness of fit test (in R and by hand)

    比例和拟合优度检验(用R和手工进行)

  • How to do a t-test or ANOVA for more than one variable at once in R and communicate the results in a better way

    如何在R中一次对多个变量进行t检验或ANOVA并以更好的方式传达结果

  • How to perform a one sample t-test by hand and in R: test on one mean

    如何手动和在R中执行一次样本t检验:一次均值检验

Originally published at https://www.statsandr.com on August 11, 2020.

最初于 2020年8月11日 发布在 https://www.statsandr.com 上。

翻译自: https://towardsdatascience.com/outliers-detection-in-r-6c835f14e554

r语言中检测异常值


http://www.taodudu.cc/news/show-6054793.html

相关文章:

  • R语言中的回归诊断
  • R语言置信区间计算(confidence interval)、计算比例值对应的置信区间、为比例值构建95%执行区间、使用glue包把最终结果以标准格式输出
  • 如何利用R语言处理 缺失值 数据
  • 【R语言】他说每个生存曲线一定要看到p值,不能0.05,0.01,0.001
  • R 处理异常值
  • 如何添加装饰螺纹线规格
  • 雷神五代笔记本U盘重装系统图文教程
  • cpu对计算机性能的影响,雷神告诉你CPU制程对性能的影响有多大?
  • 笔记本的功率计算
  • 计算机中系统更新是指,Mac电脑操作系统更新了什么功能
  • 科目二 缴费
  • 科目二练习总结
  • 科目二上车前的基础说明
  • 科目二注意事项
  • 微信支付处理支付结果取消预约
  • 科目二练习与考试点位总结
  • 车管所服务器维护还能考科目四吗,车管所可以预约科目四吗
  • win10輸入法,繁體字,簡體字切換
  • Google Chrome OS中文版下载 支持中文输入法
  • 联想笔记本 切换简体和繁体
  • 微信pc端window10多开应用
  • 转载:微信Windows版-无效的wechatwin.dll文件errcode:126,点击“确定”下载最新版本
  • 微信发布Windows PC 测试版,支持电脑与手机互迁聊天记录
  • 微信小程序支持windows PC版了
  • windows通知栏中显示 微信等应用软件 的通知
  • 如何在Ubuntu 22.04使用wine安装windows版本微信
  • Windows实现微信双开
  • Windows版微信3.3.0内测版更新啦,亲测可刷朋友圈(附内测版)
  • Pytroch 深度学习 跑CIFAR10数据集
  • Emlog程序CYP音乐主题模板源码

r语言中检测异常值_R中的异常值检测相关推荐

  1. r语言ggplot合并图形_R中带有ggplot2的图形

    r语言ggplot合并图形 介绍 (Introduction) R is known to be a really powerful programming language when it come ...

  2. R语言str_trim函数去除字符串中头部和尾部的空格

    R语言str_trim函数去除字符串中头部和尾部的空格 目录 R语言str_trim函数去除字符串中头部和尾部的空格 #导入包和库 #仿

  3. R语言ggplot2可视化:ggplot2中使用element_text函数设置轴标签文本粗体字体(bold text,只设置x轴的标签文本使用粗体字体)

    R语言ggplot2可视化:ggplot2中使用element_text函数设置轴标签文本粗体字体(bold text,只设置x轴的标签文本使用粗体字体) 目录

  4. R语言ggplot2在可视化图像中添加横线并在横线中添加文本、为横线中添加的文本添加文本框、自定义文本框的填充色(background color for a text annotation)

    R语言ggplot2在可视化图像中添加横线并在横线中添加文本.为横线中添加的文本添加文本框.自定义文本框的填充色(background color for a text annotation) 目录

  5. R语言str_extract函数从字符串中抽取匹配模式的字符串

    R语言str_extract函数从字符串中抽取匹配模式的字符串 目录 R语言str_extract函数从字符串中抽取匹配模式的字符串 #导入包和库

  6. R语言str_sub函数从字符串中提取或替换子字符串(substring):str_sub函数指定起始位置和终止位置抽取子字符、str_sub函数指定起始位置和终止位置替换子字符串

    R语言str_sub函数从字符串中提取或替换子字符串(substring):str_sub函数指定起始位置和终止位置抽取子字符.str_sub函数指定起始位置和终止位置替换子字符串 目录

  7. R语言ggplot2可视化:ggplot2中使用element_text函数设置轴标签文本粗体字体(bold text,只设置y轴的标签文本使用粗体字体)

    R语言ggplot2可视化:ggplot2中使用element_text函数设置轴标签文本粗体字体(bold text,只设置y轴的标签文本使用粗体字体) 目录

  8. R语言ggplot2可视化:jupyter中设置全局图像大小、jupyter中自定义单个ggplot2图像结果的大小

    R语言ggplot2可视化:jupyter中设置全局图像大小.jupyter中自定义单个ggplot2图像结果的大小 目录

  9. R语言ggplot2可视化:ggplot2中使用element_text函数设置轴标签文本粗体字体(bold text,使x轴和Y轴的标签文本都使用粗体字体)、注意是轴标签而非轴标题

    R语言ggplot2可视化:ggplot2中使用element_text函数设置轴标签文本粗体字体(bold text,使x轴和Y轴的标签文本都使用粗体字体).注意是轴标签而非轴标题 目录

  10. R语言ggplot2可视化改变图中线条的透明度级别实战

    R语言ggplot2可视化改变图中线条的透明度级别实战 目录 R语言ggplot2可视化改变图中线条的透明度级别实战 #默认没有透明

最新文章

  1. pass基础架构分析
  2. 如何用python编写一个绘制马赛克图像的自写程序mask = np.zeros
  3. JavaWeb上传图片到服务器,存储到数据库,并在页面显示
  4. ubuntu下搭建java web开发环境的详细步骤
  5. 无字库12864液晶屏滚动显示程序[转]
  6. GoAhead2.5源代码分析之1-用户管理(um.c)
  7. 建班子:企业需要建立什么样的班子?
  8. 腾讯首投AI芯片,领投燧原科技Pre-A轮3.4亿元融资
  9. 如何隐藏所有的导航栏
  10. MNIST二进制数据集探索--基于Numpy处理
  11. 深度强化学习笔记(一)——深度强化学习简述
  12. Web开发HTTP中URI和URL的情感纠葛
  13. 关于struts框架的优缺点
  14. Cozmo机器人使用中文Scratch3编程案例(codelab)
  15. java中UUID类生成32位随机数(附加 6 位随机数)
  16. coverity静态安全扫描分析软件linux环境搭建
  17. java 痛并快乐着 day02(2021-11-09)
  18. Windows10与Ubuntu双系统安装记录
  19. 洛谷P1053篝火晚会题解--zhengjun
  20. windows server 2016磁盘安全与管理_磁盘管理工具哪一款好用?

热门文章

  1. 家政预约系统开发作用和步骤
  2. Google浏览器简体中文版下载
  3. 微信分享带图片,描述(php版)
  4. jenkins 403 No valid crumb was included in the request 解决方案
  5. 2022 年十大最佳网络分析工具介绍
  6. Linux线程操作以及相关知识
  7. C++程序的存储空间布局
  8. python 修改PE文件头
  9. 陶朗食品业务调整为两个业务版块,专注新鲜食品和加工食品
  10. 手把手教程9-2: 460使用Flash模拟EEPROM