stats | 线性回归（四）——显著性检验和模型评价

本篇介绍线性回归的显著性检验和评价方法。示例数据同上篇：

DATA <- mtcars[, c("mpg", "wt", "qsec", "drat")]

6 显著性检验

显著性检验主要用来判断某变量是否有必要留在模型表达式中，常使用的有F检验和t检验。

建立如下模型：

model <- lm(mpg ~ wt + I(wt^2) + qsec, data = DATA)
summary(model)##
## Call:
## lm(formula = mpg ~ wt + I(wt^2) + qsec, data = DATA)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.2200 -1.2521 -0.6288  0.9357  5.1761
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  32.6418     5.6768   5.750 3.59e-06 ***
## wt          -12.4331     2.0842  -5.965 2.01e-06 ***
## I(wt^2)       1.0730     0.2970   3.613 0.001174 **
## qsec          0.8599     0.2236   3.846 0.000634 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.182 on 28 degrees of freedom
## Multiple R-squared:  0.8816, Adjusted R-squared:  0.8689
## F-statistic:  69.5 on 3 and 28 DF,  p-value: 4.345e-13

6.1 F检验

F检验用于检验模型整体的显著性。

原假设：所有解释变量对因变量都没有显著影响，即解释变量的回归系数（不含截距）都不显著异于0；
备选假设：至少有一个解释变量对因变量有显著影响，即解释变量的回归系数至少有一个显著异于0。

若p值小于给定显著性水平，则拒绝原假设，接受备选假设。

回归模型整体的F统计量如下：

其中，

F统计量有两个自由度，第一自由度等于自变量个数，第二自由度，其中为样本量。

summary函数输出结果的最后一行即为F检验的内容：

summary(model)## F-statistic:  69.5 on 3 and 28 DF,  p-value: 4.345e-13

summary函数的输出内容是对模型的所有解释变量作F检验，若只针对其中部分变量，可以使用car工具包中的linearHypothesis函数：

linearHypothesis(model, hypothesis.matrix, rhs=NULL,test=c("F", "Chisq"), vcov.=NULL, white.adjust=c(FALSE, TRUE, "hc3", "hc0","hc1", "hc2", "hc4"), singular.ok=FALSE, ...)

hypothesis.matrix：解释变量组成的向量或原假设组成的向量；

rhs：原假设中对应解释变量的系数值。

单个变量的F检验等同于t检验：

library(car)
linearHypothesis(model, c("qsec"), test = "F")## Linear hypothesis test
##
## Hypothesis:
## qsec = 0
##
## Model 1: restricted model
## Model 2: mpg ~ wt + I(wt^2) + qsec
##
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)
## 1     29 203.75
## 2     28 133.31  1    70.431 14.793 0.0006339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

所有解释变量的F检验，结果与summary函数相同：

linearHypothesis(model, c("wt", "I(wt^2)", "qsec"))## Linear hypothesis test
##
## Hypothesis:
## wt = 0
## I(wt^2) = 0
## qsec = 0
##
## Model 1: restricted model
## Model 2: mpg ~ wt + I(wt^2) + qsec
##
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)
## 1     31 1126.05
## 2     28  133.31  3    992.73 69.501 4.345e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

部分变量的F检验：

linearHypothesis(model, c("wt", "I(wt^2)"))## Linear hypothesis test
##
## Hypothesis:
## wt = 0
## I(wt^2) = 0
##
## Model 1: restricted model
## Model 2: mpg ~ wt + I(wt^2) + qsec
##
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)
## 1     30 928.66
## 2     28 133.31  2    795.34 83.522 1.579e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

根据输出结果，可以认为wt和I(wt^2)两个解释变量中至少有一个回归系数显著不为0。

原假设为某变量的回归系数为1（也可以为其他值，根据需要设置）：

# 以下写法等价
linearHypothesis(model, c("I(wt^2) = 1"))
linearHypothesis(model, c("I(wt^2)"), 1)## Linear hypothesis test
##
## Hypothesis:
## I(wt^2) = 1
##
## Model 1: restricted model
## Model 2: mpg ~ wt + I(wt^2) + qsec
##
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     29 133.60
## 2     28 133.31  1   0.28786 0.0605 0.8076

根据输出结果，可以认为I(wt^2)的回归系数不显著异于1。

6.2 t检验

t检验用于检验模型中单个变量的显著性：

使用summary函数输出的结果：

summary(model)## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  32.6418     5.6768   5.750 3.59e-06 ***
## wt          -12.4331     2.0842  -5.965 2.01e-06 ***
## I(wt^2)       1.0730     0.2970   3.613 0.001174 **
## qsec          0.8599     0.2236   3.846 0.000634 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

或者：

coef(summary(model))##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  32.6418325  5.6767588  5.750083 3.592789e-06
## wt          -12.4330965  2.0841792 -5.965464 2.008329e-06
## I(wt^2)       1.0730270  0.2969987  3.612901 1.173891e-03
## qsec          0.8598587  0.2235665  3.846099 6.338842e-04

7 模型评价

模型评价的目的是选择出最优的模型表达式，具体内容是通过比较决定是否将某些变量纳入到表达式中，原则是兼顾拟合程度和简洁性。

建立如下三个模型：

model.1 <- lm(mpg ~ wt , data = DATA)
model.2 <- lm(mpg ~ wt + I(wt^2), data = DATA)
model.3 <- lm(mpg ~ wt + I(wt^2) + drat, data = DATA)

7.1 R方

R方（R-squared）用于衡量模型的拟合程度：

回归平方和，表示能够被模型解释的信息量；

残差平方和，表示未被模型解释的信息量；

总离差平方和，表示原始数据总的信息量；。

R方的取值范围为，但对于线性模型来讲，其在数值上等于因变量真实值与拟合值值的皮尔逊系数的平方，取值范围为。

R方越大，说明模型的拟合程度越好，但是由于变量越多，R方自然也会增加，为了兼顾简洁性，模型评价一般使用调整后的R方（Adjusted R-squared）：

和分别为样本和变量个数。

summary函数输出结果的倒数第二行即为R方的结果。

summary(model.1)## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446

summary(model.2)## Multiple R-squared:  0.8191, Adjusted R-squared:  0.8066

summary(model.3)## Multiple R-squared:  0.8193, Adjusted R-squared:  0.7999

三个模型R方和自变量个数一样依次增加，但model.2的调整R方最大，因此在这三个模型中可以认为它是最优模型。

7.2 方差分析

方差分析（Analysis of Variance）是用显著性检验的方法来比较模型的优劣，有F检验和卡方检验。

F检验：

anova(model.1, model.2, model.3, test = "F")## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + I(wt^2)
## Model 3: mpg ~ wt + I(wt^2) + drat
##   Res.Df    RSS Df Sum of Sq       F   Pr(>F)
## 1     30 278.32
## 2     29 203.75  1    74.576 10.2627 0.003375 **
## 3     28 203.47  1     0.276  0.0379 0.847005
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

根据结果，model.2显著优于model.1，但model.3不显著优于model.2，因此可认为model.2是最优模型。

卡方检验：

anova(model.1, model.2, model.3, test = "Chisq")## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ wt + I(wt^2)
## Model 3: mpg ~ wt + I(wt^2) + drat
##   Res.Df    RSS Df Sum of Sq Pr(>Chi)
## 1     30 278.32
## 2     29 203.75  1    74.576 0.001357 **
## 3     28 203.47  1     0.276 0.845599
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

结果同F检验。

7.3 信息量准则

信息量准则兼顾了模型的似然度和简洁程度。常用的有赤池信息量准则（Akaike Information Criterion，AIC）和贝叶斯信息量准则（Bayesian Information Criterion，BIC）:

为模型的似然度（Likehood）；

和分别为样本和变量个数。

AIC和BIC的取值范围为任意实数，其数值越小，说明模型优度越高。

stats包中的logLik函数可以计算模型似然度的对数值：

lnL <- logLik(model.1)
lnL## 'log Lik.' -80.01471 (df=3)

# 手动计算model.1的AIC值
-2*lnL + 2## 'log Lik.' 162.0294 (df=3)

stats包中的AIC和BIC函数分别用于计算两种信息量准则：

AIC(model.1, model.2, model.3)##         df      AIC
## model.1  3 166.0294
## model.2  4 158.0484
## model.3  5 160.0051

根据输出结果，model.2的AIC值最小，因此可认为它是最优模型。

BIC(model.1, model.2, model.3)##         df      BIC
## model.1  3 170.4266
## model.2  4 163.9113
## model.3  5 167.3338

结果同AIC。

往期推荐阅读：

《数据处理通识》专辑-base | 使用apply族函数进行向量化运算
《制表与可视化》专辑-ggplot2 | ggplot2作图语法入门
《数学模型》专辑-car | 线性回归（三）——残差分析和异常点检验
《地理计算与分析》专辑-spdep | 如何在R语言中计算空间自相关指数