线性回归模型的度量参数1- SST SSR SSE R-Squared

本文解释线性回归模型的度量参数，并通过示例给出其计算过程。

模型度量参数概述

线性回归用于找到一条线能够最佳拟合数据集。通常使用三个不同的平方和值衡量回归线实际拟合数据的程度。

Sum of Squares Total (SST)
单个数据点（观测值）于响应变量的均值差的平方和。
Sum of Squares Regression (SSR)
预测值ŷi与响应变量均值差的平方和。
Sum of Squares Error (SSE)
预测值ŷi与观测值差的平方和。

三者之间的关系为：

SST = SSR + SSE

我们已经知道了三者的关系，如果已知两个变量，则可以通过上述公式计算第三个变量。

R-Squared

R-Squared 也称为决定系数，它是衡量线性回归模型拟合数据集的程度，表示一定比例响应变量的方差能够被预测变量解释。R-Squared 取值范围是0 ~ 1。R-Squared 值越高，模型拟合数据集越好。0 表示响应变量完全不能被预测变量解释，1表示响应变量可以完美无误被预测变量解释。

使用 SSR 和 SST 能够计算 R-squared：

** R-squared = SSR / SST **

举例：
如果给定模型的SSR为137.5，SST为156，则可以使用下面公式计算R-squared：

R-squared = 137.5 / 156 = 0.8814

结果表示在总误差中响应变量的88.14%能够被预测变量解释。

计算 SST, SSR, SSE

假设有下面数据集，表示学习事件与考试成绩的关系，
下面我们使用R进行预测

examResult <- data.frame(hours=c(1,2,2,3,4,5), score= c(68,77,81,82,88,90))
fit <- lm(score ~ hours, examResult)
summary(fit)# Call:
# lm(formula = score ~ hours, data = examResult)
#
# Residuals:
#       1       2       3       4       5       6
# -3.6923  0.2308  4.2308  0.1538  1.0769 -2.0000
#
# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  66.6154     2.8886  23.062 2.09e-05 ***
# hours         5.0769     0.9212   5.511  0.00529 **
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 3.032 on 4 degrees of freedom
# Multiple R-squared:  0.8836,  Adjusted R-squared:  0.8546
# F-statistic: 30.38 on 1 and 4 DF,  p-value: 0.005288

我们获得预测模型为：

score = 66.615 + 5.0769*(hours)

下面使用图形表示预测情况：

library(ggplot2)ggplot(examResult, aes(hours, score)) + geom_point(shape = 21, fill = "yellow", size = 5) + geom_smooth(method="lm", se=FALSE ,linetype = 2)

我们已经知道了最佳预测模型，下面我们计算 SST, SSR, SSE:

1.计算响应变量的均值

library(tidyverse)
avg <- mean(examResult$score)
examResult <- examResult %>% bind_cols(yu =avg)
examResult #   hours score yu
# 1     1    68 81
# 2     2    77 81
# 3     2    81 81
# 4     3    82 81
# 5     4    88 81
# 6     5    90 81

2.计算每个观察变量的预测变量

library(tidyverse)examResult <- examResult %>% mutate(yp = ( 66.615 + 5.0769 * hours))
examResult#   hours score yu      yp
# 1     1    68 81 71.6919
# 2     2    77 81 76.7688
# 3     2    81 81 76.7688
# 4     3    82 81 81.8457
# 5     4    88 81 86.9226
# 6     5    90 81 91.9995

3.计算总平方和(sst)

例如第一个学生的总平方和为：

(yi – y)^2 = (68 – 81) ^2 = 169.

我们可以使用相同的方法计算每个学生的总平方和：

examResult <- examResult %>% mutate(SSTi = (score-yu)^2)
examResult
#   hours score yu      yp SSTi
# 1     1    68 81 71.6919  169
# 2     2    77 81 76.7688   16
# 3     2    81 81 76.7688    0
# 4     3    82 81 81.8457    1
# 5     4    88 81 86.9226   49
# 6     5    90 81 91.9995   81SST <- sum(examResult$SSTi)
SST
# [1] 316

SST值为316，下面我们计算SSR

4.计算回归平方和(SSR)

第一个学生的回归平方和为：

(ŷi – y)^2 = (71.69 – 81) ^2 = 86.64.

我们可以使用同样的方法计算每个学生的回归平方和：

examResult <- examResult %>% mutate(SSRi = (yp-yu)^2)
examResult#   hours score yu      yp SSTi        SSRi
# 1     1    68 81 71.6919  169  86.6407256
# 2     2    77 81 76.7688   16  17.9030534
# 3     2    81 81 76.7688    0  17.9030534
# 4     3    82 81 81.8457    1   0.7152085
# 5     4    88 81 86.9226   49  35.0771908
# 6     5    90 81 91.9995   81 120.9890002SSR <- sum(examResult$SSRi)
SSR
# [1] 279.2282

我们看到SSR的结果为：279.2282

5.计算误差平方和(SSE)

(ŷi – yi)^2 = (71.69 – 68) ^2 = 13.63.

我们可以使用相同方法计算每个学生的误差平方和：

examResult <- examResult %>% mutate(SSEi = (yp-score)^2)
examResult#   hours score yu      yp SSTi        SSRi        SSEi
# 1     1    68 81 71.6919  169  86.6407256 13.63012561
# 2     2    77 81 76.7688   16  17.9030534  0.05345344
# 3     2    81 81 76.7688    0  17.9030534 17.90305344
# 4     3    82 81 81.8457    1   0.7152085  0.02380849
# 5     4    88 81 86.9226   49  35.0771908  1.16079076
# 6     5    90 81 91.9995   81 120.9890002  3.99800025SSE <- sum(examResult$SSEi)
SSE
# [1] 36.76923

我们可以验证公式：SST = SSR + SSE

SST = SSR + SSE
316 = 279.23 + 36.77

6.计算R-squared

上面已经计算出来SST/SSR ，下面我们计算回归模型的R-squared：

R-squared = SSR / SST
R-squared = 279.23 / 316
R-squared = 0.8836

这表示88.36%的考试分数(score)能够被学习时间变量(hours)解释。