Basic Introduction

In this chapter, we review some of the key ideas underlying the linear regression model, as well as the least squares approach that is most commonly used to fit this model.

Basic form:

“≈” means “is approximately modeled as”, to estimate the parameters, by far the most common approach involves minimizing the least squares criterion. Let training samples be (x1,y1),...,(xn,yn), then define the residual sum of squares (RSS) as

And we get

About the result, in the ideal case, that is with enough samples, we can get a classification result called population regression line.

Population regression line: the best fitting result

Least squared regression line: with limited samples

To evaluatehow well our estimation meets the true values, we use standard error, e.g. to estimate the mean value. The variance of sample mean  is

where σ is the standard deviation of each of the realization yi (i=1,2,...,n, y1,...yn are uncorrelated). n is the number of samples.

Definition of standard deviation:

Suppose X is a random variable with mean value

It is the square root of variance. About the computation of standard deviation:

Let x1, ..., xN be samples, then we calculate the standard deviation with (here is the number of limited samples):

Note: for limited samples, we use N-1 to divide, which is the common case, for unlimited samples in the ideal case, we use N to divide.

In the following formulation:


                                                       ,
Here SE means standard error. Standard error is standard deviation divided by sqrt(n), it means the accuracy of results while the standard deviation means the accuracy of data.

An estimation of σ is residual standard error. We use residual standard error (RSE) to estimate it.


The 95% confidence interval for β1 and β0:

  

As an example, we can use

to get the probability that β1 is zero, or the non-existence of the relationship between X(predictor) and Y(response).

Assessing the Accuracy of the Model
There are 3 criterions:

1. Residual Standard Error

     

2. R^2 Statistic

    

TSS measures the total variance in the response Y, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. TSS − RSS measures the amount of variability in the response that is explained (or removed) by performing the regression. An R^2 statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression. On the other hand, we can use

to assess the fit of the linear model. In simple linear model,R^2=Cor(X,Y)^2.

R^2 is normalized, when the actual line is steeper, TSS is larger, and because of

RSS is also larger.

Multiple Linear Regression

we can use least squares to get

we also use F-statistic to

An explanation of the above expression is

                   

The RSS represents the variability left unexplained, TSS is the total variability, as we have already estimated p variables and there are n variables as a whole, so the variance of RSS is n-p-1, TSS-RSS is p.

Subtopic0Hypothesis Testing in Single and Multiple Linear Regression

                   

The zero in Q剩 should be changed to 1.

Subtopic1:whether each of the predictors is useful in predicting the response.

To indicate which one of

           

is true. If F-statistic is close to 1, then H0 is true. If F-statistic is greater than 1, then Ha is true.

It turns out that the answer depends on the values of n and p. When n is large, an F-statistic that is just a little larger than 1 might still provide evidence against H0. In contrast, a larger F-statistic is needed to reject H0 if n is small.

When H0 is true and the errors i have a normal distribution, the F-statistic follows an F-distribution. For any given value of n and p, any statistical software package can be used to compute the p-value associated with the F-statistic using this distribution. Based on this p-value, we can determine whether or not to reject H0.

Here the p-value is defined as the probability, under the assumption of hypothesis H, of obtaining a result equal to or more extreme than what was actually observed. Here the reason that a smaller p can indicate the existence of the relationship between at least one of p parameters and the result is that onlyWhen H0 is true and the errors i have a normal distribution, the F-statistic follows an F-distribution. And when p is small, the hypothesis under consideration may not adequately explain the observation.The smaller p-value is, the more suspectable H0 is.

If we only want to estimate a proportion of the parameters, for example. q parameters,

We fit a second model that uses all the variables except those last q. Suppose that the residual sum of squares for that model is RSS0,The dimension of freedom of RSS0 is q larger than that of RSS.

When q=1, it satisfies t-distribution. So it reports the partial effect of adding that variable to the model.

F-statistic results in p-value.

Note: But we cannot only estimate the p-value for each predictor. We should also estimate the overall F-statistic. for even if H0 is true, there is only a 5% chance that the F-statistic will result in a p-value below 0.05, regardless of the number of predictors or the number of observations.
Subtopic2: Importance of variables
To make sure of the importance of the predictors, we can try:

Method1: Forward Selection.We begin with the null model, then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS.

Method2: Backward Selection. We start with all variables in the model, and backwardremove the variable with the largest p-value.

Method3: Mixed Selection.

Process: We start with no variables in the model, we add the variable that provides the best fit,

Stop: If at any point the p-value for one of the variables in the model rises above a certain threshold, then we remove that variable from the model. We continue to perform these forward and backward steps until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added to the model.
Comment: Backward selection cannot be used if p > n, while forward selection can always be used. Forward selection is a greedy approach, and might include variables early that later become redundant. Mixed selection can remedy this.
Subtopic3: Model fitting quality

Two criterion: R2(the fraction of variance explained) and RSE. By adding predictors, if the RSE is greatly reduced, then the predictor is useful.But models with more variables can have higher RSE if the decrease in RSS is small relative to the increase in p.

In addition to looking at the RSE and R2 statistics just discussed, it can be useful to plot the data.

Graphical summaries can reveal problems with a model that are not visible from numerical statistics. For example, Figure below displays a three-dimensional plot of TV and radio versus sales. We see that some observations lie above and some observations lie below the least squares regression plane. In particular, the linear model seems to overestimate sales for instances in which most of the advertising money was spent exclusively on either TV or radio. It underestimates sales for instances where the budget was split between the two media. This pronounced non-linear pattern cannot be modeled accurately using linear regression. It suggests a synergy or interaction effect between the advertising media, whereby combining the media together results in a bigger boost to sales than using any single medium. In Section 3.3.2, we will discuss extending the linear model to accommodate such synergistic effects through the use of interaction terms.

Qualitative Predictors
The predictors can take on 2 values, just like binary, the response is quantitative.

Further introduction to the linear model

Two basic assumptions:

1. The relationship between the predictors and response are additive and linear.

2. The effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors.

There are some flaws in the above models. e.g. a synergy effect. That is, equal increases in both predictors can contribute more to the increase in response than unbalanced increases among predictors. For example, we can modify

to

Non-linear relationships:

For example, if the linear model has low R^2 value, then we change Y=a+bX1 to Y=a+bX1+cX1^2 and represent it as Y=a+bX1+cX2 where X2=X1^2, and we can also use standard linear regression software to estimate a, b and c.

Potential problems:

1. Non-linearity

See the residual plot to see whether there's discernible pattern, if there is, then the model should not be linear. e.g. Left: obvious discernible pattern. Right: Not obvious.

2. Correlation of Error Terms

An important assumption of the linear regression model is that the error terms, 1, 2, . . . , n, are uncorrelated. i represents different samples. Then the estimated standard errors will tend to underestimate the true standard error. Also plot the residual as a function of time.

In the top panel, we see the residuals from a linear regression fit to data generated with uncorrelated errors. There is no evidence of a time-related trend in the residuals. In contrast, the residuals in the bottom panel are from a data set in which adjacent errors had a correlation of 0.9.
3. Non-constant Variance of Error Terms

Cause: It is often the case that the variances of the error terms are non-constant. For instance, the variances of the error terms may increase with the value of the response.

Solution: When faced with this problem, one possible solution is to transform the response Y using a concave function such as log Y or √Y . Such a transformation results in a greater amount of shrinkage of the larger responses. Sometimes we have a good idea of the variance of each response.

Another example, if different observations have different variances, we can fit our model by weighted least squares, with weights proportional to the inverse weighted variances.

4. Outliers

We can plot the studentized residuals: It is normalized by standard errors.

5. High leverage Points

We just saw that outliers are observations for which the response yi is unusual given the predictor xi. In contrast, observations with high leverage have an unusual value for xi.

Phenomenon:

Left: Observation 41 is a high leverage point, while 20 is not. The red line is the fit to all the data, and the blue line is the fit with observation 41 removed. Center: The red observation is not unusual in terms of its X1 value or its X2 value, but still falls outside the bulk of the data, and hence has high leverage. Right: Observation 41 has a high leverage and a high residual.

Introduction:

we observe that removing the high leverage observation has a much more substantial impact on the least squares line than removing the outlier. It is cause for concern if the least squares line is heavily affected by just a couple of observations.

Detection:

In order to quantify an observation’s leverage, we compute the leverage statistic:


If a given observation has a leverage statistic that greatly exceeds (p+1)/n, then we may suspect that the corresponding point has high leverage.

6. Collinearity

Compared with the left image, the right one is colinearity. The two predictors are correlated with each other.

Difficulties arise from colinearity are

Left: Contour of RSS associated with different possiblecoefficient estimates for the regression of balance (response) on limit (predictor 1) and age (predictor 2).

The result is a small change in the data could cause the pair of coefficient values that yield the smallest RSS—that is, the least squares estimates—to move anywhere along this valley. collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for ^βj to grow.

The probability of correctly power detecting a non-zero coefficient—is reduced by collinearity. A simple way to detect collinearity is to look at the correlation matrix of the predictors.

Multicollinearity-cannot be detected bylooking at the correlation matrices
Unfortunately, not all collinearity problems can be detected by inspection of the correlation matrix: it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this situation multicollinearity.

Instead of inspecting the correlation matrix, a better way to assess multi-collinearity collinearity is to compute the variance inflation factor (VIF).

The smallest possible value for VIF is 1, which indicates the complete absence of collinearity. A VIF value that exceeds 5 or 10 indicates a problematic amount of colinearity.

For each predictor,

is R^2 from a regression of Xj(act as response) onto all of the other predictors.

Solution
The first is to drop one of the problematic variables from the regression. The second solution is to combine the collinear variables together into a single predictor.


统计学习笔记(4) 线性回归(1)相关推荐

  1. 数据挖掘学习笔记 5 线性回归知识及预测糖尿病实例

    #2018-03-21 16:45:01 March Wednesday the 12 week, the 080 day SZ SSMR http://blog.csdn.net/eastmount ...

  2. 高维统计学习笔记1——LASSO和Oracle性质

    高维统计学习笔记1--LASSO和Oracle性质 主要参考资料:Sara Van De Geer<Estimation and Testing Under Sparsity> 前言 当年 ...

  3. 【统计学习笔记】最大似然法

    [统计学习笔记]最大似然法 最大似然原理 随机试验有若干个可能的结果,如果在一次试验中结果A发生,而导致结果A发生的原因有很多,在分析导致结果A发生的原因时,使结果A发生的概率最大的原因,推断为导致结 ...

  4. 【统计学习笔记】泛化误差上界

    [统计学习笔记]泛化误差上界 1. 泛化误差 2. 泛化误差上界 1. 泛化误差 学习方法的泛化能力是指由该方法学习到的模型对未知数据的预测能力,是学习方法本质上重要的性质.测试误差是依赖于测试数据集 ...

  5. 统计学习笔记(1)——统计学习方法概论

    1.统计学习 统计学习是关于计算机基于数据构建概率统计模型并运用模型对数据进行预测与分析的一门学科,也称统计机器学习.统计学习是数据驱动的学科.统计学习是一门概率论.统计学.信息论.计算理论.最优化理 ...

  6. 脚踏实地的好好学习深度学习 笔记一 线性回归

    李宏毅 深度学习 笔记一 from https://www.bilibili.com/video/BV15b411g7Wd?from=search&seid=26458097844866081 ...

  7. oracle常用数据统计,学习笔记:Oracle DBMS_STATS常用方法汇总 常用于收集统计oracle...

    天萃荷净 Oracle数据库中DBMS_STATS常用方法(收集oracle数据库.索引.表等信息) –收集Oracle数据库信息命令 EXEC DBMS_STATS.gather_database_ ...

  8. 统计学习笔记—手撕“感知机”

    统计学习方法笔记(1)-感知机 引言 感知机模型 模型简述 感知机算法思想 感知机算法性质 算例实现 导入数据 使用前两类莺尾花数据 利用感知机进行线性分类 小结 参考 轻松一刻 引言 下午拜读了李航 ...

  9. 数据清洗之数据统计-学习笔记

    学习笔记:数据的统计 import pandas as pd import numpy as np import os os.chdir(r'F:\CSDN\课程内容\代码和数据') #在线杂货店订单 ...

最新文章

  1. 空缺的2018-3-11《祖宗十九代》《缝纫机乐队》
  2. Android 如何退出程序
  3. 创建数据库是列名无效咋办_怎样解决列名无效 - 技术问答 - .Net源码论坛 .net源码,ASP.net|论坛 - Powered by Discuz!NT...
  4. MFC中进度条控件的使用方法
  5. Spark(5)——standalone模式
  6. 50:树中两个结点的最低公共祖先
  7. linux开机自动ZFS,linux – 为什么重新启动导致我的ZFS镜像的一面成为UNAVAIL?
  8. 台式计算机装机软件选择,装机软件哪个好?小编教你最好的装机软件推荐
  9. IDM统一认证功能说明
  10. echarts + vue2.0 实现大数据监测态势感知系统
  11. BNNVGG3-BNN Net
  12. 【Excel VBA】批量新建并重命名工作表
  13. 近视眼用什么台灯比较好?防近视眼护眼台灯排名
  14. 35岁,领高薪,拿股票:那些职业竞争力强的人,活得太爽了!
  15. oracle数据库主机CPU使用率高问题的分析及SQL优化
  16. 随手记——(细节)1
  17. 资本赋能|灵途科技获数千万元融资,深化人工智能物联网布局
  18. Multiview Detection with Feature Perspective Transformation
  19. Jetson nano 摄像头二维码识别 Opencv zbar QT
  20. 输入你的生日,显示还有多少天到你的生日

热门文章

  1. Android之编程中存在性能影响的主要方面
  2. 【必懂C++】C++可真是个“固执”的小可爱 02
  3. 这才是老公的正确用法,不吃就往死里打......
  4. 稳定匹配问题——稳定婚姻算法设计
  5. 数据挖掘算法之决策树算法总结
  6. mysql utf8 bin设置_[mysql]修改collation为utf8_bin
  7. mysql二阶段提交有什么问题_MySQL的事务两阶段提交的技术有什么意义?
  8. diy计算机组装注意事项,自己组装电脑要注意什么?DIY老司机教你装机注意事项...
  9. saiku 连接 MySQL_Saiku连接mysql数据库(二)
  10. redis 公网ip访问_Redis很重要,怎么只允许指定IP访问?