R语言采用多元回归建模的基本步骤

前言：本次建模过程是基于RedHat6.8或者CentOS6.8，R3.1.2,Rstudio-server
关于R3.1.2，Rstudio-server的整个配置，原始数据（已经脱敏处理，不涉及泄密，如有侵权，请随时联系）以及本分析的源码均放置在GitHub上,通过click here访问

数据导入：

#install essential packages
install.packages("rJava",dependencies=TRUE)
install.packages("xlsx",dependencies=TRUE)
install.packages("corrplot",dependencies=TRUE)
install.packages("leaps",dependencies=TRUE)
install.packages("lmtest",dependencies=TRUE)library(rJava)
#setting workstation
setwd("/home/steven/workstation")
#source data loaded
library(xlsx)
src <- read.xlsx("/usr/local/workstation/test.xls",1,encoding="UTF-8")

数据处理：

#drop columns that is useless
src <- src[,-c(1)]
src <- src[,c(1:6,8:10,7)]
#1、description statistic
attach(src)
names(src)
mean=sapply(src,mean)
max=sapply(src,max)
min=sapply(src,min)
median=sapply(src,median)
sd=sapply(src,sd)
cbind(mean,max,min,median,sd)

建立回归模型步骤：


#1、参数全部默认情况下的相关系数图
#混合方法之上三角为圆形，下三角为黑色数字
library(corrplot)corr <- cor(src[,1:10])
corrplot(corr = corr,order="AOE",type="upper",tl.pos="tp")
corrplot(corr = corr,add=TRUE, type="lower", method="number",order="AOE", col="black",diag=FALSE,tl.pos="n", cl.pos="n")

输出结果如下：

1.                mean         max        min      median           sd
rs      1211.924227  14000.0000    10.0000   980.00000 1.595892e+03
cs         1.168495      6.0000     0.0000     1.00000 1.471799e+00
area       7.045862     11.0000     1.0000     7.00000 3.168251e+00
type       1.588235      2.0000     1.0000     2.00000 4.923985e-01
cg         2.251246      3.0000     1.0000     3.00000 9.289564e-01
h          1.395813      2.0000     1.0000     1.00000 4.892685e-01
tjg     7843.554775   8454.1182  6909.4630  8092.36154 3.942006e+02
ggjg   14147.557328  16600.0000 11700.0000 14600.00000 1.015608e+03
yyjg      94.608809    106.6788    86.6484    95.00923 3.467941e+00
target 29893.998495 145085.0000     4.9000 27700.00000 1.939754e+04

#1、参数全部默认情况下的相关系数图
#混合方法之上三角为圆形，下三角为黑色数字
library(corrplot)corr <- cor(src[,1:10])
corrplot(corr = corr,order="AOE",type="upper",tl.pos="tp")
corrplot(corr = corr,add=TRUE, type="lower", method="number",order="AOE", col="black",diag=FALSE,tl.pos="n", cl.pos="n")

结果如下：

#2.画相关图选择回归方程的形式
plot(target~cg);abline(lm(target~cg))
plot(target~h);abline(lm(target~h))
plot(target~type);abline(lm(target~type))
plot(target~tjg);abline(lm(target~tjg))
plot(target~ggjg);abline(lm(target~ggjg))
plot(target~area);abline(lm(target~area))
plot(target~cs);abline(lm(target~cs))
plot(target~yyjg);abline(lm(target~yyjg))
plot(target~rs);abline(lm(target~rs))

#3.do regression and check results
dim(src)[1]
lm.test<-lm(target~rs+cs+area+type+cg+h+tjg+ggjg+yyjg,data=src)
summary(lm.test)

结果如下：


Residuals:
   Min     1Q Median     3Q    Max
-45689  -8172   -999   5979  90290 Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  9.678e+04  1.487e+04   6.507 1.22e-10 ***
rs           9.556e-02  2.643e-01   0.362  0.71777
cs          -5.617e+02  2.887e+02  -1.946  0.05195 .
area         2.137e+02  1.390e+02   1.537  0.12463
type        -3.414e+04  4.309e+03  -7.923 6.20e-15 ***
cg          -1.846e+03  2.136e+03  -0.864  0.38776
h            1.336e+04  1.158e+03  11.535  < 2e-16 ***
tjg         -1.198e+01  2.460e+00  -4.868 1.31e-06 ***
ggjg         7.148e+00  1.079e+00   6.625 5.66e-11 ***
yyjg        -3.731e+02  1.280e+02  -2.916  0.00363 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 12980 on 993 degrees of freedom
Multiple R-squared:  0.5564,    Adjusted R-squared:  0.5523
F-statistic: 138.4 on 9 and 993 DF,  p-value: < 2.2e-16

#4.delete variable which is not significant(rs,area)
lm.test<-lm(target~cs+type+cg+h+tjg+ggjg+yyjg,data=src)
summary(lm.test)

Residuals:
   Min     1Q Median     3Q    Max
-46131  -8655  -1038   5716  91466 Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) 101636.162  14537.438   6.991 4.99e-12 ***
cs            -572.037    288.600  -1.982  0.04774 *
type        -34143.617   4296.236  -7.947 5.14e-15 ***
cg           -1815.420   2127.549  -0.853  0.39370
h            12995.825   1132.191  11.478  < 2e-16 ***
tjg            -12.728      2.411  -5.279 1.59e-07 ***
ggjg             7.474      1.058   7.068 2.96e-12 ***
yyjg          -389.096    127.090  -3.062  0.00226 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 12980 on 995 degrees of freedom
Multiple R-squared:  0.5553,    Adjusted R-squared:  0.5522
F-statistic: 177.5 on 7 and 995 DF,  p-value: < 2.2e-16

#4.1.use residual analysis delete outlier points
plot(lm.test,which=1:4)
src = src[-c(12,765,788,790),]
dim(src)[1]

结果如下：

得到的四个图依次为：
4.1普通残差与拟合值的残差图
4.2正态QQ的残差图（若残差是来自正态总体分布的样本，则QQ图中的点应该在一条直线上）
4.3标准化残差开方与拟合值的残差图（对于近似服从正态分布的标准化残差，应该有95%的样本点落在[-2,2]的区间内。这也是判断异常点的直观方法）
4.4cook统计量的残差图（cook统计量值越大的点越可能是异常值，但具体阀值是多少较难判别）

从图中可见，12,765,788,790三个样本存在异常，需要剔除。
Residuals:
   Min     1Q Median     3Q    Max
-46131  -8655  -1038   5716  91466 Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) 101636.162  14537.438   6.991 4.99e-12 ***
cs            -572.037    288.600  -1.982  0.04774 *
type        -34143.617   4296.236  -7.947 5.14e-15 ***
cg           -1815.420   2127.549  -0.853  0.39370
h            12995.825   1132.191  11.478  < 2e-16 ***
tjg            -12.728      2.411  -5.279 1.59e-07 ***
ggjg             7.474      1.058   7.068 2.96e-12 ***
yyjg          -389.096    127.090  -3.062  0.00226 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 12980 on 995 degrees of freedom
Multiple R-squared:  0.5553,    Adjusted R-squared:  0.5522
F-statistic: 177.5 on 7 and 995 DF,  p-value: < 2.2e-16

#5.异方差检验
#5.1GQtest，H0（误差平方与自变量，自变量的平方和其交叉相都不相关），
#p值很小时拒绝H0，认为上诉公式有相关性，存在异方差
src.test<-residuals(lm.test)
library(lmtest)
gqtest(lm.test)#5.2BPtest,H0(同方差),p值很小时认为存在异方差
bptest(lm.test)

结果如下：

Goldfeld-Quandt testdata:  lm.test
GQ = 1.841, df1 = 494, df2 = 493, p-value = 8.865e-12
---------
studentized Breusch-Pagan testdata:  lm.test
BP = 93.9696, df = 7, p-value < 2.2e-16

两个检验的p值都很小时认为存在异方差，需要修正异方差


#6.修正异方差
#修正的方法选择FGLS即可行广义最小二乘
#6.1修正步骤
#6.1.1将y对xi求回归，算出res--u
#6.1.2计算log(u^2)
#6.1.3做log(u^2)对xi的辅助回归 log(u^2),得到拟合函数g=b0+b1x1+..+b2x2
#6.1.4计算拟合权数1/h=1/exp(g)，并以此做wls估计lm.test2<-lm(log(resid(lm.test)^2)~cs+type+cg+h+tjg+ggjg+yyjg,data=src)
lm.test3<-lm(target~cs+type+cg+h+tjg+ggjg+yyjg,weights=1/exp(fitted(lm.test2)),data=src)
summary(lm.test3)

结果如下：

Weighted Residuals:
    Min      1Q  Median      3Q     Max
-5.0707 -1.3281 -0.2952  1.0471  9.3937 Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  61648.563  13382.003   4.607 4.62e-06 ***
cs            -320.525    153.174  -2.093   0.0366 *
type        -26556.135   4766.749  -5.571 3.26e-08 ***
cg           -4653.887   2430.178  -1.915   0.0558 .
h            11107.677    692.499  16.040  < 2e-16 ***
tjg            -14.767      2.224  -6.638 5.20e-11 ***
ggjg             7.919      1.049   7.548 9.96e-14 ***
yyjg            97.193    133.864   0.726   0.4680
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 1.868 on 995 degrees of freedom
Multiple R-squared:  0.7178,    Adjusted R-squared:  0.7159
F-statistic: 361.6 on 7 and 995 DF,  p-value: < 2.2e-16

#7.1计算解释变量相关稀疏矩阵的条件数k，k<100多重共线性程度很小，100<k<1000较强，>1000严重
src[1:9]
XX<-cor(src[1:9])
kappa(XX)

结果如下：

[1] 160.9175

K>100 and K<1000，说明共线性较强，接下来找出共线性强的变量

$values
[1] 3.4192494 1.3030312 1.1563047 0.9199288 0.9041041 0.7562702 0.4205589 0.1011783
[9] 0.0193744$vectors[,1]         [,2]       [,3]        [,4]        [,5]         [,6][1,] -0.05733512 -0.501899822  0.2026552 -0.50588080  0.52056535 -0.417997886[2,] -0.14763837  0.007109461  0.4116265 -0.46530338 -0.75069812 -0.163284493[3,] -0.05138859  0.665302345 -0.1293286 -0.01422961  0.11408507 -0.699971313[4,]  0.49655847 -0.112090220 -0.1278988  0.02167963 -0.16337262 -0.239671005[5,]  0.46448785 -0.167719865 -0.2352263  0.06705893 -0.20715145 -0.265202906[6,]  0.40186335 -0.218262940 -0.2471633 -0.09210826 -0.11794142  0.001433268[7,]  0.40615453  0.174700538  0.5215318  0.06040287  0.12878909  0.132238544[8,]  0.40635458  0.274603051  0.4469450 -0.07777933  0.20683789  0.127659436[9,] -0.13555751 -0.333404886  0.4101634  0.71008868 -0.09781406 -0.383530067[,7]         [,8]         [,9][1,] -0.02888533  0.034061964 -0.010937600[2,]  0.03865752 -0.008224625 -0.020908812[3,]  0.16651561  0.084133996  0.004735702[4,] -0.30629410 -0.209415519  0.708684433[5,] -0.36523446  0.094446587 -0.663876582[6,]  0.83587224  0.093172922  0.004206256[7,] -0.04882043  0.695383719  0.094120334[8,]  0.10362759 -0.660605287 -0.217813589[9,]  0.17586069 -0.101388052 -0.011659019

#8.修正多重共线性---逐步回归法
#ps2：step中可进行参数设置：
#direction=c("both","forward","backward")来选择逐步回归
#的方向，默认both，forward时逐渐增加解释变两个数，backward则相反。
step(lm.test)

结果如下：

Start:  AIC=19007.25
target ~ cs + type + cg + h + tjg + ggjg + yyjgDf  Sum of Sq        RSS   AIC
- cg    1 1.2269e+08 1.6778e+11 19006
<none>               1.6766e+11 19007
- cs    1 6.6200e+08 1.6832e+11 19009
- yyjg  1 1.5794e+09 1.6924e+11 19015
- tjg   1 4.6960e+09 1.7235e+11 19033
- ggjg  1 8.4171e+09 1.7608e+11 19054
- type  1 1.0643e+10 1.7830e+11 19067
- h     1 2.2201e+10 1.8986e+11 19130

Step:  AIC=19005.98
target ~ cs + type + h + tjg + ggjg + yyjgDf  Sum of Sq        RSS   AIC
<none>               1.6778e+11 19006
- cs    1 5.9084e+08 1.6837e+11 19008
- yyjg  1 1.5277e+09 1.6931e+11 19013
- tjg   1 5.3779e+09 1.7316e+11 19036
- ggjg  1 1.2972e+10 1.8075e+11 19079
- h     1 2.2083e+10 1.8986e+11 19128
- type  1 1.5438e+11 3.2216e+11 19658

Call:
lm(formula = target ~ cs + type + h + tjg + ggjg + yyjg, data = src)Coefficients:
(Intercept)           cs         type            h          tjg         ggjg  99781.026     -533.902   -37652.525    12909.485      -13.222        7.941  yyjg  -381.819

所以最终入选的变量是：formula = target ~ cs + type + h + tjg + ggjg + yyjg

R语言采用多元回归建模的基本步骤相关推荐

R语言采用优化方法拟合曲线并计算AIC,BIC,LRT
文章目录前言一.R代码实现 1.导入库 2.随机生成原始数据 3.RMSD 4.梯度下降 5.最大似然估计 6.做出优化后图像 7.求AIC,BIC 8.求LRT 二.运行结果 1.图像输出 2. ...
『R语言Python』建模前的准备：连续型与离散型变量探索，离散型变量转为虚拟变量
在建立模型之前,我们常要先对数据的类型作出判断,连续型数据可以不做处理,而离散型数据则可能需要转为虚拟变量.下文使用R语言中的经典数据集 mtcarsmtcarsmtcars 进行演示 Python: ...
R语言语法及建模合集
点击下列超链接可进入博客: 一.语法篇: R语言常用包分类 R语言数据导入导出总结 R语言数据探索功能总结 R语言中的离群点检测方法 R语言中的向量使用合集 R语言中的因子类型 R语言中的对象以及它的 ...
使用R语言进行决策树建模
关于决策树的理解及自定义代码实现请参考我的另一个博客: 数据挖掘常用算法理解与R语言实现(系列待完成) 本次技能点: 训练集和测试集的选取决策树构建与减值决策树的print和plot 预测值与实际 ...
时间序列模型R语言实现-批量建模，预测（ARIMA, 随机森林）
时间序列预测首先要确定预测的内容. 本文预测共享单车的日租借量使用的是每日数据预测的时间范围是需要提前一个月.半年还是一年?根据预测范围,会使用到不同的模型. 首先安装加载本文所需要的包 inst ...
R语言入门——多元回归交叉验证的实现
目录引言 1.主要函数编写 1.1 随机数据的产生 1.2 CV.lm的编写 1.2 CV.lm的调用 2.数值模拟 2.1 CV模拟 2.2 MCMC模拟 3.总结引言随着模型复杂度的提高和数 ...
R语言tidyverse数据处理建模案例
管道%>% 左连接left_join() 筛选行 filter(条件) 行排序arrange() 选择列select() 修改(计算)列mutate() 分组汇总group_by().%> ...
pvrect r语言聚类_R语言一条命令实现基于样本和距离的聚类分析
上一篇文章给大家介绍了利用 R语言的 hclust()进行聚类分析的步骤,已经很简单了,但是依然有不少小伙伴来问 "老师,还有更简单的方法吗,最好是一条命令那种",为了满足的大家的 ...
python调用r语言加载包错误_Python调用R语言
网络上经常看到有人问数据分析是学习Python好还是R语言好,还有一些争论Python好还是R好的文章.每次看到这样的文章我都会想到李舰和肖凯的<数据科学中的R语言>,书中一直强调,工具不 ...

R语言采用多元回归建模的基本步骤

R语言采用多元回归建模的基本步骤相关推荐

最新文章

热门文章