回归模型选择(R语言版)

选择准测:

  1. With Cp , AIC and BIC, smaller values are better, but for adjusted R2 , larger values are better.
  2. Model choice should be guided by economic theory and practical considerations, as well as by model selection criteria.

案例

Illustrate model selection with Nelson-Plosser data set of U.S. yearly macroeconomic variables from 1909 to 1970, a total of 62 years as follows:
变量说明:

# sp: Stock Prices, [Index; 1941-43=100],
# gnp.r: Real GNP,  [Billions of 1958 Dollars],
# gnp.pc: Real Per Capita GNP, [1958 Dollars],
# ip: industrial Production Index, [1967=100],
# cpi: Consumer Price Index, [1967=100],
# emp: Total Employment, [Thousands],
# bond: Basic Yields 30-year Corporate Bonds, [% pa].# response: diff(log(sp))
# regressors: diff(gnp.r), diff(gnp.pc), diff(log(ip)), diff(log(cpi)), diff(emp), diff(bond).# install.packages(c("fEcofin","MASS","car","lmtest"),repos= "http://cran.cnr.Berkeley.edu", dep=TRUE)

安装需要的包

rm(list=ls())
install.packages(c("MASS","car","leaps","faraway"))   # install packages
install.packages("fEcofin", repos="http://R-Forge.R-project.org")library(fEcofin)                                      # call packages
?? fEcofin
data("nelsonplosser")                                 # load data in above packages
list(nelsonplosser)                                  # check data
names(nelsonplosser)                      new_np=nelsonplosser[50:111, c(2,4,5,6,9,14,15)]
attach(new_np)
new_np

查看原数据

par(mfrow=c(2,3))     # 一页多图: 一个图版显示2行,3列
plot(diff(gnp.r), type="b")
plot(diff(log(gnp.r)),type="b")
plot(diff(sqrt(gnp.r)),type="b")
plot(diff(ip),type="b")  #喇叭形,用对数消去
plot(diff(log(ip)),type="b")
plot(diff(sqrt(ip)),type="b")

# creat a scatterplot matrix 查看变量间的关系
pairs(cbind(diff(log(sp)), diff(gnp.r), diff(gnp.pc), diff(log(ip)),diff(log(cpi)), diff(emp), diff(bnd)))

可以看到,var2-diff(gnp.r) 与var3-diff(gnp.pc)有很强的线性关系,变量var3-diff(gnp.pc)与变量var6-diff(emp)也有线性关系.

fit=lm(formula=diff(log(sp))~diff(gnp.r)+diff(gnp.pc)+diff(log(ip))+diff(log(cpi))+diff(emp)+diff(bnd), data=new_np )  # make linear refression
summary(fit)                  # dispay the result of fit
anova(fit)


在0.1的置信水平下,只有三个参数显著. 可能存在多重共线性,看一下方差膨胀因子(VIF)吧. VIF就是检查预测变量之间的共线性的. 一般认为VIF>10,认为有共线性.

library(faraway)
vif(fit)

output:

vif(fit)
diff(gnp.r) diff(gnp.pc) diff(log(ip)) diff(log(cpi)) diff(emp)
16.030836 31.777449 3.325433 1.290405 10.948953
diff(bnd)
1.481480

结果显示,变量diff(gnp.r), 变量diff(gnp.pc), 变量diff(emp)有线性关系.

然后我们用stepAIC函数来找一个简约模型, 它是R中的变量选择过程,它首先由用户指定的模型开始,然后依次添加或删除变量。在每一步中,添加或删除的元素都能最大程度地提高AIC值。这个例子,stepAIC将从所有六个预测因子开始。

#### AIC 准则
library(MASS)
fit_AIC=stepAIC(fit)
summary(fit_AIC) #diff(gnp.r) diff(gnp.pc) diff(log(ip)) diff(bnd)#####输出#####
Start:  AIC=-224.92
diff(log(sp)) ~ diff(gnp.r) + diff(gnp.pc) + diff(log(ip)) + diff(log(cpi)) + diff(emp) + diff(bnd)Df Sum of Sq    RSS     AIC
- diff(emp)       1  0.001846 1.2162 -226.83
- diff(gnp.pc)    1  0.023749 1.2381 -225.74
- diff(log(cpi))  1  0.033944 1.2483 -225.24
<none>                        1.2143 -224.92
- diff(gnp.r)     1  0.074538 1.2889 -223.28
- diff(log(ip))   1  0.097736 1.3120 -222.20
- diff(bnd)       1  0.169469 1.3838 -218.95Step:  AIC=-226.83
diff(log(sp)) ~ diff(gnp.r) + diff(gnp.pc) + diff(log(ip)) + diff(log(cpi)) + diff(bnd)Df Sum of Sq    RSS     AIC
- diff(log(cpi))  1  0.032116 1.2483 -227.24
<none>                        1.2162 -226.83
- diff(gnp.pc)    1  0.057144 1.2733 -226.03
- diff(gnp.r)     1  0.084456 1.3006 -224.73
- diff(log(ip))   1  0.096252 1.3124 -224.18
- diff(bnd)       1  0.188593 1.4047 -220.03Step:  AIC=-227.24
diff(log(sp)) ~ diff(gnp.r) + diff(gnp.pc) + diff(log(ip)) + diff(bnd)Df Sum of Sq    RSS     AIC
<none>                       1.2483 -227.24
- diff(gnp.pc)   1  0.046572 1.2949 -227.00
- diff(gnp.r)    1  0.069250 1.3175 -225.94
- diff(log(ip))  1  0.122309 1.3706 -223.53
- diff(bnd)      1  0.157181 1.4055 -222.00
> summary(fit_AIC) #diff(gnp.r) diff(gnp.pc) diff(log(ip)) diff(bnd)Call:
lm(formula = diff(log(sp)) ~ diff(gnp.r) + diff(gnp.pc) + diff(log(ip)) + diff(bnd), data = new_np)Residuals:Min       1Q   Median       3Q      Max
-0.40802 -0.08630  0.00795  0.08612  0.28259 Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept)   -0.0186635  0.0287229  -0.650   0.5185
diff(gnp.r)    0.0077426  0.0043928   1.763   0.0834 .
diff(gnp.pc)  -0.0010291  0.0007119  -1.445   0.1539
diff(log(ip))  0.6729244  0.2872758   2.342   0.0227 *
diff(bnd)     -0.1774902  0.0668399  -2.655   0.0103 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1493 on 56 degrees of freedom
Multiple R-squared:  0.3469,    Adjusted R-squared:  0.3003
F-statistic: 7.437 on 4 and 56 DF,  p-value: 7.06e-05

AIC准测下,模型保留四个参数diff(gnp.r) diff(gnp.pc) diff(log(ip)) diff(bnd).
现在,四个变量中的三个在0.1都有统计学意义,尽管diff(gnp.pc)具有相当大的p值,而且似乎值得探索其他可能的模型。

#### BIC 准则
n=length(diff(log(sp)))
fit_BIC=stepAIC(fit,k=log(n))
fit_BIC=stepAIC(fit,k=log(n),trace=F)                 # 用逐步回归法对训练数据建模
summary(fit_BIC)    #diff(log(ip)) diff(bnd)#####输出#####
Start:  AIC=-210.14
diff(log(sp)) ~ diff(gnp.r) + diff(gnp.pc) + diff(log(ip)) + diff(log(cpi)) + diff(emp) + diff(bnd)Df Sum of Sq    RSS     AIC
- diff(emp)       1  0.001846 1.2162 -214.16
- diff(gnp.pc)    1  0.023749 1.2381 -213.07
- diff(log(cpi))  1  0.033944 1.2483 -212.57
- diff(gnp.r)     1  0.074538 1.2889 -210.62
<none>                        1.2143 -210.14
- diff(log(ip))   1  0.097736 1.3120 -209.53
- diff(bnd)       1  0.169469 1.3838 -206.28Step:  AIC=-214.16
diff(log(sp)) ~ diff(gnp.r) + diff(gnp.pc) + diff(log(ip)) + diff(log(cpi)) + diff(bnd)Df Sum of Sq    RSS     AIC
- diff(log(cpi))  1  0.032116 1.2483 -216.68
- diff(gnp.pc)    1  0.057144 1.2733 -215.47
- diff(gnp.r)     1  0.084456 1.3006 -214.18
<none>                        1.2162 -214.16
- diff(log(ip))   1  0.096252 1.3124 -213.62
- diff(bnd)       1  0.188593 1.4047 -209.48Step:  AIC=-216.68
diff(log(sp)) ~ diff(gnp.r) + diff(gnp.pc) + diff(log(ip)) + diff(bnd)Df Sum of Sq    RSS     AIC
- diff(gnp.pc)   1  0.046572 1.2949 -218.56
- diff(gnp.r)    1  0.069250 1.3175 -217.50
<none>                       1.2483 -216.68
- diff(log(ip))  1  0.122309 1.3706 -215.09
- diff(bnd)      1  0.157181 1.4055 -213.56Step:  AIC=-218.56
diff(log(sp)) ~ diff(gnp.r) + diff(log(ip)) + diff(bnd)Df Sum of Sq    RSS     AIC
- diff(gnp.r)    1  0.026578 1.3214 -221.43
- diff(log(ip))  1  0.079892 1.3747 -219.02
<none>                       1.2949 -218.56
- diff(bnd)      1  0.126709 1.4216 -216.97Step:  AIC=-221.43
diff(log(sp)) ~ diff(log(ip)) + diff(bnd)Df Sum of Sq    RSS     AIC
<none>                       1.3214 -221.43
- diff(bnd)      1   0.10281 1.4242 -220.97
- diff(log(ip))  1   0.39112 1.7126 -209.72
> #fit_BIC=stepAIC(fit,k=log(n),trace=F)                 # 用逐步回归法对训练数据建模
> summary(fit_BIC)    #diff(log(ip)) diff(bnd)Call:
lm(formula = diff(log(sp)) ~ diff(log(ip)) + diff(bnd), data = new_np)Residuals:Min       1Q   Median       3Q      Max
-0.44254 -0.09786  0.00377  0.10525  0.28136 Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.01657    0.02100   0.789 0.433316
diff(log(ip))  0.69748    0.16834   4.143 0.000113 ***
diff(bnd)     -0.13224    0.06225  -2.124 0.037920 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.1509 on 58 degrees of freedom
Multiple R-squared:  0.3087,    Adjusted R-squared:  0.2848
F-statistic: 12.95 on 2 and 58 DF,  p-value: 2.244e-05

BIC准则下保留了两个参数diff(log(ip)) diff(bnd),在0.05下都显著.

model.matrix 设计矩阵,实际上就是矩阵形式表示的最小二乘回归的系数矩阵. 可参考设计矩阵

#### Cp 准则
library(leaps)
x=model.matrix(diff(log(sp))~diff(gnp.r)+diff(gnp.pc)+diff(log(ip))+diff(log(cpi))+diff(emp)+diff(bnd), data=new_np)[,-1]
x
le=leaps(x,diff(log(sp)), method="Cp", nbest=1 )
le
par(mfrow=c(1,2))
plot(le$size-1,le$Cp,type="b") cbind(le$which, le$Cp)
le$which[le$Cp==min(le$Cp)]fit_Cp=lm(formula = diff(log(sp))~diff(log(ip))+diff(bnd),data=new_np)
summary(fit_Cp)


可以看出,Cp值最小时,是选择第三个变量diff(log(ip))和第六个变量diff(bnd).模拟之后的线性模型结果如下:

#### adjr2 准则
le=leaps(x,diff(log(sp)), method="adjr2", nbest=1)
plot(le$size-1,le$adjr2,type="b")


在调整R2准则下,需要保留五个变量. 由下图可知, 不要第5个变量diff(emp)时,调整R2最大.

leaps函数

leaps(x=, y=, wt=rep(1, NROW(x)), int=TRUE, method=c("Cp", "adjr2", "r2"), nbest=10, names=NULL, df=NROW(x), strictly.compatible=TRUE)

leaps() performs an exhaustive search for the best subsets of the variables in x for predicting y in linear regression, using an efficient branch-and-bound algorithm. It is a compatibility wrapper for regsubsets does the same thing better.

Since the algorithm returns a best model of each size, the results do not depend on a penalty model for model size: it doesn’t make any difference whether you want to use AIC, BIC, CIC, DIC, …
参数
x: A matrix of predictors
y: A response vector
method: Calculate Cp, adjusted R-squared or R-squared
输出
Value
A list with components

which logical matrix. Each row can be used to select the columns of x in the respective model

size Number of variables, including intercept if any, in the model

Cp or adjr2 or r2 is the value of the chosen model selection statistic for each model

label vector of names for the columns of x

regsubsets函数

真子集回归

#######Supplement:
rm(list=ls())
library(fEcofin)
data("nelsonplosser")
new_np=nelsonplosser[50:111, c(2,4,5,6,9,14,15)]
attach(new_np)
new_nplibrary(leaps)
?? regsubsets #模型选择函数,参数x可以是design matrix or model formula for full model, or biglm object
a=regsubsets( cbind( diff(gnp.r), diff(gnp.pc), diff(log(ip)), diff(log(cpi)), diff(emp), diff(bnd) ), diff(log(sp)) )
summary(a)
reg.summary = summary(a)
par(mfrow=c(1,3))
plot(reg.summary$adjr2, xlab="Number of Variables", ylab="Adjusted RSq ", type ="l")
which.max(reg.summary$adjr2)                                # 5
points(5, reg.summary$adjr2[5] , col="red", cex=2, pch=20)  # puts points on a plot that has already been createdplot(reg.summary$cp , xlab ="Number of Variables", ylab ="Cp", type ="l")
which.min(reg.summary$cp)                                   # 2
points(2, reg.summary$cp[2], col="red", cex=2, pch=20)plot(reg.summary$bic ,xlab ="Number of Variables",ylab ="BIC",type ="l")
which.min(reg.summary$bic)                                 # 2
points(2, reg.summary$bic[2], col ="red", cex=2, pch=20)


由图可知,调整R2标准下,5个变量. Cp和BIC标准下保留两个变量. 具体见下图

cbind(reg.summary$which,reg.summary$adjr2)
cbind(reg.summary$which,reg.summary$cp)
cbind(reg.summary$which,reg.summary$bic)


若regsubsets 函数,参数x是 model formula for full model

把六个变量都放到模型中,进行真子集回归. 注意AIC,BIC,Cp都是越小越好,adR2是越大越好. 图形,一行一行看

a0=regsubsets(diff(log(sp))~diff(gnp.r)+diff(gnp.pc)+diff(log(ip))+diff(log(cpi))+diff(emp)+diff(bnd),nbest=1, data=new_np)
par(mfrow=c(1,3))
plot(a0,scale="bic")
plot(a0,scale="Cp")
plot(a0,scale="adjr2")
#图形解释

一行一行看, 会发现含变量diff(log(ip)), diff(bnd)时, bic小为-10, Cp最小为3.8
除了不含变量diff(emp), 包含其他五个变量的模型, 调整R2最大为0.31.
这与前面的用函数leaps得到的结论是一致的.


注意: 大部分情况中, 全子集回归要优于逐步回归, 因为考虑了更多模型. 但是, 当有大量预测变量时, 全子集回归会很慢. 一般来说, 变量自动选择应该被看做是对模型选择的一种辅助方法, 而不是直接方法. 拟合效果佳而没有意义的模型对你毫无帮助, 主题背景知识的理解才能最终指引你获得理想的模型.

参考文献:
[1] Ruppert, David. Statistics and Data Analysis for Financial Engineering[M]. Springer, 2010.
[2] R语言实战: 第2版/(美)卡巴科弗(Kabacoff,R.I.)著; 王小宁等译. 一2版.一北京: 人民邮电出版社, 2016.5.

回归模型选择(R语言版)相关推荐

  1. r读取shape文件可视化_R语言读取空间数据以及ArcGIS中OLS工具回归结果可视化R语言版...

    前面已经介绍过R语言读取excel的方法了,当然读取数据来说,个人还是推荐csv或txt存储(针对小数据量).大数据量的数据的话建议还是用数据库,此外也可以考虑data.table包读取,这个包也是个 ...

  2. 带南海九段线分位数地图可视化(R语言版)

    今天带来一篇承诺虾神的可视化博客.内容是使用R语言进行带南海九段线分位数地图可视化.虾神的原博文地址如下(Python版). Python实现带南海九段线分位数地图完整可视化版本(附代码及数据) 19 ...

  3. R语言基于lm模型构建线性回归模型(蟋蟀的鸣叫声与温度的关系)、计算回归模型的RMSE指标、计算回归模型的R方指标(R-squared)

    R语言基于lm模型构建线性回归模型(蟋蟀的鸣叫声与温度的关系).计算回归模型的RMSE指标.计算回归模型的R方指标(R-squared) 目录

  4. r语言 面板数据回归_工具方法 | “名牌包”:面板、时间序列模型常用R语言包...

    计量经济学是数学.统计技术和经济分析的综合,即运用数学.统计方法和相关经济理论,通过计量模型来揭示经济数量关系和规律.R语言包,已经实现了现代计量经济学的很多统计分析功能,下面从面板数据模型和时间序列 ...

  5. 【视频】支持向量机SVM、支持向量回归SVR和R语言网格搜索超参数优化实例

    最近我们被客户要求撰写关于SVM的研究报告,包括一些图形和统计输出. 什么是支持向量机 (SVM)? 我们将从简单的理解 SVM 开始. [视频]支持向量机SVM.支持向量回归SVR和R语言网格搜索超 ...

  6. 信用评分卡模型(R语言)

    贷款风险预测-信用评分卡模型(R语言) 时间:2018年10月9日 本次的分析数据来自Kaggle数据竞赛平台的"give me some credit"竞赛项目.下载地址为:ht ...

  7. 最大熵,三硬币模型的R语言代码

    #最大熵算法 #< 统计学习方法>最大熵,三硬币模型的R语言代码 有3个硬币,分布为A.B.C,硬币正面的概率是pai,p,q.投币实验如下,先投A,如果A是正面,即A=1,那么选择投B: ...

  8. r语言 siar 代码_平滑转换自回归(STAR)模型的R语言编程实现详解

    总体讲,ST(Smooth transition)模型,这块的code混乱的比较狗血,文献中的做法也是千奇百怪, 单变量的有 :STAR, LSTAR, ESTAR, 多变量的有:STVAR, LST ...

  9. logistic回归模型—基于R

    logistic回归模型-基于R 数据理解和准备 一. 对缺失值的处理 二.虚拟变量的赋值 三.箱线图 四.相关性分析 训练集与测试集的划分 模型构建与评价 一.logistic回归模型 二.检查模型 ...

最新文章

  1. 每日算法练习之括号配对
  2. OpenGL硬件加速指南
  3. 使用sql服务器发送贺卡_创建和发送免费电子贺卡的最佳网站
  4. html 标签 r语言,从R中的字符串中删除html标签
  5. fcpx插件Corporate Story for Mac(商务公司视频宣传片头模板)
  6. 什么样的领导会给公司造成损失呢?
  7. 介绍几款WAP网页制作工具(提供下载)
  8. 前端登陆界面 html+js
  9. 免费开源网站系统html,全CMS开源系统
  10. OpenFileDiaog.Filter的使用
  11. c#推箱子小游戏代码_C# 简单推箱子游戏源码
  12. java中获取农历日期以及星期几
  13. canvas练习笔记之手绘熊本熊
  14. 智能电视 = Android + 高清大屏幕 ?
  15. 为什么我总是更新不了头像啊
  16. [论文阅读]中文版-AlexNet
  17. 读懂常见IRP:IRP_MJ_CLEANUP\IRP_MJ_CLOSE\IRP_MJ_CREATE
  18. 如何批量将 Doc 格式的 Word 文档转为 Docx 格式
  19. cocos2dx的图片加载
  20. Synchronization 同步

热门文章

  1. 【编译原理】中间代码优化(二) 局部优化
  2. unity2D横版游戏教程-2 代码实现
  3. 一、Cocos2d-x/Qt,2D手游开发方案
  4. python安装教程——最新详细版
  5. 也说 2012国庆交通事故共造成794人死亡
  6. 计算中英文字符串的长度!
  7. 中国嵌入式天花板灯具行业市场供需与战略研究报告
  8. Uddi注册中心教程
  9. 草图大师SU导入Unity和虚幻UE4对比
  10. UTF-8 GBK UTF8 GB2312 之间的区别和关系