应用预测建模第六章-线性回归-预测化合物溶解度练习-R语言（多元线性回归，稳健回归，偏最小二乘回归，岭回归，lasso回归，弹性网)

模型：多元线性回归，稳健回归，偏最小二乘回归，岭回归，lasso回归，弹性网

语言：R语言

参考书：应用预测建模 Applied Predictive Modeling (2013) by Max Kuhn and Kjell Johnson，林荟等译

案例：

导入数据集。

library(AppliedPredictiveModeling)
data(solubility)
ls(pattern = "^solT")#我们需要用到的数据集

> ls(pattern = "^solT")#我们需要用到的数据集
[1] "solTestX"       "solTestXtrans"  "solTestY"       "solTrainX"      "solTrainXtrans" "solTrainY"

首先展示一些关系图。注意，数据集中的响应变量已经取了以10底的对数。

（a )连续型描述量与溶解度（响应变量）之间的关系

xyplot(solTrainY ~ solTrainX$MolWeight, type = c("p", "g"),ylab = "Solubility (log)",main = "(a)",xlab = "Molecular Weight")

如图，随着分子量（选定的连续型描述量）的增加，溶解度逐渐降低。这一关系大体是对数线性的，除了一些溶解度很小或溶解度在0到- 5 之间而分子量很大的离群值之外。

（b )指纹0-1变量与溶解度（响应变量）之间的关系

bwplot(solTrainY ~ ifelse(solTrainX[,100] == 1, "structure present", "structure absent"),ylab = "Solubility (log)",main = "(b)",horizontal = FALSE)

如图b，对于一个特定的指纹描述量，当分子中出现了某种子结构时，溶解度会略微更大

（c )计数描述量与溶解度（响应变量）之间的关系

xyplot(solTrainY ~ solTrainX$NumRotBonds, type = c("p", "g"),ylab = "Solubility (log)",xlab = "Number of Rotatable Bonds")

（d ) 展示响应变量与所有非指纹0-1变量的关系，纵轴为响应变量，横轴为已经经过box-cox变换的预测变量

notFingerprints <- grep("FP", names(solTrainXtrans))library(caret)
featurePlot(solTrainXtrans[, -notFingerprints],solTrainY,between = list(x = 1, y = 1),type = c("g", "p", "smooth"),labels = rep("", 2))

如图，图片展示了预测变量与结果变量之间的散点图，并加上了用“平滑”模型处理过的被称为loess 的回归线。平滑后的回归线显示出，某些预测变量与结果变量之间的关系是线性的（如分子量），而另一些则是非线性的（如氯原子的数量）。基于这些原因，可以考虑在预测变量中增加某些变量的二次项。

（e ) 展所有非指纹0-1变量的相关系数

library(corrplot)corrplot::corrplot(cor(solTrainXtrans[, -notFingerprints]), order = "hclust", tl.cex = .8)

如图，图片展示变换后非0-1指纹变量之间的相关结构。其中有许多很强的正相关关系（由巨大的、深蓝色的圆表示）。这可能会对建模（例如线性问归）产生不好的影响。

普通多元线性回归

利用普通最小二乘建立线性回归模型的主函数是lm ，它接受一个公式和一个数据框作为输入。因为这个原因，训练集预测变量和结果变量应该被包含在同一个数据框中。可以创建一个新数据框来达到这一目的：

trainingData <- solTrainXtrans
trainingData$Solubility <- solTrainY #加入响应变量

进行建模并输出结果：

#建模
lmFitAllPredictors <- lm(Solubility~., data= trainingData)
summary(lmFitAllPredictors)

注：因为太长，Coefficients:这一项有删减。

由统计汇总结果可知，RSE残差标准误为0.5524（可作为RMSE的简单估计值），R方为94.46%，调整R方为92.71%。这个结果可能过于乐观，因为这是在训练集上的结果。

> summary(lmFitAllPredictors)Call:
lm(formula = Solubility ~ ., data = trainingData)Residuals:Min       1Q   Median       3Q      Max
-1.75620 -0.28304  0.01165  0.30030  1.54887 Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept)        2.431e+00  2.162e+00   1.124 0.261303
FP001              3.594e-01  3.185e-01   1.128 0.259635
FP002              1.456e-01  2.637e-01   0.552 0.580960
FP044             -3.860e-01  2.184e-01  -1.768 0.077562 .
FP061             -6.365e-01  1.440e-01  -4.421 1.13e-05 ***
FP062             -5.224e-01  2.961e-01  -1.764 0.078078 .
FP063             -2.001e+00  1.287e+00  -1.554 0.120553
FP064              2.549e-01  1.221e-01   2.087 0.037207 *
FP065             -2.844e-01  1.197e-01  -2.377 0.017714 *
FP066              2.093e-01  1.264e-01   1.655 0.098301 .
FP067             -1.406e-01  1.540e-01  -0.913 0.361631
FP068              4.964e-01  2.028e-01   2.447 0.014630 *
FP069              1.324e-01  8.824e-02   1.501 0.133885
FP070              3.453e-03  8.088e-02   0.043 0.965963
FP071              1.474e-01  1.237e-01   1.192 0.233775
FP072             -9.773e-01  2.763e-01  -3.537 0.000431 ***
FP073             -4.671e-01  2.072e-01  -2.254 0.024474 *
FP074              1.793e-01  1.206e-01   1.487 0.137566   [ reached getOption("max.print") -- omitted 29 rows ]
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.5524 on 722 degrees of freedom
Multiple R-squared:  0.9446,    Adjusted R-squared:  0.9271
F-statistic: 54.03 on 228 and 722 DF,  p-value: < 2.2e-16

在测试集上预测样本，将测试集的观测值和预测值放到一个数据框中，然后使用caret中的函数 defaultSummary 来估计模型在测试集上的性能：

#预测测试集的样本
lmPred1<-predict(lmFitAllPredictors,newdata=solTestXtrans)
#将测试集的观测值和预测值放到一个数据框中，然后使用caret 中的函数 defaultSummary 来估计模型在测试集上的性能
lmValues1 <- data.frame(obs = solTestY, pred = lmPred1)
head(lmValues1)
defaultSummary(lmValues1)
testResults <- data.frame(obs = solTestY,lm = lmPred1)#存储测试集预测数据

可见，在测试集上预测，RMSE为0.745，R方为0.8722 ，这表示在训练集上的结果确实过于乐观了。

> head(lmValues1)obs        pred
20 0.93  0.99370933
21 0.85  0.06834627
23 0.81 -0.69877632
25 0.74  0.84796356
28 0.61 -0.16578324
31 0.58  1.40815083
> defaultSummary(lmValues1)RMSE  Rsquared       MAE
0.7455802 0.8722236 0.5497605

虽然普通线性回归没有调优参数，但是这并不妨碍建模者使用严谨的模型验证工具对其进行验证，特别是当模型用于预测时。事实上，我们必须使用与第4 章中相同的训练和验证方法来对模型预测新样本的能力进行评价。

train 函数可以利用重抽样方法来生成模型性能的估计值。由于训练集的容量并不小（951个），因此10 折交叉验证应该可以生成模型性能的合理估计。函数trainControl 可以指定重抽样的类型。train 函数既接受模型公式作为参数，也支持非公式的接口（参见4. 9 节中介绍的指定预测模型的各种方法）。

#用重抽样方法对模型预测新样本的能力进行评价
#设定随机数种子，这样十折交叉验证数据集可以重复
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)#进行线性回归的十折交叉验证
set.seed(100)
lmFit1 <- train(x = solTrainXtrans, y = solTrainY,method = "lm",trControl = ctrl)lmFit1
testResults$lm_cv<-predict(lmFit1,solTestXtrans)#将预测结果存储起

运行时提示的警告信息（ prediction from a rank-deficient fit may be misleading）是因为某些预测变量的高度相关性导致一些数学计算中断。

可见，十折交叉验证得出的平均RMSE为0.721，得出的平均R方为0.8768。

> #进行线性回归的十折交叉验证
> set.seed(100)
> lmFit1 <- train(x = solTrainXtrans, y = solTrainY,
+                  method = "lm",
+                  trControl = ctrl)
Warning messages:
1: In predict.lm(modelFit, newdata) :prediction from a rank-deficient fit may be misleading
2: In predict.lm(modelFit, newdata) :prediction from a rank-deficient fit may be misleading
>
> lmFit1
Linear Regression 951 samples
228 predictorsNo pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results:RMSE       Rsquared   MAE      0.7210355  0.8768359  0.5401102Tuning parameter 'intercept' was held constant at a value of TRUE

对于需要进行解释的模型，检查模型的假设是非常重要的，例如残差的分布。对于预测模型，如果其预测效果不佳，同样可以利用一些模型诊断方法来进行核查。例如，可以绘制残差与预测值的散点图，如果图形显示出随机分布的点云，那么可以更肯定地说模型没有遗漏重要的项（例如二次项等），也不存在显著的离群值。另一个重要的图形是预测值与现测值的散点图，它可以用来评价预测值与真实值的接近程度。实现以上两幅图的方法如下（利用训练集）：

#观测值与预测值之间的散点图
xyplot(solTrainY ~ predict(lmFit1),## plot the points (type = 'p') and a background grid ('g')type = c("p", "g"),xlab = "Predicted", ylab = "Observed")#预测值与残差的散点图
#resid 函数可以生成训练集的残差，而不加数据参数的predict 函数可以返回训练集的预测值
xyplot(resid(lmFit1) ~ predict(lmFit1),type = c("p", "g"),xlab = "Predicted", ylab = "Residuals")

对于该模型，两幅诊断图没有显示出异常情况。

观测们对预测值的散点图。这种图可以显示出离群值或模型拟合不佳的区域。

残差对预测值的散点图。如果模型设定良好，这种图应该呈现为随机的点云，且不存在离群值或明显的形状（如漏斗型）。

为了移除极度强相关的预测变量以建立更精简的模型，可以使用3. 3 节介绍的方法，将两两相关系数绝对值超过0. 9 的变量予以剔除，以减少预测变量的数目。

#用相关系数剔除高相关变量 p33
highCorr<-findCorrelation(cor(solTrainXtrans),cutoff=0.9) #返回建议删除变量的列数
highCorr
filtesolTrainXtrans<-solTrainXtrans[,-highCorr]
filtesolTestXtrans<-solTestXtrans[,-highCorr]
#利用过滤过的变量建模
set.seed(100)
lmFiltered <- train(x = filtesolTrainXtrans, y = solTrainY,method = "lm",trControl = ctrl)lmFiltered###将对测试集的预测结果存储起来
testResults$lm_cor<-predict(lmFiltered,solTestXtrans)
head(testResults)

可见，剔除后的变量为190个（原来为228个）。提示的警告信息（ prediction from a rank-deficient fit may be misleading）消失。

> #利用过滤过的变量建模
> set.seed(100)
> lmFiltered <- train(x = filtesolTrainXtrans, y = solTrainY,
+                 method = "lm",
+                 trControl = ctrl)
>
> lmFiltered
Linear Regression 951 samples
190 predictorsNo pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results:RMSE       Rsquared   MAE      0.7113935  0.8793396  0.5410503Tuning parameter 'intercept' was held constant at a value of TRUE
>

预测测试集样本（过滤后的），并计算RMSE与R方

#交叉验证多元回归（过滤后的）预测测试集样本（过滤后的）
lmFilteredPred1<-predict(lmFiltered,newdata=filtesolTestXtrans)
#将测试集的观测值和预测值放到一个数据框中，然后使用caret 中的函数 defaultSummary 来估计模型在测试集上的性能
lmFilteredValues1 <- data.frame(obs = solTestY, pred = lmFilteredPred1)
head(lmFilteredValues1)
defaultSummary(lmFilteredValues1)

> head(lmFilteredValues1)obs        pred
20 0.93  0.87458991
21 0.85  0.09911774
23 0.81 -0.51698844
25 0.74  0.79768536
28 0.61  0.21186617
31 0.58  1.50363807
> defaultSummary(lmFilteredValues1)RMSE  Rsquared       MAE
0.7601790 0.8669049 0.5661760

但是对于此方法，需要注意：

移除高度相关的预测变量以保证它们两两间的相关系数不超过一个给定的阈值。然而，这一过程并不能保证某些预测变量的线性组合与其他变量是不相关的。如果情况如此，那么最小二乘的解将依然是不稳定的。

稳健回归：最小化离群观测值的影响

图模型残差与其对目标函数贡献之间的关系，图中展示了几种不同的目标函数。对于Huber 函数，阈值设定为2

在R中进行稳健回归，可以使用train函数选择稳健回归的方法

#train函数选择稳健回归方法
set.seed(100)
rlm <- train(x = solTrainXtrans, y = solTrainY,method = "rlm",trControl = ctrl,preProc = c("center", "scale"))
rlm

有截距，方法为huber

> rlm
Robust Linear Model 951 samples
228 predictorsPre-processing: centered (228), scaled (228)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:intercept  psi           RMSE       Rsquared   MAE      FALSE      psi.huber     2.8034713  0.8760595  2.7069027FALSE      psi.hampel    2.8034713  0.8760595  2.7069027FALSE      psi.bisquare  2.8035978  0.8760408  2.7070396TRUE      psi.huber     0.7321544  0.8752285  0.5389277TRUE      psi.hampel    0.7350600  0.8742359  0.5418816TRUE      psi.bisquare  0.7838603  0.8598210  0.5738828RMSE was used to select the optimal model using the smallest value.
The final values used for the model were intercept = TRUE and psi = psi.huber.

可以使用MASS包中的稳健线性模型函数（rlm），它默认使用的是Huber 方法。类似于lm 函数， rlm 的用法如下：

#稳健回归
library(MASS)
rlmFitAllPredictors <- rlm(Solubility ~ ., data = trainingData)
summary(rlmFitAllPredictors)

注：因为太长，Coefficients:这一项有删减。

> summary(rlmFitAllPredictors)Call: rlm(formula = Solubility ~ ., data = trainingData)
Residuals:Min       1Q   Median       3Q      Max
-2.89940 -0.25046  0.01221  0.25351  1.86225 Coefficients:Value    Std. Error t value
(Intercept)         2.5861   1.9646     1.3164
FP001               0.3706   0.2894     1.2804
FP002               0.0370   0.2396     0.1546
NumAromaticBonds   -2.0220   0.5663    -3.5706
NumHydrogen         0.7778   0.1709     4.5527
NumCarbon           0.7865   0.3419     2.3003
NumNitrogen         8.5838   2.7669     3.1023
NumOxygen           2.6481   0.4110     6.4436
NumSulfer          -9.6687   3.2884    -2.9403
NumChlorine        -6.4608   1.8075    -3.5744
NumHalogen          1.4341   1.9165     0.7483
NumRings            1.0132   0.6102     1.6605
HydrophilicFactor  -0.0836   0.1033    -0.8094
SurfaceArea1        0.0948   0.0550     1.7225
SurfaceArea2        0.1181   0.0510     2.3141Residual standard error: 0.3739 on 722 degrees of freedom

稳健回归预测测试集样本：

#稳健回归预测测试集样本
rlmPred1<-predict(rlmFitAllPredictors,newdata=solTestXtrans)
#将测试集的观测值和预测值放到一个数据框中，然后使用caret 中的函数 defaultSummary 来估计模型在测试集上的性能
rlmValues1 <- data.frame(obs = solTestY, pred = rlmPred1)
head(rlmValues1)
defaultSummary(rlmValues1)
testResults$lm_rlm<-rlmPred1#将预测结果存储起来

稳健回归在测试集的RMSE为0.753，R方为0.8700 ；由上文可知，普通线性回归在测试集的RMSE为0.745，R方为0.8722 。两个模型在此案例中表现差异不大。

> head(rlmValues1)obs        pred
20 0.93  1.09257730
21 0.85 -0.01596496
23 0.81 -0.55293933
25 0.74  0.82031495
28 0.61 -0.06744233
31 0.58  1.33320420
> defaultSummary(rlmValues1)RMSE  Rsquared       MAE
0.7529670 0.8700394 0.5371296

偏最小二乘回归（PLS）与主成分回归（PCR）

在进行PLS 之前，预测变量应该先进行中心化和标准化，特别是预测变量具有不同的量级时。
PLS 有一个调优参数：需要保留的成分数。重抽样方法可以用来选择最优的成分数。
PLS适用于存在高度相关预测变量的数据集，但对于预测变量空间与响应变量的相关结构为非线性的模型，虽然可以自己构建数据集（比如加入平方项），但是更推荐用那些有可以在不修改预测变量空间的前提下更自然地发现顶测变量和响应变量之间的非线性关系的预测模型。

下面进行PLS建模：

首先用train函数确定最优参数（成分数）：

注意：train函数也可以以配舍plsr函数的method 参数进行使用，例如"oscorespls"、"simpls "或"widekernelpls”（具体见p86，不同算法的结果相同，只是某些算法优化了PLS的运算效率）。

#设定随机数种子，这样十折交叉验证数据集可以重复
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)#PLS偏最小二乘回归
set.seed(100)
plsTune <- train(x = solTrainXtrans, y = solTrainY,method = "kernelpls",#Dayal和MacGregor的第一种核函数算法kernelplstuneGrid = expand.grid(ncomp = 1:20),#设定1-20个成分数trControl = ctrl,preProc = c("center", "scale"))
plsTunetestResults$PLS_cv <- predict(plsTune, solTestXtrans)

可知，用十折交叉验证，选出来的最佳成分数为14 （RMSE最小，为0.7080）。

注：结果与书上略有不同。

> plsTune
Partial Least Squares 951 samples
228 predictorsPre-processing: centered (228), scaled (228)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:ncomp  RMSE       Rsquared   MAE      1     1.2820896  0.6061663  0.99089782     1.0609080  0.7326945  0.83337283     0.9344973  0.7914940  0.72383934     0.8570977  0.8235854  0.66609435     0.8126349  0.8413981  0.63155786     0.7785140  0.8546538  0.60164397     0.7539695  0.8642916  0.57316498     0.7415685  0.8687828  0.56760669     0.7307073  0.8726161  0.558245010     0.7201343  0.8767022  0.549382111     0.7174982  0.8778295  0.548626712     0.7093452  0.8803296  0.541252813     0.7108267  0.8802818  0.541443514     0.7080011  0.8811427  0.540925915     0.7085605  0.8809953  0.538516316     0.7085113  0.8809079  0.541301317     0.7088188  0.8809152  0.542941318     0.7097142  0.8808145  0.544482519     0.7119750  0.8798337  0.546135920     0.7139587  0.8791681  0.5477691RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 14.

利用一倍标准差准则，0.7080011+0.03492938=0.7429305，可以将14个成分缩减到8个

> plsTune$resultsncomp      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
1      1 1.2820896 0.6061663 0.9908978 0.12643831 0.07606802 0.08692801
2      2 1.0609080 0.7326945 0.8333728 0.08404089 0.03177571 0.06993945
3      3 0.9344973 0.7914940 0.7238393 0.07943628 0.02469891 0.05908861
4      4 0.8570977 0.8235854 0.6660943 0.05336496 0.02172862 0.03857533
5      5 0.8126349 0.8413981 0.6315578 0.03817911 0.02277085 0.03615625
6      6 0.7785140 0.8546538 0.6016439 0.03076132 0.01996481 0.03540179
7      7 0.7539695 0.8642916 0.5731649 0.03505010 0.01720024 0.03340631
8      8 0.7415685 0.8687828 0.5676066 0.03069722 0.01705571 0.02893890
9      9 0.7307073 0.8726161 0.5582450 0.02635050 0.01771026 0.02175748
10    10 0.7201343 0.8767022 0.5493821 0.02860208 0.01678537 0.01987854
11    11 0.7174982 0.8778295 0.5486267 0.03056129 0.01782793 0.01624970
12    12 0.7093452 0.8803296 0.5412528 0.02910027 0.01916458 0.01918982
13    13 0.7108267 0.8802818 0.5414435 0.02974200 0.01844625 0.02230934
14    14 0.7080011 0.8811427 0.5409259 0.03492938 0.01899973 0.02364453
15    15 0.7085605 0.8809953 0.5385163 0.03939786 0.01984845 0.02798348
16    16 0.7085113 0.8809079 0.5413013 0.04050778 0.02012026 0.02949941
17    17 0.7088188 0.8809152 0.5429413 0.04190305 0.01935973 0.02644218
18    18 0.7097142 0.8808145 0.5444825 0.04462642 0.01946978 0.02812665
19    19 0.7119750 0.8798337 0.5461359 0.04775935 0.02040080 0.02894271
20    20 0.7139587 0.8791681 0.5477691 0.04872832 0.02108566 0.03097676

确定了最佳成分数（14）后，可以用pls包的plsr函数进行建模：


#选定最佳成分数后，用pls包进行预测
library(pls)
#plsr训练集预测变量和结果变量应该被包含在同一个数据框,然后使用公式
#plsr函数默认使用Dayal和MacGregor的第一种核函数算法kernelpls，其他算法可以用method函数进行指定，如"oscorespls", "simpls",  "widekernelpls"
plsFit <- plsr(Solubility ~ ., data = trainingData,ncomp=14,method="kernelpls", scale = TRUE, center = TRUE)#成分数可以用ncomp参数近行指定，如果使用默认值，则会使用最大数量的成分数summary(plsFit) #预测变量被解释百分比与响应变量被解释百分比

预测变量被解释百分比与响应变量被解释百分比(如前两个成分解释了7.71%的预测变量，解释了61.52%的响应变量）

> summary(plsFit)
Data:   X dimension: 951 228 Y dimension: 951 1
Fit method: kernelpls
Number of components considered: 14
TRAINING: % variance explained1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps
X              7.71    15.26    21.64    28.44    35.48    38.30    40.49    42.89    45.68
Solubility    61.51    74.52    80.54    84.06    86.01    87.65    89.03    89.86    90.4510 comps  11 comps  12 comps  13 comps  14 comps
X              47.25     48.65     50.49     52.69     55.39
Solubility     91.20     91.71     92.05     92.31     92.48

进行预测，预测结果与train函数的预测结果相同。

#新样本预测，可以使用一个特定的成分数，也可以一次性计算多个值，如
#predict(plsFit, solTestXtrans[1:5,], ncomp = 1:2)testResults$PLSr <- predict(plsFit, solTestXtrans,ncomp=14) #必须设定成分数
head(testResults[1:5,c("PLS_cv","PLSr")]) #可见，train函数与plsr函数的预测结果相同

> head(testResults[1:5,c("PLS_cv","PLSr")]) #可见，train函数与plsr函数的预测结果相同PLS_cv       PLSr
20  0.5506072  0.5506072
21  0.1418576  0.1418576
23 -0.4927215 -0.4927215
25  0.5059239  0.5059239
28  0.1064888  0.1064888

下面进行PCR主成分回归建模：

#PCR主成分回归
set.seed(100)
pcrTune <- train(x = solTrainXtrans, y = solTrainY,method = "pcr",tuneGrid = expand.grid(ncomp = 1:40),trControl = ctrl,preProc = c("center", "scale"))
pcrTune  testResults$PCR <- predict(pcrTune, solTestXtrans)

笔者进行了中心化与标准化。如果不使用，则结果跟书上一样，PCR用37个成分才达到了RMSE的最小值。但进行后，发现PCR在100+成分下的情况下也未达到最小值。

> pcrTune
Principal Component Analysis 951 samples
228 predictorsPre-processing: centered (228), scaled (228)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:ncomp  RMSE       Rsquared     MAE      1     2.0355455  0.009995405  1.57600482     1.9769395  0.079463737  1.56092453     1.7043970  0.313044590  1.34937664     1.6022753  0.392005978  1.24229765     1.5920674  0.399525705  1.23778436     1.4425258  0.505930729  1.11091667     1.2874737  0.605238500  1.00207818     1.2905565  0.603239916  1.00724349     1.2956287  0.600452235  1.011765410     1.2727250  0.615882890  0.987558811     1.2473825  0.631012300  0.967000812     1.2428753  0.633688589  0.966581613     1.2388695  0.636315637  0.961392414     1.1890327  0.663558256  0.925541415     1.1617356  0.677867352  0.908963616     1.1105888  0.706637555  0.873157117     1.0496580  0.737469231  0.825167218     1.0415583  0.740561784  0.818867719     1.0294205  0.746280919  0.802401320     1.0130624  0.754530826  0.790369221     1.0018983  0.759391365  0.780961922     0.9993558  0.760355097  0.779961923     0.9759803  0.772350870  0.761919324     0.9766142  0.771999687  0.763382725     0.9739905  0.773149541  0.762159926     0.9644208  0.777538519  0.755472727     0.9642577  0.777545969  0.754458828     0.9600433  0.779304123  0.750362829     0.9592284  0.779991857  0.747706530     0.9582400  0.780537389  0.744673331     0.9345170  0.791119125  0.722567932     0.9253489  0.794788262  0.720173133     0.9139088  0.799805666  0.711488434     0.9144045  0.799335992  0.712283735     0.8995955  0.806076909  0.703261936     0.8909331  0.809912280  0.693002337     0.8839311  0.813218932  0.686227638     0.8832381  0.813533619  0.685180539     0.8829844  0.813722510  0.683984540     0.8692075  0.819293706  0.6758810RMSE was used to select the optimal model using the smallest value.
The final value used for the model was ncomp = 40.

绘图对比PLS与PCR的成分数选择过程

#绘图对比PLS与PCR的成分数选择过程
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"
plsPlotData <- rbind(plsResamples, pcrResamples)xyplot(RMSE ~ ncomp,data = plsPlotData,#aspect = 1,xlab = "# Components",ylab = "RMSE (Cross-Validation)",auto.key = list(columns = 2),groups = Model,type = c("o", "g"))

由于PLS 的降维是有监督的，所以它更快地找到了预测变量与响应变量之间的关系。尽管两个模型的预测能力差别不大，但显然PLS 找到了一个更简单的模型，其成分数要远远小于PCR 。

对于PLS，预测变量在某些成分的标准化权重越大，以及响应变量变异被这些成分解释的比例越大，这一预测变量对于PLS 模型而言就越重要。

根据指标的建立过程，所有VIP 值的平方和等于顶测变量的总数目。一个常用的准则是，VIP 值大于1就意味着该变量包含了预
测响应变量的信息。Wold (1995 ）进一步建议，具有较小PLS 回归系数和VIP 值的预测变量可以被认为是不重要的．可以考虑将它们从模型中移除。

#显示PLS的变量重要性
plsImp <- varImp(plsTune, scale = FALSE) #varImp为caret包中的函数，用以计算train()函数产生对象的变量重要性
head(plsImp$importance)
plot(plsImp, top = 25, scales = list(y = list(cex = .95)))

> head(plsImp$importance)Overall
FP001 0.01298455
FP002 0.03359802
FP003 0.01070767
FP004 0.01784958
FP005 0.03157168
FP006 0.02440820

注：为什么变量重要性的值VIP都小于1问题未解决。

罚回归模型（岭回归、lasso回归、弹性网）

岭回归

岭回归建模：

用train函数选择岭回归的最佳参数

#设定随机数种子，这样十折交叉验证数据集可以重复
set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)#岭回归
#用train函数选择岭回归的最佳参数
#设定正则化参数 取值范围为0-0.1，中间取15个值
ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 15))set.seed(100)
ridgeTune <- train(x = solTrainXtrans, y = solTrainY,method = "ridge", #岭回归tuneGrid = ridgeGrid,trControl = ctrl,preProc = c("center", "scale"))
ridgeTune
testResults$ridge_cv <- predict(ridgeTune, solTestXtrans)

交叉验证选择的最佳lambda值为0.0357

> ridgeTune <- train(x = solTrainXtrans, y = solTrainY,
+                    method = "ridge", #岭回归
+                    tuneGrid = ridgeGrid,
+                    trControl = ctrl,
+                    preProc = c("center", "scale"))
> ridgeTune
Ridge Regression 951 samples
228 predictorsPre-processing: centered (228), scaled (228)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:lambda       RMSE       Rsquared   MAE      0.000000000  0.7207131  0.8769711  0.53980350.007142857  0.7047552  0.8818659  0.53524680.014285714  0.6964731  0.8847911  0.52985260.021428571  0.6925923  0.8862699  0.52704960.028571429  0.6908607  0.8870609  0.52600820.035714286  0.6904220  0.8874561  0.52606500.042857143  0.6908548  0.8875998  0.52673540.050000000  0.6919152  0.8875759  0.52779250.057142857  0.6934719  0.8874300  0.52919860.064285714  0.6954114  0.8872009  0.53085650.071428571  0.6976723  0.8869096  0.53272910.078571429  0.7002069  0.8865723  0.53472150.085714286  0.7029801  0.8862009  0.53687440.092857143  0.7059656  0.8858041  0.53913890.100000000  0.7091432  0.8853885  0.5415401RMSE was used to select the optimal model using the smallest value.
The final value used for the model was lambda = 0.03571429.

绘图，横轴为lambda取值，纵轴为RMSE。

可见，当没有惩罚项时．模型的误差非常大（模型过拟合，方差较大），而随着罚参数的增加，误差由0. 72 降低到了0. 69 。当罚参数增加且超过0.036 时，偏差变得较大，模型出现了拟合不足的问题，从而RMSE又开始增加。

#lambda与RMSE的关系图
print(update(plot(ridgeTune), xlab = "Penalty"))

选定最佳岭回归参数后，用elasticnet包的enet函数进行岭回归建模

ridgeTune$bestTune #交叉验证选定的最佳岭回归参数 0.03571429
#选定最佳岭回归参数后，进行岭回归建模
#岭回归模型可以利用MASS包中的lm.ridge函数或elasticnet包中的enet函数建立。
library(elasticnet)
ridgeModel <- enet(x = as.matrix(solTrainXtrans), y = solTrainY,lambda = 0.03571429,normalize = TRUE)
ridgeModel
#预测
#针对enet对象的predict函数可以通过参数s和mode来同时计算多个lasso罚参数下的预测值。
#对于岭回归模型，我们希望lasso 罚为0 ，即想要的是完全解。
#为了得到岭回归的解，定义s=l 和mode=" fraction ”， s 取值为1代表比例参数为1,即完全解。
ridgePred <- predict(ridgeModel, newx = as.matrix(solTestXtrans), s= 1, mode = "fraction",type ="fit")
head(ridgePred$fit)#预测值
testResults$ridge_enet <- ridgePred$fit#储存预测结果

> head(ridgePred$fit)#预测值20         21         23         25         28         31 0.5858371  0.2176331 -0.4492668  0.8766181  0.1447030  1.6156822

lasso回归与弹性网

lasso回归：可以进行特征选择

弹性网：组合了岭回归法和lasso罚，既可以进行有效的正则化（通过岭回归罚），还可以进行变量选择（通过lasso 罚）

进行lasso回归和弹性网建模。

注：当岭回归参数lambda为0时，弹性网即为lasso回归。

#lasso回归与弹性网
#弹性网模型同时具有岭回归罚参数和lasso 罚参数
#lambda为岭回归罚参数（当lambda为0时即为纯lasso模型）
#fraction为lasso罚参数，取值范围为0.05-1，取了20个值
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), fraction = seq(.05, 1, length = 20))
set.seed(100)
enetTune <- train(x = solTrainXtrans, y = solTrainY,method = "enet", #弹性网 elastic nettuneGrid = enetGrid,trControl = ctrl,preProc = c("center", "scale"))
enetTune#交叉验证建立模型进行预测
testResults$lasso_cv <- predict(enetTune,solTestXtrans)

交叉验证选择：fraction = 0.15 and lambda = 0，即为纯lasso模型，lasso罚为0.15

> enetTune
Elasticnet 951 samples
228 predictorsPre-processing: centered (228), scaled (228)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ...
Resampling results across tuning parameters:lambda  fraction  RMSE       Rsquared   MAE      0.00    0.05      0.8713747  0.8337289  0.66068360.00    0.10      0.6882637  0.8858786  0.52346360.00    0.15      0.6729264  0.8907993  0.51355770.00    0.20      0.6754697  0.8903865  0.51661060.00    0.25      0.6879252  0.8865202  0.52513130.00    0.30      0.6971062  0.8836414  0.53069890.00    0.35      0.7062274  0.8808469  0.53548310.00    0.40      0.7125901  0.8788941  0.53870850.00    0.45      0.7138745  0.8785587  0.53945600.00    0.50      0.7141241  0.8785620  0.53944050.00    0.55      0.7144676  0.8784958  0.53937960.00    0.60      0.7140540  0.8786589  0.53871740.00    0.65      0.7140608  0.8786876  0.53814640.00    0.70      0.7145473  0.8785740  0.53798490.00    0.75      0.7151022  0.8784343  0.53777240.00    0.80      0.7158078  0.8782624  0.53764420.00    0.85      0.7167930  0.8780153  0.53787700.00    0.90      0.7178723  0.8777462  0.53834020.00    0.95      0.7191461  0.8774050  0.53895630.00    1.00      0.7207131  0.8769711  0.53980350.01    0.05      1.5168857  0.6435177  1.16476580.01    0.10      1.1324481  0.7671388  0.86669060.01    0.15      0.9061843  0.8241043  0.68698280.01    0.20      0.7855269  0.8571170  0.59830990.01    0.25      0.7296380  0.8733531  0.55500000.01    0.30      0.6989522  0.8826020  0.53207110.01    0.35      0.6866513  0.8863490  0.52429040.01    0.40      0.6806730  0.8884346  0.52030750.01    0.45      0.6778780  0.8895285  0.51839580.01    0.50      0.6760780  0.8902871  0.51645570.01    0.55      0.6743998  0.8909724  0.51537390.01    0.60      0.6746777  0.8910026  0.51503220.01    0.65      0.6765522  0.8904906  0.51630520.01    0.70      0.6796775  0.8895768  0.51855690.01    0.75      0.6829651  0.8886182  0.52079390.01    0.80      0.6862396  0.8876472  0.52319560.01    0.85      0.6895735  0.8866477  0.52562050.01    0.90      0.6930103  0.8856210  0.52792800.01    0.95      0.6968398  0.8844630  0.53040190.01    1.00      0.7006283  0.8833050  0.53271900.10    0.05      1.6867967  0.5157969  1.29480150.10    0.10      1.4058744  0.6954146  1.07691930.10    0.15      1.1697385  0.7596795  0.89388380.10    0.20      1.0082617  0.7880698  0.76612930.10    0.25      0.8950440  0.8218825  0.67711720.10    0.30      0.8193443  0.8435444  0.62213070.10    0.35      0.7744593  0.8570276  0.59507140.10    0.40      0.7519611  0.8644826  0.57664850.10    0.45      0.7343282  0.8710631  0.56197170.10    0.50      0.7245543  0.8750318  0.55544090.10    0.55      0.7180823  0.8778937  0.55099780.10    0.60      0.7137901  0.8799906  0.54814700.10    0.65      0.7110967  0.8815343  0.54635290.10    0.70      0.7104058  0.8823940  0.54528000.10    0.75      0.7103284  0.8829674  0.54424580.10    0.80      0.7097899  0.8836319  0.54326260.10    0.85      0.7093246  0.8842290  0.54258490.10    0.90      0.7094949  0.8845954  0.54252650.10    0.95      0.7094181  0.8849823  0.54196770.10    1.00      0.7091432  0.8853885  0.5415401RMSE was used to select the optimal model using the smallest value.
The final values used for the model were fraction = 0.15 and lambda = 0.

绘图

plot(enetTune)

如图，纯lasso 模型（lambda=0 ）在最开始可以使得误差下降，并在比例大于0. 2 时逐渐上升。岭回归的罚参数不为0的两个模型在达到最小误差时包含了更多的变量。最终，最优的模型是lasso 模型，比例参数为0. 15. 对应于从228 个预测变量中选取了130 个（注：该结论如何得到尚不清楚）。

lasso模型同样可以由enet函数进行计算，设定lambda=0即为纯lasso模型

#lasso模型同样可以由enet函数进行计算，设定lambda=0即为纯lasso模型
lassoModel <- enet(x = as.matrix(solTrainXtrans), y = solTrainY,lambda = 0, normalize = TRUE)#s用交叉验证选定为值即可
lassoPred <- predict(lassoModel, newx = as.matrix(solTestXtrans),s= .15, mode = "fraction",type = "fit")
testResults$lasso_enet <- predict(enetTune,solTestXtrans)

为了了解哪些预测变量用于建模，可以在predict 方法中指定type ＝” coefficients"

#为了了解哪些预测变量用于建模，可以在predict 方法中指定type ＝” coefficients"
enetCoef <- predict(lassoModel, newx = as.matrix(solTestXtrans),s= .15, mode = "fraction",type = "coefficients")
tail(enetCoef$coefficients)

> tail(enetCoef$coefficients)NumChlorine        NumHalogen          NumRings HydrophilicFactor      SurfaceArea1 -0.4658882         0.0000000         0.0000000         0.0000000         0.2462247 SurfaceArea2 0.0000000

有许多其他的软件包可以用来拟合lasso 模型或其变种，如biglars （针对大数据集）、FLLat（针对混合型lasso ）、rplasso（分组lasso ）、penalized 和relaxo (松弛lasso)等。

最后对比一下各个模型预测训练集的结果

如下表，用同一算法建模的train函数（cv后缀）与对应函数的结果都相同（注意在写代码时同时中心化和标准化）。

> head(testResults[1:5,])obs          lm       lm_cv      lm_cor      lm_rlm     PLS_cv       PLSr        PCR
20 0.93  0.99370933  0.99370933  0.87458991  1.09257730  0.5506072  0.5506072 -0.2281472
21 0.85  0.06834627  0.06834627  0.09911774 -0.01596496  0.1418576  0.1418576  0.3554648
23 0.81 -0.69877632 -0.69877632 -0.51698844 -0.55293933 -0.4927215 -0.4927215 -0.7485805
25 0.74  0.84796356  0.84796356  0.79768536  0.82031495  0.5059239  0.5059239 -0.0786488
28 0.61 -0.16578324 -0.16578324  0.21186617 -0.06744233  0.1064888  0.1064888  0.4491043ridge_enet   ridge_cv  lasso_enet    lasso_cv
20  0.5858371  0.5858371  0.58577853  0.58577853
21  0.2176331  0.2176331  0.17759666  0.17759666
23 -0.4492668 -0.4492668 -0.45078961 -0.45078961
25  0.8766181  0.8766181  0.67547073  0.67547073
28  0.1447030  0.1447030 -0.05812367 -0.05812367

计算RMSE与R方。

RMSE 取值通常解释为（平均意义上）残差距离0 的远近，或者解释为观测值和模型预测值之间的平均距离。

R方被解释为数据包含的信息中能被模型所解释的比例。R方是一种相关性的度量，而不是准确性的度量。

caret软件包包含了计算RMSE和R2 的函数。RMSE(predicted, observed)，R2(predicted, observed)

model.compare<-data.frame(method='mymethod',RMSE=0,R2=0,stringsAsFactors = FALSE)
for (i in 2:ncol(testResults)) {model.compare[i-1,1]<-names(testResults[i])model.compare[i-1,2]<-RMSE(testResults[,i],testResults$obs)model.compare[i-1,3]<-caret::R2(testResults[,i],testResults$obs)
}
model.compare

可见，PCR主成分回归的RMSE最大，纯lasso模型的RMSE最小。

> model.comparemethod      RMSE        R2
1          lm 0.7455802 0.8722236
2       lm_cv 0.7455802 0.8722236
3      lm_cor 0.7601790 0.8669049
4      lm_rlm 0.7529670 0.8700394
5      PLS_cv 0.7205682 0.8800456
6        PLSr 0.7205682 0.8800456
7         PCR 0.8983336 0.8129715
8  ridge_enet 0.7193079 0.8812286
9    ridge_cv 0.7193079 0.8812286
10 lasso_enet 0.7012669 0.8861631
11   lasso_cv 0.7012669 0.8861631