一、pRRophetic 是干什么用的？

pRRophetic是明尼苏达大学Paul Geeleher, Nancy Cox, R. Stephanie Huang做的包，主要算法是基于自己课题组2014年的发的Genome Biology，Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines (https://link.springer.com/article/10.1186/gb-2014-15-3-r47)。这个R包的算法的输入主要是较大的Project的cell line expression profiles 与对应的IC50信息，然后通过ridge regression建立一个模型，再用这个模型去预测临床样本的Chemotherapeutic Response。2021年，这个课题组又出了
oncoPredict (oncoPredict: an R package for predicting in vivo or cancer patient drug response and biomarkers from cell line screening data), 算是对pRRophetic的一次升级。
(注：R 4.2.1 版本会报错，建议使用4.0.2以下版本）

二、使用指南

1. 包的安装

github地址：https://github.com/paulgeeleher/pRRophetic
但是大家好像下载tar本地安装

OSF | pRRophetic R package Wiki
https://osf.io/5xvsg/wiki/home/

需要 “car”, “ridge”, “preprocessCore”, “genefilter”, “sva” 四个前置包。

# 如果不缺环境 直接安装就行，不过我喜欢Rstudio，点点鼠标就完事了。
install.packages("pRRophetic_0.5.tar.gz", repos = NULL, dependencies = TRUE)

值得一提的是，这个包里除了源代码，包含很多现成的数据集，这个就很人性化。

# 数据集在这里
find.package("pRRophetic") %>% paste0("/data") %>% dir()
[1] "bortezomibData.RData"
[2] "ccleData.RData"
[3] "Cell_line_RMA_proc_basalExp.txt"
[4] "Cell_Lines_Details-1.csv"
[5] "cgp2016ExprRma.RData"
[6] "datalist"
[7] "drugAndPhenoCgp.RData"
[8] "PANCANCER_IC_Tue Aug  9 15_28_57 2016.csv"
[9] "PANCANCER_IC_Tue_Aug_9_15_28_57_2016.RData"

2. 使用范例

核心的函数 pRRopheticPredict

pRRopheticPredict {pRRophetic}   R Documentation
Given a gene expression matrix, predict drug senstivity for a drug in CGP
Description
Given a gene expression matrix, predict drug senstivity for a drug in CGP.Usage
pRRopheticPredict(testMatrix, drug, tissueType = "all", batchCorrect = "eb",powerTransformPhenotype = TRUE, removeLowVaryingGenes = 0.2,minNumSamples = 10, selection = -1, printOutput = TRUE,removeLowVaringGenesFrom = "homogenizeData", dataset = "cgp2014")
Arguments
testMatrix
a gene expression matrix with gene names as row ids and sample names as column ids.drug
the name of the drug for which you would like to predict sensitivity, one of A.443654, A.770041, ABT.263, ABT.888, AG.014699, AICAR, AKT.inhibitor.VIII, AMG.706, AP.24534, AS601245, ATRA, AUY922, Axitinib, AZ628, AZD.0530, AZD.2281, AZD6244, AZD6482, AZD7762, AZD8055, BAY.61.3606, Bexarotene, BI.2536, BIBW2992, Bicalutamide, BI.D1870, BIRB.0796, Bleomycin, BMS.509744, BMS.536924, BMS.708163, BMS.754807, Bortezomib, Bosutinib, Bryostatin.1, BX.795, Camptothecin, CCT007093, CCT018159, CEP.701, CGP.082996, CGP.60474, CHIR.99021, CI.1040, Cisplatin, CMK, Cyclopamine, Cytarabine, Dasatinib, DMOG, Docetaxel, Doxorubicin, EHT.1864, Elesclomol, Embelin, Epothilone.B, Erlotinib, Etoposide, FH535, FTI.277, GDC.0449, GDC0941, Gefitinib, Gemcitabine, GNF.2, GSK269962A, GSK.650394, GW.441756, GW843682X, Imatinib, IPA.3, JNJ.26854165, JNK.9L, JNK.Inhibitor.VIII, JW.7.52.1, KIN001.135, KU.55933, Lapatinib, Lenalidomide, LFM.A13, Metformin, Methotrexate, MG.132, Midostaurin, Mitomycin.C, MK.2206, MS.275, Nilotinib, NSC.87877, NU.7441, Nutlin.3a, NVP.BEZ235, NVP.TAE684, Obatoclax.Mesylate, OSI.906, PAC.1, Paclitaxel, Parthenolide, Pazopanib, PD.0325901, PD.0332991, PD.173074, PF.02341066, PF.4708671, PF.562271, PHA.665752, PLX4720, Pyrimethamine, QS11, Rapamycin, RDEA119, RO.3306, Roscovitine, Salubrinal, SB.216763, SB590885, Shikonin, SL.0101.1, Sorafenib, S.Trityl.L.cysteine, Sunitinib, Temsirolimus, Thapsigargin, Tipifarnib, TW.37, Vinblastine, Vinorelbine, Vorinostat, VX.680, VX.702, WH.4.023, WO2009093972, WZ.1.84, X17.AAG, X681640, XMD8.85, Z.LLNle.CHO, ZM.447439.tissueType
specify if you would like to traing the models on only a subset of the CGP cell lines (based on the tissue type from which the cell lines originated). This be one any of "all" (for everything, default option), "allSolidTumors" (everything except for blood), "blood", "breast", "CNS", "GI tract" ,"lung", "skin", "upper aerodigestive"batchCorrect
How should training and test data matrices be homogenized. Choices are "eb" (default) for ComBat, "qn" for quantiles normalization or "none" for no homogenization.powerTransformPhenotype
Should the phenotype be power transformed before we fit the regression model? Default to TRUE, set to FALSE if the phenotype is already known to be highly normal.removeLowVaryingGenes
What proportion of low varying genes should be removed? 20 percent be defaultminNumSamples
How many training and test samples are requried. Print an error if below this thresholdselection
How should duplicate gene ids be handled. Default is -1 which asks the user. 1 to summarize by their or 2 to disguard all duplicates.printOutput
Set to FALSE to supress outputValue
a gene expression matrix that does not contain duplicate gene ids

# 加载数据集 硼替佐米（Bortezomib）
data(bortezomibData)

这个数据集来自与GSE9782。exprDataBortezomib数据集里面包含264的样本，pRRphetic里预处理后有22283个基因变量。另一个人重要的输入，就是drug response了。

# studyResponse的level 有三个 Levels: PGx_Responder = IE PGx_Responder = NR PGx_Responder = R
table(studyResponse)PGx_Responder = IE PGx_Responder = NR  PGx_Responder = R
25                126                113

首先QQplot看一下IC50是否符合正态分布。

pRRopheticQQplot("Bortezomib")

然后执行了一个 5-fold cross-validation，验证一下是否可以预测临床药物敏感性。可能朋友们会有疑问，为啥没有train dataset，只有test dataset，因为train dataset其实是默认好的，是CGP cell lines，可以用tissueType执行肿瘤类型（This be one any of “all” (for everything, default option), “allSolidTumors” (everything except for blood), “blood”, “breast”, “CNS”, “GI tract” ,“lung”, “skin”, “upper aerodigestive”）。

cvOut <- pRRopheticCV("Bortezomib", cvFold=5, testExprData=exprDataBortezomib)
summary(cvOut)

看看结果

Summary of cross-validation results:Pearsons correlation: 0.43 , P =  3.57572505890629e-14
R-squared value: 0.19
Estimated 95% confidence intervals: -4.41, 3.56
Mean prediction error: 1.61

#Plot the cross validation predicted phenotype against the measured IC50s.
plot(cvOut)

pRRopheticPredict
Given a gene expression matrix, predict drug senstivity for a drug in CGP

Based on the qqplot it is likely acceptable to use these data for prediction of
bortezomib sensitivity. Predict bortezomib sensitivity using all cell lines, then
only cell lines from hematological cancers and then only cell lines from derived from solid tumors. （selection = ？，How should duplicate gene ids be handled. Default is -1 which asks the user. 1 to summarize by their or 2 to disguard all duplicates.）

predictedPtype <- pRRopheticPredict(exprDataBortezomib, "Bortezomib", selection=1)
predictedPtype_blood <- pRRopheticPredict(exprDataBortezomib, "Bortezomib", "blood", selection=1)
predictedPtype_solid <- pRRopheticPredict(exprDataBortezomib, "Bortezomib","allSolidTumors", selection=1)

t.test(predictedPtype[((studyResponse == "PGx_Responder = NR") & bortIndex)], predictedPtype[((studyResponse == "PGx_Responder = R") & bortIndex)], alternative="greater")Welch Two Sample t-testdata:  predictedPtype[((studyResponse == "PGx_Responder = NR") & bortIndex)] and predictedPtype[((studyResponse == "PGx_Responder = R") & bortIndex)]
t = 3.0109, df = 163.75, p-value = 0.001509
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:0.1826465       Inf
sample estimates:
mean of x mean of y
-4.926576 -5.331926 t.test(predictedPtype_blood[((studyResponse == "PGx_Responder = NR") & bortIndex)], predictedPtype_blood[((studyResponse == "PGx_Responder = R") & bortIndex)],alternative="greater")Welch Two Sample t-testdata:  predictedPtype_blood[((studyResponse == "PGx_Responder = NR") & bortIndex)] and predictedPtype_blood[((studyResponse == "PGx_Responder = R") & bortIndex)]
t = 4.1204, df = 165.24, p-value = 2.984e-05
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:0.3975589       Inf
sample estimates:
mean of x mean of y
-4.372173 -5.036370 t.test(predictedPtype_solid[((studyResponse == "PGx_Responder = NR") & bortIndex)], predictedPtype_solid[((studyResponse == "PGx_Responder = R") & bortIndex)],alternative="greater")Welch Two Sample t-testdata:  predictedPtype_solid[((studyResponse == "PGx_Responder = NR") & bortIndex)] and predictedPtype_solid[((studyResponse == "PGx_Responder = R") & bortIndex)]
t = 1.1167, df = 165.87, p-value = 0.1329
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:-0.07325769         Inf
sample estimates:
mean of x mean of y
-5.381191 -5.533412

allTissuesPpv <- getPPV(predResponders=predictedPtype[((studyResponse == "PGx_Responder = R") & bortIndex)], predNonResponders=predictedPtype[((studyResponse == "PGx_Responder = NR") & bortIndex)], drug="Bortezomib")PPV:  0.57
NPV:  0.59
Cutpoint:  -5.04 bloodPpv <- getPPV(predResponders=predictedPtype_blood[((studyResponse == "PGx_Responder = R") & bortIndex)], predNonResponders=predictedPtype_blood[((studyResponse == "PGx_Responder = NR") & bortIndex)], drug="Bortezomib", tissue="blood")PPV:  0.63
NPV:  0.69
Cutpoint:  -4.54

df <- stack(list(NR=predictedPtype_blood[((studyResponse == "PGx_Responder = NR")& bortIndex)], R=predictedPtype_blood[((studyResponse == "PGx_Responder = R") & bortIndex)]))
ggplot(data=df, aes(y=values, x=ind)) + geom_boxplot(alpha=.3, fill=c("#CC0033", "#006633")) + theme_bw() + ylab("Predicted Bortezomib Sensitivity") + xlab("Clinical Response")

然后是CCLE数据库，

data(ccleData)

exprMatCcle是表达矩阵，里面有1037个cell line，18988个基因信息。
sensDataCcle里面有504的cell line的各种信息，包括“”CLE.Cell.Line.Name" ，"Primary.Cell.Line.Name”，“Compound”，还有最最重要的“IC50…uM.”

cvOut_pd <- pRRopheticCV("PD.0325901", cvFold=5, testExprData=exprMatCcle)
summary(cvOut_pd)

plot(cvOut_pd)

predictedPtype_ccle <- pRRopheticPredict(exprMatCcle, "PD.0325901", selection=1)11897  gene identifiers overlap between the supplied expression matrices... Found2batches
Adjusting for0covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data2380 low variabilty genes filtered.
Fitting Ridge Regression model... DoneCalculating predicted phenotype...Done

cellLinesWithCcleIc50s <- names(predictedPtype_ccle)[names(predictedPtype_ccle) %in%sensDataCcle$CCLE.Cell.Line.Name]
predCcleOrd <- predictedPtype_ccle[names(predictedPtype_ccle)]
ccleActArea_pd <- -sensDataCcle$"ActArea"[sensDataCcle$Compound == "PD-0325901"]
names(ccleActArea_pd) <- sensDataCcle$"CCLE.Cell.Line.Name"[sensDataCcle$Compound =="PD-0325901"]
ccleActAreaord <- ccleActArea_pd[cellLinesWithCcleIc50s]

cor.test(predictedPtype_ccle[cellLinesWithCcleIc50s], ccleActAreaord, method="spearman")Spearman's rank correlation rhodata:  predictedPtype_ccle[cellLinesWithCcleIc50s] and ccleActAreaord
S = 7728298, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:rho
0.5807112

df2 <- data.frame(predCcle=predictedPtype_ccle[cellLinesWithCcleIc50s], actAreaCcle=ccleActAreaord)
ggplot(data=df2, aes(y=predCcle, x=actAreaCcle)) + geom_point(alpha=0.5) + geom_smooth(method=lm) + theme_bw() + xlab("Measured Activity Area") +ylab("Predicted Drug Sensitivity")

predictedPtype_ccle_erlotinib <- pRRopheticLogisticPredict(exprMatCcle, "Erlotinib", selection=1)

Get the ActArea for the CCLE cell lines for which we have just predicted
IC50.

cellLinesWithCcleIc50s <- names(predictedPtype_ccle_erlotinib)[names(predictedPtype_ccle_erlotinib) %in%sensDataCcle$CCLE.Cell.Line.Name]
predCcleOrd <- predictedPtype_ccle_erlotinib[names(predictedPtype_ccle_erlotinib)]
ccleActArea_pd <- sensDataCcle$"ActArea"[sensDataCcle$Compound == "Erlotinib"]
names(ccleActArea_pd) <- sensDataCcle$"CCLE.Cell.Line.Name"[sensDataCcle$Compound =="Erlotinib"]
ccleActAreaord <- ccleActArea_pd[cellLinesWithCcleIc50s]

There are a very large number of cell lines resistant to Erlotinib (within
the drug screening window), so a correlation is not an appropriate measure of concordance. So lets do a t-test between some of the most sensitive and resistant cell lines to assess whether signal is being captured by the predictions.

resistant <- names(sort(ccleActAreaord))[1:55] #55 highly resistant cell lines.
sensitive <- names(sort(ccleActAreaord, decreasing=TRUE))[1:15] #15 sensitive
t.test(predictedPtype_ccle_erlotinib[resistant], predictedPtype_ccle_erlotinib[sensitive])Welch Two Sample t-testdata:  predictedPtype_ccle_erlotinib[resistant] and predictedPtype_ccle_erlotinib[sensitive]
t = 2.431, df = 32.127, p-value = 0.02081
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:0.2421538 2.7431351
sample estimates:
mean of x mean of y
-5.184405 -6.677049

Despite the fact that IC50 values are not correlated for this drug between
these studies, the most sensitive/resistant samples are separated highly significantly with this logistic models.

boxplot(list(Resistant=predictedPtype_ccle_erlotinib[resistant], Sensitive=predictedPtype_ccle_erlotinib[sensitive]), pch=20, vertical=TRUE, method="jitter", ylab="Log-odds of sensitivity")

Include, is an example, prediction from the bortezomib clinical data where we try to predict CR, PR, MR, NC, PD from CR, PR, MR, NC, PD. This serves as both an example of prediction directly from clinical data and of using a dataset other than the CGP from which to predict.
First, prepare the training data and test expression data.

trainExpr <- exprDataBortezomib[, (detailedResponse %in% c(1,2,3,4,5)) & studyIndex %in% c("studyCode = 25", "studyCode = 40")]
trainPtype <- detailedResponse[(detailedResponse %in% c(1,2,3,4,5)) & studyIndex %in% c("studyCode = 25", "studyCode = 40")]
testExpr <- exprDataBortezomib[, (detailedResponse %in% c(1,2,3,4,5)) & studyIndex %in% c("studyCode = 39")]

Calculate the predicted phenotype.

ptypeOut <- calcPhenotype(trainExpr, trainPtype, testExpr, selection=1)

testPtype <- detailedResponse[(detailedResponse %in% c(1,2,3,4,5)) & studyIndex %in% c("studyCode = 39")]
cor.test(testPtype, ptypeOut, alternative="greater")

t.test(ptypeOut[testPtype %in% c(3,4,5)], ptypeOut[testPtype %in% c(1,2)], alternative="greater")

pRRophetic 通过基因表达水平预测临床化疗反应的R包相关推荐

16S预测宏基因组最强R包-Tax4Fun
之前在公众号的文章<根据16S预测微生物群落功能最全攻略>阅读人数近3000人,有需求的用户还是非常多的.其中提到了4个软件,之前已经介绍了其中非常有特点的三种,分别为: - PICRUS ...
Kaggle实例-家庭贫困水平预测
Kaggle实例-家庭贫困水平预测 **1. 数据背景** **2. 数据目标** **3. 问题和数据说明** **3.1. 目标说明** **3.2. 评估度量** **4. 数据分析** **4 ...
R语言编写自定义函数、评估回归模型预测变量的相对重要性（Relative importance）、通过在所有可能的子模型中添加一个预测变量而获得的R方的平均增加、评估预测变量的重要度、并通过点图可视化
R语言编写自定义函数.评估回归模型预测变量的相对重要性(Relative importance).通过在所有可能的子模型中添加一个预测变量而获得的R方的平均增加.来评估预测变量的重要程度.并通过点图可 ...
手把手教你用Prophet快速进行时间序列预测（附Prophet和R代码）
作者:ANKIT CHOUDHARY 翻译:王雨桐校对:丁楠雅本文约3000字,建议阅读12分钟. 本文将通过拆解Prophet的原理及代码实例来讲解如何运用Prophet进行时间序列预测. 简介 ...
R包estimate评估肿瘤组织中基质及免疫细胞浸润水平
根据表达数据,ESTIMATE为研究人员提供了肿瘤纯度.存在的基质细胞水平和肿瘤组织中免疫细胞浸润水平的分数.https://bioinformatics.mdanderson.org/estimat ...
python 时间序列prophet 模型分析_手把手教你用Prophet快速进行时间序列预测（附Prophet和R代码）...
原标题:手把手教你用Prophet快速进行时间序列预测(附Prophet和R代码) 作者:ANKIT CHOUDHARY:翻译:王雨桐:校对:丁楠雅: 本文约3000字,建议阅读12分钟. 本文将通过 ...
预测房价：回归问题——R语言
在回归问题中,我们的目标是预测连续值的输出,如价格或概率.将此与分类问题进行对比,分类的目标是预测离散标签(例如,图片包含苹果或橙色). 问题描述我们将要预测20世纪70年代中期波士顿郊区房屋价格的 ...
深耕大数据市场，所问数据打造深度学习数据分析与预测引擎
卖什么?卖多少钱? 这些是每一个线上零售卖家都会遇到的问题.在大数据时代开始之前,答案都是基于个人经验做的判断:随着近年数据分析平台纷纷上线,卖家们也渐渐开始接受多维度.不同时间粒度的数据分析服务,包 ...
得到多组单选框的值_多组学如何构建预后预测模型，还发了7分+？
今天小编解读的这篇文章是2020年发表在Molecular Therapy-Nucleic Acids上(影响因子7.032).此研究对肺腺癌进行了多组学分析,并建立预后预测模型.作者的预后预测模型可 ...

pRRophetic 通过基因表达水平预测临床化疗反应的R包

一、pRRophetic 是干什么用的？

二、使用指南

1. 包的安装

2. 使用范例

pRRophetic 通过基因表达水平预测临床化疗反应的R包相关推荐

最新文章

热门文章