R语言并行计算RC~bray-curtis~距离

R语言并行计算RC_bray-curtis距离

群落构建分析是微生物生态学分析的重要组成部分，成为目前文章发表的热点技术。之前我们介绍了计算beta-NTI（beta nearest taxon index）来进行群落构建分析。|beta-NTI| >2说明决定性过程主导，其中beta-NTI >2说明OTU的遗传距离发散，为生物交互作用主导，beta-NTI < -2则说明OUT的遗传距离收敛为环境选择主导。|beta-NTI| <2则说明随机过程主导，但随机过程包含遗传漂变、扩散限制和均质扩散，那么如何对随机过程进一步划分呢？

1. 如何理解RCbray-curtis距离？

根据OUT表获得每个物种的占居表（例如，物种1只在C群落中出现过则为1，物种2在A和B群落中出现过则为2，以此类推）和丰度表（获取物种在各个样点的丰度总和），然后根据物种占居表和丰度表进行模拟。在保证样点丰富度不变，根据占居表的概率选择出现的物种，根据丰度表的概率选择所出现物种的丰度，从而获得模拟条件下的OTU表，并计算样品间的Bray-Curtis距离。重复以上过程1000次，获得1000个模拟条件下样品间的Bray-Curtis距离。将模拟得到的Bray-Curtis距离与实际观察到的Bray-Curtis距离进行比较，记录小于实际观测值的比例，即为RCbray-curtis距离。当|beta-NTI| < 2时，可根据RCbray-curtis对随机过程进一步划分。 RC_bray-curtis > 0.95 意味着扩散限制，RCbray-curtis < 0.05意味着均值扩散，0.05 < RC_bray-curtis < 0.95 则意味着漂变。

2. 如何计算RCbray-curtis距离

#calculate RC-bray value if site which beta-NTI in (-2,2)
raup_crick_abundance = function(spXsite,plot_names_in_col1 = TRUE,classic_metric = FALSE,split_ties = TRUE,reps = 999,set_all_species_equal = FALSE,as.distance.matrix = TRUE,report_similarity = FALSE,threads = 1) {##expects a species by site matrix for spXsite, with row names for plots, or optionally plots named in column 1.  By default calculates a modification of the Raup-Crick metric (standardizing the metric to range from -1 to 1 instead of 0 to 1). Specifying classic_metric=TRUE instead calculates the original Raup-Crick metric that ranges from 0 to 1. The option split_ties (defaults to TRUE) adds half of the number of null observations that are equal to the observed number of shared species to the calculation- this is highly recommended.  The argument report_similarity defaults to FALSE so the function reports a dissimilarity (which is appropriate as a measure of beta diversity).  Setting report_similarity=TRUE returns a measure of similarity, as Raup and Crick originally specified.  If ties are split (as we recommend) the dissimilarity (default) and similarity (set report_similarity=TRUE) calculations can be flipped by multiplying by -1 (for our modification, which ranges from -1 to 1) or by subtracting the metric from 1 (for the classic metric which ranges from 0 to 1). If ties are not split (and there are ties between the observed and expected shared number of species) this conversion will not work. The argument reps specifies the number of randomizations (a minimum of 999 is recommended- default is 9999).  set_all_species_equal weights all species equally in the null model instead of weighting species by frequency of occupancy.##Note that the choice of how many plots (rows) to include has a real impact on the metric, as species and their occurrence frequencies across the set of plots is used to determine gamma and the frequency with which each species is drawn from the null model##this section moves plot names in column 1 (if specified as being present) into the row names of the matrix and drops the column of namesif (plot_names_in_col1) {row.names(spXsite) <- spXsite[, 1]spXsite <- spXsite[, -1]}## count number of sites and total species richness across all plots (gamma)library(doParallel)library(foreach)library(vegan)n_sites <- nrow(spXsite)gamma <- ncol(spXsite)##build a site by site matrix for the results, with the names of the sites in the row and col names:results <-matrix(data = NA,nrow = n_sites,ncol = n_sites,dimnames = list(row.names(spXsite), row.names(spXsite)))##make the spXsite matrix into a new, pres/abs. matrix:ceiling(spXsite / max(spXsite)) -> spXsite.inc##create an occurrence vector- used to give more weight to widely distributed species in the null model:occur <- apply(spXsite.inc, MARGIN = 2, FUN = sum)##create an abundance vector- used to give more weight to abundant species in the second step of the null model:abundance <- apply(spXsite, MARGIN = 2, FUN = sum)##make_null:##looping over each pairwise community combination:for (null.one in 1:(nrow(spXsite) - 1)) {for (null.two in (null.one + 1):nrow(spXsite)) {null_bray_curtis <- NULLcl <- makeCluster(threads)registerDoParallel(cl)null_bray_curtis <- foreach (i = 1:reps,.combine = "c") %dopar%{library(vegan)##two empty null communities of size gamma:com1 <- rep(0, gamma)com2 <- rep(0, gamma)##add observed number of species to com1, weighting by species occurrence frequencies:com1[sample(1:gamma,sum(spXsite.inc[null.one, ]),replace = FALSE,prob = occur)] <- 1com1.samp.sp = sample(which(com1 > 0),(sum(spXsite[null.one, ]) - sum(com1)),replace = TRUE,prob = abundance[which(com1 > 0)])com1.samp.sp = cbind(com1.samp.sp, 1)# head(com1.samp.sp);com1.sp.counts = as.data.frame(tapply(com1.samp.sp[, 2], com1.samp.sp[, 1], FUN = sum))colnames(com1.sp.counts) = 'counts'# head(com1.sp.counts);com1.sp.counts$sp = as.numeric(rownames(com1.sp.counts))# head(com1.sp.counts);com1[com1.sp.counts$sp] = com1[com1.sp.counts$sp] + com1.sp.counts$counts# com1;#sum(com1) - sum(spXsite[null.one,]); ## this should be zero if everything work properlyrm('com1.samp.sp', 'com1.sp.counts')##same for com2:com2[sample(1:gamma,sum(spXsite.inc[null.two, ]),replace = FALSE,prob = occur)] <- 1com2.samp.sp = sample(which(com2 > 0),(sum(spXsite[null.two, ]) - sum(com2)),replace = TRUE,prob = abundance[which(com2 > 0)])com2.samp.sp = cbind(com2.samp.sp, 1)# head(com2.samp.sp);com2.sp.counts = as.data.frame(tapply(com2.samp.sp[, 2], com2.samp.sp[, 1], FUN =sum))colnames(com2.sp.counts) = 'counts'# head(com2.sp.counts);com2.sp.counts$sp = as.numeric(rownames(com2.sp.counts))# head(com2.sp.counts);com2[com2.sp.counts$sp] = com2[com2.sp.counts$sp] + com2.sp.counts$counts# com2;# sum(com2) - sum(spXsite[null.two,]); ## this should be zero if everything work properlyrm('com2.samp.sp', 'com2.sp.counts')null.spXsite = rbind(com1, com2)# null.spXsite;##calculate null bray curtisnull_bray_curtis = vegdist(null.spXsite, method = 'bray')#print(c(i,date()))detach(package:vegan)}stopCluster(cl)# end reps loop## empirically observed bray curtisobs.bray = vegdist(spXsite[c(null.one, null.two), ], method = 'bray')##how many null observations is the observed value tied with?num_exact_matching_in_null = sum(null_bray_curtis == obs.bray)##how many null values are smaller than the observed *dissimilarity*?num_less_than_in_null = sum(null_bray_curtis < obs.bray)rc = (num_less_than_in_null) / reps# rc;if (split_ties) {rc = ((num_less_than_in_null + (num_exact_matching_in_null) / 2) / reps)}if (!classic_metric) {##our modification of raup crick standardizes the metric to range from -1 to 1 instead of 0 to 1rc = (rc - .5) * 2}results[null.two, null.one] = round(rc, digits = 2)##store the metric in the results matrixprint(c(null.one, null.two, date()))}## end null.two loop}## end null.one loopif (as.distance.matrix) {## return as distance matrix if so desiredresults <- as.dist(results)}return(results)
}
## end function
library(vegan)
#读取OTU表
otu_biom2 <- read.delim("./otu table.txt", row.names=1)
#查看样品序列条数
summary(colSums(otu_biom2))
otu_biom2  = otu_biom2[,colSums(otu_biom2) > 8500]
#将OTU表转置为行名为样品名称，列名为OTU名称的表
otu<-t(otu_biom2)
head(rowSums(otu))
#对OTU表进行抽平
otu = as.data.frame(rrarefy(otu, 8500))
head(rowSums(otu))
#去除序列条数较少的OTU
otu_niche<-otu[,log(colSums(otu)/sum(otu),10)>-4.5]
#以前500个OTU为例，度量十核计算所需的时间
system.time(raup_crick_abundance(otu_niche[,1:200],plot_names_in_col1 = FALSE,threads=1))
#以前500个OTU为例，度量单核计算所需的时间
system.time(raup_crick_abundance(otu_niche[,1:200],plot_names_in_col1 = FALSE,threads=5))

3. 多核运行可以节省多少时间？
根据下图我们可以看到，以前200个OTUs为例，十核计算一个样本对所需的时间是1秒，总共花费194秒；而单核计算一个样本对花费的时间是7-8秒，总共花费1193秒。当样本量较大时，多核计算可以节省更多时间。

测试数据和代码可以通过如下链接进行下载，两者内容相同：
面包多：https://mianbaoduo.com/o/bread/YpmTlZhu
CSDN: https://download.csdn.net/download/weixin_43367441/79594637
更多R语言分析微生物生态学的资源可参考如下链接：
https://mbd.pub/o/bread/mbd-YpmTlZpr

注意事项：

Raup-Crick index for absence-presence data. This index (Raup & Crick1979) uses a randomization (Monte Carlo) procedure, comparing the observed number of species ocurring in both associations with the distribution of co-occurrences from 1000 random replicates from the pool of samples (not from the pair of samples, as incorrectly assumed by some authors).

参考文献：
[1] Quantifying Community Assembly Processes and Identifying Features that Impose Them (https://doi.org/10.1038/ismej.2013.93)
[2] https://www.nhm.uio.no/english/research/infrastructure/past/help/similarity.html
[3] Using null models to disentangle variation in community dissimilarity from variation in α-diversity