硬核干货：如果样本量不一一样多，或者不是一一对应关系，如何做差异？相关？...

写在前面

因为我们测定的样本不总是可以匹配上的，但是最少的样本也不能太少，我们测定的三个样本做相关其实还是被质疑，但是测定了5或者6个重复，这个时候直接将样本比较多的样本过滤掉比较少的样本不就可以了吗？这样做一次是没有什么意义的，但是如果将这个过程重复多次，让每个重复都使用的上，就会大大增加结果的可行性。当然审稿人就可以承认了。

例如下面这个文章

参考原文 ismej Metabolic regulation of the maize rhizobiome by benzoxazinoids

作者信息：T. E. Anne Cotton1,2 ● Pierre Pétriacq1,2,3,5 ● Duncan D. Cameron1,2 ● Moaed Al Meselmani1,2,4 ● Roland Schwarzenbacher1,2 ● Stephen A. Rolfe 1,2 ● Jurriaan Ton 1,2

发表：2019年

作者亲自书写：

The correlations are calcuated on an average (as the data are not paired) but this will give a false sense of precision in some interactions. One way to cope with this is to randomly assign ions to specific OTU counts and calculate the correlation. If we do this multiple times and then take the average (and SD) we have a better measure of the true correlation.

To see what occurs by chance we can assign the ions randomly across all of the samples rather than just within a treatment. Some false positives will occur by chance but we can then assess whether we get more in specific interactions.

Algorithm

8 bacterial samples

6 metabolomic samples

Within a treatment assign 6 of the metabolomic samples at random to the bacterial samples

Calculate the correlation coefficient

Repeat this process x times and calculate the mean value

那么这个过程是如何实现的呢，下面我带大家一起来实现这个过程：

构建随机OTU矩阵

library(tidyverse)

library(vegan)

library (plyr)

# 构建OTU表格
otu<-matrix(sample(1:500, 100),ncol=16,nrow = 300)
row.names(otu) = paste("ASV_",1:nrow(otu),sep = "")
colnames(otu) = c(paste("A",1:4,sep = ""),paste("B",1:4,sep = ""),paste("C",1:4,sep = ""),paste("D",1:4,sep = ""))
head(otu)

##        A1  A2  A3  A4  B1  B2  B3  B4  C1  C2  C3  C4  D1  D2  D3  D4
## ASV_1  42  42  42  42  42  42  42  42  42  42  42  42  42  42  42  42
## ASV_2 259 259 259 259 259 259 259 259 259 259 259 259 259 259 259 259
## ASV_3 407 407 407 407 407 407 407 407 407 407 407 407 407 407 407 407
## ASV_4 272 272 272 272 272 272 272 272 272 272 272 272 272 272 272 272
## ASV_5 209 209 209 209 209 209 209 209 209 209 209 209 209 209 209 209
## ASV_6 304 304 304 304 304 304 304 304 304 304 304 304 304 304 304 304

#--相对丰度标准化
otunorm = t(t(otu)/colSums(otu))
otunorm = as.data.frame(t(otunorm))otunorm$Group = c(rep("A",4),rep("B",4),rep("C",4),rep("D",4))

构建代谢组随机矩阵

代谢组表格和otu表格处理相同，但是有的重复为3个，有的重复为5个，不仅仅自己不统一重复数量，同时和otu的重复数量也不一致。

# 构建代谢组矩阵
#GC<-matrix(sample(1:500, 100),ncol=16,nrow = 100)
row.names(GC) = paste("GC_",1:nrow(GC),sep = "")
colnames(GC) = c(paste("A",1:3,sep = ""),paste("B",1:5,sep = ""),paste("C",1:3,sep = ""),paste("D",1:5,sep = ""))
head(GC)

##       A1  A2  A3  B1  B2  B3  B4  B5  C1  C2  C3  D1  D2  D3  D4  D5
## GC_1 186 186 186 186 186 186 186 186 186 186 186 186 186 186 186 186
## GC_2 171 171 171 171 171 171 171 171 171 171 171 171 171 171 171 171
## GC_3  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50  50
## GC_4 321 321 321 321 321 321 321 321 321 321 321 321 321 321 321 321
## GC_5 419 419 419 419 419 419 419 419 419 419 419 419 419 419 419 419
## GC_6 199 199 199 199 199 199 199 199 199 199 199 199 199 199 199 199

#--相对丰度标准化
GCnorm = t(t(GC)/colSums(GC))
# mapGC = data.frame(ID =c(paste("A",1:3,sep = ""),paste("B",1:5,sep = ""),paste("C",1:3,sep = ""),paste("D",1:5,sep = "")),
#                     Group = c(rep("A",3),rep("B",5),rep("C",3),rep("D",5)))
GCnorm = as.data.frame(t(GCnorm))GCnorm$Group = c(rep("A",3),rep("B",5),rep("C",3),rep("D",5))

提取各自的重复数量

#make the code robust to allow different sample depths
otu_reps<-data.frame(dplyr::count(otunorm,Group))
met_reps<-data.frame(dplyr::count(GCnorm,Group))
#get the minimum number
otu_min_reps<-min(otu_reps$n)
otu_min_reps

## [1] 4

met_min_reps<-min(met_reps$n)
met_min_reps

## [1] 3

min_reps=min(otu_min_reps,met_min_reps)
min_reps

## [1] 3

准备运行参数

相关参数和运行结果存储对象，顺便做一个进度条

depth=min_reps
repeats=10
cormethod = "spearman"
pb <- txtProgressBar(min=1,max=repeats,style=3,width=50)cor_list<-vector("list",repeats)
a = 1
R = c()
p = c()

for (a in 1:repeats){Sys.sleep(0.1)setTxtProgressBar(pb,a)#this has two effects  - it randomises the order and reduces the sample number to depthotu_select<-data.frame(otunorm %>% group_by(Group) %>% sample_n(depth))otu_select$Group = NULLdistotu <- vegan::vegdist(otu_select) met_select<-data.frame(GCnorm %>% group_by(Group) %>% sample_n(depth))met_select$Group = NULLdistGC <- vegan::vegdist(otu_select,"euclid") res <- vegan::mantel(distotu, distGC, method="spearman")p[a] = res$signifR[a] = res$statistic
}

## |                                                        |                                                  |   0%

result = data.frame(R_mean = mean(R),R_sd = sd(R),p_mean = mean(p),p_sd = sd(p))

这里的R值和p值都是NA，因为通过随机矩阵并没有差异，所以距离矩阵都是0，使得两个模拟数据不存在相关或者差异。

result

##   R_mean R_sd p_mean p_sd
## 1     NA   NA     NA   NA

往期精品(点击图片直达文字对应教程)

机器学习

后台回复“生信宝典福利第一波”或点击阅读原文获取教程合集