复现nature communication PCA原图|代码分析（一）

复现PCA原图之bulk RNA-seq

NGS系列文章包括NGS基础、转录组分析（Nature重磅综述|关于RNA-seq你想知道的全在这）、ChIP-seq分析（ChIP-seq基本分析流程）、单细胞测序分析 (重磅综述：三万字长文读懂单细胞RNA测序分析的最佳实践教程（原理、代码和评述）)、DNA甲基化分析、重测序分析、GEO数据挖掘（典型医学设计实验GEO数据分析 (step-by-step) - Limma差异分析、火山图、功能富集）等内容。

2020年4月14日，Sanger研究团队于nature communication在线发表了题为Single-cell transcriptomics identifies an effectorness gradient shaping the response of CD4+ T cells to cytokines的研究内容，作者使用蛋白质组学、bulk RNA-seq和单细胞转录组测序对人体40,000个以上的naïve and memory CD4+ T cells进行分析，发现细胞类型之间的细胞因子反应差异很大。memory T细胞不能分化为Th2表型，但可以响应iTreg极化获得类似Th17的表型。单细胞分析表明，T细胞构成了一个转录连续体（transcriptional continuum），从幼稚到中枢和效应记忆T细胞，形成了一种效应梯度，并伴随着趋化因子和细胞因子表达的增加。最后，作者表明，T细胞活化和细胞因子反应受效应梯度的影响。

我们这次复现一个Fig1cPCA图和Fig2aPCA图，这个说起来容易、做起来可就不一定容易的plot!

以上是Fig1c原图，图注为“PCA plots from the whole proteome of TN and TM cells. Different colors correspond to cell types and different shades to stimulation time points. PCA plots were derived using 47 naive and 47 memory T cell samples for RNAseq”，作者使用不同处理方式对human naive (TN) and memory (TM) CD4+ T cells进行处理，然后收集不同时间点的sample进行bulk RNA-seq。

以上为Fig 2a原图，图注为“PCA plot from the full transcriptome of TN and TM cells following five days of cytokine stimulations. Only stimulated cells were included in this analysis. PCA plots were derived using 20 naive and 21 memory T cells samples ”

我们需要复现该图之前，先需要下载数据，可以点击https://www.opentargets.org/projects/effectorness 对bulk RNA-Seq的count数据和metadata数据进行下载，然后进行以下步骤：

加载R包

library(annotables)
library(SummarizedExperiment)
library(DESeq2)
library(ggplot2)
library(cowplot)
library(rafalib)
library(dplyr)
library(reshape2)
library(RColorBrewer)
library(pheatmap)
library(limma)
library(ggrepel)

数据格式化

## 加载counts矩阵
counts <- read.table("NCOMMS-19-7936188_bulk_RNAseq_raw_counts.txt", header=T, row.names=1)
View(counts)# 行是基因的Ensembl ID，列是sample_ID，数据有94个样品## 加载sample metadata
sample.info <- read.table("NCOMMS-19-7936188_bulk_RNAseq_metadata.txt", header=T, row.names=1, stringsAsFactors = T)
View(sample.info)
## 以上数据有10列：sample_id、cell_type、cytokine_condition、stimulation_time、donor_id、sex、age、sequencing_batch、cell_culture_batch和rna_integrity_number。## 可以看出celltype有两种：naive和memory；cytokine_condition有7种：INFB，resting，iTreg，Th0，Th1，Th2，Th17;stimulation_time有两种：16h和5d。# 数据格式化
## 将所有注释转换为合适的数据类型
sample.info$sequencing_batch <- factor(sample.info$sequencing_batch)
sample.info$cell_culture_batch <- factor(sample.info$cell_culture_batch)
sample.info$activation_status <- ifelse(sample.info$cytokine_condition == "Resting", "Resting", "Activated")

从Ensembl87/GRCh38获取基因注释

## Fetching gene annotations from Ensembl87/GRCh38
features.data <- data.frame(grch38)
features.data <- features.data[!duplicated(features.data[c("ensgene")]),]
rownames(features.data) <- features.data$ensgene
features.data <- features.data[,-1]

NGS基础 - 参考基因组和基因注释文件

将基因注释转换为最合适的数据类型:

## Converting gene annotations to the most suitable data types
gene.info$Gene_id <- as.character(gene.info$Gene_id)
gene.info$Start <- as.numeric(gene.info$Start)
gene.info$End <- as.numeric(gene.info$End)
gene.info$Strand <- factor(gene.info$Strand)

创建relevant metadata的变量:

## Creating a variable with relevant metadata for the study
meta <- list(Study="Mapping cytokine induced gene expression changes in human CD4+ T cells",Experiment="RNA-seq panel of cytokine induced T cell polarisations",Laboratory="Trynka Group, Wellcome Sanger Institute",Experimenter=c("Eddie Cano-Gamez","Blagoje Soskic","Deborah Plowman"),Description="To study cytokine-induced cell polarisation, we isolated human naive and memory CD4+ T cells in triplicate from peripheral blood of healthy individuals. Next, we polarised the cells with different cytokine combinations linked to autoimmunity and performed RNA-sequencing.",Methdology=c("Library Prep: Illumina TruSeq (Poly-A capture method)", "Sequencing Platform: Illumina HiSeq 2500","Read Alignment: STAR (MAPQ > 20)","Read Quantification: featureCounts","Reference: Ensembl 87 (GRCh38/hg38)"),Characteristics="Data type: Raw counts",Date="September, 2019",URL="https://doi.org/10.1101/753731"
)

建立SummarizedExperiment object

# CREATING SUMMARIZED EXPERIMENT OBJECT
RNA.experiment <- SummarizedExperiment(assays=list(counts=as.matrix(counts)),colData=sample.info,rowData=gene.info,metadata=meta)

注意：这里我的报错了。。。

说的其实蛮明白的，于是运行以下代码：

rownames(sample.info) <- NULL

再次建立SummarizedExperiment object

# CREATING SUMMARIZED EXPERIMENT OBJECT
RNA.experiment <- SummarizedExperiment(assays=list(counts=as.matrix(counts)),colData=sample.info,rowData=gene.info,metadata=meta)

saveRDS(RNA.experiment, "bulkRNAseq_summarizedExperiment.rds")

加载数据

exp <- readRDS("bulkRNAseq_summarizedExperiment.rds")

基因过滤

根据以下条件筛选基因：

删除映射到Y染色体的基因；
删除映射到HLA区域的基因（即Ensembl 87/GRCh38中的“chr6 28,000,000-34,000,000”）；
仅保留蛋白质编码基因和lincRNA（Long intergenic non-coding RNA）；
删除低表达的基因（即至少30 reads）。

注意：DESeq2在进行统计测试时执行严格的自动过滤。

exp <- exp[!is.na(rowData(exp)$Chr)]
exp <- exp[rowData(exp)$Chr != "Y",]
exp <- exp[!(rowData(exp)$Chr == "6" & rowData(exp)$Start > 28000000 & rowData(exp)$End < 34000000),]
exp <- exp[rowData(exp)$Biotype == "protein_coding" | rowData(exp)$Biotype == "lincRNA",]
exp <- exp[rowSums(assay(exp))>30,]

将数据转换为DESeqDataSet。由于此代码中没有进行统计测试，因此design暂时为空（即〜1）。

dds <- DESeqDataSet(exp, design= ~ 1)
rowData(dds) <- rowData(exp)

DESeqDataSet是DESeq2包（DESeq2差异基因分析和批次效应移除）中储存reads count以及统计分析过程中的数据的一个“对象”，在代码中常表示为“dds”。

一个DESeqDataSet对象必须关联相应的design公式。design公式指明了要对哪些变量进行统计分析。该公式（上文中的design = ~batch + condition）以短波浪字符开头，中间用加号连接变量。design公式可以修改，但是相应的差异表达分析就需要重新做，因为design公式是用来估计统计模型的分散度以及log2 fold change的。

合并技术重复

写在前面：DESeq2 provides a function collapseReplicates which can assist in combining the counts from technical replicates into single columns of the count matrix. The term technical replicate implies multiple sequencing runs of the same library. You should NOT collapse biological replicates using this function.

将相同条件的技术复制合计到同一列中：

dds$ID <- factor(paste0(dds$donor_id, "_", dds$cell_type, "_", dds$stimulation_time, "_", dds$cytokine_condition))
dds.coll <- collapseReplicates(object=dds, groupby=dds$ID)
dds.coll <- estimateSizeFactors(dds.coll)
#在DESeq2中，通过estimateSizeFactors函数为每个样本计算一个系数，称之为sizefactor；saveRDS(dds.coll, "rawCounts_bulkRNAseq_DESeq2.rds")

数据归一化

通过log2对数据进行归一化：

rld.coll <- rlog(dds.coll, blind=TRUE)
saveRDS(rld.coll, "rlogCounts_bulkRNAseq_DESeq2.rds")

数据可视化

定义函数：执行主成分分析（PCA）（一文看懂PCA主成分分析）并返回每个样本的前6个PC值，以及样本注释和每个成分所解释的方差百分比。

get.pcs <- function(exp){pca <- prcomp(t(assay(exp)))pVar <- pca$sdev^2/sum(pca$sdev^2)intgroup=as.data.frame(colData(exp))d <- data.frame(PC1=pca$x[, 1], PC2=pca$x[, 2], PC3=pca$x[, 3],PC4=pca$x[, 4], PC5=pca$x[, 5], PC6=pca$x[, 6], intgroup)d <- list(d,pVar[1:6])names(d) <- c("PCs","pVar")return(d)
}

对rlog计数矩阵执行主成分分析：

pcs.rld <- get.pcs(rld.coll)

可视化PCA的结果

ggplot(data=pcs.rld$PCs, aes(x=PC1, y=PC2, color=cell_type, shape=activation_status, alpha = stimulation_time)) +geom_point(size = 8) +xlab(paste0("PC1:", round(pcs.rld$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs.rld$pVar[2]*100), "% variance")) +scale_colour_manual(values = c("#5AB4AC","#AF8DC3")) + scale_alpha_discrete(range = c(0.5,1)) +coord_fixed() +  theme_bw() +theme(panel.grid = element_blank())

删除由PCA标识的异常样本（即262_CD4_Memory_16h_iTreg）：

dds.coll <- dds.coll[, dds.coll$ID != "262_CD4_Memory_16h_iTreg"]
rld.coll <- rld.coll[, rld.coll$ID != "262_CD4_Memory_16h_iTreg"]
saveRDS(dds.coll, "rawCounts_bulkRNAseq_DESeq2.rds")
saveRDS(rld.coll, "rlogCounts_bulkRNAseq_DESeq2.rds")

在clean data中进行PCA：

pcs.rld <- get.pcs(rld.coll)

继续可视化PCA结果

ggplot(data=pcs.rld$PCs, aes(x=PC1, y=PC2, color=cell_type, shape=activation_status, alpha = stimulation_time)) +geom_point(size = 8) +xlab(paste0("PC1:", round(pcs.rld$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs.rld$pVar[2]*100), "% variance")) +scale_colour_manual(values = c("#5AB4AC","#AF8DC3")) + scale_alpha_discrete(range = c(0.5,1)) +coord_fixed() +  theme_bw() +theme(panel.grid = element_blank())

这下分布确实真的好看，去掉离群值后不同细胞的分散度更为合理。

然后我们可以看看原图是什么样子哒！

其实还蛮像的，，，其实好看的图都是“做”出来的。。。。

细胞类型和时间点特定分析

将naive T细胞和memory T细胞样本分为两个不同的DESeqDataSet。

dds.naive <- dds.coll[,dds.coll$cell_type=="CD4_Naive"]
dds.memory <- dds.coll[,dds.coll$cell_type=="CD4_Memory"]

rld.naive <- rld.coll[,rld.coll$cell_type=="CD4_Naive"]
rld.memory <- rld.coll[,rld.coll$cell_type=="CD4_Memory"]

naive T细胞

naive T细胞的早期刺激

仅对16h-stimulated naive T cells进行PCA:

rld_naive16h <- rld.naive[, (rld.naive$activation_status == "Activated") & (rld.naive$stimulation_time=="16h")]
pcs_naive16h <- get.pcs(rld_naive16h)

ggplot(data=pcs_naive16h$PCs, aes(x=PC1, y=PC2, color=as.factor(donor_id))) +geom_point(size = 8) + xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") + coord_fixed() +  theme_bw() +theme(panel.grid = element_blank())\#按照不同样本id进行分析；

去掉个体间变异性：

assay(rld_naive16h) <- removeBatchEffect(assay(rld_naive16h), batch = factor(as.vector(colData(rld_naive16h)$donor_id)))

重新计算PCA：

pcs_naive16h <- get.pcs(rld_naive16h)

ggplot(data=pcs_naive16h$PCs, aes(x=PC1, y=PC2, color=cytokine_condition)) +geom_point(size = 8) + geom_label_repel(aes(label=cytokine_condition, color=cytokine_condition)) +xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") +scale_fill_brewer(palette = "Dark2") +coord_fixed() +  theme_bw() +theme(panel.grid = element_blank(), legend.position = "none")

naive T细胞的晚期刺激

rld_naive5d <- rld.naive[, (rld.naive$activation_status == "Activated") & (rld.naive$stimulation_time=="5d")]
pcs_naive5d <- get.pcs(rld_naive5d)

ggplot(data=pcs_naive5d$PCs, aes(x=PC1, y=PC2, color=as.factor(donor_id))) +geom_point(size = 8) + xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") + coord_fixed() +  theme_bw() +theme(panel.grid = element_blank())

去掉个体间变异性：

assay(rld_naive5d) <- removeBatchEffect(assay(rld_naive5d), batch = factor(as.vector(colData(rld_naive5d)$donor_id)))

重新计算PCA：

pcs_naive5d <- get.pcs(rld_naive5d)

ggplot(data=pcs_naive5d$PCs, aes(x=PC1, y=PC2, color=cytokine_condition)) +geom_point(size = 8) + geom_label_repel(aes(label=cytokine_condition, color=cytokine_condition)) +xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") +scale_fill_brewer(palette = "Dark2") +coord_fixed() +  theme_bw() +theme(panel.grid = element_blank(), legend.position = "none")

原图

可以看出和原图还是很像的，分类也是蛮清楚的！

Memory T cells

memory T细胞的早期刺激

again……

Performing PCA on 16h-stimulated memory T cells only.

rld_memory16h <- rld.memory[, (rld.memory$activation_status == "Activated") & (rld.memory$stimulation_time=="16h")]
pcs_memory16h <- get.pcs(rld_memory16h)

ggplot(data=pcs_memory16h$PCs, aes(x=PC1, y=PC2, color=as.factor(donor_id))) +geom_point(size = 8) + xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") + coord_fixed() +  theme_bw() +theme(panel.grid = element_blank())

去掉个体间变异性：

assay(rld_memory16h) <- removeBatchEffect(assay(rld_memory16h), batch = factor(as.vector(colData(rld_memory16h)$donor_id)))

重新计算PCA：

pcs_memory16h <- get.pcs(rld_memory16h)

ggplot(data=pcs_memory16h$PCs, aes(x=PC1, y=PC2, color=cytokine_condition)) +geom_point(size = 8) + geom_label_repel(aes(label=cytokine_condition, color=cytokine_condition)) +xlab(paste0("PC1: ", round(pcs_naive16h$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs_naive16h6$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") +scale_fill_brewer(palette = "Dark2") +coord_fixed() +  theme_bw() +theme(panel.grid = element_blank(), legend.position = "none")

memory T细胞的晚期刺激

again……again…….

Performing PCA on 5 days-stimulated memory T cells only.

rld_memory5d <- rld.memory[, (rld.memory$activation_status == "Activated") & (rld.memory$stimulation_time=="5d")]
pcs_memory5d <- get.pcs(rld_memory5d)

ggplot(data=pcs_memory5d$PCs, aes(x=PC1, y=PC2, color=as.factor(donor_id))) +geom_point(size = 8) + xlab(paste0("PC1: ", round(pcs.n16$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs.n16$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") + coord_fixed() +  theme_bw() +theme(panel.grid = element_blank())

assay(rld_memory5d) <- removeBatchEffect(assay(rld_memory5d), batch = factor(as.vector(colData(rld_memory5d)$donor_id)))

重新计算PCA

pcs_memory5d <- get.pcs(rld_memory5d)

ggplot(data=pcs_memory5d$PCs, aes(x=PC1, y=PC2, color=cytokine_condition)) +geom_point(size = 8) + geom_label_repel(aes(label=cytokine_condition, color=cytokine_condition)) +xlab(paste0("PC1: ", round(pcs.n16$pVar[1]*100), "% variance")) +ylab(paste0("PC2: ", round(pcs.n16$pVar[2]*100), "% variance")) +scale_colour_brewer(palette = "Dark2") +scale_fill_brewer(palette = "Dark2") +coord_fixed() +  theme_bw() +theme(panel.grid = element_blank(), legend.position = "none")

原图

基本分布还是差不多的，，，，
后面需要做什么差异分析的，就靠你自己了！

校对排版 | 生信宝典

你可能还想看

PCA主成分分析实战和可视化附R代码和测试数据
用了这么多年的PCA可视化竟然是错的！！！
什么？你做的差异基因方法不合适？
单细胞分群后，怎么找到Marker基因定义每一类群？

往期精品(点击图片直达文字对应教程)

后台回复“生信宝典福利第一波”或点击阅读原文获取教程合集