Celaref | 单细胞测序细胞类型注释工具

我导再也不用担心我不认识marker啦

我们在进行单细胞测序的时候，通常情况下是通过高变genes来辨别细胞类型（于是一大堆不认识的），除了免疫细胞可能容易分析出来，其他的细胞我是两眼一抹黑。虽然具有single cell marker基因库，但在判断细胞亚型的时候有时并不是那么solid，有一些原则上比较特异的gene不知道为什么在cell marker上出现在不同的细胞上，比如S100A8,S100A9，在cell marker基因库中多为中性粒细胞的marker，但结合生理学意义，其实在单细胞中由于中性粒细胞本身含量较少且较为脆弱等原因导致捕获中性粒细胞是极为困难的。此时如果贸然下结论为中性粒细胞其实不利于后期的分析。

单细胞分群后，怎么找到Marker基因定义每一类群？

celaref R包通过与已知细胞类型的参考数据集的相似度进行比较。输入每个细胞中每个基因的reads counts数（gene-cell matrix）和每个细胞所属的簇（cluster）信息，和每个查询组中最明显富集的基因的参考样本比较，通过排名来匹配细胞类型。

Workflow

下面是典型的celaref工作流程，根据与参考数据集的转录组相似性来表征查询数据集的细胞簇。

数据及其格式

gene-cell matrix
每个细胞所属的cluster信息

输入数据之后利用MAST包进行基因表达差异分析，每个基因被指定为从0（最富集）到1（最不存在）的rescaled rank。

比较查询数据和参考数据

得到每个查询细胞簇的Up基因列表 — 在该簇中具有显著更高表达的基因。在每个参考细胞簇的基因排名中查找这些基因，比较并绘制相似性。

输出结果

通常，查询数据中的每个细胞簇都针对参考数据（X轴）中的所有内容绘制。刻度线表示up基因，并且**中位基因（middle generank）**显示为粗条。顶部附近的偏差分布表示各组的相似性 - 基本上相同的基因代表它们各自样品中的cluster。也就是说查询组聚类分析后的代表基因如果和reference具有一定的相似性，则可以通过其相同的基因代表与其对应，也就是细胞类型的对应。

R语言 - 箱线图（小提琴图、抖动图、区域散点图）

为clusters分配标签

通过上图第二列可以发现其实可能存在两种以上的标签，这时候就需要进行综合判断了。

实例分析

# Installing BiocManager if necessary:
#通过BiocManager进行R包加载
# install.packages("BiocManager")
BiocManager::install("celaref")

我们以系统包中的dataset为例进行演示。

假设有一个新的scRNAseq数据集（查询），其聚类已经聚集成4组：组1—4。但是我们还不知道哪个组对应于哪种细胞类型。

现在有一个相同组织类型的旧数据集（参考），其他人已经确定了细胞类型：Weird subtype, Exciting, Mystery cell typeand Dunno。

library(celaref)
# Paths to data files.
counts_filepath.query    <- system.file("extdata", "sim_query_counts.tab",    package = "celaref")
cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref")
counts_filepath.ref      <- system.file("extdata", "sim_ref_counts.tab",      package = "celaref")
cell_info_filepath.ref   <- system.file("extdata", "sim_ref_cell_info.tab",   package = "celaref")

# 加载数据
toy_ref_se   <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref)
toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)

head(de_table.demo_ref) # 参考数据格式

 head(de_table.toy_query ) # 查询数据格式

# 过滤
toy_ref_se     <- trim_small_groups_and_low_expression_genes(toy_ref_se)
toy_query_se   <- trim_small_groups_and_low_expression_genes(toy_query_se)

# 分别得到各自的差异表达基因
de_table.toy_ref   <- contrast_each_group_to_the_rest(toy_ref_se,    dataset_name="ref")
de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se,  dataset_name="query")

以上两步的运行均需要较长的时间。

# 展示查询组top基因在参考组中的分布
make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)

# 获得各个group的标签
make_ref_similarity_names(de_table.toy_query, de_table.toy_ref)

通过以上方法我们基本可以辨别细胞类型，其中我们可以看到部分group并没有给出细胞类型（如group1,2在reciprocal_matches中没有找到匹配类型），有的可能会出现多个类型，需要进一步判断。

我们再以10X数据为例进行演示|PBMCs - 10X vs Microarray Reference

10X genomic****s有几个数据集可供从他们的网站下载，比如pbmc4k datasets（https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/pbmc4k），其中包含来自健康个体的外周血细胞(PBMC)数据，下面示例是筛选后的文件Gene/cell matrix(filtered)和聚类文件Clustering analysis。

查询数据：10X query dataset

library(celaref) # 首先设置路径并且加载数据，我们以kmeans_7_clusters 的聚类结果来进行细胞定义
datasets_dir <- "~/celaref_extra_vignette_data/datasets"dataset_se.10X_pbmc4k_k7 <- load_dataset_10Xdata(dataset_path   = file.path(datasets_dir,'10X_pbmc4k'),dataset_genome = "GRCh38",clustering_set = "kmeans_7_clusters",id_to_use      = "GeneSymbol")
dataset_se.10X_pbmc4k_k7 <- trim_small_groups_and_low_expression_genes(dataset_se.10X_pbmc4k_k7) # 进行部分数据筛选并转换格式
？trim_small_groups_and_low_expression_genes # 查询函数帮助

de_table.10X_pbmc4k_k7   <- contrast_each_group_to_the_rest(dataset_se.10X_pbmc4k_k7, dataset_name="10X_pbmc4k_k7", num_cores=7) ##进行差异比较

参考数据：microarray dataset

查询数据有了，那么参考数据（reference）呢？

Watkins 等人在2009年的一篇文献已经发表过合适的 PBMC reference (a HaemAtlas)，这里使用的数据是从haemosphere网站（Graaf et al.2016，https://www.haemosphere.org/datasets/show）下载得到。

因为这是微阵列数据，所以还需要一些处理步骤：

Logged, normalised expression values. Any low expression or poor quality measurements should have already been removed.
Sample information (see contrast_each_group_to_the_rest_for_norm_ma_with_limma for details)

this_dataset_dir<-file.path(datasets_dir,  'haemosphere_datasets','watkins')
norm_expression_file<-file.path(this_dataset_dir,"watkins_expression.txt")
samples_file<- file.path(this_dataset_dir, "watkins_samples.txt")
norm_expression_table.full<-ead.table(norm_expression_file, sep="\t", header=TRUE,quote="",comment.char="",row.names=1,check.names=FALSE)
samples_table<- read_tsv(samples_file, col_types = cols())
samples_table$description<- make.names( samples_table$description) # Avoid group or extra_factor names starting with numbers, for microarrays

该数据集还包括其他组织，但作为PBMC数据的参考，我们只想考虑外周血样本。

samples_table <- samples_table[samples_table$tissue == "Peripheral Blood",]

这就是大致的流程，最难的部分是数据格式。微阵列表达值应使用经过标准化，对数转换并具有相同ID号的数据作为query dataset。从haemosphere网站能得到标准化的数据 — 但仍需要匹配ID。

该数据来自Illumina HumanWG-6 v2 Expression BeadChips，并在探针水平上给出表达。需要将这些探针转换为gene symbol以匹配PBMC数据。


library("tidyverse")
library("illuminaHumanv2.db")
probes_with_gene_symbol_and_with_data <- intersect(keys(illuminaHumanv2SYMBOL),rownames(norm_expression_table.full))# Get mappings - non NA
probe_to_symbol <- select(illuminaHumanv2.db, keys=rownames(norm_expression_table.full), columns=c("SYMBOL"), keytype="PROBEID")
probe_to_symbol <- unique(probe_to_symbol[! is.na(probe_to_symbol$SYMBOL),])
# no multimapping probes
genes_per_probe <- table(probe_to_symbol$PROBEID) # How many genes a probe is annotated against?
multimap_probes <- names(genes_per_probe)[genes_per_probe  > 1]
probe_to_symbol <- probe_to_symbol[!probe_to_symbol$PROBEID %in% multimap_probes, ]convert_expression_table_ids<- function(expression_table, the_probes_table, old_id_name, new_id_name){the_probes_table <- the_probes_table[,c(old_id_name, new_id_name)]colnames(the_probes_table) <- c("old_id", "new_id")# Before DE, just pick the top expresed probe to represent the gene# Not perfect, but this is a ranking-based analysis.# hybridisation issues aside, would expect higher epressed probes to be more relevant to Single cell data anyway.probe_expression_levels <- rowSums(expression_table)the_probes_table$avgexpr <- probe_expression_levels[as.character(the_probes_table$old_id)]the_genes_table <-  the_probes_table %>%group_by(new_id) %>%top_n(1, avgexpr)expression_table <- expression_table[the_genes_table$old_id,]rownames(expression_table) <- the_genes_table$new_idreturn(expression_table)
}# Just the most highly expressed probe foreach gene.
norm_expression_table.genes <- convert_expression_table_ids(norm_expression_table.full,probe_to_symbol, old_id_name="PROBEID", new_id_name="SYMBOL")

# 现在读取数据并用contrast_each_group_to_the_rest_for_norm_ma_with_limma进行对比。de_table.Watkins2009PBMCs <- contrast_each_group_to_the_rest_for_norm_ma_with_limma(norm_expression_table = norm_expression_table.genes,sample_sheet_table    = samples_table,dataset_name          = "Watkins2009PBMCs",extra_factor_name     = 'description',sample_name           = "sampleId",group_name            = 'celltype')</pre>

将10X查询数据与参考数据进行比较

make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7,de_table.ref=de_table.Watkins2009PBMCs)

# 提供分组标签
label_table.pbmc4k_k7_vs_Watkins2009PBMCs <- make_ref_similarity_names(de_table.10X_pbmc4k_k7, de_table.Watkins2009PBMCs)

这么热的单细胞，中科院的算法开发博士带你真正玩转这项平均每个月都有多篇高IF文章的技术。（点击蓝色字体了解详细）

Reference

Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
https://bioconductor.org/packages/release/bioc/vignettes/celaref/inst/doc/celaref_doco.html