比对

The raw Drop-seq data was processed with the standard pipeline (Drop-seq tools version 1.12 from McCarroll laboratory). Reads were aligned to the ENSEMBL release 84 Mus musculus genome.

10x Genomics data was processed using the same pipeline as for Drop-seq data, adjusting the barcode locations accordingly

我还没有深入接触10x和drop-seq的数据,目前的10x数据都是用官网cellranger跑出来的。

质控

We selected cells for downstream processing in each Drop-seq run, using the quality control metrics output by the Drop-seq tools package9, as well as metrics derived from the UMI matrix.

1) We first removed cells with a low number (<700) of unique detected genes. From the remaining cells, we filtered additional outliers.

2) We removed cells for which the overall alignment rate was less than the mean minus three standard deviations.

3) We removed cells for which the total number of reads (after log10 transformation) was not within three standard deviations of the mean.

4) We removed cells for which the total number of unique molecules (UMIs, after log10 transformation) was not within three standard deviations of the mean.

5) We removed cells for which the transcriptomic alignment rate (defined by PCT_USABLE_BASES) was not within three standard deviations of the mean.

6) We removed cells that showed an unusually high or low number of UMIs given their number of reads by fitting a loess curve (span= 0.5, degree= 2) to the number of UMIs with number of reads as predictor (both after log10 transformation). Cells with a residual more than three standard deviations away from the mean were removed.

7) With the same criteria, we removed cells that showed an unusually high or low number of genes given their number of UMIs. Of these filter steps, step 1 removed the majority of cells.

Steps 2 to 7 removed only a small number of additional cells from each eminence (2% to 4%), and these cells did not exhibit unique or biologically informative patterns of gene expression.

1. 过滤掉基因数量太少的细胞;

2. 过滤基因组比对太差的细胞;

3. 过滤掉总reads数太少的细胞;

4. 过滤掉UMI太少的细胞;

5. 过滤掉转录本比对太少的细胞;

6. 根据统计分析,过滤reads过多或过少的细胞;

7. 根据统计分析,过滤UMI过低或过高的细胞;

注:连过滤都有点统计的门槛,其实也简单,应该是默认为正态分布,去掉了左右极端值。

还有一个就是简单的拟合回归,LOESS Curve Fitting (Local Polynomial Regression)

How to fit a smooth curve to my data in R?

正则化

The raw data per Drop-seq run is a UMI count matrix with genes as rows and cells as columns. The values represent the number of UMIs that were detected. The aim of normalization is to make these numbers comparable between cells by removing the effect of sequencing depth and biological sources of heterogeneity that may confound the signal of interest, in our case cell cycle stage.

目前有很多正则化的方法,但是作者还是自己开发了一个。

正则化就是去掉一些影响因素,使得我们的数据之间可以相互比较。这里就提到了两个最主要的因素:测序深度和细胞周期。

A common approach to correct for sequencing depth is to create a new normalized expression matrix x with (see Fig), in which ci,j is the molecule count of gene i in cell j and mj is the sum of all molecule counts for cell j. This approach assumes that ci,j increases linearly with mj, which is true only when the set of genes detected in each cell is roughly the same.

可以看到常规的正则化方法是不适合的,

However, for Drop-seq, in which the number of UMIs is low per cell compared to the number of genes present, the set of genes detected per cell can be quite different. Hence, we normalize the expression of each gene separately by modelling the UMI counts as coming from a generalized linear model with negative binomial distribution, the mean of which can be dependent on technical factors related to sequencing depth. Specifically, for every gene we model the expected value of UMI counts as a function of the total number of reads assigned to that cell, and the number of UMIs per detected gene (sum of UMI divided by number of unique detected genes).

这个就有些门槛了,用了广义线性回归模型来做正则化。

To solve the regression problem, we use a generalized linear model (glm function of base R package) with a regularized overdispersion parameter theta. Regularizing theta helps us to avoid overfitting which could occur for genes whose variability is mostly driven by biological processes rather than sampling noise and dropout events. To learn a regularized theta for every gene, we perform the following procedure.

1) For every gene, obtain an empirical theta using the maximum likelihood model (theta.ml function of the MASS R package) and the estimated mean vector that is obtained by a generalized linear model with Poisson error distribution.

2) Fit a line (loess, span = 0.33, degree = 2) through the variance–mean UMI count relationship (both log10 transformed) and predict regularized theta using the fit. The relationship between variance and theta and mean is given by variance= mean + (mean2/theta).
Normalized expression is then defined as the Pearson residual of the regression model, which can be interpreted as the number of standard deviations by which an observed UMI count was higher or lower than its expected value. Unless stated otherwise, we clip expression to the range [-30, 30] to prevent outliers from dominating downstream analyses.

好的是,代码人家都给出来了,你去跑跑,就能猜出大致的意思。

# for normalization
# regularized overdispersion parameter theta. Regularizing theta helps us to avoid overfitting which could occur for genes whose variability is mostly driven by biological processes rather than sampling noise and dropout events.
# divide all genes into 64 bins
theta.reg <- function(cm, regressors, min.theta=0.01, bins=64) {b.id <- (1:nrow(cm)) %% max(1, bins, na.rm=TRUE) + 1cat(sprintf('get regularized theta estimate for %d genes and %d cells\n', nrow(cm), ncol(cm)))cat(sprintf('processing %d bins with ca %d genes in each\n', bins, round(nrow(cm)/bins, 0)))theta.estimate <- rep(NA, nrow(cm))# For every gene, obtain an empirical theta using the maximum likelihood model (theta.ml function of the MASS R package)for (bin in sort(unique(b.id))) {sel.g <- which(b.id == bin)bin.theta.estimate <- unlist(mclapply(sel.g, function(i) {# estimated mean vector that is obtained by a generalized linear model with Poisson error distributionas.numeric(theta.ml(cm[i, ], glm(cm[i, ] ~ ., data = regressors, family=poisson)$fitted))}), use.names = FALSE)theta.estimate[sel.g] <- bin.theta.estimatecat(sprintf('%d ', bin))}cat('done\n')raw.mean <- apply(cm, 1, mean)log.raw.mean <- log10(raw.mean)var.estimate <- raw.mean + raw.mean^2/theta.estimate# Fit a line (loess, span = 0.33, degree = 2) through the variance–mean UMI count relationship (both log10 transformed)fit <- loess(log10(var.estimate) ~ log.raw.mean, span=0.33)# predict regularized theta using the fit. The relationship between variance and theta and mean is given by variance= mean + (mean2/theta)theta.fit <- raw.mean^2 / (10^fit$fitted - raw.mean)to.fix <- theta.fit <= min.theta | is.infinite(theta.fit)if (any(to.fix)) {cat('Fitted theta below', min.theta, 'for', sum(to.fix), 'genes, setting them to', min.theta, '\n')theta.fit[to.fix] <- min.theta}names(theta.fit) <- rownames(cm)return(theta.fit)
}nb.residuals.glm <- function(y, regression.mat, fitted.theta, gene) {fit <- 0try(fit <- glm(y ~ ., data = regression.mat, family=negative.binomial(theta=fitted.theta)), silent=TRUE)if (class(fit)[1] == 'numeric') {message(sprintf('glm and family=negative.binomial(theta=%f) failed for gene %s; falling back to scale(log10(y+1))', fitted.theta, gene))return(scale(log10(y+1))[, 1])}return(residuals(fit, type='pearson'))
}## Main function
norm.nb.reg <- function(cm, regressors, min.theta=0.01, bins=64, theta.fit=NA, pr.th=NA, save.theta.fit=c()) {cat('Normalizing data using regularized NB regression\n')cat('explanatory variables:', colnames(regressors), '\n')if (any(is.na(theta.fit))) {theta.fit <- theta.reg(cm, regressors, min.theta, bins)if (is.character(save.theta.fit)) {save(theta.fit, file=save.theta.fit)}}b.id <- (1:nrow(cm)) %% max(1, bins, na.rm=TRUE) + 1cat('Running NB regression\n')res <- matrix(NA, nrow(cm), ncol(cm), dimnames=dimnames(cm))for (bin in sort(unique(b.id))) {sel.g <- rownames(cm)[b.id == bin]expr.lst <- mclapply(sel.g, function(gene) nb.residuals.glm(cm[gene, ], regressors, theta.fit[gene], gene), mc.preschedule = TRUE)# Normalized expression is then defined as the Pearson residual of the regression model, which can be interpreted as the number of standard deviations by which an observed UMI count was higher or lower than its expected value.res[sel.g, ] <- do.call(rbind, expr.lst)cat(sprintf('%d ', bin))}cat('done\n')# clip expression to the range [-30, 30] to prevent outliers from dominating downstream analysesif (!any(is.na(pr.th))) {res[res > pr.th] <- pr.thres[res < -pr.th] <- -pr.th}attr(res, 'theta.fit') <- theta.fitreturn(res)
}

  

  

单细胞数据初步处理 | drop-seq | QC | 质控 | 正则化 normalization相关推荐

  1. 单细胞测序流程(三)质控和数据过滤——Seurat包分析,小提琴图和基因离差散点图

    质控和数据过滤 准备工具:R. 准备数据:上期经过整理的数据geneMatrix. 注意事项:R的安装目录和文件所在位置都不可有英文. R 语言所需安装的包: #if (!requireNamespa ...

  2. 整合QC质控结果的利器——MultiQC

    一.MultiQC介绍 NGS技术的进步催生了新的实验设计.分析类型和极高通量测序数据的生成.对于这些数据的质量评估,每一步分析结果的评估是后续结果可信度的衡量和保障.不少生信工具都可以给样品生成一个 ...

  3. Nature Methods | 用深度多任务神经网络探索单细胞数据

    1.研究背景 在生物医学领域,分析大规模.高维度的单细胞数据,并且处理由分批实验效应和不同制备造成的数据噪声是当前的挑战:单细胞数据的大规模.高维度处理比较困难,需要考虑数据中不同程度的噪声.分批效应 ...

  4. 让你的单细胞数据动起来!|iCellR(二)

    前情回顾:让你的单细胞数据动起来!|iCellR(一) 在上文我们已经基本完成了聚类分析和细胞在不同条件下的比例分析,下面继续开始啦,让你看看iCellR的强大! 1. 每个cluster的平均表达量 ...

  5. python sci数据_scanpy学习笔记:用Python分析单细胞数据

    Scanpy 是一个基于 Python 分析单细胞数据的软件包,内容包括预处理,可视化,聚类,拟时序分析和差异表达分析等.本文翻译自 scanpy 的官方教程 Preprocessing and cl ...

  6. 用多任务网络探索单细胞数据

    目录 摘要 引言 结果 SAUCIE的架构和layer的正则 聚类Clustering Batch correction Imputation and denoising 可视化 对感染登革热的患者的 ...

  7. SingleR包注释单细胞数据

    SingleR包通过利用纯细胞类型的参考转录组数据集独立推断每个单细胞的起源细胞,从单细胞RNA测序数据执行无偏细胞类型识别. 1. 载入单细胞数据 # if (!requireNamespace(& ...

  8. 单细胞数据整合方法 | Comprehensive Integration of Single-Cell Data

    操作代码:https://satijalab.org/seurat/ 依赖的算法 CCA CANONICAL CORRELATION ANALYSIS | R DATA ANALYSIS EXAMPL ...

  9. 推荐几个单细胞数据分享和展示平台 | 短视频演示

    Broad的单细胞数据分享和展示平台 可选择子类展示 映射单个基因的颜色到t-SNE/UMAP图 分屏展示Cluster着色图和单基因着色图 多基因热图.Dotplot.Boxplot.Violinp ...

最新文章

  1. struts.xml配置文件中result的语法
  2. git 第三天 SSH免密码登录 2
  3. 实践 Network Policy - 每天5分钟玩转 Docker 容器技术(172)
  4. 一个整数,它加上100后是一个完全平方数,再加上168又是一个完全平方数,请问该数是多少...
  5. 秋招视频攻略!13个offer,8家SSP的Q神谈算法岗秋招技巧
  6. 就要有鹤立鸡群的HTML5资本
  7. Spring Data Jpa出现Not supported for DML operations
  8. C++实验课任务(多态--容器--算法)
  9. 中专考的计算机一级b有用吗,白城计算机一级B资格证真实可查么
  10. iOS开发常用技能点(持续更新中。。。)
  11. Multilingual预训练的那些套路
  12. win7耳机插前面没声音_win7电脑音箱没声音如何解决 win7电脑音箱没声音解决方式【图解】...
  13. CHM 打开时提示 已取消到该网页的导航
  14. 信息安全之加密域可逆信息隐藏
  15. “免费午餐”成为销量第一,看明星吉杰淘宝直播如何抓取粉丝眼球
  16. 小白MacBook超级实战教程——装双系统WIN10
  17. 图片过大怎么办?如何把图片压缩到最小
  18. 适合编程初学者的开源博客系统(Python版)
  19. Flagger on ASM——基于Mixerless Telemetry实现渐进式灰度发布系列 3 渐进式灰度发布
  20. 使用 wireshark 抓本地包

热门文章

  1. 判断是否是ie浏览器 前端js_JS判断是否是IE浏览器
  2. python网页运行环境_Python小牛叔Web开发1:安装Visual Studio Code配置Python运行环境...
  3. 计算机数学基础 课程定位图形,本科《计算机数学基础》(上)课程教学设计方案.doc...
  4. linux 内核升级 网络 不能上网,Deepin Linux 无法上网
  5. mysql 5.7 centos 7_CentOS 7 下 MySQL 5.7 的安装与配置
  6. pyminifier混淆代码的使用案例
  7. webservice 函数2007不可以用_Excel出了一个新函数,太好用啦!但我不建议你们学……...
  8. python前端开发之准备开发环境(建议收藏)
  9. 【项目管理】绩效域-工件裁剪对照(绩效维度)
  10. Jvm平时用到的参数