算法,参数,输出。

Peak calling


ChIP-seq实验,从比对文件中观察到正/负链上以结合位点为中心的非对称reads
密度。
For ChIP-seq experiments, what we observe from the alignment files is a strand asymmetry with read densities on the +/- strand, centered around the binding site. The 5’ ends of the selected fragments will form groups on the positive- and negative-strand. The distributions of these groups are then assessed using statistical measures and compared against background (input or mock IP samples) to determine if the site of enrichment is likely to be a real binding site.


ChIP-seq analysis algorithms are specialized in identifying one of two types of enrichment (or have specific methods for each): broad peaks or broad domains (i.e. histone modifications that cover entire gene bodies) or narrow peaks ( transcription factor binding). Narrow peaks are easier to detect as we are looking for regions that have higher amplitude and are easier to distinguish from the background, compared to broad or dispersed marks. There are also ‘mixed’ binding profiles which can be hard for algorithms to discern. An example of this is the binding properties of PolII, which binds at promotor and across the length of the gene resulting in mixed signals (narrow and broad).

NOTE:ChIP-seq的分析方法可以鉴定两种类型的富集模式:broad domains和narrow peaks。broad domains,如组蛋白修饰在整个基因body区域的分布;narrow peak,如转录因子的结合。narrow peak相对于broad 或者分散的marks更易被检测到。也有一些混合的结合图谱,如PolII包括narrow和broad信号。

MACS2

MACS2是最常用的call peaks工具。 The MACS algorithm captures the influence of genome complexity to evaluate the significance of enriched ChIP regions。全称Model-based Analysis of ChIP-Seq,最初设计用来鉴定转录因子结合位点(also suited for larger regions),也可用于其他类型的富集方式测序。
MACS通过整合序列标签位置信息方向信息提高结合位点空间分辨率improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation)。单独使用或加对照( increases specificity of the peak calls)。

1. Remove redundancy

Why worry about duplicates? Reads with the same start position are considered duplicates. These duplicates can arise from experimental artifacts, but can also contribute to genuine ChIP-signal. (相同起点的reads被认为是duplicates,可能由于实验误差造成,也可能是ChIP信号)

  • 坏 duplicates: If initial starting material is low this can lead to overamplification of this material before sequencing. Any biases in PCR will compound this problem and can lead to artificially enriched regions. Also blacklisted (repeat) regions with ultra high signal will also be high in duplicates. Masking these regions prior to analysis can help remove this problem。 实验材料量少,过度扩增,PCR偏差,人为富集区域。blacklist(重复)区域?怎么获得,屏蔽该区域。
  • 好 duplicates: You can expect some biological duplicates with ChIP-seq since you are only sequencing a small part of the genome. This number can increase if your depth of coverage is excessive or if your protein only binds to few sites. If there are a good proportion of biological dupicates, removal can lead to an underestimation of the ChIP signal. 有ChIP-seq意义的生物学意义duplicates ,测序基因组小部分。覆盖深度过多,蛋白质与少数位点结合,duplicates会增加。去除此类会导致对ChIP信号的低估。
  • Consider your enrichment efficiency and sequencing depth. Try to discriminate via genome browser of your non-deduplicated data. Bona fide peaks will have multiple overlapping reads with offsets, while samples with only PCR duplicates will stack up perfectly without offsets. A possible solution to distinguishing biological duplicate from PCR artifact would be to include UMIs into your experimental setup. 考虑富集效率测序深度。通过基因组浏览器区别真正的峰会有多个重叠reads和偏移量PCR重复没有偏移量。如何区分:实验设计考虑UMIs(UMI将被用于合并PCR复制物)。
  • Retain duplicates for differential binding analysis. 保留duplicates 用于差异结合分析 why?retain和keep区别?
  • If you are expecting binding in repetitive regions, use paired-end sequencing and keep duplicates. 研究重复区域,使用pariedend测序并保持重复?
  • call peak之前就remove dup

2.0 Shift size d

  • 真实结合位点周围的tag density应显示双峰富集(成对峰)。MACS利用这种双峰模式 empirically model the shifting size,精确定位结合位点。
  • 为了找到成对峰构建模型,MACS首先扫描整个数据集,寻找高度显著富集区域。只利用ChIP样本,给定超声大小sonication size(bandwidth)和high-confidence fold-enrichment(mfold), MACS slides two bandwidth windows across the genome to find regions with tags more than mfold enriched relative to a random tag genome distribution.
  • MACS随机抽取1000个高质量峰分离正链和负链标签,aligns them by the midpoint between their centers。The distance between the modes of the two peaks in the alignment is defined as ‘d’ and represents the estimated fragment length. MACS shifts all the tags by d/2 toward the 3’ ends to the most likely protein-DNA interaction sites.

2. Select 1000 regions with a 10- to 30-fold enrichment relative to the genome background

3. Build model and estimate DNA fragment size d

4. Shift reads toward 3’end by d

5. Scale two libraries(if have control)

For experiments in which sequence depth differs between input and treatment samples, MACS linearly scales the total control tag count to be the same as the total ChIP tag count. The default behaviour is for the larger sample to be scaled down.

6.0 Effective genome length

To calculate λBG from tag count, MAC2 requires the effective genome size or the size of the genome that is mappable. Mappability is related to the uniqueness of the k-mers at a particular position the genome. Low-complexity and repetitive regions have low uniqueness, which means low mappability. Therefore we need to provide the effective genome length to correct for the loss of true signals in low-mappable regions. 提供有效基因组长度,以纠正低定位区域真实信号丢失

如何获得?The MACS2 software has some pre-computed values for commonly used organisms (human, mouse, worm and fly,rice?). more accurate values? The deepTools docs has additional pre-computed values for more recent builds but also has some good materials on how to go about computing it.

6. Call candidate peaks relative to genome background

After MACS shifts every tag by d/2, it then slides across the genome using a window size of 2d to find candidate peaks. The tag distribution along the genome can be modeled by a Poisson distribution. The Poisson is a one parameter model, where the parameter λ is the expected number of reads in that window.

7. Calculate dynamic λ for candidate peaks


泊松分布的参数λ是单位时间(或单位面积)内随机事件的平均发生次数。 泊松分布的期望和方差均为λ.
Instead of using a uniform λ estimated from the whole genome, MACS uses a dynamic parameter, λlocal, defined for each candidate peak. The lambda parameter is estimated from the control sample and is deduced by taking the maximum value across various window sizes:

λlocal = max(λBG, λ1k, λ5k, λ10k).

In this way lambda captures the influence of local biases, and is robust against occasional low tag counts at small local regions. Possible sources for these biases include local chromatin structure, DNA amplification and sequencing bias, and genome copy number variation. 通过使用动态lambda捕获局部偏差的影响,并且对小局部区域中偶尔出现的low tag counts表现较好。偏差可能来源包括局部染色质结构、DNA扩增和测序偏差以及基因组拷贝数变化。

A region is considered to have a significant tag enrichment if the p-value < 10e-5 (可设置). This is a Poisson distribution p-value based on λ.

Overlapping enriched peaks are merged, and each tag position is extended ‘d’ bases? from its center. The location in the peak with the highest fragment pileup堆积, hereafter referred to as the summit峰顶, is predicted as the precise binding location. The ratio between the ChIP-seq tag count and λlocal? is reported as the fold enrichment.

8. Calculate P value and filter candidate peaks

9. Calculate FDR by exchanging treatment and control

Each peak is considered an independent test and thus, when we encounter thousands of significant peaks detected in a sample we have a multiple testing problem. In MACSv1.4, the FDR was determined empirically by exchanging the ChIP and control samples. However, in MACS2, p-values are now corrected for multiple comparison using the Benjamini-Hochberg correction.

MACS2 参数

1. Input file options

  • -t : The IP data file (this is the only REQUIRED parameter for MACS) 实验组
  • -c : Control or mock data 对照
  • -f : 输入文件格式,默认值“AUTO” ,bam sam bed
  • -g : mappable genome size which is defined as the genome size which can be sequenced; some precompiled values provided.
  • hs: 2.7e9人类基因组有效大小(UCSC human hg18 assembly)

2. Output arguments

  • -outdir : 输出文件夹
  • -n : 文件前缀
  • -B/--bdg : store the fragment pileup, control lambda, -log10pvalue and -log10qvalue scores in bedGraph files 输出bedgraph格式的文件

3. Shifting model arguments

  • -s : size of sequencing tags. Default, MACS will use the first 10 sequences from your input treatment file to determine it
  • --bw : The bandwidth which is used to scan the genome ONLY for model building. Can be set to the expected sonication fragment size.
  • --mfold : upper and lower limit for model building
  • --nomodel和extsize、shift是配套使用,有这个参数才可设置extsize和shift。
  • --extsize:当设置了nomodel时,MACS会用–extsize这个参数从5'->3'方向扩展reads修复fragments。如转录因子结合范围200bp,设置这个参数是200。
  • --shift:当设置了–nomodel,MACS用这个参数从5' 端移动剪切,然后用–extsize延伸,如果–shift是负值表示从3’端方向移动。建议ChIP-seq数据集这个值保持默认值为0?,对于检测富集剪切位点如DNAseq数据集设置为EXTSIZE的一半。
想找富集剪切位点,如DNAse-seq,所有5'端的序列reads应该从两个方向延伸,如果想设置移动的窗口是200bp,参数设置如下:
--nomodel --shift -100 --extsize 200
对nucleosome-seq数据,用核小体大小的一半进行extsize,所以参数设置如下:
--nomodel --shift 37 --extsize 73

ATAC-seq关心的是在哪切断,断点才是peak的中心,所以使用shift模型,–shift -75或-100
对人细胞系ATAC-seq 数据call peak的参数设置:

macs2 callpeak -t H1hesc.final.bam -n sample --shift -100 --extsize 200 --nomodel -B --SPMR -g hs --outdir Macs2_out 2> sample.macs2.log

4. Peak calling arguments

  • -q : q-value (minimum FDR) cutoff q value默认值是0.05,与pvalue不能同时使用。
  • -p : p-value cutoff (instead of q-value cutoff)
  • --nolambda : do not consider the local bias/lambda at peak candidate regions。不要考虑在峰值候选区域的局部偏差/λ
  • --broad : broad peak calling,narrow peak和broad peak

Relaxing the q-value does not behave as expected in this case since it is partially tied to peak widths. Ideally, if you relaxed the thresholds, you would simply get more peaks but with MACS2 relaxing thresholds also results in wider peaks. q值与峰宽有一定的联系。理想情况下,如果放宽阈值,您将简单地获得更多的峰值,但是使用MACS2放松阈值也会导致更宽的峰值。

macs2 callpeak -t bowtie2/H1hesc_Nanog_Rep1_aln.bam \-c bowtie2/H1hesc_Input_Rep1_aln.bam \-f BAM -g 1.3e+8 \-n Nanog-rep1 \--outdir macs2 2> macs2/Nanog-rep1-macs2.log

MACS2 Output files

narrowPeak

A narrowPeak (.narrowPeak) file is used by the ENCODE project to provide called peaks of signal enrichment based on pooled, normalized (interpreted) data. It is a BED 6+4 format, which means the first 6 columns of a standard BED file with 4 additional fields:

WIG format

Wiggle format (WIG) allows the display of continuous-valued data in a track format. Wiggle format is line-oriented. It is composed of declaration lines and data lines, and require a separate wiggle track definition line. There are two options for formatting wiggle data: variableStep and fixedStep. These formats were developed to allow the file to be written as compactly as possible.

BedGraph format

The BedGraph format also allows display of continuous-valued data in track format. This display type is useful for probability scores and transcriptome data. This track type is similar to the wiggle (WIG) format, but unlike the wiggle format, data exported in the bedGraph format are preserved in their original state. For the purposes of visualization, these can be interchangeable.

  • _peaks.narrowPeak: BED6+4 format file which contains the peak locations together with peak summit, pvalue and qvalue。BED6+4格式,可以上传到UCSC浏览。

    • 1:染色体号
    • 2:peak起始位点
    • 3:结束位点
    • 4:peak name
    • 5:int(-10*log10qvalue)
    • 6 :正负链
    • 7:fold change
    • 8:-log10pvalue
    • 9:-log10qvalue
    • 10:relative summit position to peak start(?)
  • _peaks.xls: a tabular file which contains information about called peaks. Additional information includes pileup and fold enrichment。含peak信息的tab分割的文件,前几行显示callpeak命令

    • 染色体号
    • peak起始位点
    • peak结束位点
    • peak区域长度
    • peak的峰值位点(summit position)
    • peak 峰值的高度(pileup height at peak summit, -log10(pvalue) for the peak summit)
    • peak的富集倍数(相对于random Poisson distribution with local lambda)
    • XLS里的坐标和bed格式的坐标还不一样,起始坐标需要减1才与narrowPeak的起始坐标一样。
  • _summits.bed: peak summits locations for every peak. To find the motifs at the binding sites, this file is recommended。BED格式的文件,包含peak的summits位置,第5列是-log10pvalue。如果想找motif,推荐使用此文件。

  • _model.R: an R script which you can use to produce a PDF image about the model based on your data and cross-correlation plot

  • _control_lambda.bdg: bedGraph format for input sample

  • _treat_pileup.bdg: bedGraph format for treatment sample。bedGraph格式,可以导入UCSC或者转换为bigwig格式。两种bfg文件:treat_pileup, and control_lambda.

  • NAME_peaks.broadPeak: BED6+3格式与narrowPeak类似,只是没有第10列。

R作图the first plot illustrates the distance between the modes from which the shift size was determined?

The second plot is the cross-correlation plot. This is a graphical representation of the Pearson correlation of positive- and negative- strand tag densities, shifting the strands relative to each other by increasing distance?

xls文件
文件包含信息还是比较多的,和narrowPeak唯一不同的是peak的起始位置需要减1才是bed格式的文件,另外还包含fold_enrichment 和narrowPeak的fold change 对应,-log10pvalue,-log10qvalue,peak长度,peak 峰值位置等。
narrowPeak文件
和xls文件信息类似
summits.bed文件
包含峰的位置信息和-log10pvalue
bdg文件
bdg文件适合导入UCSC或IGV进行谱图可视化,或者转换为bigwig格式再进行可视化。

hbctraining-05_peak_calling_macs2相关推荐

  1. ATAC-seq【Harvard FAS Informatics】

    ATAC-seq数据质量评估注意 ENCODE的ATACseq数据标准. Uniform Processing Pipeline Restrictions The read length prior ...

  2. ATAC-seq学习记录

    ATAC-seq意义 为何同样DNA序列的细胞的表型会不同,为何肝细胞是肝细胞,神经细胞是神经细胞?是什么造成了他们生产蛋白不同,决定蛋白生成的RNA不同呢?原因可以用表观遗传来解释. DNA转录成R ...

  3. 哈佛大学单细胞课程|笔记汇总(1-9)

    哈佛大学单细胞课程|笔记汇总 为什么做单细胞? 如何得到单细胞原始数据并转换成分析需要的矩阵格式 质控前的数据准备 质控 归一化和主成分分析 聚类分析与可视化 marker识别与注释 单细胞转录组测序 ...

  4. 哈佛大学单细胞课程|笔记汇总 (六)

    生物信息学习的正确姿势 NGS系列文章包括NGS基础.在线绘图.转录组分析 (Nature重磅综述|关于RNA-seq你想知道的全在这).ChIP-seq分析 (ChIP-seq基本分析流程).单细胞 ...

  5. 哈佛大学单细胞课程|笔记汇总 (三)

    生物信息学习的正确姿势 NGS系列文章包括NGS基础.在线绘图.转录组分析 (Nature重磅综述|关于RNA-seq你想知道的全在这).ChIP-seq分析 (ChIP-seq基本分析流程).单细胞 ...

  6. chip-seq三个生物学重复样品处理——IDR

    前言 ATAC-seq/ChIP-Seq中重复样本的处理 ATAC-Seq要求必须有2次或更多次生物学重复(十分珍贵或者稀有样本除外,但必须做至少2次技术重复).理论上重复样本的peaks应该有高度的 ...

  7. 单细胞分析:归一化和回归(八)

    导读 现在有了高质量的细胞,首先探索数据并确定任何不需要的变异来源.然后需要对数据进行归一化,计算方差并回归任何对数据有影响的协变量. 1. 学习目标 学会如何执行归一化,方差估计,鉴定易变基因 2. ...

  8. 送书 | 哈佛大学单细胞课程:笔记汇总前篇

    经典赏析 NGS系列文章包括NGS基础.转录组分析 (Nature重磅综述|关于RNA-seq你想知道的全在这).ChIP-seq分析 (ChIP-seq基本分析流程).单细胞测序分析 (重磅综述:三 ...

  9. 生信软件 | STAR(测序序列与参考序列比对)

    文章目录 零.介绍 一.安装 二.使用 1.建立索引 2.STAR 比对 三.原理 聚类.拼接和评分 零.介绍 STAR (Spliced Transcripts Alignment to a Ref ...

  10. 单细胞分析:marker鉴定(11)

    导读 前面我们已经确定了我们想要的簇,我们可以继续进行标记识别,这将使我们能够验证某些簇的身份并帮助推测任何未知簇的身份. 1. 学习目标 学会确定单个簇的marker 学会在聚类和marker识别间 ...

最新文章

  1. beats x连android手机吗,beats x 能连安卓手机吗?
  2. 独家 | 2018年Analytics Vidhya上最受欢迎的15篇数据科学和机器学习文章
  3. core和node开发小程序_成都小程序开发:微信小程序开发要多少钱?
  4. 关于Linux系统中用户权限问题
  5. 数风·数林 | 炉石传说中的概率(声控篇)
  6. Diverse Team(CF-988A)
  7. EPCS 无法配置FPGA的解决方法以及JTAG、AS调试总结
  8. react生命周期(自己的方式理解)
  9. GO语言学习之路25
  10. 使用 SqlDataSource 控件查询数据47
  11. 从来没见过这么多的资源~~好好找找吧,一定有你需要的
  12. HTML5期末大作业:我的家乡网站设计——我的家乡-杭州(7页) HTML+CSS+JavaScript 大学生家乡网页作品 老家网页设计作业模板 学生网页制作源代码下载
  13. 企业SOA平台 JBoss SOA
  14. 【机器学习】偏差-方差分解
  15. matlab solve和subs,【MATLAB】matlab中的subs()函数和solve()函数用法
  16. ffmpeg视频滤镜中英文对照
  17. api数据接口文档_接口文档示例(Taobao/jd/pinduoduo/开放接口调用)
  18. web前端-微信小程序开发学习
  19. AFNetworking的使用
  20. ArcEngine创建平头缓冲区的方法

热门文章

  1. ISE中使用Notepad++的关联设置以及Notepad++的护眼设置(设置背景色)
  2. java笔记 -- java运算
  3. 转载----Python的zip()函数
  4. 【转载】Unix编程艺术——Unix哲学
  5. 仿iOS中图标的抖动
  6. zqgame《每日一言》
  7. WCF分布式开发步步为赢(1):WCF分布式框架基础概念
  8. 爱因斯坦谜题解答(三种算法比较)
  9. FPGA锁存器、触发器、寄存器和缓冲器的区别
  10. Linux系统无线鼠标不能用,手把手教你win7系统无线鼠标不能用的处理方案