2020.12.10丨cufflinks 简介及使用说明

一. 简介

Cufflinks下主要包含cufflinks,cuffmerge,cuffcompare和cuffdiff等几支主要的程序。主要用于基因表达量的计算和差异表达基因的寻找。

Cufflinks程序主要根据Tophat的比对结果，依托或不依托于参考基因组的GTF注释文件，计算出(各个gene的)isoform的FPKM值，并给出trascripts.gtf注释结果(组装出转录组)。

注意：

1. fragment的长度的估测，若为pair-end测序，则cufflinks自己会有一套算法，算出结果。若为single-end测序，则cufflinks默认的是高斯分布，或者你自己提供相关的参数设置。

2. cufflinks计算multi-mapped reads，一般a read map到10个位置，则每个位置记为10%

3. 一般不推荐用cufflinks拼接细菌的转录组，推荐 Glimmer。但是，若有注释文件，可以用cufflinks和cuffdiff来检测基因的表达和差异性。

4. cufflinks/cuffdiff不能计算出exon或splicing event的FPKM

5.cuffdiff处理时间序列data：采用参数-t

6.当你使用cufflinks时，在最后出现了99%，然后一直不动。因为cuffdiff需要更多的CPU来处理一些匹配很多reads的loci。而这些位点一般要等其他位点全部解决了后，才由cuffdiff来处理。可以用参数-M来提供相关的文件，过滤掉rRNA或者线粒体RNA。

7. 当使用cufflinks或cuffdiff出现了“crash with a ‘bad_alloc' error”，cuffdiff和cufflinks运行了很长时间才结束————这表明计算机拼接一个高表达的基因或定量分析一个高表达的基因，运行的内存使用玩尽了！解决方法：修改选项“-max-bundle-frags”，可以先尝试500000，若错误依旧在，可以继续下调！

8. cuffdiff报道的结果里面所有的基因和转录本的FPKM=0，这表明GTF中的染色体名字和BAM里的名字不匹配。

9. cuffdiff和cufflinks的缺点：存在一定的假基因和转录本（原因：测序深度，测序质量，测序样本的测序次数，以及注释的错误）

10. large fold change表达量不代表数据的明显性（这些基因的isform多或这些基因测序测到的少，整体较低的表达）。cuffdiff中明显表达倍数改变的基因，存在不确定性。

11. 通过cufflinks产生的结果中transcript.gtf文件中cuff标识的转录本就是新的转录本。相应的，其他模块输出中CUFF标识代表着新的转录本。

12. 若出现了如下错误：

You are using Cufflinks v2.2.1, which is the most recent release.
open: No such file or directory
File 30 doesn't appear to be a valid BAM file, trying SAM...
Error: cannot open alignment file 30 for reading
这表明，你的参数有问题。例如“--min-intron-length”,你设置为了：“-min-intron-length”

二. 参数；

1. 普通参数

  -h | --help

   -o | --output-dir   default: ./设置输出的文件夹名称

 -p | --num-threads  default: 1用于比对reads的CPU线程数

 -G | --GTF 提供一个GFF文件，以此来计算isoform的表达。此时，将不会组装新的transcripts，
程序会忽略和reference transcript不兼容的比对结果

 -g | --GTF-guide 提供GFF文件，以此来指导转录子组装(RABT assembly)。此时，输出结果会包含reference transcripts和novel genes and isforms。

 -M | --mask-file 提供GFF文件。Cufflinks将忽略比对到该GTF文件的transcripts中的reads。该
文件中常常是rRNA的注释，也可以包含线立体和其它希望忽略的transcripts的注释。将这些不需要的RNA去除后，对计算mRNA的表达量是有利的。

 -b | --frag-bias-correct 提供一个fasta文件来指导Cufflinks运行新的bias detection and correction algorithm。这样能明显提高转录子丰度计算的精确性。

 -u | --multi-read-correct让Cufflinks来做initial estimation步骤，从而更精确衡量比对到genome多个位点的reads。

 --library-type  default:fr-unstranded处理的reads具有链特异性。比对结果中将会有个XS标签。一般Illumina数据的lib
rary-type为 fr-unstranded。--library-norm-method    具体参考官网,三种方式：classic-fpkm  默认的方式。geometric  针对DESeq。quartile  计算时，fragments和总的map的count取75%

2. 丰度评估参数

-m | --frag-len-mean default: 200
插入片段的平均长度。不过现在Cufflinks能learns插入片段的平均长度，因此不推荐自主
设置此值。

 -s | --frag-len-std-dev default: 80
插入片段长度的标准差。不过现在Cufflinks能learns插入片段的平均长度，因此不推荐自
主设置此值。

 -N | --upper-quartile-form
使用75%分为数的值来代替总的值(比对到单一位点的fragments的数值)，作normalize。这样有利于在低丰度基因和转录子中寻找差异基因。

 --total-hits-norm default: TRUE
Cufflinks在计算FPKM时,算入所有的fragments和比对上的reads。和下一个参数
对立。默认激活该参数。

 --compatible-hits-norm
Cufflinks在计算FPKM时，只针对和reference transcripts兼容的fragments以及比对上的reads。该参数默认不激活，只能在有 --GTF 参数下有效，并且作 RABT
或 ab initio 的时候无效。

--max-mle-iterations   进行极大似然法时选择的迭代次数，默认为：5000

--max-bundle-frags   一个skipped locus/loci在别skipped前可以拥有的最大的fragment片段。默认为1000000

--no-effective-length-correction   Cufflinks will not employ its "effective" length normalization to transcript FPKM.Cufflinks将不会使用它的“effective” 长度标准化去计算转录的FPKM

--no-length-correction   Cufflinks将根本不会使用转录本的长度去标准化fragment的数目。当fragment的数目和the features being quantified的size是独立的，可以使用（例如for small RNA libraries, where no fragmentation takes place, or 3 prime end sequencing, where sampled RNA fragments are all essentially the same length).小心使用

3. 组装常用参数

-L | --label  default: CUFFCufflink以GTF格式来报告转录子片段(transfrags),该参数是GTF文件的前缀

-F/--min-isoform-fraction <0.0-1.0>  在计算一个基因的isoform 丰度后，过滤了丰度极低的转录本，因为这些转录本不可以信任。也可以过滤一些read匹配极低的外显子。默认为0.1或者10% of the most abundant isoform (the major isoform) of the gene.（一个基因的主要isoform的丰度的10%）

-j/--pre-mrna-fraction <0.0-1.0>   内含子被aligment覆盖的最低深度。若小于这个值则那些内含子的alignments被忽略掉。默认为15%。 The minimum depth of coverage in the intronic region covered      by the alignment is divided by the number of spliced reads, and if the          result is lower than this parameter value, the intronic alignments are          ignored. The default is 15%.

-I/--max-intron-length   内含子的最大长度。若大于该值的内含子，cufflinks不会报告。默认为300000.Cufflinks will not report transcripts with    introns longer than this, and will ignore SAM alignments with REF_SKIP          CIGAR operations longer than this.  The default is 300,000.

-a/--junc-alpha <0.0-1.0>    剪接比对过滤中假阳性的二项检验中的 alpha value。默认为 0.001

-A/--small-anchor-fraction <0.0-1.0>  在junction中一个reads小于自身长度的这个百分比，会被怀疑，可能会在拼接前被过滤掉。默认为0.09

--min-frags-per-transfrag   default: 10组装出的transfrags被支持的RNA-seq的fragments数少于该值则不被报道。

--overhang-tolerance   当决定一个reads或转录本与某个转录本兼容或匹配的时候，允许的能加入该转录本的外显子的延伸长度。默认是8bp和bowtie/tophat默认的一致。

--max-bundle-length   Maximum genomic length allowed for a given bundle.  The default is 3,500,000bp.

--min-intron-length   default: 50最小的intron大小。

--trim-3-avgcov-thresh   最小的3‘端的平均覆盖程度。小于该值，则删除其3’端序列。默认10  Minimum average coverage required to attempt 3' trimming.  The default is 10.

--trim-3-dropoff-frac    最低百分比的拼接的转录本的3‘端的平均覆盖程度。默认0.1  The fraction of average coverage below which to trim the 3' end of an assembled          transcript.  The default is 0.1.

--max-multiread-fraction <0.0-1.0>   若一个转录本Transfrags的reads能匹配到基因组的多个位置，其中该转录本的reads有超过该百分比是multireads，则不会报告这个转录本。默认为75%   The fraction a transfrag's supporting reads that may be multiply mapped to the genome. A transcript composed of more than this fraction will not be reported by the assembler.  Default: 0.75 (75% multireads or more is suppressed).

--overlap-radius   default: 50Transfrags之间的距离少于该值，则将其连到一起。

Advanced Reference Annotation Based Transcript (RABT) Assembly Options:当你使用-g/--GTF-guide这个参数时，需要考虑的选项。

--3-overhang-tolerance     当决定一个拼接的转录本（这个转录本可能不是新的转录本）和一个参考转录本是否合并时，参考转录本的3‘端允许延伸的长度。默认600bp   The number of bp allowed to overhang the 3' end of a reference transcript when determining      if an assembled transcript should be merged with it (ie, the assembled transcript is not novel).        The default is 600 bp.

--intron-overhang-tolerance    当决定一个拼接的转录本（这个转录本可能不是新的转录本）和一个参考转录本是否合并时，参考转录本的外显子允许延伸的长度。默认50bp   The number of bp allowed to enter the intron of a reference transcript when determining if an     assembled transcript should be merged with it (ie, the assembled transcript is not novel).      The default is 50 bp.

--no-faux-reads   This option disables tiling of the reference transcripts with faux reads.  Use this if you only         want to use sequencing reads in assembly but do not want to output assembled transcripts that lay       within reference transcripts.  All reference transcripts in the input annotation will also      be included in the output.这一项将不能掩盖参考转录组中的假reads。当你只想在拼接中使用测序的reads而不想输出lay within reference transcripts的拼接的转录组。输入时注释的所有的参考转录组也将会输入到输出中。

其他参数（无关紧要）

-v/--verbose   显示版本信息等等

-q/--quiet     除了警告和错误外，其他信息将不会print

--no-update-check   关系cufflinks自动更新的能力

4. Cufflinks输出结果

cufflinks的输入文件是sam或bam格式。并且sam或bam格式的文件必须排好序。（The SAM file supplied to Cufflinks must be sorted by reference position.）Tophat的输出结果sam或bam已经排好了序。针对其他的未排序的sam或bam文件采用如下排序方式：sort -k 3,3 -k 4,4n hits.sam > hits.sam.sorted1.
transcripts.gtf该文件包含Cufflinks的组装结果isoforms。前7列为标准的GTF格式，最后一列为attributes。其每一列的意义：

列数   列的名称  例子         描述1     序列名    chrX        染色体或contig名;
2     来源      Cufflinks   产生该文件的程序名；
3     类型      exon        记录的类型，一般是transcript或exon;
4     起始      1           1-base的值;
5     结束      1000        结束位置;
6     得分      1000        ;
7     链        +          Cufflinks猜测isoform来自参考序列的那一条链，一般是'+','-'或'.';
8     frame    .           Cufflinks不去预测起始或终止密码子框的位置;
9     attributes  ...      详见下

每一个GTF记录包含如下attributes：gene_id          CUFF.1      Cufflinks的gene id
;  transcript_id    CUFF.1.1  Cufflinks的转录子 id
; FPKM           101.267   isoform水平上的丰度, Fragments Per Kilobaseof exon model per Million mapped fragments
; frac           0.7647    保留着的一项，忽略即可，以后可能会取消这个;
conf_lo        0.07      isoform丰度的95%置信区间的下边界，即 下边界值 =FPKM * ( 1.0 - conf_lo )
;  conf_hi        0.1102    isoform丰度的95%置信区间的上边界，即 上边界值 =FPKM * ( 1.0 + conf_hi )
; cov            100.765   计算整个transcript上read的覆盖度;
full_read_support   yes  当使用 RABT assembly 时，该选项报告所有的intr
ons和exons是否完全被reads所覆盖

2. ispforms.fpkm_trackingisoforms(可以理解为gene的各个外显子)的fpkm计算结果3.
genes.fpkm_trackinggene的fpkm计算结果

2020.12.10丨cufflinks 简介及使用说明相关推荐

2020.12.10【读书笔记】丨基因组De novo 产品介绍
基因组De novo 简介基因组定义:基因组就是物种所含有的一套遗传物质单倍体细胞核. 细胞器所含的全部 DNA 分子),包括全套基因和间隔序列. 基因组观察维度基因组测序基因组测序( ...
2020.12.10【读书笔记】丨Survey二代数据质控
为什么进行Survey 分析? Survey方案通过质控 . NT 比对,获得高质量的 clean data ,为后续分析奠定良好基础. 基因组 Survey 基于小片段文库的低深度测序数据( 5 ...
2020.12.10(阴阳师照妖镜和御魂跑分器)
原文链接: 江湖上失传已久的照妖镜和御魂跑分器,今天正式开源 - 自学编程的大好机会作者:kiino 江湖上失传已久的照妖镜和御魂跑分器,今天正式开源 - 自学编程的大好机会 2020-04-28 ...
2020.12.03丨全长转录组之基因和转录本鉴定
折叠转录本分析目的:基于基因组比对结果,将相似的多转录本折叠成单个转录本(去冗余) PacBio分析软件: TAMA:https://github.com/GenomeRIK/tama TAMA简介 ...
Java牛客专项练习2020.12.10
TreeSet subset()方法: subset(from,true,to,true),返回从from元素到to元素的一个集合,true判断是否包含边境元素 Java与C++ Java与C++都有 ...
安卓牛客专项练习2020.12.10
安卓Activity活动 Android的Activity活动中,我们一般可以归结为四种状态: 1.运行状态Running 2.暂停状态Paused 3.停止状态Stopped 4.销毁状态Destr ...
2020.9.10丨Pacbio技术三代建库、测序原理
基本概念: 第三代基因测序技术又被为"Single Molecule Real Time (SMRT™) DNA Sequencing"(单分子实时DNA测序技术),该方法基于纳米 ...
「营业日志 2020.12.10」Jiangly 的排列数数题
问对于所有长为 nnn 的排列,有多少排列存在一个连续上升段 ≥k\ge k≥k.对所有 kkk 回答,对大质数取模. 首先经过转化,只需要数所有连续段均 <k<k<k 的即可.容易 ...
2020.12.10 ps临摹作业
欢迎观看阿贝贝啊的今日成果今天主要是在临摹一张图,过程略微繁琐,暂未完成先上一下还未调色以及细致调整的半成品吧感谢观看阿贝贝啊今日划水日常!

2020.12.10丨cufflinks 简介及使用说明

2020.12.10丨cufflinks 简介及使用说明相关推荐

最新文章

热门文章