转录组分析---Hisat2+StringTie+Ballgown使用

(2016-10-10 08:14:45)

转载▼

标签:

生物信息学

转录组

 
1.Hisat2建立基因组索引:

First, using the python scripts included in the HISAT2 package, extract splice-site and exon information from the gene
annotation file:
$ extract_splice_sites.py gemome.gtf >genome.ss#得到剪接位点信息
$ extract_exons.py genome.gtf >genome.exon#得到外显子信息
Second, build a HISAT2 index:
$ hisat2-build --ss genome.ss --exon genome.exon genome.fa genome

备注extract_splice_sites.py 和 extract_exons.py 在hisat2软件包中涵盖了,这两步不是必须的,只是为了发现剪切位点,也可以直接:
$ hisat2-build  genome.fa genome
2. 利用hisat2比对到基因组:

 
hisat2 -p 8 --dta -x genome -1 file1_1.fastq.gz -2 file1_2.fastq.gz -S file1.sam
hisat2 -p 8 --dta -x chrX_data/indexes/chrX_tran -1 file2_1.fastq.gz -2 file2_2.fastq.gz -S file2.sam
备注:--dta:输出转录组型的报告文件
-x:基因组索引
-S : 输出sam文件
-p: 线程数
其他参数:
Input:
-q                 query input files are FASTQ .fq/.fastq (default)
--qseq             query input files are in Illumina's qseq format
-f                 query input files are (multi-)FASTA .fa/.mfa
-r                 query input files are raw one-sequence-per-line
-c                 , , are sequences themselves, not files
-s/--skip    skip the first reads/pairs in the input (none)
-u/--upto    stop after first reads/pairs (no limit)
-5/--trim5   trim bases from 5'/left end of reads (0)
-3/--trim3   trim bases from 3'/right end of reads (0)
--phred33          qualities are Phred+33 (default)
--phred64          qualities are Phred+64
--int-quals        qualities encoded as space-delimited integers
Alignment:
-N           max # mismatches in seed alignment; can be 0 or 1 (0)
-L           length of seed substrings; must be >3, <32 (22)
-i          interval between seed substrings w/r/t read len (S,1,1.15)
--n-ceil    func for max # non-A/C/G/Ts permitted in aln (L,0,0.15)
--dpad       include extra ref chars on sides of DP table (15)
--gbar       disallow gaps within nucs of read extremes (4)
--ignore-quals     treat all quality values as 30 on Phred scale (off)
--nofw             do not align forward (original) version of read (off)
--norc             do not align reverse-complement version of read (off)
Spliced Alignment:
--pen-cansplice              penalty for a canonical splice site (0)
--pen-noncansplice           penalty for a non-canonical splice site (12)
--pen-canintronlen          penalty for long introns (G,-8,1) with canonical splice sites
--pen-noncanintronlen       penalty for long introns (G,-8,1) with noncanonical splice sites
--min-intronlen              minimum intron length (20)
--max-intronlen              maximum intron length (500000)
--known-splicesite-infile   provide a list of known splice sites
--novel-splicesite-outfile  report a list of splice sites
--novel-splicesite-infile   provide a list of novel splice sites
--no-temp-splicesite               disable the use of splice sites found
--no-spliced-alignment             disable spliced alignment
--rna-strandness          Specify strand-specific information (unstranded)
--tmo                              Reports only those alignments within known transcriptome
--dta                              Reports alignments tailored for transcript assemblers
--dta-cufflinks                    Reports alignments tailored specifically for cufflinks
Scoring:
--ma         match bonus (0 for --end-to-end, 2 for --local)
--mp ,   max and min penalties for mismatch; lower qual = lower penalty <2,6>
--sp ,   max and min penalties for soft-clipping; lower qual = lower penalty <1,2>
--np         penalty for non-A/C/G/Ts in read/ref (1)
--rdg ,  read gap open, extend penalties (5,3)
--rfg ,  reference gap open, extend penalties (5,3)
--score-min min acceptable alignment score w/r/t read length
(L,0.0,-0.2)
Reporting:
(default)          look for multiple alignments, report best, with MAPQ
OR
-k           report up to alns per read; MAPQ not meaningful
OR
-a/--all           report all alignments; very slow, MAPQ not meaningful
Effort:
-D           give up extending after failed extends in a row (15)
-R           for reads w/ repetitive seeds, try sets of seeds (2)
Paired-end:
--fr/--rf/--ff     -1, -2 mates align fw/rev, rev/fw, fw/fw (--fr)
--no-mixed         suppress unpaired alignments for paired reads
--no-discordant    suppress discordant alignments for paired reads
Output:
-t/--time          print wall-clock time taken by search phases
--un           write unpaired reads that didn't align to
--al           write unpaired reads that aligned at least once to
--un-conc      write pairs that didn't align concordantly to
--al-conc      write pairs that aligned concordantly at least once to
(Note: for --un, --al, --un-conc, or --al-conc, add '-gz' to the option name, e.g.
--un-gz , to gzip compress output, or add '-bz2' to bzip2 compress output.)
--quiet            print nothing to stderr except serious errors
--met-file  send metrics to file at (off)
--met-stderr       send metrics to stderr (off)
--met        report internal counters & metrics every secs (1)
--no-head          supppress header lines, i.e. lines starting with @
--no-sq            supppress @SQ header lines
--rg-id     set read group id, reflected in @RG line and RG:Z: opt field
--rg        add ("lab:value") to @RG line of SAM header.
Note: @RG line only printed when --rg-id is set.
--omit-sec-seq     put '*' in SEQ and QUAL fields for secondary alignments.
Performance:
-o/--offrate override offrate of index; must be >= index's offrate
-p/--threads number of alignment threads to launch (1)
--reorder          force SAM output order to match order of input reads
--mm               use memory-mapped I/O for index; many 'bowtie's can share
Other:
--qc-filter        filter out reads that are bad according to QSEQ filter
--seed       seed for random number generator (0)
--non-deterministic seed rand. gen. arbitrarily instead of using read attributes
--version          print version information and quit
-h/--help          print this usage message

3.  将sam文件sort并转化成bam:
 
$ samtools sort -@ 8 -o file1.bam file1.sam

$ samtools sort -@ 8 -o file2.bam file2.sam
 
4. 组装转录本:
 
$ stringtie -p 8 -G genome.gtf -o file1.gtf –l file1 file1.bam
$ stringtie -p 8 -G genome.gtf -o file2.gtf –l file2 file2.bam
lncRNA (-f 0.01 -a 10 -j 1 -c 0.01)
其中:
-G reference annotation to use for guiding the assembly process (GTF/GFF3)
-l name prefix for output transcripts (default: STRG)
-f minimum isoform fraction (default: 0.1)
-m minimum assembled transcript length (default: 200)
-o output path/file name for the assembled transcripts GTF (default: stdout)
-a minimum anchor length for junctions (default: 10)
-j minimum junction coverage (default: 1)
-t disable trimming of predicted transcripts based on coverage
(default: coverage trimming is enabled)
-c minimum reads per bp coverage to consider for transcript assembly
(default: 2.5)
-v verbose (log bundle processing details)
-g gap between read mappings triggering a new bundle (default: 50)
-C output a file with reference transcripts that are covered by reads
-M fraction of bundle allowed to be covered by multi-hit reads (default:0.95)
-p number of threads (CPUs) to use (default: 1)
-A gene abundance estimation output file
-B enable output of Ballgown table files which will be created in the
same directory as the output GTF (requires -G, -o recommended)
-b enable output of Ballgown table files but these files will be
created under the directory path given as
-e only estimate the abundance of given reference transcripts (requires -G)
-x do not assemble any transcripts on the given reference sequence(s)
-h print this usage message and exit

5. 合并所有样本的gtf文件
 
$ stringtie --merge -p 8 -G genome.gtf -o stringtie_merged.gtf mergelist.txt
 
6. 新转录本的注释(lncRNA必备,普通转录组忽略)
gffcompare –r genomegtf –G –o merged stringtie_merged.gtf
备注:gffcompare 是独立软件,下载地址http://ccb.jhu.edu/software/stringtie/gff.shtml,结果如下;
= Predicted transcript has exactly the same introns as the reference transcript
c Predicted transcript is contained within the reference transcript
j Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript
e Predicted single-exon transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment
i Predicted transcript falls entirely within a reference intron
o Exon of predicted transcript overlaps a reference transcript
p Predicted transcript lies within 2 kb of a reference transcript (possible polymerase run-on fragment)
r Predicted transcript has >50% of its bases overlapping a soft-masked (repetitive) reference sequence
u Predicted transcript is intergenic in comparison with known reference transcripts
x Exon of predicted transcript overlaps reference but lies on the opposite strand
s Intron of predicted transcript overlaps a reference intron on the opposite strand

7. 转录本定量和下游ballgown软件原始文件构建:
 
$ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file1/file1.gtf file1.bam
$ stringtie –e –B -p 8 -G stringtie_merged.gtf -o ballgown/file2/file2.gtf file2.bam
8. Ballgown差异表达分析:
 
>library(ballgown)
>library(RSkittleBrewer)
>library(genefilter)
>library(dplyr)
>library(devtools)

>pheno_data = read.csv("geuvadis_phenodata.csv")#读取表型数据
>bg_chrX = ballgown(dataDir = "ballgown", samplePattern = "file", pData=pheno_data)#读取表达量
>bg_chrX_filt = subset(bg_chrX,"rowVars(texpr(bg_chrX)) >1",genomesubset=TRUE)#过滤低表达量基因
>results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars =c("population"), getFC=TRUE, meas="FPKM")#差异表达分析,运用的是一般线性模型,比较组sex,影响因素:population
>results_genes = stattest(bg_chrX_filt, feature="gene",covariate="sex", adjustvars = c("population"), getFC=TRUE,meas="FPKM")#基因差异表达
>results_transcripts=data.frame(geneNames=ballgown::geneNames(bg_chrX_filt),geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)#增加基因名字,id
>results_transcripts = arrange(results_transcripts,pval)#按pval sort
>results_genes = arrange(results_genes,pval)

>write.csv(results_transcripts, "chrX_transcript_results.csv",
row.names=FALSE)
>write.csv(results_genes, "chrX_gene_results.csv",
row.names=FALSE)

>subset(results_transcripts,results_transcripts$qval<0.05)
>subset(results_genes,results_genes$qval<0.05)

9. 结果可视化:
 
>tropical= c('darkorange', 'dodgerblue',
'hotpink', 'limegreen', 'yellow')
>palette(tropical)

>fpkm = texpr(bg_chrX,meas="FPKM")
>fpkm = log2(fpkm+1)
>boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)')

>ballgown::transcriptNames(bg_chrX)[12]
## 12
## "NM_012227"
>ballgown::geneNames(bg_chrX)[12]
## 12
## "GTPBP6"
>plot(fpkm[12,] ~ pheno_data$sex, border=c(1,2),
main=paste(ballgown::geneNames(bg_chrX)[12],' : ',
ballgown::transcriptNames(bg_chrX)[12]),pch=19, xlab="Sex",
ylab='log2(FPKM+1)')
>points(fpkm[12,] ~ jitter(as.numeric(pheno_data$sex)),
col=as.numeric(pheno_data$sex))

>plotTranscripts(ballgown::geneIDs(bg_chrX)[1729], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))
>plotMeans('MSTRG.56', bg_chrX_filt,groupvar="sex",legend=FALSE)

转载于:https://www.cnblogs.com/wangprince2017/p/9937579.html

转录组分析---Hisat2+StringTie+Ballgown使用相关推荐

  1. HISAT2+STRINGTIE+BALLGOWN 分析转录组数据

    师兄推荐这篇文章,按照里面的命令,先做一套转录组分析. 参考文献: Pertea M, Kim D,Pertea G M, et al. Transcript-level expression ana ...

  2. HISAT2+StringTie+Ballgown安装及使用流程

    HISAT2+StringTie+Ballgown安装及使用流程 2015年Nature Methods上面发表了一款快速比对工具hisat,作为接替tophat和bowtie的比对工具,它具有更快的 ...

  3. 转录组分析_转录组分析 | 使用Stringtie对数据进行下游处理

    TCGA | GEO | 文献阅读 | 数据库 | 理论知识 R语言 | Bioconductor | 服务器与Linux 接前文: 转录组分析 | fastqc进行质控与结果解读 转录组分析 | 使 ...

  4. 39个工具,120种组合深度评估 (转录组分析工具哪家强)

    前言 NGS系列文章包括NGS基础.转录组分析 (Nature重磅综述|关于RNA-seq你想知道的全在这).ChIP-seq分析 (ChIP-seq基本分析流程).单细胞测序分析 (重磅综述:三万字 ...

  5. HISAT2 - StringTie - DESeq2 pipeline 进行bulk RNA-seq

    软件官网: Hisat2: Manual | HISAT2 StringTie:StringTie 文章:Transcript-level expression analysis of RNA-seq ...

  6. 在Windows下完成转录组分析

    纯粹就是记录下如何在Window下少写代码完成RNA-seq,并没有详细讲解转录组分析的原理和每一步背后的含义.想深入了解可以翻生信技能树,生信宝典,组学大讲堂等等公众号.本文里的命令都是用power ...

  7. 高级转录组分析和R语言数据可视化第十三期 (线上线下同时开课)

    " 福利公告:为了响应学员的学习需求,经过易生信培训团队的讨论筹备,现决定安排扩增子16S分析.宏基因组.Python课程线上直播课.报名参加线上直播课的老师可在1年内选择参加同课程的一次线 ...

  8. 高级转录组分析和R语言数据可视化第12期 (线上线下同时开课)

    " 福利公告:为了响应学员的学习需求,经过易生信培训团队的讨论筹备,现决定安排扩增子16S分析.宏基因组.Python课程线上直播课.报名参加线上直播课的老师可在1年内选择参加同课程的一次线 ...

  9. 最后1周 | 高级转录组分析和R语言数据可视化第十一期 (报名线上课还可免费参加线下课)...

    " 福利公告:为了响应学员的学习需求,经过易生信培训团队的讨论筹备,现决定安排扩增子16S分析.宏基因组.Python课程线上直播课.报名参加线上直播课的老师可在1年内选择参加同课程的一次线 ...

最新文章

  1. 静态链表实现(A-B)+(B-A)【代码】
  2. 计算机教育中缺失的一课 · the missing semester of your cs education
  3. c#连接oracle11,C#连接Oracle 11g 无需安装Oracle客户端
  4. 洛谷 P4151 BZOJ 2115 [WC2011]最大XOR和路径
  5. 用户NT AUTHORITYNETWORK SERVICE登录失败解决方法
  6. 真正的取真实IP地址及利弊Asp.net
  7. 核心概念——节点/边/Combo——内置节点——Triangle
  8. 你以为AlphaGo只是下围棋厉害?不,它还能用来优化金融交易策略参数
  9. Matlab策略模式
  10. 利用nginx集群式部署服务器中,数据同步问题
  11. 新来的老大说,“公司以后禁止使用Lombok”,我表示反对~
  12. python之通过thread来实现多进程
  13. 随机变量的均值与样本的平均值有何区别
  14. 数理统计SPSS软件实验报告一--描述性统计
  15. tomcat配置ssl证书
  16. 七夕活动主题html邮件,七夕节活动策划方案,七夕创意活动主题
  17. 计算机xp系统ie8,WinXP系统IE8安装失败的解决方法
  18. zTree简单暴力修改图标样式
  19. PD3.1 140W双C快充解决方案
  20. 【转载】破解物联网落地困境-阿里云硬件接入最佳实践

热门文章

  1. 从头开始学习->JVM(八):运行时数据区(下)
  2. 技嘉的UEFI修复windows与Ubuntu双系统引导+老毛桃修复引导失败+No EFI system partition was found.
  3. 武汉java开发工资一般多少_武汉Java开发工资是否还会增长?工资为什么那么高?...
  4. Python爬取58同城租房数据,破解字体加密
  5. 'font:12px/1.5 Tahoma' 其中12px/1.5表示什么
  6. 创业公司期权如何运作
  7. php怎么插入图层,PS制作-把图片添加到图层的4种方法
  8. 苹果手机怎么在照片上添加文字_手机照片如何添加文字?原来方法这么简单,手把手教你学会。...
  9. 证明HITTING SET 是NP完全
  10. ValueError: Cannot have number of splits n_splits=3 greater than the number of samples: 1