RNA-seq数据上游分析流程（从原始数据开始）

数据分析的基本思路

（1）从ncbi的geo或者其它数据库中查找自己感兴趣的RNASeq数据，至少要求给出如下信息：

（2）对芯片数据进行质量控制评价及处理（如果质量差的话，每个样本都应该处理），

可以用软件Fastqc+Trimmomatic配合使用,也可以用其它软件替换

（3）用TopHat2 + Cufflinks+Hisat系列软件进行分析

（4）用Hisat、StringTie和Ballgown系列软件进行分析

（5）（3）和（4）的结果都要进行差异表达分析和聚类分析

（6）比较（3）和（4）两种系列软件结果的差异表达分析和聚类分析

（一）数据集选择

1、根据筛选标准（是否为RNA-seq数据、样本量、样本分类清晰程度、是否有sra原始数据），到GEO数据库中查找数据集。

数据ID：GSE111845

数据来源：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111845

样本介绍：Standard methods were used for RNA-sequencing library construction, EXBead preparation, and Next-Generation sequencing, based on the protocol provided for the Life Technologies SOLiD4 system. Briefly, 2μg of total RNA per sample of 30 human liver samples (fetal, pediatric, and adult; n=10) were used for library preparation. The rRNA was depleted using RiboMinus Eukaryote Kit for RNA-Seq.

分类：成人、幼儿、胎儿的肝样本各十个。

sra accession number：SRP135703

一共下载了10个样本，其中成人、胎儿各5个。

SRR6835938	√	幼儿
SRR6835935	√	成人
SRR6835936	√	成人
SRR6835940	√	幼儿
SRR6835941	√	幼儿
SRR6835942	√	幼儿
SRR6835943	√	幼儿
SRR6835944	√	成人
SRR6835945	√	成人
SRR6835947	√	成人

（二）数据下载

【参考链接】https://www.jianshu.com/p/8dca09077df3

1、使用SRAtoolkit的prefetch命令下载SRA数据。

for i in `seq 35 47`;
do
prefetch SRR68359${i}
done

2、使用SRAtoolkit的fastq-dump将SRA数据解析成fasta文件。

由于sra文件是单端测序(此处具体参考sratoolkit工具的操作说明)，故而使用fastq-dump指令为：

fastq-dump  SRR6835947.sra

（三）质量控制和预处理

1、使用fastqc（软件的安装步骤自行百度）对读段数据进行质控分析。

fastqc  -f  fastq  -o  result  SRR6835935.out.fastq

fastqc的使用说明：

fastqc是用来评估测序reads质量的软件。

使用命令：fastqc -f fastq -o result reads.fq.gz

#-f 表示的是输入文件的类型

#-o 表示的是输出的目录（事先用mkdir创建了一个文件夹用于存放结果）

#reads.fq 表示的是被评估的测序数据

#在Linux平台敲入指令：
s45@HP45:~/Biosofts/FastQC$ fastqc -f fastq -o result SRR6835935.out.fastq #屏幕回显：
Started analysis of SRR6835935.out.fastq
Approx 5% complete for SRR6835935.out.fastq
Approx 10% complete for SRR6835935.out.fastq
Approx 15% complete for SRR6835935.out.fastq
Approx 20% complete for SRR6835935.out.fastq
Approx 25% complete for SRR6835935.out.fastq
Approx 30% complete for SRR6835935.out.fastq
Approx 35% complete for SRR6835935.out.fastq
Approx 40% complete for SRR6835935.out.fastq
Approx 45% complete for SRR6835935.out.fastq
Approx 50% complete for SRR6835935.out.fastq
Approx 55% complete for SRR6835935.out.fastq
Approx 60% complete for SRR6835935.out.fastq
Approx 65% complete for SRR6835935.out.fastq
Approx 70% complete for SRR6835935.out.fastq
Approx 75% complete for SRR6835935.out.fastq
Approx 80% complete for SRR6835935.out.fastq
Approx 85% complete for SRR6835935.out.fastq
Approx 90% complete for SRR6835935.out.fastq
Approx 95% complete for SRR6835935.out.fastq
Analysis complete for SRR6835935.out.fastq

2、质控分析结果

A	B

C	D

E	F

说明：这篇博客主要是介绍RNA-seq总的分析流程。关于fastqc所得到的这许多结果图的生物学意义，稍后会另外写一篇文章详细介绍。

3、使用Trimmomatic软件对读段数据进行过滤

java -jar trimmomatic-0.38.jar SE -phred33 ./RNA-seq-data/$ip ./OUTresult/ SRR6835935.out.fastq.gz ILLUMINACLIP:TruSeq3-SE:2:30:10 LEADING:3 TRAILING:3 HEADCROP:10 SLIDINGWINDOW:4:15 MINLEN:36

注明：本次操作我所使用的序列文件，在fastqc评估时评估的各项指标都较好，认为此处不需要对数据再进行过滤，使原始数据失真。各位在实际的操作的过程，应该根据上面那一步的fastqc的结果来进行判断是否trim，以及trim哪里，trim多少。

（四）将读段比对到参考基因组

【准备工作】

（1）人类基因组序列文件的下载

下载NCBI的genome数据库中下载人类基因组文件（GRCh37/hg19，根据自己的实际情况进行选择，选择错误的话，下游的操作会报错）

ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.fna.gz

（2）人类基因组注释文件的下载

在GENECODE数据库中下载chr开头的gff3人类基因组注释文件。

https://www.gencodegenes.org/human/release_29.html

本次实验下载hg19类型的Comprehensive gene annotation（Regions:CHR）人类染色体的gff3格式的注释文件。

【方案一】

1、使用TopHat程序将读段比对到参考基因组

（1）构建参考索引

nohup bowtie2-build -f  human_genomic.fna human &

#nohup ……&是将程序挂在服务器后台运行的指令。具体可自行百度。

（2）将读段比对到基因组

for i in *fastq.gzdoecho $itophat2 -p 4 -o /home/s45/index/result/  /home/s45/index/human19   /home/s45/index/data/$idone

#同样地，循环指令。根据自己的实际情况，修改参数。

# 本次操作输出文件

1.accepted_hits.bam 包含比对，为bam格式，比对根据染色体坐标被排序。

2.junctions.bed 包含发现外显子结界，为bed格式。结界由两块组成，其中每个块与跨越该结界的任何读段的最大延伸量一样长。得分数是跨越结界的比对的数目。

3.insertions.bed 包含发现的插入。

4.deletion.bed 包含发现的缺失。

5.align_summary.txt 报告

【方案二】

2、使用HiSat程序将读段比对到参考基因组

（1）构建参考索引

hisat-build  human19.fa  human3

（2）将读段比对到基因组

for i in `seq 41 47`dohisat2 -t --dta -p 8 -x human3 -U ./data/SRR68359${i}.out.fastq.gz  -S ./result/SRR68359${i}.samdone

（五）转录组组装

【方案一】

1、使用cufflinks系列软件对读段进行组组装

（1）cufflinks组装

cufflinks -p 8 -o ./   35_accepted_hits.bam

参数说明：

#-p 指定线程数

#-o ：指定输出文件所在目录

#最后是Tophat比对后的结果文档bam文件

（2）cuffmerge转录本合并

cuffmerge -g genes.gtf  –s  ./genome.fa  -p 8   assemblies.txt

# assemblies.txt文件的具体内容（指明要比对的gtf文件的路径）

./35/ transcripts.gtf
./36/ transcripts.gtf
./38/ transcripts.gtf
./40/ transcripts.gtf
./41/ transcripts.gtf
./42/ transcripts.gtf
./43/ transcripts.gtf
./44/ transcripts.gtf
./45/ transcripts.gtf
./47/ transcripts.gtf

（3）cuffquant基因和转录本表达定量

cuffquant -o  sample_quant  -p  8  35_accepted_hits.bam

这些步骤生成的文件：

【方案二】

2、使用stringtie软件对读段进行组装

（1）使用samtools将HiSeq产生的.sam文件转换为.bam文件

（2）使用samtools将.bam文件进行排序

samtools sort SRR6835936.bam SRR6835936.sort

（3）使用stringtie对读段进行组装

stringtie -p 8 -G ./hg19_refSeq.gtf -o ./gtf/36.gtf - 36 SRR6835936.sort.bam
stringtie merge -p 8 -G ./genes/chr.gtf -o stringtie_merged.gtf ./mergelist.txt

# mergelist.txt的具体内容(类似于前面的某一步)：

./35.gtf
./36.gtf
./38.gtf
./40.gtf
./41.gtf
./42.gtf
./43.gtf
./44.gtf
./45.gtf
./47.gtf

stringtie -e -B -p 8 -G ./gtf/merged.gtf -o ./ballgown/35/365gtf  SRR6835935.sort.bam

#e2t.ctab 外显子和转录本间的关系

#e_data.ctab

#i2t.ctab 内含子和转录本间的关系

#i_data.ctab

#t_data.ctab 转录本的数据

（六）差异表达分析

【方案一】

1、使用Hiseq软件对TopHat比对结果进行定量

htseq-count -f bam /media/zxx/6476-C43F2/SRR6835936_accepted_hits.bam /media/zxx/6476-C43F2/hg19_refSeq.gtf >383_sample.count

2、使用R语言，将Hiseq定量结果，汇总成表达矩阵

setwd("D:/Rworkplace/HTSeq")
Adult1 <- read.table("D:/Rworkplace/HTSeq/35_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Adult2 <- read.table("D:/Rworkplace/HTSeq/36_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Pediatric1<-read.table("D:/Rworkplace/HTSeq/38_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Pediatric2<-read.table("D:/Rworkplace/HTSeq/40_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Pediatric3<-read.table("D:/Rworkplace/HTSeq/41_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Pediatric4<-read.table("D:/Rworkplace/HTSeq/42_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Pediatric5<-read.table("D:/Rworkplace/HTSeq/43_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Adult3<-read.table("D:/Rworkplace/HTSeq/44_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Adult4<-read.table("D:/Rworkplace/HTSeq/45_sample.count", sep="\t", col.names = c("gene_id","readsSum"))
Adult5<-read.table("D:/Rworkplace/HTSeq/47_sample.count", sep="\t", col.names = c("gene_id","readsSum"))myMatrix <- merge(merge(Adult1, Adult2,Adult3,Adult4,Adult5,by="gene_id"), merge(Pediatric1,Pediatric2,Pediatric3,Pediatric4,Pediatric5,by="gene_id"))Merge_func<-function(x,y){df <- merge(x,y,by="gene_id")rownames(df)<- df$Row.namesdf$Row.names<-NULLreturn(df)
}
Adult<- Reduce(Merge_func,list(Adult1, Adult2,Adult3,Adult4,Adult5))
Pediatric<-Reduce(Merge_func,list(Pediatric1,Pediatric2,Pediatric3,Pediatric4,Pediatric5))
myMatrix<-merge(Adult,Pediatric,by="gene_id")
colnames(myMatrix)<-c("gene_id","Adult_35","Adult_36","Adult_44","Adult_45","Adult_47","Pediatric_38","Pediatric_40","Pediatric_41","Pediatric_42","Pediatric_43")
ENSEMBL <- gsub("\\.\\d*", "",myMatrix$gene_id)
head(ENSEMBL, n=7)
myMatrixS<-cbind(ENSEMBL,myMatrix)
row.names(myMatrixS)<-myMatrixS[,1]
myMatrixR<-myMatrixS[,-c(1:2)]
myMatrixE<-myMatrixR[-c(1:5),]#myMatrixE就是最后的表达谱矩阵
write.table(myMatrixE,file="HTSeqmatrix.txt",col.names = T,row.names = T)

构建出的表达矩阵如下：

"Adult_35" "Adult_36" "Adult_44" "Adult_45" "Adult_47" "Pediatric_38" "Pediatric_40" "Pediatric_41" "Pediatric_42" "Pediatric_43"

"ENSG00000000003" 5 0 0 0 0 0 3 0 1 2

"ENSG00000000005" 0 0 0 1 0 0 0 0 0 0

"ENSG00000000419" 0 0 0 0 0 0 0 0 2 0

"ENSG00000000457" 38 10 36 26 5 19 32 24 51 44

"ENSG00000000460" 33 3 9 10 0 6 11 7 7 13

"ENSG00000000938" 0 0 0 0 0 1 0 0 0 0

"ENSG00000000971" 626 216 289 209 140 202 270 0 606 345

"ENSG00000001036" 1 0 0 0 0 0 0 0 0 0

"ENSG00000001084" 5 1 0 3 0 0 0 0 3 2

"ENSG00000001167" 12 2 5 5 0 1 0 0 3 3

"ENSG00000001460" 0 0 0 0 0 0 0 0 0 0

"ENSG00000001461" 0 0 0 0 0 0 0 0 1 0

"ENSG00000001497" 0 0 0 1 0 0 0 0 0 0

"ENSG00000001561" 0 0 0 0 0 0 0 0 0 0

"ENSG00000001617" 21 4 28 19 5 29 45 0 25 23

"ENSG00000001626" 0 0 0 0 0 0 0 0 2 0

"ENSG00000001629" 0 0 0 0 0 0 0 0 0 0

"ENSG00000001630" 7 5 2 3 0 5 1 0 4 0

3、对表达矩阵进行差异表达分析，筛选出差异基因

以Pvalue<0.05&log2(Foldchange)>1为阈值，筛选差异表达基因。

setwd("D:/Rworkplace/HTseq ")
geneexprSet<-read.table("microRNA.txt",header = T,sep = " ")
dim(geneexprSet)
TTest<-function(geneexprSet,numStart,numEnd){P_value<-c()Foldchange<-c()len<-dim(geneexprSet)[1]for(n in 1:len){controlsample<-as.numeric(geneexprSet[n,2:numStart])testsample<-as.numeric(geneexprSet[n,(numStart+1):numEnd])res<-t.test(controlsample,testsample,paired=TRUE)meanControl<-mean(controlsample)meanSample<-mean(testsample)fold<-meanSample/meanControlp<-res$p.valueP_value<-append(P_value,p)Foldchange<-append(Foldchange,fold)}finalexprSet<-cbind(geneexprSet,Pvalue=P_value,Foldchange=Foldchange)write.table(finalexprSet,file="microfinalexprSet.txt",col.names = T,row.names = T)
}
TTest(geneexprSet,11,21)
geneexprSet<-read.table("exprSet.txt",header = T,sep = " ")
geneexprSet<-geneexprSet[order(geneexprSet$Pvalue), ]
diffresult<-subset(geneexprSet,geneexprSet$Pvalue<0.05&log2(geneexprSet$Foldchange)>1,select = c(-Pvalue,-Foldchange))
write.table(diffresult,file="diffresult.txt",col.names = T,row.names = T)

筛选得到的差异表达矩阵如下所示：

"Adult_35" "Adult_36" "Adult_44" "Adult_45" "Adult_47" "Pediatric_38" "Pediatric_40" "Pediatric_41" "Pediatric_42" "Pediatric_43"

"ENSG00000023839" 0 1 0 0 0 2 3 0 1 3

"ENSG00000031698" 18 14 19 22 7 37 23 19 42 47

"ENSG00000048462" 0 0 0 0 0 1 1 0 1 1

"ENSG00000066294" 4 1 4 1 0 2 9 7 7 8

"ENSG00000085999" 0 0 0 0 0 4 2 1 3 4

"ENSG00000086015" 394 236 304 552 276 522 576 610 849 1139

4、对差异表达基因可视化

（1）对矩阵进行预处理

dealdata<- read.table("diffresult.txt",header = TRUE)
#去除NA
dealdataNA<-na.omit(dealdata)
#在dealdata矩阵的基础上，绘图

（2）绘制火山图查看全部基因的表达情况

library(ggplot2)
volcano<-subset(dealdataNA,select = c(Pvalue,Foldchange))
threshold<-as.factor((log2(volcano$Foldchange)>1.5|log2(volcano$Foldchange)<(-1.5))&volcano$Pvalue<0.05)
r03=ggplot(volcano,aes(log2(Foldchange),-log2(Pvalue),colour=threshold))+geom_point()+xlim(-10,10)
r04=r03+geom_vline(xintercept=c(-1.5,1.5),linetype="dotted",size=1)+geom_hline(yintercept=-log2(0.05),col="blue")
r05=r04+labs(title="Volcanoplot")+theme(plot.title = element_text(hjust = 0.5))

（3）对筛选出的差异表达基因绘制热图

diffresult<-subset(dealdataNA,dealdataNA$Pvalue<0.05&log2(dealdataNA$Foldchange)>1,select = c(-Pvalue,-Foldchange))
write.table(diffresult,file="diffresult.txt",col.names = T,row.names = T)
library(pheatmap)
bk = unique(c(seq(-1,1, length=100)))
col_anno<-data.frame(type=c(rep("adult",5),rep("pediatric",5)),row.names=colnames(diffresult))pheatmap(diffresult,breaks=bk,main="the heatmap for diffgene expression in age samples(P<0.05)",scale='row',cutree_rows=2,cutree_cols=4,annotation_col=col_anno,fontsize_col=15,treeheight_row=120,treeheight_col=20,show_rownames = F ,color = colorRampPalette(c("red","black","green"))(100))

对差异表达基因进行计数，一共筛选出来85个差异表达基因（不分上调/下调）。

【方案二】

1、安装R包Ballgown包

#Ballgoon包的安装
BiocManager::install("Ballgoon", ask=F, update=F)
#安装成功！
#其余genefiliter/dplyr/devtools通过常规方法可以完成。
#其次安装RSkittleBrewer
library(devtools)
install_github('alyssafrazee/RSkittleBrewer')
#安装成功！
#加载包
library(RSkittleBrewer)
library(ballgown)
library(dplyr)
library(genefilter)
library(devtools)

2、使用R语言的程序包Ballgown对Stringtie的组装结果提取分析

setwd("D:\\Rworkplace\\ballgown")
sample<-read.csv("sample.csv")
bg = ballgown(dataDir='./', samplePattern='SRA', pData=sample)
bg_filt=subset(bg,"rowVars(texpr(bg))>1",genomesubset=TRUE)
result_transcripts=stattest(bg_filt,feature = "transcript",covariate="sample", getFC=TRUE,meas="FPKM")
result_genes=stattest(bg_filt,feature = "gene",covariate = "sample", getFC=TRUE,meas="FPKM")
result_transcripts=data.frame(geneNames=ballgown::geneNames(bg_filt),geneIDs=ballgown::geneIDs(bg_filt),result_transcripts)
result_transcripts=arrange(result_transcripts,pval)
result_genes=arrange(result_genes,pval)
write.csv(result_transcripts, "chrX_transcript_results.csv",row.names=FALSE)
write.csv(result_genes, "chrX_gene_results.csv",row.names=FALSE)

3、差异表达分析

difftranscript<-subset(result_transcripts,result_transcripts$qval<0.05)
diffgene<-subset(result_genes,result_genes$qval<0.05)
tropical= c("green","red")
palette(tropical)
fpkm = texpr(bg,meas="FPKM")
fpkm = log2(fpkm+1)setwd("D:\\Rworkplace\\ballgown")
gene<-read.csv("chrX_gene_results.csv")library(ggplot2)
volcano<-subset(gene,select = c(pval,fc))
threshold<-as.factor((log2(volcano$fc)>1.5|log2(volcano$fc)<(-1.5))&volcano$pval<0.05)
tropical= c("green","red")
palette(tropical)
r03=ggplot(volcano,aes(log2(fc),-log2(pval),colour=threshold))+geom_point()+xlim(-10,10)
r04=r03+geom_vline(xintercept=c(-1.5,1.5),linetype="dotted",size=1)+geom_hline(yintercept=-log2(0.05),col="blue")
r05=r04+labs(title="Volcanoplot")+theme(plot.title = element_text(hjust = 0.5))

4、可视化

（1）绘制盒氏图

boxplot(fpkm,col=as.numeric(sample$sample),las=2,ylab='log2(FPKM+1)',cex.axis=0.7,ylim=c(0,8),main="FPKM, adult in green, pediatric in red")

（2）转录本分布图

plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835935'),samples="SRA35")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835936'),samples="SRA36")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835938'),samples="SRA38")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835940'),samples="SRA40")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835942'),samples="SRA42")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835943'),samples="SRA43")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835944'),samples="SRA44")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835945'),samples="SRA45")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835947'),samples="SRA47")
plotTranscripts(ballgown::geneIDs(bg)[21808], bg, main=c('transcripts MSTRG.11191 in sample SRR6835941'),samples="SRA41")

以差异转录本MSTRG.11191为例，观察各样本在转录本上的分布。

Adult	Pediatric
A	F

B	G

C	H

D	I

E	J

plotMeans('MSTRG.11191', bg_filt,groupvar="sample",legend=T)

不同的R包的出现，一定是为了解决特定的需求。这在我们使用的过程中是尤其要注意的。

比较转录本MSTRG.11191在不同的样本组之间表达明显不同。

plot(fpkm[21808,]~sample$sample, border=c(1,2), main=paste(ballgown::geneIDs(bg)[21808],' : ', ballgown::transcriptNames(bg)[21808]),pch=19, xlab="age",ylab='log2(FPKM+1)',cex=1)
points(fpkm[21808,]~jitter(as.numeric(sample$sample)),col=c("black","blue"),pch=c(19,20))
grid()

（3）火山图

筛选得到309个差异表达基因（不分上下调）。

（七）分析与比较（相对简单）

参考链接：https://blog.csdn.net/hill_night/article/details/44829965

【HISAT与TopHat的对比】

速度：100bp，20M的reads

tophat用时26.7分钟，tophat2 1170分钟

比对正确率：hisat（99.2%），tophat2(97.4%)

敏感度和准确性：HISAT（97.3%,94.8%） Tophat2（90.6%,82.6%）

【StringTie和Cufflinks对比】

算法：

cufflinks parsimony算法 (简约算法)：生成最少的亚型

说这种算法没有考虑转录丰度，在isoforms方面算的不准。其在算表达量的时候，按照图上的说法是用了最大似然冗余算法。

stringTie先将reads分为不同的类，然后再针对每个类的reads生成一个拼接图来确定转录本，之后每个转录本产生一个流神经网络的最大流算法来评估表达水平

这个算法的意思对应过来就是在一个基因处的若干个转录本，如何分配reads的数目才能让每个转录本的数目都处在最多的状态。

组装结果：

在组装方面StringTie具有一些优势，在低表达的部分，阈值过滤5%的StringTie比阈值过滤10%的准确度和敏感度还要高。

关于组装效果，StringTie要好于cufflinks，同时他们又远远好于其他软件。

找出来的基因中，cufflink找出来的70%在StringTie中有重合，相比于cufflink，StringTie在基因重构方面对三种类型的基因更有效，分别是：低冗余，高exon数目，和多重转录本。

时间：

对于100bp的reads而言

StringTie 30min cufflink 81min更多

【cuffdiff和ballgown的对比】

测试内存而言，cuffdiff 9个小时用148G，而ballgown 18秒用8G

差异结果而言，cuffdiff更保守一些

（八）总结与心得

实验过程中遇到的错误及其解决方案

1、TopHat运行组装程序时：

Error:Could not find Bowtie 2 index files

(/home/s45/Downloads/index/HomoGRCh38.96.4.*.bt2)

（1）尝试在ensembel上下载同一来源的数据文件（gtf,fasta）,建立索引。

https://blog.csdn.net/sinat_38163598/article/details/73067785

如果是在不同网站上下载的染色体序列和基因gtf注释文件，就可能会出现错误。

所以下载数据时需要在同一网址下载，都在NCBI或者UCSC下载。

尝试在ensembel上下载同一来源的数据文件（gtf,fasta）,建立索引。

依旧出错。

tophat2-G/home/s45/Downloads/index/Homo_sapiens.GRCh38.96.4.gtf --transcriptome-index=GRCh38.96.4.tr GRCh38.96.4

屏显：

[2019-05-15 12:48:06] Building transcriptome files with TopHat v2.1.0
-----------------------------------------------
[2019-05-15 12:48:06] Checking for BowtieBowtie version:     2.2.6.0
[2019-05-15 12:48:16] Checking for Bowtie index files (genome)..
Error: Could not find Bowtie 2 index files (GRCh38.96.4.*.bt2)

（2）修改路径，依旧不行。

tophat2  -G /home/s45/Downloads/index/Homo_sapiens.GRCh38.96.4.gtf --transcriptome-index=GRCh38.96.4.tr /home/s45/Downloads/index/GRCh38.96.4

屏显：

[2019-05-15 12:49:28] Building transcriptome files with TopHat v2.1.0
-----------------------------------------------
[2019-05-15 12:49:28] Checking for BowtieBowtie version:     2.2.6.0
[2019-05-15 12:49:36] Checking for Bowtie index files (genome)..
Error: Could not find Bowtie 2 index files (/home/s45/Downloads/index/GRCh38.96.4.*.bt2)

（3）尝试从tophat官网上index的下载：

http://ccb.jhu.edu/software/tophat/igenomes.shtml

依旧不行。

最终解决方案：

tophat2 -p 4 -o ./ /home/s45/index/human19  SRR6835935.fastq.gz

2、Stringtie组装时出错。

Error: the input alignment file is not sorted!

我突然想到是不是需要建立索引。

于是，回到samtools那一步，如何对文件进行排序。

对sort后的bam文件建立索引。

samtools index SRR6835935_sort_hisat.bam

屏显：

[E::hts_idx_push] NO_COOR reads not in a single block at the end 7 -1

通过index指令，可以检验samtools文件是否经过排序。从上述的屏显，我可以推测出samtools对bam文件进行排序时，并没有排序成功。

于是，重新进行排序。

samtools sort SRR6835936.bam SRR6835936.sort

屏显：

s24@HP24:~/index/result$ samtools sort SRR6835936.bam SRR6835936.sort
[bam_sort_core] merging from 6 files…

再使用sort指令进行验证，这个sort文件能够建立索引，则推测这个文件是经过排序的。

这种情况下，我们再次尝试代码。

stringtie -p 8 -G ./hg19_refSeq.gtf -o ./36.gtf -l 36 SRR6835936.sort.bam

在文件夹中发现36.gtf文档。

推测可能原因是使用批处理指令时出错。

3、使用Hiseq处理bam文件时报错。

Error occured when reading beginning of SAM/BAM file.no BGZF EOF marker; file may be truncated

对bam文件是否完整的诊断：

samtools  view  42_align_sorted.bam|tail

4、HiSeq定量结果全为零。

ENSG00000000003    0
ENSG00000000005    0
ENSG00000000419    0
ENSG00000000457    0
ENSG00000000460    0
ENSG00000000938    0
ENSG00000000971    0
ENSG00000001036    0
ENSG00000001084    0
ENSG00000001167    0
ENSG00000001460    0
ENSG00000001461    0
ENSG00000001497    0
ENSG00000001561    0
ENSG00000001617    0
ENSG00000001626    0
ENSG00000001629    0
ENSG00000001630    0
ENSG00000001631    0
ENSG00000002016    0

gtf文件格式与tophat构建索引时的文件格式（1 or chr1）不一致。还有注意使用的注释文件的版本号。

在比对、组装等操作时一定要注意注释文档、序列文件、索引文件的文件格式和版本的一致性。否则很有可能会出现问题。

解决方案：

在GENECODE数据库中可以下载到chr开头的hg19 版本的gff3格式的人类基因组注释文件。

https://www.gencodegenes.org/human/release_29.html

本次实验我主要下载的Comprehensive gene annotation（Regions:CHR）人类染色体的注释文件。

【参考链接】（部分）：

https://www.cnblogs.com/chenpeng1024/p/9248891.html

http://www.biotrainee.com/thread-415-1-1.html

http://blog.sciencenet.cn/blog-759995-990471.html

https://www.jianshu.com/p/681e02e7f9af

https://blog.csdn.net/qq_40280759/article/details/80552193

https://blog.csdn.net/weixin_34281477/article/details/87016243

https://www.jianshu.com/p/1f5d13cc47f8?from=timeline

https://www.jianshu.com/p/7ce1add65ab8

http://blog.sina.com.cn/s/blog_9c93f6450102wcx9.html

https://www.plob.org/article/12130.html

http://blog.sciencenet.cn/blog-1509670-848266.html

https://blog.csdn.net/kongxx/article/details/88653970

http://www.360doc.com/content/18/0104/10/19913717_718924981.shtml

https://www.jianshu.com/p/681e02e7f9af

https://www.jianshu.com/p/796b5e711a26

http://blog.sina.com.cn/s/blog_83f77c940102v7wl.html

http://blog.sina.com.cn/s/blog_6836e9f40102v4hb.html

https://www.cnblogs.com/wangprince2017/p/9937495.html

https://blog.csdn.net/hill_night/article/details/44829965

https://www.jianshu.com/p/38c2406367d5