分析流程

image.png

1.上传四个样本原始就文件到服务器

Step 1. Analysing Sequence Quality with FastQC

Description

"FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis."

The first step before processing any samples is to analyze the quality of the data. Within the fastq file is quality information that refers to the accuracy (% confidence) of each base call. FastQC looks at different aspects of the sample sequences to determine any irregularies or features that make affect your results (adapter contamination, sequence duplication levels, etc.)

Installation

conda install -c bioconda fastqc --yes

Command

# Help

fastqc -h

# Run FastQC

fastqc \

-o results/1_initial_qc/ \

--noextract \

input/sample.fastq

Output

── results/1_initial_qc/

└── sample_fastqc.html

└── sample_fastqc.zip

2.质控

fastqc生成质控报告,multiqc将各个样本的质控报告整合为一个。

(rna) cheng@super-Super-Server:~/project$ ls

NALM6-ctrl_combined_R1.filtered.1.fq.gz

NALM6-ctrl_combined_R1.filtered.2.fq.gz

NALM6-treat_combined_R1.filtered.1.fq.gz

NALM6-treat_combined_R1.filtered.2.fq.gz

RS411-ctrl_combined_R1.filtered.1.fq.gz

RS411-ctrl_combined_R1.filtered.2.fq.gz

RS411-treat_combined_R1.filtered.1.fq.gz

RS411-treat_combined_R1.filtered.2.fq.gz

(rna) cheng@super-Super-Server:~/project$ ls *gz | xargs fastqc -t 2

第一步fastqc生成质控报告,-t设为2时同时分析2个原始fq数据,等待约30min完成,因为本服务器没安装multiqc,就暂时不做整合。

Step2: filter the bad quality reads and remove adaptors.

运行如下代码,得到名为config的文件,包含两列数据

mkdir $wkd/clean

cd $wkd/clean

ls /home/data/cheng/project/raw/*1.fq.gz >1

ls /home/data/cheng/project/raw/*2.fq.gz >2

paste 1 2 > config

(rna) cheng@super-Super-Server:~/project$ pwd

/home/cheng/project

(rna) cheng@super-Super-Server:~/project$ mkdir -p clean

(rna) cheng@super-Super-Server:~/project$ cd clean/

(rna) cheng@super-Super-Server:~/project/clean$ ls /home/data/cheng/project/raw/*.1.fq.gz >1

(rna) cheng@super-Super-Server:~/project/clean$ ls /home/data/cheng/project/raw/*.2.fq.gz >2

(rna) cheng@super-Super-Server:~/project/clean$ paste 1 2 > config

(rna) cheng@super-Super-Server:~/project/clean$ cat config

/home/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.1.fq.gz /home/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.2.fq.gz

/home/cheng/project/raw/NALM6-treat_combined_R1.filtered.1.fq.gz /home/cheng/project/raw/NALM6-treat_combined_R1.filtered.2.fq.gz

/home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.1.fq.gz /home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz

/home/cheng/project/raw/RS411-treat_combined_R1.filtered.1.fq.gz /home/cheng/project/raw/RS411-treat_combined_R1.filtered.2.fq.gz

(rna) cheng@super-Super-Server:~/project/clean$ vi qc.sh

(rna) cheng@super-Super-Server:~/project/clean$ bash qc.sh config

打开文件 qc.sh ,并且写入如下内容

trim_galore,用于去除低质量和接头数据

source activate rna

bin_trim_galore=trim_galore

dir='/home/data/cheng/project/clean'

cat $1 |while read id

do

arr=(${id})

fq1=${arr[0]}

fq2=${arr[1]}

$bin_trim_galore -q 25 --phred33 --length 36 --stringency 3 --paired -o $dir $fq1 $fq2

done

source deactivate

运行qc.sh

bash qc.sh config #config是传递进去的参数

结果显示如下

RUN STATISTICS FOR INPUT FILE: /home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.1.fq.gz

=============================================

31046571 sequences processed in total

The length threshold of paired-end sequences gets evaluated later on (in the validation step)

Writing report to '/home/cheng/project/clean/RS411-ctrl_combined_R1.filtered.2.fq.gz_trimming_report.txt'

SUMMARISING RUN PARAMETERS

==========================

Input filename: /home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz

Trimming mode: paired-end

Trim Galore version: 0.6.4_dev

Cutadapt version: 2.6

Number of cores used for trimming: 1

Quality Phred score cutoff: 25

Quality encoding type selected: ASCII+33

Adapter sequence: 'AGATCGGAAGAGC' (Illumina TruSeq, Sanger iPCR; auto-detected)

Maximum trimming error rate: 0.1 (default)

Minimum required adapter overlap (stringency): 3 bp

Minimum required sequence length for both reads before a sequence pair gets removed: 36 bp

Output file(s) will be GZIP compressed

Cutadapt seems to be fairly up-to-date (version 2.6). Setting -j -j 1

Writing final adapter and quality trimmed output to RS411-ctrl_combined_R1.filtered.2_trimmed.fq.gz

>>> Now performing quality (cutoff '-q 25') and adapter trimming in a single pass for the adapter sequence: 'AGATCGGAAGAGC' from file /home/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz <<<

Connection to 192.168.1.6 closed by remote host.

Connection to 192.168.1.6 closed.

我的程序被中断了,因为内存不足可能

管理员给我发一个提醒:

(rna) cheng@super-Super-Server:~$ cat A_Message_from_Admin

your programme almost used all the disk, so I killed it.

Please move all your data to /home/data, this directory is on the other disk and it is big enough to run your programme.

After you read this message, you can delete it by yourself. I changed the owner to you.

更换工作目录:

(rna) cheng@super-Super-Server:~$ mv project/ /home/data/

(rna) cheng@super-Super-Server:~$ cd /home/data/

(rna) cheng@super-Super-Server:/home/data$ mv project/ cheng/

(rna) cheng@super-Super-Server:/home/data$ cd cheng

(rna) cheng@super-Super-Server:/home/data/cheng$ ls

project

查看目录下软件信息

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/

genome gtf hisat index trnadb

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/biosoft/

cellranger-4.0.0 JAGS-4.3.0 ncbi-blast-2.10.1+ refdata-gex-GRCh38-2020-A.tar.gz tophat-2.1.1.Linux_x86_64.tar.gz

cellranger-4.0.0.tar.tar JAGS-4.3.0.tar.gz ncbi-blast-2.10.1+-x64-linux.tar.gz RNA-SeQC wget.out

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/biosoft/RNA-SeQC/

gencode.v7.annotation_goodContig.gtf.gz gencode.v7.gc.txt Homo_sapiens_assembly19.fasta.gz rRNA.tar.gz ThousandReads.bam

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/genome/

GRCh38_reference_genome hg19 hg38 M25 mm10

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/gtf/

ensembl gencode

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/hisat/

grcm38.tar.gz hg19.tar.gz hg38.tar.gz mm10.tar.gz

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/index/

bowtie bwa hisat tophat

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/trnadb/

hg38 mm10

本服务器可以用比对软件较多:

(rna) cheng@super-Super-Server:/home/data/cheng$ ls /home/data/reference/index/

bowtie bwa hisat tophat

先完成去去接头:

#确定工作目录

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ pwd

/home/data/cheng/project/clean

#构建 config

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ ls /home/data/cheng/project/raw/*1.fq.gz >1

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ ls /home/data/cheng/project/raw/*2.fq.gz >2

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ paste 1 2 > config

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ ls

1 2 config

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ cat config

/home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.2.fq.gz

/home/data/cheng/project/raw/NALM6-treat_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/NALM6-treat_combined_R1.filtered.2.fq.gz

/home/data/cheng/project/raw/RS411-ctrl_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz

/home/data/cheng/project/raw/RS411-treat_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/RS411-treat_combined_R1.filtered.2.fq.gz

#编写质控脚本

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ vi qc.sh

#将任务挂到后台运行

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ nohup bash qc.sh config &

[1] 16284

step4: alignment

star, hisat2, bowtie2, tophat, bwa, subread都是可以用于比到的软件

先运行一个样本,测试一下

mkdir $wkd/test

cd $wkd/test

source activate rna

ls $wkd/clean/*gz |while read id;do (zcat ${id}|head -1000> $(basename ${id} ".gz"));done

id=SRR1039508

hisat2 -p 10 -x /public/reference/index/hisat/hg38/genome -1 ${id}_1_val_1.fq -2 ${id}_2_val_2.fq -S ${id}.hisat.sam

subjunc -T 5 -i /public/reference/index/subread/hg38 -r ${id}_1_val_1.fq -R ${id}_2_val_2.fq -o ${id}.subjunc.sam

bowtie2 -p 10 -x /public/reference/index/bowtie/hg38 -1 ${id}_1_val_1.fq -2 ${id}_2_val_2.fq -S ${id}.bowtie.sam

bwa mem -t 5 -M /public/reference/index/bwa/hg38 ${id}_1_val_1.fq ${id}_2_val_2.fq > ${id}.bwa.sam

事实上,对RNA-seq数据来说,不要使用bwa和bowtie这样的软件,它需要的是能进行跨越内含子比对的工具。所以后面就只先择hisat2,STAR进行下一步,但是本服务器中hisat2的索引文件不完整,最后只用STAR进行了比对

批量比对代码

cd $wkd/clean

ls *gz|cut -d"_" -f 1 |sort -u |while read id;do

ls -lh ${id}_combined_R1.filtered.1_val_1.fq.gz ${id}_combined_R1.filtered.2_val_2.fq.gz

hisat2 -p 10 -x /home/data/reference/index/hisat/hg38/genome -1 ${id}_combined_R1.filtered.1_val_1.fq.gz -2 ${id}_combined_R1.filtered.2_val_2.fq.gz -S ${id}.hisat.sam

done

Note the two inputs for this command are the genome located in the (genome/ folder) and the annotation file located in the (annotation/ folder)

# This can take up to 30 minutes to complete

STAR \

--runMode genomeGenerate \

--genomeDir star_index \

--genomeFastaFiles /home/data/reference/genome/hg38/* \

--sjdbGTFfile /home/data/reference/gtf/ensembl/* \

--runThreadN 4

# Help

STAR -h

# Run STAR (~10min)

STAR \

--genomeDir star_index \

--readFilesIn filtered/sample_filtered.fq \

--runThreadN 4 \

--outSAMtype BAM SortedByCoordinate \

--quantMode GeneCounts

# Move the BAM file into the correct folder

mv -v results/4_aligned_sequences/sampleAligned.sortedByCoord.out.bam results/4_aligned_sequences/aligned_bam/

# Move the logs into the correct folder

mv -v results/4_aligned_sequences/${BN}Log.final.out results/4_aligned_sequences/aligned_logs/

mv -v results/4_aligned_sequences/sample*Log.out results/4_aligned_sequences/aligned_logs/

先构建STAR索引

Generating Indexes

Similar to the SortMeRNA step, we must first generate an index of the genome we want to align to, so that there tools can efficently map over millions of sequences. The star_index folder will be the location that we will keep the files necessary to run STAR and due to the nature of the program, it can take up to 30GB of space. This step only needs to be run once and can be used for any subsequent RNAseq alignment analyses.

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ STAR

Usage: STAR [options]... --genomeDir /path/to/genome/index/ --readFilesIn R1.fq R2.fq

Spliced Transcripts Alignment to a Reference (c) Alexander Dobin, 2009-2020

STAR version=2.7.5a

STAR compilation time,server,dir=Tue Jun 16 12:17:16 EDT 2020 vega:/home/dobin/data/STAR/STARcode/STAR.master/source

For more details see:

To list all parameters, run STAR --help

(rna) cheng@super-Super-Server:/home/data/cheng/project/clean$ cd /home/data/cheng

(rna) cheng@super-Super-Server:/home/data/cheng$ ls

project

(rna) cheng@super-Super-Server:/home/data/cheng$ nohup STAR \

> --runMode genomeGenerate \

> --genomeDir star_index \

> --genomeFastaFiles /home/data/reference/genome/* \

> --sjdbGTFfile /home/data/reference/gtf/* \

> --runThreadN 4 &

[2] 18522

(rna) cheng@super-Super-Server:/home/data/cheng$ nohup: ignoring input and appending output to 'nohup.out'

[2]+ Exit 104 nohup STAR --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles /home/data/reference/genome/* --sjdbGTFfile /home/data/reference/gtf/* --runThreadN 4

(rna) cheng@super-Super-Server:/home/data/cheng$ ls

nohup.out project star_index _STARtmp

STAR --runThreadN 20 --runMode genomeGenerate \

--genomeDir /home/data/cheng/star_index \

--genomeFastaFiles /home/data/reference/genome/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa \

--sjdbGTFfile /home/data/reference/gtf/gencode.v35.annotation.gtf \

–runThreadN 是指构建是使用的线程数,在没有其他数据在跑的情况下,可以满线程跑

–runMode genomeGenerate 让STAR执行基因组索引的生成工作

–genomeDir 构建好的参考基因组存放的位置,最好是单独建立的一个文件夹

–genomeFastaFiles 参考基因组序列文件

–sjdbGTFfile 基因注释文件

比对

我们来看一下比对的代码

STAR

--runThreadN 20 \

--genomeDir /home/data/cheng/star_index \

--readFilesCommand gunzip -c \

--readFilesIn /home/data/cheng/ /home/data/cheng/ \

--outSAMtype BAM SortedByCoordinate \

--outFileNamePrefix N052611_Alb \

–runThreadN 运行的线程数,根据自己的服务器合理选择

–genomeDir 构建的参考基因组位置

–readFilesCommand 对于gz压缩的文件,我们可以在后面添加 gunzip -c

–readFilesIn 输入文件的位置,对于双末端测序文件,用空格分隔开就行了

–outSAMtype 默认输出的是sam文件,我们这里的BAM SortedByCoordinate是让他输出为ban文件,并排序

–outFileNamePrefix 表示的是输出文件的位置和前缀

然后就是输出文件的问题,输出的文件不止一个,包含了比对过程中的一些信息

Aligned.out.sam或者Aligned.out.bam

它指的就是我们的比对结果

Log.progress.out

它是每分钟记录一次的对比情况

Log.out

它记录了STAR程序在运行中的各种情况,当我们的结果出现异常时,我们可以查看具体的运行情况,来查找错误

Log.final.out

它包含的是对比完以后的对比统计信息

SJ.out.tab

它包含了剪切的信息

nohup STAR --runThreadN 20 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts &

nohup STAR --runThreadN 4 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/NALM6-ctrl_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix NALM6-ctrl &

nohup STAR --runThreadN 4 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/NALM6-treat_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/NALM6-treat_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix NALM6-treat &

nohup STAR --runThreadN 4 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/RS411-ctrl_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/RS411-ctrl_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix RS411-ctrl &

nohup STAR --runThreadN 4 --genomeDir /home/data/cheng/star_index --readFilesCommand gunzip -c --readFilesIn /home/data/cheng/project/raw/RS411-treat_combined_R1.filtered.1.fq.gz /home/data/cheng/project/raw/RS411-treat_combined_R1.filtered.2.fq.gz --outSAMtype BAM SortedByCoordinate --quantMode GeneCounts --outFileNamePrefix RS411-treat &

# Move the BAM file into the correct folder

mv -v *Aligned.sortedByCoord.out.bam /home/data/cheng/project/results/4_aligned_sequences/aligned_bam/

# Move the logs into the correct folder

mv -v *.out /home/data/cheng/project/results/4_aligned_sequences/aligned_logs/

Step 5. Summarizing Gene Counts with featureCounts

Description

"featureCounts is a highly efficient general-purpose read summarization program that counts mapped reads for genomic features such as genes, exons, promoter, gene bodies, genomic bins and chromosomal locations. It can be used to count both RNA-seq and genomic DNA-seq reads. featureCounts takes as input SAM/BAM files and an annotation file including chromosomal coordinates of features. It outputs numbers of reads assigned to features (or meta-features). It also outputs stat info for the overall summrization results, including number of successfully assigned reads and number of reads that failed to be assigned due to various reasons (these reasons are included in the stat info)."

Now that we have our .BAM alignment files, we can then proceed to try and summarize these coordinates into genes and abundances. To do this we must summarize the reads using featureCounts or any other read summarizer tool, and produce a table of genes by samples with raw sequence abundances. This table will then be used to perform statistical analysis and find differentially expressed genes.

# Help

featureCounts -h

# Change directory into the aligned .BAM folder

cd /home/data/cheng/project/results/4_aligned_sequences/aligned_bam

# Store list of files as a variable

dirlist=$(ls -t ./*.bam | tr '\n' ' ')

echo $dirlist

# Run featureCounts on all of the samples (~10 minutes)

featureCounts \

-a /home/data/reference/gtf/* \

-o /home/data/cheng/project/results/5_final_counts/final_counts.txt \

-g 'gene_name' \

-T 4 \

$dirlist

(rna) cheng@super-Super-Server:/home/data/cheng/project/results/4_aligned_sequences/aligned_bam$ featureCounts -a /home/data/reference/gtf/* -o /home/data/cheng/project/results/5_final_counts/final_counts.txt -g 'gene_name' -T 4 $dirlist

========== _____ _ _ ____ _____ ______ _____

===== / ____| | | | _ \| __ \| ____| /\ | __ \

===== | (___ | | | | |_) | |__) | |__ / \ | | | |

==== \___ \| | | | _

==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |

========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/

v2.0.1

//========================== featureCounts setting ===========================\\

|| ||

|| Input files : 4 BAM files ||

|| o NALM6-treatAligned.sortedByCoord.out.bam ||

|| o RS411-ctrlAligned.sortedByCoord.out.bam ||

|| o RS411-treatAligned.sortedByCoord.out.bam ||

|| o NALM6-ctrlAligned.sortedByCoord.out.bam ||

|| ||

|| Output file : final_counts.txt ||

|| Summary : final_counts.txt.summary ||

|| Annotation : gencode.v35.annotation.gtf (GTF) ||

|| Dir for temp files : /home/data/cheng/project/results/5_final_counts ||

|| ||

|| Threads : 4 ||

|| Level : meta-feature level ||

|| Paired-end : no ||

|| Multimapping reads : not counted ||

|| Multi-overlapping reads : not counted ||

|| Min overlapping bases : 1 ||

|| ||

\\============================================================================//

//================================= Running ==================================\\

|| ||

|| Load annotation file gencode.v35.annotation.gtf ... ||

|| Features : 1398443 ||

|| Meta-features : 59609 ||

|| Chromosomes/contigs : 25 ||

|| ||

|| Process BAM file NALM6-treatAligned.sortedByCoord.out.bam... ||

|| WARNING: Paired-end reads were found. ||

|| Total alignments : 86954860 ||

|| Successfully assigned alignments : 37685067 (43.3%) ||

|| Running time : 0.57 minutes ||

|| ||

|| Process BAM file RS411-ctrlAligned.sortedByCoord.out.bam... ||

|| WARNING: Paired-end reads were found. ||

|| Total alignments : 69733102 ||

|| Successfully assigned alignments : 30363294 (43.5%) ||

|| Running time : 0.46 minutes ||

|| ||

|| Process BAM file RS411-treatAligned.sortedByCoord.out.bam... ||

|| WARNING: Paired-end reads were found. ||

|| Total alignments : 81999795 ||

|| Successfully assigned alignments : 34018798 (41.5%) ||

|| Running time : 0.56 minutes ||

|| ||

|| Process BAM file NALM6-ctrlAligned.sortedByCoord.out.bam... ||

|| WARNING: Paired-end reads were found. ||

|| Total alignments : 70082894 ||

|| Successfully assigned alignments : 32230367 (46.0%) ||

|| Running time : 0.48 minutes ||

|| ||

|| Write the final count table. ||

|| Write the read assignment summary. ||

|| ||

|| Summary of counting results can be found in file "/home/data/cheng/projec ||

|| t/results/5_final_counts/final_counts.txt.summary" ||

|| ||

\\============================================================================//

下游分析

four_sample_deg

Cheng Liangping

24 十月, 2020

1 Importing Gene Counts into R/RStudio

Once the workflow has completed, you can now use the gene count table as an input into DESeq2 for statistical analysis using the R-programming language. It is highly reccomended to use RStudio when writing R code and generating R-related analyses. You can download RStudio for your system here: https://www.rstudio.com/products/rstudio/download/

1.1 Install required R-libraries

rm(list = ls())

options(BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor/")

options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))

options()$repos

## CRAN

## "https://mirrors.tuna.tsinghua.edu.cn/CRAN/"

options()$BioC_mirror

## [1] "http://mirrors.tuna.tsinghua.edu.cn/bioconductor/"

# install.packages("gage")

# BiocManager::install("org.Hs.eg.db",ask = F,update = F)

library(DESeq2)

library(ggplot2)

library(clusterProfiler)

library(biomaRt)

library(ReactomePA)

library(DOSE)

library(KEGG.db)

library(org.Hs.eg.db)

library(pheatmap)

library(genefilter)

library(RColorBrewer)

library(GO.db)

library(topGO)

library(dplyr)

library(gage)

library(ggsci)

1.1.1 Import featureCounts output

One you have an R environment appropriatley set up, you can begin to import the featureCounts table found within the 5_final_counts folder. This tutorial will use DESeq2 to normalize and perform the statistical analysis between sample groups. Be sure to know the full location of the final_counts.txt file generate from featureCounts.

Note: If you would like to use an example final_counts.txt table, look into the example/ folder.

# Import gene counts table

# - skip first row (general command info)

# - make row names the gene identifiers

countdata

# Remove .bam + '..' from column identifiers

colnames(countdata)

colnames(countdata)

colnames(countdata)

#save(countdata,file = "raw.Rdata")

# Remove length/char columns

countdata

# Make sure ID's are correct

head(countdata)

## NALM6.treat RS411.ctrl RS411.treat NALM6.ctrl

## DDX11L1 26 5 20 12

## WASH7P 1119 206 202 435

## MIR6859-1 21 6 4 9

## MIR1302-2HG 2 0 3 1

## MIR1302-2 0 0 0 0

## FAM138A 0 0 0 0

1.1.2 Import metadata text file. The SampleID’s must be the first column.

# Import metadata file

# - make row names the matching sampleID's from the countdata

metadata

# Reorder sampleID's to match featureCounts column order.

metadata

# Make sure ID's are correct

head(metadata)

## Group Replicate sampleid

## NALM6.treat Hiexp Rep2 NALM6.treat

## RS411.ctrl Loexp Rep1 RS411.ctrl

## RS411.treat Loexp Rep2 RS411.treat

## NALM6.ctrl Hiexp Rep1 NALM6.ctrl

1.1.3 Make DESeq2 object from counts and metadata

# - countData : count dataframe

# - colData : sample metadata in the dataframe with row names as sampleID's

# - design : The design of the comparisons to use.

# Use (~) before the name of the column variable to compare

ddsMat

colData = metadata,

design = ~Group)

# Find differential expressed genes

ddsMat

1.1.4 Get basic statisics about the number of significant genes

# Get results from testing with FDR adjust pvalues

results

# Generate summary of testing.

summary(results)

##

## out of 37196 with nonzero total read count

## adjusted p-value < 0.05

## LFC > 0 (up) : 2345, 6.3%

## LFC < 0 (down) : 1530, 4.1%

## outliers [1] : 0, 0%

## low counts [2] : 11169, 30%

## (mean count < 4)

## [1] see 'cooksCutoff' argument of ?results

## [2] see 'independentFiltering' argument of ?results

# Check directionality of the log2 fold changes

## Log2 fold change is set as (Loexp / Hiexp)

## Postive fold changes = Increased in Loexp

## Negative fold changes = Decreased in Loexp

mcols(results, use.names = T)

## DataFrame with 6 rows and 2 columns

## type description

##

## baseMean intermediate mean of normalized counts for all samples

## log2FoldChange results log2 fold change (MLE): Group Loexp vs Hiexp

## lfcSE results standard error: Group Loexp vs Hiexp

## stat results Wald statistic: Group Loexp vs Hiexp

## pvalue results Wald test p-value: Group Loexp vs Hiexp

## padj results fdr adjusted p-values

1.2 Annotate gene symbols

After alignment and summarization, we only have the annotated gene symbols. To get more information about significant genes, we can use annoated databases to convert gene symbols to full gene names and entrez ID’s for further analysis.

1.2.1 Gather gene annotation information

# Mouse genome database (Select the correct one)

library(org.Hs.eg.db)

# Add gene full name

results$description

keys = row.names(results),

column = "GENENAME",

keytype = "SYMBOL",

multiVals = "first")

# Add gene symbol

results$symbol

# Add ENTREZ ID

results$entrez

keys = row.names(results),

column = "ENTREZID",

keytype = "SYMBOL",

multiVals = "first")

# Add ENSEMBL

results$ensembl

keys = row.names(results),

column = "ENSEMBL",

keytype = "SYMBOL",

multiVals = "first")

# Subset for only significant genes (q < 0.05)

results_sig

head(results_sig)

## log2 fold change (MLE): Group Loexp vs Hiexp

## Wald test p-value: Group Loexp vs Hiexp

## DataFrame with 6 rows and 10 columns

## baseMean log2FoldChange lfcSE stat pvalue padj

##

## MTND1P23 2424.6057 -4.11145 0.678250 -6.06185 1.34561e-09 4.39978e-08

## MTND2P28 1813.7288 3.75228 0.738809 5.07882 3.79791e-07 8.29959e-06

## MTCO1P12 6187.1454 -3.29924 0.634548 -5.19936 1.99981e-07 4.56171e-06

## MTCO2P12 169.0384 -2.08444 0.773877 -2.69351 7.07044e-03 4.79849e-02

## MXRA8 725.5861 2.58245 0.664147 3.88837 1.00920e-04 1.26220e-03

## LINC01770 54.9444 9.28692 1.703909 5.45036 5.02681e-08 1.27891e-06

## description symbol entrez

##

## MTND1P23 MT-ND1 pseudogene 23 MTND1P23 100887749

## MTND2P28 MT-ND2 pseudogene 28 MTND2P28 100652939

## MTCO1P12 MT-CO1 pseudogene 12 MTCO1P12 107075141

## MTCO2P12 MT-CO2 pseudogene 12 MTCO2P12 107075310

## MXRA8 matrix remodeling associated 8 MXRA8 54587

## LINC01770 long intergenic non-protein coding RNA 1770 LINC01770 102724312

## ensembl

##

## MTND1P23 NA

## MTND2P28 NA

## MTCO1P12 NA

## MTCO2P12 NA

## MXRA8 ENSG00000162576

## LINC01770 NA

1.2.2 Write all the important results to .txt files

# Write normalized gene counts to a .txt file

write.table(x = as.data.frame(counts(ddsMat), normalized = T),

file = 'normalized_counts.txt',

sep = '\t',

quote = F,

col.names = NA)

# Write significant normalized gene counts to a .txt file

write.table(x = counts(ddsMat[row.names(results_sig)], normalized = T),

file = 'normalized_counts_significant.txt',

sep = '\t',

quote = F,

col.names = NA)

# Write the annotated results table to a .txt file

write.table(x = as.data.frame(results),

file = "results_gene_annotated.txt",

sep = '\t',

quote = F,

col.names = NA)

# Write significant annotated results table to a .txt file

write.table(x = as.data.frame(results_sig),

file = "results_gene_annotated_significant.txt",

sep = '\t',

quote = F,

col.names = NA)

1.3 Plotting Gene Expression Data

There are multiple ways to plot gene expression data. Below we are only listing a few popular methods, but there are many more resources (Going Further) that will walk through different R commands/packages for plotting.

1.3.1 PCA plot

# Convert all samples to rlog

ddsMat_rlog

# Plot PCA by column variable

plotPCA(ddsMat_rlog, intgroup = "Group", ntop = 500) +

theme_bw() + # remove default ggplot2 theme

geom_point(size = 5) + # Increase point size

scale_y_continuous(limits = c(-5, 5)) + # change limits to fix figure dimensions

ggtitle(label = "Principal Component Analysis (PCA)",

subtitle = "Top 500 most variable genes")

image.png

1.3.2 Heatmap

# Convert all samples to rlog

ddsMat_rlog

# Gather 30 significant genes and make matrix

mat

# Choose which column variables you want to annotate the columns by.

annotation_col = data.frame(

Group = factor(colData(ddsMat_rlog)$Group),

Replicate = factor(colData(ddsMat_rlog)$Replicate),

row.names = colData(ddsMat_rlog)$sampleid

)

# Specify colors you want to annotate the columns by.

ann_colors = list(

Group = c(Loexp = "lightblue", Hiexp = "darkorange"),

Replicate = c(Rep1 = "darkred", Rep2 = "forestgreen")

)

# Make Heatmap with pheatmap function.

## See more in documentation for customization

pheatmap(mat = mat,

color = colorRampPalette(brewer.pal(9, "YlOrBr"))(255),

scale = "row", # Scale genes to Z-score (how many standard deviations)

annotation_col = annotation_col, # Add multiple annotations to the samples

annotation_colors = ann_colors,# Change the default colors of the annotations

fontsize = 6.5, # Make fonts smaller

cellwidth = 55, # Make the cells wider

show_colnames = F)

image.png

1.3.3 Volcano Plot

# Gather Log-fold change and FDR-corrected pvalues from DESeq2 results

## - Change pvalues to -log10 (1.3 = 0.05)

data

pval = -log10(results$padj),

lfc = results$log2FoldChange)

# Remove any rows that have NA as an entry

data

# Color the points which are up or down

## If fold-change > 0 and pvalue > 1.3 (Increased significant)

## If fold-change < 0 and pvalue > 1.3 (Decreased significant)

data 0 & data$pval > 1.3 ~ "Increased",

data$lfc < 0 & data$pval > 1.3 ~ "Decreased",

data$pval < 1.3 ~ "nonsignificant"))

# Make a basic ggplot2 object with x-y values

vol

# Add ggplot2 layers

vol +

ggtitle(label = "Volcano Plot", subtitle = "Colored by fold-change direction") +

geom_point(size = 2.5, alpha = 0.8, na.rm = T) +

scale_color_manual(name = "Directionality",

values = c(Increased = "#008B00", Decreased = "#CD4F39", nonsignificant = "darkgray")) +

theme_bw(base_size = 14) + # change overall theme

theme(legend.position = "right") + # change the legend

xlab(expression(log[2]("Loexp" / "Hiexp"))) + # Change X-Axis label

ylab(expression(-log[10]("adjusted p-value"))) + # Change Y-Axis label

geom_hline(yintercept = 1.3, colour = "darkgrey") + # Add p-adj value cutoff line

scale_y_continuous(trans = "log1p") # Scale yaxis due to large p-values

image.png

1.3.4 MA Plot

plotMA(results, ylim = c(-5, 5))

image.png

1.3.5 Plot Dispersions

plotDispEsts(ddsMat)

image.png

1.3.6 Single gene plot

# Convert all samples to rlog

ddsMat_rlog

# Get gene with highest expression

top_gene

# Plot single gene

plotCounts(dds = ddsMat,

gene = top_gene,

intgroup = "Group",

normalized = T,

transform = T)

image.png

1.4 Finding Pathways from Differential Expressed Genes

Pathway enrichment analysis is a great way to generate overall conclusions based on the individual gene changes. Sometimes individiual gene changes are overwheling and are difficult to interpret. But by analyzing the pathways the genes fall into, we can gather a top level view of gene responses. You can find more information about clusterProfiler here: http://bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html

1.4.1 Set up matrix to take into account EntrezID’s and fold changes for each gene

# Remove any genes that do not have any entrez identifiers

results_sig_entrez

# Create a matrix of gene log2 fold changes

gene_matrix

# Add the entrezID's as names for each logFC entry

names(gene_matrix)

# View the format of the gene matrix

##- Names = ENTREZ ID

##- Values = Log2 Fold changes

head(gene_matrix)

## 100887749 100652939 107075141 107075310 54587 102724312

## -4.111451 3.752277 -3.299241 -2.084445 2.582450 9.286916

gene_matrix=sort(gene_matrix,decreasing = T)

1.4.2 Enrich genes using the KEGG database

kegg_enrich

organism = 'human',

pvalueCutoff = 0.05,

qvalueCutoff = 0.10)

kegg_enrich

# Plot results

barplot(kegg_enrich,

drop = TRUE,

showCategory = 10,

title = "KEGG Enrichment Pathways",

font.size = 8)

image.png

1.4.3 Enrich genes using the Gene Onotlogy

go_enrich

OrgDb = 'org.Hs.eg.db',

readable = T,

ont = "BP",

pvalueCutoff = 0.05,

qvalueCutoff = 0.10)

# Plot results

barplot(go_enrich,

drop = TRUE,

showCategory = 10,

title = "GO Biological Pathways",

font.size = 8)

image.png

1.5 Plotting KEGG Pathways

Pathview is a package that can take KEGG identifier and overlay fold changes to the genes which are found to be significantly different. Pathview also works with other organisms found in the KEGG database and can plot any of the KEGG pathways for the particular organism.

# Load pathview

library(pathview)

## ##############################################################################

## Pathview is an open source software package distributed under GNU General

## Public License version 3 (GPLv3). Details of GPLv3 is available at

## http://www.gnu.org/licenses/gpl-3.0.html. Particullary, users are required to

## formally cite the original Pathview paper (not just mention it) in publications

## or products. For details, do citation("pathview") within R.

##

## The pathview downloads and uses KEGG data. Non-academic uses may require a KEGG

## license agreement (details at http://www.kegg.jp/kegg/legal.html).

## ##############################################################################

# Plot specific KEGG pathways (with fold change)

## pathway.id : KEGG pathway identifier

pathview(gene.data = gene_matrix,

pathway.id = "00020",

species = "hsa")

## 'select()' returned 1:1 mapping between keys and columns

## Info: Working in directory /Users/cheng/Downloads/TYL1/20201023

## Info: Writing image file hsa00020.pathview.png

1.6 疾病通路

有一显著差异基因富集到一个已经疾病通路: 相关疾病为:Lewy body dementia。

x

ont = "DO",

pvalueCutoff = 0.05,

pAdjustMethod = "BH",

minGSSize = 5,

maxGSSize = 500,

qvalueCutoff = 0.05,

readable = FALSE)

x

head(x)

## ID Description GeneRatio

## DOID:0060049 DOID:0060049 autoimmune disease of urogenital tract 26/1207

## DOID:12236 DOID:12236 primary biliary cirrhosis 26/1207

## DOID:1037 DOID:1037 lymphoblastic leukemia 99/1207

## DOID:0060100 DOID:0060100 musculoskeletal system cancer 98/1207

## DOID:9119 DOID:9119 acute myeloid leukemia 33/1207

## DOID:16 DOID:16 integumentary system disease 88/1207

## BgRatio pvalue p.adjust qvalue

## DOID:0060049 73/8007 1.097364e-05 0.005426588 0.004509630

## DOID:12236 73/8007 1.097364e-05 0.005426588 0.004509630

## DOID:1037 443/8007 1.839793e-05 0.005426588 0.004509630

## DOID:0060100 439/8007 2.127302e-05 0.005426588 0.004509630

## DOID:9119 107/8007 2.550088e-05 0.005426588 0.004509630

## DOID:16 394/8007 5.539968e-05 0.009824211 0.008164164

## geneID

## DOID:0060049 SNCA/CAV1/NOS2/ICOS/ABCB4/PTPN13/IL7R/ENTPD1/ITGAL/TGFBR2/TNFSF13/TNFSF13B/TLR4/FN1/TNFSF10/TNFSF9/TGFB1/HLA-DRB1/FAS/ICAM1/HGF/CXCL10/CD27/CNR1/FCRL3/ENTPD2

## DOID:12236 SNCA/CAV1/NOS2/ICOS/ABCB4/PTPN13/IL7R/ENTPD1/ITGAL/TGFBR2/TNFSF13/TNFSF13B/TLR4/FN1/TNFSF10/TNFSF9/TGFB1/HLA-DRB1/FAS/ICAM1/HGF/CXCL10/CD27/CNR1/FCRL3/ENTPD2

## DOID:1037 NRP1/FLT3/CD44/ERG/CRY1/LAIR1/HSPB1/ADRB2/SLIT2/BLACE/FCMR/ZAP70/FOSL2/IL24/TNFRSF11A/FMOD/NOS2/WT1/CDX2/BGN/IL15/CD34/RAG1/NT5E/F13A1/CSPG4/IGFBP2/MDK/IRF4/IL1B/PTPN6/FLT4/ANPEP/SLC29A2/SDC1/CD33/RUNX1/ZFP36L1/MEF2C/TP73/ARID5B/TNFSF13/FCGR2A/TNFSF13B/LILRB1/CD9/EPHB4/JAK3/CDK6/LMO2/LEF1/CISH/RB1/BCL2/RAPGEF3/CCND3/MYB/DMD/ZMIZ1/CDKN1B/TNFRSF13C/POU2AF1/IGHD/TNFSF10/TGFB1/STAT1/HLA-DRB1/IRF8/BACH2/FAS/CD40/PDE4B/CD83/FCRL5/HGF/LAMA5/FCER2/HLF/ASNS/PRKCB/FCRL2/TNFRSF8/IKZF3/PEG10/BCL2A1/TP63/BIRC3/HES1/TTC12/EFNA5/RHD/FCRL3/MME/CDKN2B/TNFRSF10A/AICDA/MLLT3/CCL22/CDKN2A

## DOID:0060100 CCN2/PDPN/CD44/ERG/HMGA2/HSPB1/NOG/PROM1/NDRG1/SPRY1/ANGPT1/SLIT2/TIMP1/BMP2/ADAM28/CAV1/LPAR1/F2RL3/FBN2/S100A6/TNFRSF11A/NOS2/HPSE/WT1/NES/CDH11/SPP1/CD34/FSCN1/SALL2/PDGFA/RUNX2/F2R/MCAM/PDGFRB/IL1B/HBEGF/JUP/LGALS1/PLAU/ERBB2/PLAUR/TLE1/KLRK1/CD33/PRUNE2/S100A4/CCL5/PGF/CREB3L1/TP73/TNFSF13B/CD9/CD99/FBN1/DUSP1/FN1/RB1/CAPN2/BCL2/DMD/RASSF1/CDKN1B/IGF2BP2/ASS1/FLI1/TNFRSF10B/PRKCA/URGCP/DYRK1B/OBSCN/TGFB1/LRP1/GLI1/ITGB3/ID1/FAS/ICAM1/HGF/UTS2R/SMO/PRKCG/PAX2/CDH1/TP63/ACKR3/EGFR/IGFBP6/DCN/LUM/GFAP/DNASE1L3/EXT1/CDKN2B/WNT5A/UCHL1/MTAP/CDKN2A

## DOID:9119 NRP1/MEIS1/CD96/PLA2G4A/CAV1/SALL4/LYZ/WT1/CDX2/RAG1/IGFBP2/IL1B/PLAUR/SELL/ITGAL/RUNX1/MGMT/IL3RA/LMO2/RB1/BCL2/CDKN1B/LYL1/IRF8/FAS/CBFA2T3/SKI/RUNX3/CAMP/CDKN2B/TNFRSF10A/KCNH2/CDKN2A

## DOID:16 PDPN/LAMA3/HSPB1/HR/ADRB2/EDA2R/ADAM33/CCL28/FLG/RIN2/SLC16A2/HRH4/TIMP1/VIM/CAV1/CD93/IL24/SERPING1/GHR/CCL1/NLRP3/IL15/CD34/NPY/COL6A5/IL1RL1/IL9R/RETN/F2R/EPX/IRAK3/IL1B/XPNPEP2/IL7R/TBC1D4/PLAU/PLAUR/SELL/ANPEP/SLC29A2/COL7A1/NR3C2/GPX1/NTRK1/TNFRSF1B/CCL5/PGF/TNFSF13/TNFSF13B/TLR4/NOS3/BCL2/SLC29A3/NOP53/HLA-DQA1/TGFB1/HLA-DRB1/ITGB3/ITGA6/FAS/CD40/TNFSF8/ALOX5AP/ICAM1/CXCL10/LAMA5/CYP1A2/C3/ADM2/CCL3L3/CAMP/BCL2A1/TP63/CCL17/EGFR/FBLN5/LMNA/CX3CL1/ALOX5/APP/TNFRSF14/FCRL3/ST14/MME/TRPC1/CDKN2B/CCL22/CDKN2A

## Count

## DOID:0060049 26

## DOID:12236 26

## DOID:1037 99

## DOID:0060100 98

## DOID:9119 33

## DOID:16 88

1.7 GO数据可视化:

1.7.1 barplot、dotplot

用散点图展示富集到的GO terms

library(enrichplot)

library(clusterProfiler)

library(patchwork)

#(1)条带图

barplot(go_enrich,showCategory=10)

image.png

#(2)气泡图

dotplot(go_enrich)

image.png

1.7.1.1 GO有向无环图

调用topGO来实现GO有向无环图的绘制,矩形代表富集到的top10个GO terms, 颜色从黄色过滤到红色,对应p值从大到小。

plotGOgraph(go_enrich)

image.png

## $dag

## A graphNEL graph with directed edges

## Number of Nodes = 48

## Number of Edges = 83

##

## $complete.dag

## [1] "A graph with 48 nodes."

1.7.1.2 emapplot、goplot、cnetplot

#Gene-Concept Network

cnetplot(go_enrich, categorySize="pvalue", foldChange=names(gene_matrix),colorEdge = TRUE)

image.png

cnetplot(go_enrich, foldChange=names(gene_matrix), circular = TRUE, colorEdge = TRUE)

image.png

#Enrichment Map

emapplot(go_enrich)

image.png

#(4)展示通路关系

goplot(go_enrich)

image.png

#(5)Heatmap-like functional classification

heatplot(go_enrich,foldChange = gene_matrix)

image.png

#太多基因就会糊。可通过调整比例或者减少基因来控制。

1.7.1.3 04.upsetplot

upsetplot(go_enrich)

image.png

1.7.2 用所有有entrez的基因做GSEA分析

# Remove any genes that do not have any entrez identifiers

results_entrez

# Create a matrix of gene log2 fold changes

gene_matrix

# Add the entrezID's as names for each logFC entry

names(gene_matrix)

gene_matrix=sort(gene_matrix,decreasing = T)

ego_GSE_bp

geneList = gene_matrix,

OrgDb = org.Hs.eg.db,

ont = "MF",

nPerm = 1000,

minGSSize = 100,

maxGSSize = 500,

pvalueCutoff = 0.05,

verbose = FALSE)

1.7.3 ridgeplot

#gsea

ridgeplot(ego_GSE_bp)

image.png

1.7.4 gseaplot、gseaplot2及多个geneset验证结果

#gesa图

p15

title = ego_GSE_bp$Description[1])

p16

title = ego_GSE_bp$Description[1])

p17

#还有一种展示方式

p18=gseaplot2(ego_GSE_bp, geneSetID = 1, title = ego_GSE_bp$Description[1])

cowplot::plot_grid(p15, p16, p17,p18, ncol=2, labels=LETTERS[1:6])

image.png

#多条、加table(加p值)

gseaplot2(ego_GSE_bp, geneSetID = 1:3, pvalue_table = TRUE)

image.png

#上下3联(多个geneset验证结果)

pp

anno

lab

gsearank(ego_GSE_bp, i, ego_GSE_bp[i, 2]) + xlab(NULL) +ylab(NULL) +

annotate("text", 0, ego_GSE_bp[i, "enrichmentScore"] * .9, label = lab, hjust=0, vjust=0)

})

plot_grid(plotlist=pp, ncol=1)

image.png

1.7.5 KEGG数据GSEA分析

kk_gse

organism = 'hsa',

nPerm = 1000,

minGSSize = 120,

pvalueCutoff = 0.9,

verbose = FALSE)

#使用了所有基因进行KEGG的GSEA分析发现有上调16个通路得到验证,下调0个通路显著。

down_kegg

up_kegg 0,];up_kegg$group=1

#(4)可视化

source("kegg_plot_function.R")

g_kegg

g_kegg

image.png

1.8 查看kegg富集通路及差异基因可视化

1.8.1 Upsetplot for gene set enrichment analysis.

library(enrichplot)

library(clusterProfiler)

upsetplot(kegg_enrich)

image.png

1.8.2 可通过浏览器打开kegg网站查看具体信号通路的信息

head(kegg_enrich)[,1:6]

## ID Description GeneRatio

## hsa04015 hsa04015 Rap1 signaling pathway 56/1085

## hsa04060 hsa04060 Cytokine-cytokine receptor interaction 68/1085

## hsa04640 hsa04640 Hematopoietic cell lineage 30/1085

## hsa04390 hsa04390 Hippo signaling pathway 39/1085

## hsa05146 hsa05146 Amoebiasis 28/1085

## hsa04672 hsa04672 Intestinal immune network for IgA production 17/1085

## BgRatio pvalue p.adjust

## hsa04015 210/8076 1.830708e-07 0.0000583996

## hsa04060 295/8076 3.267347e-06 0.0005211418

## hsa04640 99/8076 8.949786e-06 0.0009516606

## hsa04390 157/8076 7.554828e-05 0.0060249750

## hsa05146 102/8076 1.251732e-04 0.0066839043

## hsa04672 49/8076 1.257161e-04 0.0066839043

#browseKEGG(kk.diff, 'hsa04015')

dotplot(kegg_enrich)

image.png

barplot(kegg_enrich,showCategory=10)

image.png

cnetplot(kegg_enrich, categorySize="pvalue", foldChange=gene_matrix,colorEdge = TRUE)

image.png

cnetplot(kegg_enrich, foldChange=gene_matrix, circular = TRUE, colorEdge = TRUE)

image.png

emapplot(kegg_enrich)

image.png

heatplot(kegg_enrich,foldChange = gene_matrix)

image.png

1.8.3 Going further with RNAseq analysis

You can the links below for a more in depth walk through of RNAseq analysis using R:

1.8.3.1 Citations:

Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc

Martin, Marcel. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, [S.l.], v. 17, n. 1, p. pp. 10-12, may. 2011. ISSN 2226-6089. Available at: http://journal.embnet.org/index.php/embnetjournal/article/view/200. doi:http://dx.doi.org/10.14806/ej.17.1.200.

Kopylova E., Noé L. and Touzet H., “SortMeRNA: Fast and accurate filtering of ribosomal RNAs in metatranscriptomic data”, Bioinformatics (2012), doi: 10.1093/bioinformatics/bts611

Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15-21. doi:10.1093/bioinformatics/bts635.

Lassmann et al. (2010) “SAMStat: monitoring biases in next generation sequencing data.” Bioinformatics doi:10.1093/bioinformatics/btq614 [PMID: 21088025]

Liao Y, Smyth GK and Shi W (2014). featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30(7):923-30.

Love MI, Huber W and Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, pp. 550.

Yu G, Wang L, Han Y and He Q (2012). “clusterProfiler: an R package for comparing biological themes among gene clusters.” OMICS: A Journal of Integrative Biology, 16(5), pp. 284-287.

Philip Ewels, Måns Magnusson, Sverker Lundin and Max Käller. “MultiQC: Summarize analysis results for multiple tools and samples in a single report” Bioinformatics (2016). doi: 10.1093/bioinformatics/btw354. PMID: 27312411

c 语言头文件seqlsit,2020-10-24 RNAseq 从fq开始分析全流程相关推荐

  1. c语言字符型头文件,C语言头文件大全Word版

    <C语言头文件大全Word版>由会员分享,可在线阅读,更多相关<C语言头文件大全Word版(7页珍藏版)>请在人人文库网上搜索. 1.传播优秀Word版文档 ,希望对您有帮助, ...

  2. 51单片机C语言程序100例分析(1)IO+C语言+头文件

    51单片机C语言程序100例分析(1)IO+C语言+头文件 \\\插播一条:文章末尾有惊喜哟~///  P1=0xfe;//P1=11111110B,即P1.0输出低电平} 分析:通过这短短的几行代码 ...

  3. C语言头文件正确写法

    一般写法 例如这样有一个file.h头文件,一般写法如下 //file.h //条件编译 #ifndef _FILE_H_ //如果没有引入头文件file.h#define _FILE_H_ //那就 ...

  4. c语言头文件编写范例,编写自己的C语言头文件

    编写自己的C语言头文件 1. 头文件用于声明而不是用于定义 当设计头文件时,记住定义和声明的区别是很重要的.定义只可以出现一次,而声明则可以出现多次(2.3.5节).下列语句是一些定义,所以不应该放在 ...

  5. c语言头文件命名规则,C语言头文件规则.doc

    C语言头文件规则.doc 下载提示(请认真阅读)1.请仔细阅读文档,确保文档完整性,对于不预览.不比对内容而直接下载带来的问题本站不予受理. 2.下载的文档,不会出现我们的网址水印. 3.该文档所得收 ...

  6. C语言头文件路径剖析

    在一个软件项目中,如果需要在一个文件中包含另一个头文件时,一般有两种包含方式: #include <stdio.h>#include "module.h" 如果你引用的 ...

  7. c语言程序头文件作用,C语言头文件

    C语言头文件教程 C 语言的头文件一般都是 .h 做为结尾的. C语言头文件详解 语法 #include 参数 参数 描述 filename 我们需要引入的头文件的名称. 说明 C 语言的头文件一般都 ...

  8. C语言头文件深入理解

    C语言程序中,源文件通常分为两种:一种用于保存程序的声明(declaration),称为头文件:另一种用于保存程序的实现(implementation),称为定义(definition)文件. C程序 ...

  9. C语言头文件避免重复包含

    C语言头文件避免重复包含 假定有以下几个头文件及其包含关系为: File1.h,file2.h,file3.h,file4.h,file5.h,main.cpp 那么:file3.h包含file1.h ...

最新文章

  1. ERR_CONTENT_LENGTH_MISMATCH
  2. Day9-Postfix
  3. 数据结构之C语言模拟整数数组实现
  4. 框架开发之Java注解的妙用
  5. ES5、ES6、ES7、ES8
  6. org.hibernate.LazyInitializationException: failed to lazily initialize a collection of role:no sessi
  7. 检测Product 被其他business transaction 引用的小程序
  8. Selleck --- 01Cookie
  9. 插入数据前查询是否存在_异步检测数据是否存在的修订
  10. github上传本地项目代码
  11. 华为人才选拔的管理实践
  12. WDF队列分析(3)
  13. Windows2000下IE5升级到IE6
  14. Paul Graham:为什么在经济危机中创业?
  15. 参考文献的类型--参考文献里的J、M等字母都代表什么
  16. BMS(电池管理系统)第五课 ——核心!!!SOH算法开发
  17. 基于Arduino IDE开发的ESP8266(ESP-12F)项目4 ——中断及高级输入输出
  18. 大意是没有经历过贫穷的人,很难成为优秀的人才。
  19. 5000词学英语——DAY9
  20. html 5 调用手机条码扫描,vue h5页面如何实现扫一扫功能,扫条形码获取编码

热门文章

  1. 组图:1936年柏林奥运会
  2. 常用传感器讲解十四--障碍探测器(KY-032)
  3. 单片机播放WAV格式音频的理解
  4. 【嵌入式开发基础】gn ninja命令安装
  5. 这套系统,可能真的是数据分析师们未来5年的机遇!
  6. Windows驱动_文件系统微小过滤驱动之一初识MiniFilter
  7. The Indian Job
  8. 什么叫做石英表_石英表和机械表区别是什么?
  9. 2021年中山大学计算机专业学硕复试线,2021中山大学研究生分数线一览表(含2019-2020历年复试)...
  10. Python文档算法整理