16s扩增子 qiime2 实战

qiime2-2020.11

0 环境配置
- minconda安装
- qiime2安装
- 安装fastqc multiqc
1 下载文件转换
2 质检
3 Import 导入文件
- 双端
- 单端
4 过滤降噪
- ① deblur
- ② dada2
5 物种注释
6 过滤和重采样
- Table Filtering
- Sequence Filtering
7 构建系统发育树
- 多样性计算
- - α多样性
  - LEFse组间分析

0 环境配置

minconda安装

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.shbash Miniconda3-latest-Linux-x86_64.sh  #安装时注意安装到自己所需要目录source ~/.bashrc  #激活环境

qiime2安装

miniconda抛出HTTP错误可能是防火墙问题执行以下命令

conda config --set ssl_verify false

wget -c http://210.75.224.110/github/QIIME2ChineseManual/2020.11/qiime2-2020.11-py36-linux-conda.yml
# 创建虚拟环境并安装qiime2，防止影响其它己安装软件
# 我用时13m，供参考，主要由网速决定
time conda env create -n qiime2-2020.11 --file qiime2-2020.11-py36-linux-conda.yml
# 删除软件列表
rm qiime2-2020.11-py36-linux-conda.yml

此步耗时较久，这里有可能因为python版本问题报错，注意到这个软件列表是py36，也就是对应版本是python3.6

python --version
conda install python=3.6  #查看python版本 如果没有3.6 可手动安装

启用qiime环境

sudo su #如果是在本地
conda info --envs
conda activate qiime2-2020.11

关闭环境

conda deactivate

安装fastqc multiqc

# fastqc
conda install -c bioconda fastqc -y# multiqc
pip install multiqc

参考博客：
扩增子分析流程详细版
16s数据分析流程
实战qiime2020.11
qiime2流程
qiime2扩增子详细流程

1 下载文件转换

fastq-dump
fasterq-dump
fastq和fasterq异同比较

sra或者.1文件转fastq

单个文件

fastq-dump --split-3 --gzip SRR6724357.1
fasterq-dump --split-3  SRR6724357.1

fasterq没有 --gzip命令需要提取完成后自行压缩，但可以设定多线程，跑得更快
fastq比较慢，但有–gzip指令，可以边提取边压缩，不占内存

多个文件
bash文件操作相关

vi test.sh #新建批处理脚本文件

#!/bin/bash  #必须有 告诉系统用哪个解释器来执行
for i in *.1
do
echo $i   #显示运行到哪一个了
fastq-dump --split-3 --gzip $i -O ../raw_data
done

bash test.sh #运行脚本文件

2 质检

fastqc的结果包括reads各位置的碱基质量值分布、碱基的总体质量值分布、reads各个位置上碱基分布比例、GC含量分布、reads各个位置的N碱基数目、是否含有测序接头序列等。

单独质检后将质检报告合并

#设置运行核数
NCORES=1
mkdir fastqc_out  #创建一个文件夹用于存放质检文件
mkdir raw_data
mv *.fastq.gz raw_data
fastqc -t $NCORES raw_data/*.fastq.gz -o fastqc_out
#合并
cd fastqc_out
multiqc . #合并报告
cd ..

完整的报告为 multiqc_report.html ，直接用浏览器即可查看。生成可视化报告最重要的原因是要确保样品是否高质量（大多数 reads 中每碱基的质量是否大于 30），是否存在异常的样本。
合并后质检

# cat命令合并压缩过的文件会出错，合并之前需要先解压。
gunzip *.gz # 解压
# 第一种方法：合并后质检。
cat *R1* > R1.fastq #合并上游序列,并指定输出文件名为R1.fastq
cat *R2* > R2.fastq #合并下游序列,并指定输出文件名为R2.fastq
# 质检
mkdir qc #创建一个文件夹用于存放质检文件
fastqc -t 2 R1.fastq R2.fastq -o qc # -t --threads，一般有多少个样本用多少线程。-o 指定输出文件存放目录。

p.s.:也可在dada2步骤时设置合适的参数去除引物(论坛里建议在使用dada2处理数据之前先去掉引物。参考：https://forum.qiime2.org/t/lost-of-data-with-dada2/1449/5）

3 Import 导入文件

通过脚本manifest生成参考
另外也可以用python从metadata.xlsx里生成（我更习惯用这个再粘贴至 vi manifest打开的文件里）

#双端
import pandas as pddf=pd.read_excel('metadata_NanJ.xlsx')
path='$PWD'
print('sample-id'+'\t'+'forward-absolute-filepath'+'\t'+'reverse-absolute-filepath')
for i in range(231):print(df['Sample-ID'][i] +'\t'+path+'/raw_data/' + df['Run-ID'][i] + '_1.fastq.gz'+'\t'+ path +'/raw_data/'+df['Run-ID'][i] + '_2.fastq.gz')

#单端
import pandas as pddf=pd.read_excel('metadata_NanJ.xlsx')
path='$PWD'
print('sample-id,absolute-filepath,direction')
for i in range(231):print(df['Sample-ID'][i] +','+path+'/raw_data/' + df['Run-ID'][i] + ''.fastq.gz'+',forward')

双端

manifest格式：注意使用制表符

METADATA="$PWD/metadata.txt"
qiime tools import \--type 'SampleData[PairedEndSequencesWithQuality]' \--input-path pe-33-manifest \--output-path reads_qza/paired-demux.qza \--input-format PairedEndFastqManifestPhred33V2

#可视化文件
qiime demux summarize \--i-data reads_qza/paired-demux.qza \--o-visualization reads_qzv/paired-demux.qzv

单端

manifest格式：

qiime tools import \
--type 'SampleData[SequencesWithQuality]' \
--input-path manifest_se \
--output-path reads_qza/demux_se.qza  \
--input-format SingleEndFastqManifestPhred33

4 过滤降噪

用 cutadapt 插件去除引物序列
这个讲解的很清楚

v3v4区选用的341F 和 806R

nohup qiime cutadapt trim-paired \--i-demultiplexed-sequences reads_qza/paired-demux.qza \--p-cores 20 \--p-front-f CCTACGGGNGGCWGCAG \--p-front-r GACTACHVGGGTATCTAATCC \--p-discard-untrimmed \--p-no-indels \--o-trimmed-sequences reads_qza/reads_trimmed.qza &qiime demux summarize \--i-data reads_qza/reads_trimmed.qza \--o-visualization reads_qzv/reads_trimmed_summary.qzv

单V4区选用的515F 和 806R

nohup qiime cutadapt trim-paired \--i-demultiplexed-sequences reads_qza/paired-demux.qza \--p-cores 20 \--p-front-f GTGYCAGCMGCCGCGGTAA \--p-front-r GGACTACNVGGGTWTCTAAT \--p-discard-untrimmed \--p-no-indels \--o-trimmed-sequences reads_qza/reads_trimmed.qza &qiime demux summarize \--i-data reads_qza/reads_trimmed.qza \--o-visualization reads_qzv/reads_trimmed_summary.qzv

单端

nohup qiime cutadapt trim-single \--i-demultiplexed-sequences reads_qza/demux_se.qza \--p-front GTGCCAGCMGCCGCGGTAA \--o-trimmed-sequences reads_qza/reads_trimmed.qza &qiime demux summarize \--i-data reads_qza/reads_trimmed.qza \--o-visualization reads_qzv/reads_trimmed_summary.qzv

降噪可选deblur和dada2其中一种，deblur只能对合并后序列进行操作，dada2可以对双端序列进行操作

① deblur

用 deblur 将 reads 降噪为 ASV （amplicon sequence variants）

1.vsearch合并双端序列

nohup qiime vsearch join-pairs \--i-demultiplexed-seqs reads_qza/reads_trimmed.qza \--o-joined-sequences reads_qza/reads_trimmed_joined.qza &qiime demux summarize \--i-data reads_qza/reads_trimmed_joined.qza \--o-visualization reads_qzv/reads_trimmed_joined_summary.qzv

2.过滤低质量的 reads

nohup qiime quality-filter q-score\--i-demux reads_qza/reads_trimmed_joined.qza \--o-filter-stats filt_stats.qza \--o-filtered-sequences reads_qza/reads_trimmed_joined_filt.qza &

qiime demux summarize \--i-data reads_qza/reads_trimmed_joined_filt.qza \--o-visualization reads_qzv/reads_joined_filt_summary.qzv

3.运行 deblur 流程得到 ASVs


nohup qiime deblur denoise-16S \--i-demultiplexed-seqs reads_qza/reads_trimmed_joined_filt.qza \--p-trim-length 240 \--p-sample-stats \--p-jobs-to-start 20 \--output-dir deblur_output &

qiime feature-table summarize \--i-table deblur_output/table.qza \--o-visualization deblur_output/deblur_table_summary.qzv

② dada2

nohup qiime dada2 denoise-paired \--i-demultiplexed-seqs reads_qza/paired-demux.qza \--p-trim-left-f 0 \--p-trim-left-r 0 \--p-trunc-len-f 245 \--p-trunc-len-r 248 \--o-table reads_qza/table.qza \--o-representative-sequences reads_qza/rep-seqs.qza \--o-denoising-stats reads_qza/denoising-stats.qza \--p-n-threads 20 &

nohup qiime metadata tabulate \--m-input-file reads_qza/denoising-stats.qza \--o-visualization reads_qzv/denoising-stats.qzv &

特征表

METADATA="$PWD/metadata.txt"
nohup qiime feature-table summarize \--i-table deblur_output/table.qza \--o-visualization deblur_output/table.qzv \--m-sample-metadata-file $METADATA &

代表序列

nohup qiime feature-table tabulate-seqs \--i-data reads_qza/rep-seqs.qza \--o-visualization reads_qzv/rep-seqs.qzv &

5 物种注释

耗时较久一定要nohup

nohup qiime feature-classifier classify-sklearn \--i-reads deblur_output/representative_sequences.qza \--i-classifier ../classifer/silva-138-99-515-806-nb-classifier.qza \--p-n-jobs 20 \--output-dir taxa &qiime tools export \--input-path taxa/classification.qza \--output-path taxa

6 过滤和重采样

Table Filtering

过滤污染物及未分类的ASVs

qiime taxa filter-table \--i-table deblur_output/table.qza \--i-taxonomy taxa/classification.qza \--p-include d__ \--p-exclude mitochondria,chloroplast \--o-filtered-table taxa/table_filt.qza

移除频次小于10的特征

qiime feature-table filter-features \--i-table taxa/table_filt.qza \--p-min-frequency 10\--o-filtered-table taxa/table-frequency10.qza

重采样

Fr=7030
qiime feature-table rarefy \
--i-table taxa/table-frequency10.qza \
--p-sampling-depth $Fr \
--o-rarefied-table reads_qza/table-filtered-depth$Fr.qzaqiime feature-table summarize\--i-table reads_qza/table-filtered-depth$Fr.qza\--o-visualization reads_qzv/table-filtered-depth$Fr.qzv

抽取结果

qiime tools export \--input-path reads_qza/table-filtered-depth$Fr.qza \--output-path reads_qza/table-filtered-depth$Frbiom convert \
-i reads_qza/table-filtered-depth$Fr/feature-table.biom \
-o table-filtered-depth$Fr.txt  --to-tsv

分类信息绘制叠柱状图

 METADATA="$PWD/metadata.txt"#重采样前qiime taxa barplot \--i-table deblur_output/table.qza \--i-taxonomy taxa/classification.qza \--m-metadata-file $METADATA \--o-visualization reads_qzv/taxa-bar-plots.qzv
#重采样后qiime taxa barplot \--i-table reads_qza/table-filtered-depth$Fr.qza \--i-taxonomy taxa/classification.qza \--m-metadata-file $METADATA \--o-visualization reads_qzv/taxa-bar-plots-filtered$Fr.qzv

合并组内

qiime feature-table group \--i-table reads_qza/table.qza \--p-axis sample \--p-mode sum \--m-metadata-file medadata_AuLC.txt \--m-metadata-column subject \--o-grouped-table reads_qza/table_subject.qzaqiime taxa barplot \--i-table reads_qza/table_subject.qza \--i-taxonomy reads_qza/taxonomy.qza \--m-metadata-file medadata_AuLC_group.txt \--o-visualization reads_qzv/taxa-bar-plots-subject.qzv

Sequence Filtering

过滤污染物及未分类

qiime taxa filter-seqs \
--i-sequences deblur_output/representative_sequences.qza \
--i-taxonomy taxa/classification.qza \
--p-include d__ \
--p-exclude mitochondria,chloroplast \
--o-filtered-sequences deblur_output/rep-seqs-filtered.qza

重抽样

qiime feature-table filter-seqs \
--i-data deblur_output/rep-seqs-filtered.qza \
--i-table reads_qza/table-filtered-depth$Fr.qza \
--o-filtered-data reads_qza/filtered-rep-seqs-depth$Fr.qza

7 构建系统发育树

qiime phylogeny align-to-tree-mafft-fasttree   \ #使用mafft比对的fasttree构建进化树--i-sequences reads_qza/rep-seqs.qza \--o-alignment reads_qza/aligned-rep-seqs.qza \ #输出的对齐序列--o-masked-alignment reads_qza/masked-aligned-rep-seqs.qza \#屏蔽掉高可变的复杂序列--o-tree reads_qza/unrooted-tree.qza \#输出的无根树--o-rooted-tree reads_qza/rooted-tree.qza \ #输出的有根树--p-n-threads 20

纯净版

mkdir tree
nohup qiime phylogeny align-to-tree-mafft-fasttree   \--i-sequences reads_qza/filtered-rep-seqs-depth$Fr.qza \--o-alignment tree/aligned-rep-seqs.qza \--o-masked-alignment tree/masked-aligned-rep-seqs.qza \--o-tree tree/unrooted-tree.qza \--o-rooted-tree tree/rooted-tree.qza &

稀疏曲线

SamDep=15005
METADATA="$PWD/AusLC_metadata.txt"
qiime diversity alpha-rarefaction \--i-table reads_qza/table-filtered-depth$SamDep.qza \--i-phylogeny tree2/rooted-tree.qza \--p-max-depth $SamDep \--m-metadata-file $METADATA \--o-visualization reads_qzv/alpha-rarefaction-filtered-depth$SamDep.qzv

多样性计算

nohup qiime diversity core-metrics-phylogenetic \--i-phylogeny tree/rooted-tree.qza \--i-table reads_qza/table-filtered-depth$Fr.qza \--p-sampling-depth $Fr \--m-metadata-file $METADATA \--output-dir tree/core-metrics-results &

输出结果包括多种多样性结果，文件列表和解释如下：
#beta多样性bray_curtis距离矩bray_curtis_distance_matrix.qza
#alpha多样性evenness(均匀度，考虑物种和丰度)指数evenness_vector.qza
#alpha多样性faith_pd(考虑物种间进化关系)指数faith_pd_vector.qza
#beta多样性jaccard距离矩阵 jaccard_distance_matrix.qza
#alpha多样性observed_otus(OTU数量)指数 observed_otus_vector.qza
#alpha多样性香农熵(考虑物种和丰度)指数 shannon_vector.qza
#beta多样性unweighted_unifrac距离矩阵，不考虑丰度 unweighted_unifrac_distance_matrix.qza
#beta多样性unweighted_unifrac距离矩阵，考虑丰度 weighted_unifrac_distance_matrix.qza

METADATA="$PWD/medadata.txt"
# 统计faith_pd算法Alpha多样性组间差异是否显著，输入多样性值、实验设计，输出统计结果
qiime diversity alpha-group-significance \--i-alpha-diversity tree/core-metrics-results/faith_pd_vector.qza \--m-metadata-file $METADATA\--o-visualization tree/faith-pd-group-significance.qzv# 统计evenness组间差异是否显著
qiime diversity alpha-group-significance \--i-alpha-diversity tree/core-metrics-results/evenness_vector.qza \--m-metadata-file $METADATA\--o-visualization tree/evenness-group-significance.qzvqiime diversity alpha-group-significance \
--i-alpha-diversity tree/core-metrics-results/shannon_vector.qza \
--m-metadata-file  $METADATA \
--o-visualization tree/shannon-group-significance.qzv qiime diversity alpha-group-significance \
--i-alpha-diversity tree/core-metrics-results/observed_features_vector.qza \
--m-metadata-file  $METADATA \
--o-visualization tree/observed_features-group-significance.qzv

扩增子实战
分析
得到菌群丰度表后用R进行计算分析

α多样性

R画箱线图
（来自师弟的代码，需要把数据文件存为tab符分割的文本文件，与R文件放在同一文件夹下）

setwd("C:/Users/voyag/Downloads/microbiom/DATASET/Xiamen/R作图")
#设置自己的路径
library(data.table)
library(dplyr)
library(ggpubr) #在图上直接加pvalue
library(rlist)inputFiles <- list.files(pattern = "txt$",full.names = T)#目录路径将被添加到文件名前面，以给出一个相对文件路径
for(f in inputFiles){#f=inputFiles[1]dat <- fread(f,select = c(1:6))  #相当于read.table，but faster and more convenient,select选择保留的列tmp <- t(combn(unique(dat$Group), 2)) #找出group之间的两两combination,combn()为每次取2个的不同情况，这样2个组和3个组的不用单独写不同的脚本my_comparisons <- list()for(i in c(1:nrow(tmp)) ){my_comparisons <- list.append(my_comparisons, tmp[i,])}yVars <- c("observed_features", "faith_pd", "pielou_evenness","shannon_entropy")#对不同的y画图for(yVar in yVars){i=which(yVars == yVar)dat_tmp <- datcolnames(dat_tmp)[which(colnames(dat_tmp) == yVar)] <- "yVar"   #用”yVar代替下面要求的y坐标值“p<-ggboxplot(dat_tmp, x = "Group", y = "yVar",color = "Group", palette = "aaas",width = 0.6, outlier.shape = NA, add="jitter")+   # jitter添加抖动点stat_compare_means(comparisons = my_comparisons,method = "t.test",aes(label = ..p.signif..)) +  #添加p值, method 可选择 “t.test” 或 “wilcox.test”(默认) theme(legend.position="none") + ylab(yVar) +xlab("")assign(paste("P_",i,sep = ""),p, envir = .GlobalEnv)}P <- ggarrange(P_1,P_2,P_3,P_4,nrow = 1,ncol = 4)  #这里只适合有三个图要排列的情况，其他情况要手动改。ggsave(P, filename = paste(f,".jpeg",sep = ""),width = 14.0, height = 4.0)
}

LEFse组间分析

LEfSe在线工具地址：

①：https://huttenhower.sph.harvard.edu/galaxy/
哈佛hutten教授团队的，最近国内很难登

②：http://www.ehbio.com/Cloud_Platform/front/#/analysis
中科院刘永鑫老师团队的，数据格式要求更简单，gg数据库注释后的只需要把；替换成 | 即可