查询宇宙生命的家谱--TaxonKit工具详解

文章目录

遇到的问题
taxonkit 概述
taxonkit安装
- 安装
- 下载依赖数据
taonkit使用
- 1）列出给定taxonomy id的子分类树
- 2）从taxid获取完整谱系
- 3）重新构造谱系的格式
- 4）通过物种拉丁名查询taxid：name2taxid
回到问题
一个也不能少
参考

作者：余涛
email：yutao@big.ac.cn
中国科学院大学

遇到的问题

在做宏基因组分析时，通过基因注释得到一个包含10k之多种微生物物种名list(scientific name)，现在想统计这些物种在界、门、纲、目、科、属等不同分类水平的总的数量。这就是本篇推送想解决的问题，10000多种微生物的拉丁名称示例如下：

 [NeptuneYT$] head scientific_name.txt

Abiotrophia defectiva
Abiotrophia sp.
Absiella dolichum
Acaryochloris marina
Acetanaerobacterium sp.
Acetivibrio cellulolyticus
Acetoanaerobium noterae
Acetoanaerobium sticklandii
Acetobacter aceti
Acetobacter ghanensis

[NeptuneYT$] wc -l all_bacteria_genomic_fna.species

10146 all_bacteria_genomic_fna.species

打开NCBI Taxonomy输入一个拉丁名，如Acetobacter aceti，搜索之后默认获得完整的lineage信息，但我们这里只需要7个层次的，因此再点击一次Lineage获得缩略的谱系信息，如下：

得到的Lineage字段后以分号隔开的就是对应于7个分类层次的结果，后续以分号切割之后统计不同列的结果即可。
很自然的，我们想到爬虫，其搜索接口为https://www.ncbi.nlm.nih.gov/taxonomy/?term=拉丁名（空格以+号连接），如https://www.ncbi.nlm.nih.gov/taxonomy/?term=Acetobacter+aceti，然后对结果页面进行后续解析。但是10k之多的查询量，必然要设置爬取频率，否则就要被NCBI关小黑屋了，考虑时间代价，果断放弃。其实，从网上查询的原理也是基于Taxonomy后台的数据库，而这个文件在ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz，可以从解压之后的names.dmp和nodes.dmp文件写代码解析，但是其内容过于妖孽，为了少撸掉点头发，因此先看看网上是否有造好的轮子。

果然，动动手指发现有三个工具可以实现以上诉求，ETE toolkit, taxadb和 TaxonKit，这里选择最近发表的TaxonKit，优势在于其直接基于names.dmp和nodes.dmp文件的解析，本地搜索速度块，尤其是大批量的查找和格式转换，另外使用也极简单。

TaxonKit paper
相比于另外两种工具，TaxonKit在处理大批量数据时更快，占用内存也可接受

taxonkit 概述

说完废话，进入今天的主题，说说TaxonKit这个工具的使用。
TaxonKit是处理NCBI Taxonomy数据库中结构性数据的良心工具，19年1月在bioRxiv上online，作者Wei Shen, Jie Xiong，隶属于Department of ClinicalLaboratory, General Hospital of Western Theater Command，特地查了一下，原来是位于成都的中国人民解放军西部战区总医院（好牛的感觉），看来生信真是无处不在。

它是Go语言编写的，可以在Windows，Linux和Mac OS X运行，直接使用NCBI Taxonomy的数据（需手动下载）而无需构建本地数据库。

taxonkit安装

安装

选择对应系统的版本安装，推荐conda安装。详见https://github.com/shenwei356/taxonkit
conda安装：

conda install -c bioconda taxonkit

下载依赖数据

下载NCBI taxonomy数据库的taxdump.tar.gz文件，解压后将names.dmp和 nodes.dmp拷贝到家目录下的.taxonkit目录下。

wget -c ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz
mkdir -p $HOME/.taxonkit
cp names.dmp nodes.dmp $HOME/.taxonkit

确认文件完整：

[NeptuneYT]$ ll -h  ~/.taxonkit/

total 155494
-rw-r–r-- 1 xx UsersGrp 169.3M Mar 18 16:07 names.dmp
-rw-r–r-- 1 xx UsersGrp 134.4M Mar 18 16:07 nodes.dmp

配置完成，开始使用。

taonkit使用

[NeptuneYT$] taxonkit --help

Usage:
taxonkit [command]
Available Commands:
genautocomplete generate shell autocompletion script
help Help about any command
lineage query lineage of given taxids
list list taxon tree of given taxids
name2taxid query taxid by taxon scientific name
reformat reformat lineage
version print version information and check for update
…
Use “taxonkit [command] --help” for more information about a command.

taxonkit按照功能分成不同的子命令，其中最主要的功能包括4块：
1）列出给定taxonomy id(taxid)的子分类树：list
2）从taxid获取完整谱系：lineage
3）重新构造谱系的格式：reformat
4）通过物种拉丁名查询taxid：name2taxid

1）列出给定taxonomy id的子分类树

 [NeptuneYT$] taxonkit list --help   #查看list子命令使用方法

list taxon tree of given taxids
Usage:
taxonkit list [flags]
Flags:
-h, --help help for list
–ids string taxid(s), multiple values should be separated by comma
–indent string indent (default " “)
–json output in JSON format. you can save the result in file with suffix “.json” and open with modern text editor
–show-name output scientific name
–show-rank output rank
Global Flags:
–data-dir string directory containing nodes.dmp and names.dmp (default “/home/xx/.taxonkit”)
–line-buffered use line buffering on output, i.e., immediately writing to stdin/file for every line of output
-o, --out-file string out file (”-" for stdout, suffix .gz for gzipped out) (default “-”)
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
–verbose print verbose information

实例：
给定taxid：9606和10090

[NeptuneYT$] taxonkit list --ids 9606,10090 --show-name  --show-rank -j 2

#–ids 给定的taxid，多个以英文逗号分割
#–show-name 输出科学命名
#–show-rank 输出分类等级
#-j 线程数，默认是2

9606 [species] Homo sapiens
63221 [subspecies] Homo sapiens neanderthalensis
741158 [subspecies] Homo sapiens subsp. ‘Denisova’

10090 [species] Mus musculus
10091 [subspecies] Mus musculus castaneus
10092 [subspecies] Mus musculus domesticus
35531 [subspecies] Mus musculus bactrianus
39442 [subspecies] Mus musculus musculus
46456 [subspecies] Mus musculus wagneri
57486 [subspecies] Mus musculus molossinus
80274 [subspecies] Mus musculus gentilulus
116058 [subspecies] Mus musculus brevirostris
179238 [subspecies] Mus musculus homourus
477815 [subspecies] Mus musculus musculus x M. m. domesticus
477816 [subspecies] Mus musculus musculus x M. m. castaneus
947985 [subspecies] Mus musculus albula
1266728 [subspecies] Mus musculus domesticus x M. m. molossinus
1385377 [subspecies] Mus musculus gansuensis
1643390 [subspecies] Mus musculus helgolandicus
1879032 [subspecies] Mus musculus isatissus

2）从taxid获取完整谱系

[NeptuneYT$] echo 9606|taxonkit lineage  -d "-" -t -r

#-d 输出谱系树分割符，默认分号
#-t 显示包含taxid的谱系树
#-r 显示给定taxid的分类等级

9606 cellular organisms-Eukaryota-Opisthokonta-Metazoa-Eumetazoa-Bilateria-Deuterostomia-Chordata-Craniata-Vertebrata-Gnathostomata-Teleostomi-Euteleostomi-Sarcopterygii-Dipnotetrapodomorpha-Tetrapoda-Amniota-Mammalia-Theria-Eutheria-Boreoeutheria-Euarchontoglires-Primates-Haplorrhini-Simiiformes-Catarrhini-Hominoidea-Hominidae-Homininae-Homo-Homo sapiens 131567-2759-33154-33208-6072-33213-33511-7711-89593-7742-7776-117570-117571-8287-1338369-32523-32524-40674-32525-9347-1437010-314146-9443-376913-314293-9526-314295-9604-207598-9605-9606 species

3）重新构造谱系的格式

上一步通过taxid提取的谱系信息复杂，往往需要根据我们的需求重新格式化

[NeptuneYT$] echo 9606|taxonkit lineage |taxonkit reformat

9606 cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens Eukaryota;Chordata;Mammalia;Primates;Hominidae;Homo;Homo sapiens

输出结果的第三列就是重新格式化的结果，默认是(“{k};{p};{c};{o};{f};{g};{s}”)7个水平。
查询给定taxid9606的谱系，并按照门：科；属的格式输出

[NeptuneYT$] echo 9606|taxonkit lineage |taxonkit reformat -f "{p}:{f};{s}" |cut -f3

#{}内是分类等级，大括号之间是输出的连接符

Chordata:Hominidae;Homo sapiens

4）通过物种拉丁名查询taxid：name2taxid

按人的拉丁名查询taxid

[NeptuneYT$] echo "Homo sapiens" |taxonkit name2taxid

Homo sapiens 9606
批量查询

[NeptuneYT$] head scientific_name.txt |taxonkit name2taxid --show-rank

Abiotrophia defectiva 46125 species
Abiotrophia sp. 76631 species
Absiella dolichum 31971 species
Acaryochloris marina 155978 species
Acetanaerobacterium sp.
Acetivibrio cellulolyticus 35830 species
Acetoanaerobium noterae 745369 species
Acetoanaerobium sticklandii 1511 species
Acetobacter aceti 435 species
Acetobacter ghanensis 431306 species

回到问题

基于Taxonkit上述用法，回到之前的问题就好解决了
1.将scientific name先转化成taxid，便于查找lineage

time taxonkit name2taxid  scientific_name.txt >scientific_name_taxid.txt &
awk -F"\t" '$2!=""{print $2}'  scientific_name.txt >find_taxid.txt  #去掉未查到的，即空值
awk -F"\t" '$2==""'  scientific_name_taxid.txt >NotFindName.txt #输出未查到物种名
awk -F"\t" 'BEGIN{OFS="\t";print "findTaxid\tNull\tTotal"}{$2!=""?taxid++:null++}END{print taxid,null,taxid+null}' scientific_name_taxid.txt |column -t #统计查找到的和未查到的数量

real 0m6.279s
findTaxid Null Total
9606 541 10147

可以看到，查询速度相当之快。
由于待批量查询的物种名不是规范的拉丁名称，导致出现两个问题，一是输入的list是10146个,转换id后（找到和未找到）的行数却增加了1个，统计之后发现是同一个物种名有两个taxid；二是没找到的高达541个！！！为了说明完整的处理过程，541个后面再说。

[NeptuneYT$]$ awk -F"\t" '{print $1}' scientific_name_taxid.txt |sort |uniq -d

Deinococcus soli

手工查询发现是两个种，但是根据部分拉丁名Deinococcus soli可以查到俩taxid我也是醉了。

2.查找lineage

 [NeptuneYT$] time taxonkit lineage find_taxid.txt >lineage.txt&

real 0m7.860s

3.重新格式化lineage

[NeptuneYT$] time taxonkit reformat lineage.txt|cut -f3 >newformat.txt&

real 0m11.465s

结果：

现在就按照界门纲目科属种的层次变成整齐划一的格式了，通过简单处理即可统计不同层次的物种数量和分布，首先构建一个用于循环处理的分类标签

[NeptuneYT$] cat tag.txt

1 Kingdom
2 Phylum
3 Class
4 Order
5 Family
6 Genus
7 Species

详细的物种分类层次数量统计：

[NeptuneYT$] cat tag.txt|while read num lev;do  awk -v v=$num -F";" '{print $v}' newformat.txt|sort |uniq -c|sort -nr |awk -v v=$lev '{print v"\t"$0}' |awk '$3!=""';done >detail_taxonomy_range.txt
[NeptuneYT$] head detail_taxonomy_range.txt

Kingdom 6722 Bacteria
Kingdom 4 Eukaryota
Phylum 2682 Proteobacteria
Phylum 1390 Firmicutes
Phylum 1099 Actinobacteria
Phylum 729 Bacteroidetes

按7个类别统计数量：

[NeptuneYT$] cat tag.txt|while read index level;do num=$(awk -v v=$index -F";" '{print $v}' newformat.txt|sort |uniq |wc -l);printf  "${level}\t${num}\n";done |tee taxonomy_stat.txt

Kingdom 2
Phylum 118
Class 82
Order 181
Family 401
Genus 1824
Species 6519

简单画个图瞅瞅：

data<-read.table('taxonomy_stat.txt')
library('ggplot2')
ggplot(data,aes(reorder(V1,-V2),V2,fill=V1))+geom_bar(stat='identity')+xlab('Taxonomy level')+ylab('Numbers')+geom_text(aes(label=V2),position=position_dodge(.9),vjust=-1.5)+theme(text=element_text(size=16,family='Times New Roman',face='bold'))+labs(fill='level')

一个也不能少

首先看看没有找到的541个输入物种名长啥样：

Bacillus sp.
[Bacillus thuringiensis]
bacteria symbiont
Bacteriovorax sp.
bacterium 42 11
bacterium BRH c32
bacterium Candidatus
bacterium CG06 land 8 20 14 3 00 33 50
bacterium CG09 39 24
bacterium CG1 02 42 9
…

这个物种list是师妹们给的，实在太好（yao）看(nie)可了，拷问了一遍奈何就只有这样的查询list，那就只能手工挑了几个去网页上查了，发现没有找到taxid的原因包括：
1）出现多个搜索结果：原物种名Acaryochloris sp.，搜到的拉丁名：Acaryochloris sp. A4cAcaryochloris marinaAsterochloris sp. DA2Asterochloris sp. 101Asterochloris sp. 103
2）截短名称：原物种名Acetothermia bacterium，搜到结果：Candidatus Acetothermia bacterium
3）多加中括号：[Bacillus thuringiensis]
4）没有下划线：bacterium 42 11，搜到结果：bacterium 42_11
针对多加中括号和没有下划线的问题，可以先处理一下，其他的就只能手工查了。

sed -e "s/\[//g;s/\]//g"   NotFindName.txt >bracket_minus_name.txt #minus all bracket
awk '{filed1=$1;$1="";sub(/ /,"");gsub(/ /,"_");print filed1,$0}'  NotFindName.txt  >underline_plus_name.txt #plus all filed of split underline except first one

将后面查到的taxid加到之前的taxid list再按之前的流程跑一遍即可，实在查不到的那就手工吧。（此时配乐起~“这是自由的感觉，鼠标咔哒咔哒点击这些可爱的物种名称，凭着一颗永不哭泣勇敢的心”）

参考

taxonkit github
TaxonKit: a cross-platform and efficient NCBI taxonomy toolkit