NCBI BLAST是最常用的短序列局部比对软件之一,学习一下软件参数,并这记录一下我的软件操作的命令行。

NR 库分割参考Gu Kai老师的笔记:http://www.bioinfo-scrounger.com/archives/673

1. 软件版本

我部署在服务器上的是version 2.8.1 ,因为其支持物种注释以及分割的NR子库的索引作为比对的库,使用比较方便,但是需要对应使用新版的NR数据库 (BLASTDBv5,ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/ );另外官方也给了常用子NR库:

  1. env_nr_v5
  2. swissprot_v5
  3. taxdb
  4. tsa_nr_v5

2. 使用makeblastdb格式化数据库

makeblastdb的参数有很多,具体参数如下:

$ makeblastdb -help
USAGEmakeblastdb [-h] [-help] [-in input_file] [-input_type type]-dbtype molecule_type [-title database_title] [-parse_seqids][-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids][-mask_desc mask_algo_descriptions] [-gi_mask][-gi_mask_name gi_based_mask_names] [-out database_name][-blastdb_version version] [-max_file_sz number_of_bytes][-logfile File_Name] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-version]DESCRIPTIONApplication to create BLAST databases, version 2.8.1+REQUIRED ARGUMENTS-dbtype <String, `nucl', `prot'>Molecule type of target dbOPTIONAL ARGUMENTS-hPrint USAGE and DESCRIPTION;  ignore all other parameters-helpPrint USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters-versionPrint version number;  ignore other arguments*** Input options-in <File_In>Input file/database nameDefault = `-'-input_type <String, `asn1_bin', `asn1_txt', `blastdb', `fasta'>Type of the data specified in input_fileDefault = `fasta'*** Configuration options-title <String>Title for BLAST databaseDefault = input file name provided to -in argument-parse_seqidsOption to parse seqid for FASTA input if set, for all other input typesseqids are parsed automatically-hash_indexCreate index of sequence hash values.*** Sequence masking options-mask_data <String>Comma-separated list of input files containing masking data as produced byNCBI masking applications (e.g. dustmasker, segmasker, windowmasker)-mask_id <String>Comma-separated list of strings to uniquely identify the masking algorithm* Requires:  mask_data* Incompatible with:  gi_mask-mask_desc <String>Comma-separated list of free form strings to describe the masking algorithmdetails* Requires:  mask_id-gi_maskCreate GI indexed masking data.* Requires:  parse_seqids* Incompatible with:  mask_id-gi_mask_name <String>Comma-separated list of masking data output files.* Requires:  mask_data, gi_mask*** Output options-out <String>Name of BLAST database to be createdDefault = input file name provided to -in argumentRequired if multiplefile(s)/database(s) are provided as input-blastdb_version <Integer, 4..5>Version of BLAST database to be createdDefault = `4'-max_file_sz <String>Maximum file size for BLAST database filesDefault = `1GB'-logfile <File_Out>File to which the program log should be redirected*** Taxonomy options-taxid <Integer, >=0>Taxonomy ID to assign to all sequences* Incompatible with:  taxid_map-taxid_map <File_In>Text file mapping sequence IDs to taxonomy IDs.Format:<SequenceId> <TaxonomyId><newline>* Requires:  parse_seqids* Incompatible with:  taxid

然后举个例子,来说明参数。输入文件是:1.)待格式化的序列;2)序列的物种信息表(可选),物种信息是NCBI taxid,可以 使用get_species_taxids.sh脚本或者提取accession2taxid文件获得

$ cat test.fsa
>seq1
MSFSTKPLDMATWPDFAALVERHNGVWGGCWCMAFHAKGSGAVGNREAKEARVREGSTHAALVFDGSACVGWCQFGPTGE
LPRIKHLRAYEDGQAVLPDWRITCFFSDKAFRGKGVAAAALAGALAEIGRLGGGTVESYPEDAQGRTVAGAFLHNGTLAM
>seq2
MKAIDLKAEEKKRLIEGIQDFFYEERNEEIGIIAAEKALDFFLSGVGKLIYNKALDESKIWFSRRLEDISLDYELLYK
>seq3
MTLAAAAQSATWTFIDGDWYEGNVAILGPRSHAMWLGTSVFDGARWFEGVAPDLELHAARVNASAIALGLAPNMTPEQIV
GLTWDGLKKFDGKTAVYIRPMYWAEHGGYMGVPADPASTRFCLCLYESPMISPTGFSVTVSPFRRPTIETMPTNAKAGCL
YPNNGRAILEAKARGFDNALVLDMLGNVAETGSSNIFLVKDGHVLTPAPNGTFLSGITRSRTMTLLGDYGFRTTEKTLSV
RDFLEADEIFSTGNHSKVVPITRIEGRDLQPGPVAKKARELYWDWAHSASVG
>seq4
MRSFFHHVAAADPASFGVAQRVLTIPIKRAHIEVTHHLTKAEVDALIAAPNPRTSRGRRDRTFLLFLARTGARVSEATGV
NANDLQLERSHPQVLLRGKGRRDRVIPIPQDLARALTALLAEHGIANHEPRPIFIGARQERLTRFGATHIVRRAAAQAVT
IKPALAHKPISPHIFRHSLAMKLLQSGVDLLTIQAWLGHAQVATTHRYAAADVEMMRKGLEKAGVSGDLGLRFRPNDAVL
QLLTSI
>seq5
MTISRVCGSRTEAMLTNGQEIAMTSILKSTGAVALLLLYTLTANATSLMISPSSIERVAPDRAAVFHLRNQMDRPISIKV
RVFRWSQKGGVEKLEPTGDVVASPISAQLSPNGNRAVRVVRVSKEPLRSEEGYRVVIDEADPTRNTPEAESLSARHVLPV
LFRPPDVLGPEIELSLTRSDGWLMLVVENKGASRLRRSDVTLAQGSAGIARREGFVGYVLPGLTRHWRVGREDSYSGGIV
TVSANSSGGAIGEQLVVSGR
>seq6
TTLLLQVPIGWGVLHQGGALVVLGFAIAHWRGFVGTYTRDTAIEMRD$ cat test_map.txt
seq1 68287
seq2 2382161
seq3 68287
seq4 382
seq5 382
seq6 382

执行格式化命令:

$ makeblastdb -in test.fsa -parse_seqids -blastdb_version 5 -taxid_map test_map.txt -title "test_demo" -dbtype prot -out test_dbBuilding a new DB, current time: 03/09/2019 17:08:14
New DB name:   test_db
New DB title:  test_demo
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 6 sequences in 0.00222588 seconds.

如果不需要物种信息的话,可以执行:(-parse_seqids 参数需要加上,根据序列ID来分裂,可以使用)

$ makeblastdb -in test.fsa -parse_seqids  -title test_demo -dbtype prot -out test_db -blastdb_version 5

3. 使用blastp, blastx或者blastn进行序列比对

blast的-outfmt参数的选择与说明,可以使用blastp -help打印每个输出格式的信息:

*** Formatting options-outfmt <String>alignment view options:0 = Pairwise,1 = Query-anchored showing identities,2 = Query-anchored no identities,3 = Flat query-anchored showing identities,4 = Flat query-anchored no identities,5 = BLAST XML,6 = Tabular,7 = Tabular with comment lines,8 = Seqalign (Text ASN.1),9 = Seqalign (Binary ASN.1),10 = Comma-separated values,11 = BLAST archive (ASN.1),12 = Seqalign (JSON),13 = Multiple-file BLAST JSON,14 = Multiple-file BLAST XML2,15 = Single-file BLAST JSON,16 = Single-file BLAST XML2,18 = Organism ReportOptions 6, 7 and 10 can be additionally configured to producea custom format specified by space delimited format specifiers.The supported format specifiers are:qseqid means Query Seq-idqgi means Query GIqacc means Query accesionqaccver means Query accesion.versionqlen means Query sequence lengthsseqid means Subject Seq-idsallseqid means All subject Seq-id(s), separated by a ';'sgi means Subject GIsallgi means All subject GIssacc means Subject accessionsaccver means Subject accession.versionsallacc means All subject accessionsslen means Subject sequence lengthqstart means Start of alignment in queryqend means End of alignment in querysstart means Start of alignment in subjectsend means End of alignment in subjectqseq means Aligned part of query sequencesseq means Aligned part of subject sequenceevalue means Expect valuebitscore means Bit scorescore means Raw scorelength means Alignment lengthpident means Percentage of identical matchesnident means Number of identical matchesmismatch means Number of mismatchespositive means Number of positive-scoring matchesgapopen means Number of gap openingsgaps means Total number of gapsppos means Percentage of positive-scoring matchesframes means Query and subject frames separated by a '/'qframe means Query framesframe means Subject framebtop means Blast traceback operations (BTOP)staxid means Subject Taxonomy IDssciname means Subject Scientific Namescomname means Subject Common Namesblastname means Subject Blast Namesskingdom means Subject Super Kingdomstaxids means unique Subject Taxonomy ID(s), separated by a ';'(in numerical order)sscinames means unique Subject Scientific Name(s), separated by a ';'scomnames means unique Subject Common Name(s), separated by a ';'sblastnames means unique Subject Blast Name(s), separated by a ';'(in alphabetical order)sskingdoms means unique Subject Super Kingdom(s), separated by a ';'(in alphabetical order) stitle means Subject Titlesalltitles means All Subject Title(s), separated by a '<>'sstrand means Subject Strandqcovs means Query Coverage Per Subjectqcovhsp means Query Coverage Per HSPqcovus means Query Coverage Per Unique Subject (blastn only)

一般常用的就是-outfmt 5, 6 或者 7 参数,“5输出XML格式,6输出TAB分割格式,“7”输出带注释的TAB分割格式;

TAB格式每列信息如下,或者使用“7”输出带注释的格式

qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

如果像输出其他信息,比如比对覆盖度信息(qcovs:Query Coverage Per Subject),需要如在outfmt 6基础上加上 “qcovs”( 覆盖度信息),添加位置和顺序就是输出TAB文件中列的排列形式:

-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovs"

如果需要blast比对返回一个最优的比对结果,需要控制-max_target_seqs , -num_alignments 和 -max_hsps 选项:

-max_target_seqs <Integer, >=1>Maximum number of aligned sequences to keepNot applicable for outfmt <= 4* Incompatible with: num_descriptions, num_alignments
-num_alignments <Integer, >=0>Number of database sequences to show alignments for* Incompatible with: max_target_seqs
$ blastp -query query.fas -db /NAS2/bio_db/test/test_pep -outfmt 6 -max_hsps 1  -num_alignments 1 -evalue 1e-5 -num_threads 20
Warning: [blastp] Examining 5 or more matches is recommended
NP_001262865.1  XP_027222871.1  59.500  200     59      2       424     622     282     460     1.27e-71        241
APW79906.1      XP_027222871.1  32.765  528     258     17      206     705     2       460     4.30e-61        213
ASV48202.1      XP_027222871.1  54.592  196     73      3       227     421     280     460     9.60e-63        211$ blastp -query query.fas -db /NAS2/bio_db/test/test_pep -outfmt 6 -max_target_seqs 5 -evalue 1e-5 -num_threads 20
NP_001262865.1  XP_027222871.1  59.500  200     59      2       424     622     282     460     1.27e-71        241
NP_001262865.1  XP_027222871.1  43.038  79      32      4       104     172     89      164     4.20e-07        53.5
NP_001262865.1  XP_027222869.1  59.500  200     59      2       424     622     323     501     4.44e-71        241
NP_001262865.1  XP_027222870.1  59.500  200     59      2       424     622     322     500     4.56e-71        241
APW79906.1      XP_027222871.1  32.765  528     258     17      206     705     2       460     4.30e-61        213
APW79906.1      XP_027222869.1  54.872  195     70      4       515     705     321     501     1.96e-60        212
APW79906.1      XP_027222869.1  54.167  48      20      1       294     341     133     178     6.30e-06        49.7
APW79906.1      XP_027222870.1  54.872  195     70      4       515     705     320     500     2.14e-60        212
APW79906.1      XP_027222870.1  54.167  48      20      1       294     341     132     177     6.29e-06        49.7
ASV48202.1      XP_027222871.1  54.592  196     73      3       227     421     280     460     9.60e-63        211
ASV48202.1      XP_027222870.1  54.592  196     73      3       227     421     320     500     2.51e-62        210
ASV48202.1      XP_027222869.1  54.592  196     73      3       227     421     321     501     2.62e-62        210

3. 分割NR子库

NCB blast-2.8版本可支持用NCBI自带代码分割的NR子库的索引作为比对的库,使用比较方便

  • Support for a new version of the BLAST database that allows you to limit search by taxonomy as well some other improvements.

当然如果用这个版本的话,NR库也要重新下载了ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/

使用方式也比较简单(至少比之前的方法方便了),如果只想比对单一物种(如人:9606的话),命令如下:

    blastp –db nr –query query.fasta –taxids 9606 –outfmt 6 –out blast.outfm6

如果想比对NR子库哺乳动物的话,需要先建个哺乳动物子库tax_id索引

    get_species_taxids.sh -t 40674 > 40674.txids

然后再将序列比对至NR哺乳动物子库

    blastp –db nr –query query.fasta –taxidlist 40674.txids –outfmt 6 –out blast.outfm6

具体说明可看:https://ftp.ncbi.nlm.nih.gov/blast/db/v5/blastdbv5.pdf

利用BLAST进行序列比对和寻找同源基因相关推荐

  1. r语言remarkdown展示图_使用R语言包circlize可视化展示blast双序列比对结果

    circlize这个包还挺强大的,R语言里用来画圈图还挺方便的. 今天这篇文章记录用circlize这个包画圈图展示blast双序列比对结果的代码 植物线粒体基因组类的文章通常会分析细胞器基因组间基因 ...

  2. MAT之PSO:利用PSO算法优化二元函数,寻找最优个体适应度

    MAT之PSO:利用PSO算法优化二元函数,寻找最优个体适应度 目录 实现结果 设计代码 实现结果 设计代码 figure [x,y] = meshgrid(-5:0.1:5,-5:0.1:5); z ...

  3. 探索:使用北鲲云平台利用Gaussian16进行HAT反应过渡态的寻找

    HAT(Hydrogen atom transfer),即氢原子转移反应,在自由基参与的反应中十分常见,对自由基活性物种的转移和淬灭有着重要的意义.近年来,随着自由基化学的复兴,实现自由基参与的选择性 ...

  4. 利用XGBoost、Information Value、SHAP寻找“小北极星“指标与分层处理

    利用XGBoost.Information Value.SHAP寻找"小北极星"指标与分层处理 前言 "小北极星"指标 Information Value(IV ...

  5. blast mysql 基因序列_转载-网页方式下利用BLAST 程序进行基因/蛋白质序列比对...

    美国国家生物技术信息中心(National Center of Biotechology Information ,NCBI) 充分利用Internet ,为用户提供了丰富的生物信息资源.NCBI 的 ...

  6. python搜索pdf内容所在页码_利用Python在pdf文档中寻找某些词出现的页码

    要研究pdf文件的页码,首先要考虑这个文件的种类.pdf可能是一本书的电子版,可能是一份简历.可能是由Word.PPT或其他文档导出的--如果不是一本书,通常页面内容里是没有页码的:如果是一本书,虽然 ...

  7. python第k序列元素查找_Python寻找第k小的元素

    更多: http://my.oschina.net/u/438371/blog/131956 1.[代码][Python]代码 # -*- coding: utf-8 -*- from random ...

  8. 利用人工势场法的最短路径寻找

    人工势场法也可以用作机器人避障.我目前思考的是使其作为全局规划器,规划全局路径,也可以做局部规划直接下达至速度计算,目前暂时先看看全局路径计算.它将整个地图环境抽象为势场,机器人同时受到目标点的引力与 ...

  9. 生信分析-利用TBtools提取序列

    准备文件 gtf/gff3 全基因组fasta 背景知识 CDS=coding sequence,是编码区,是可翻译成蛋白质的exon的集合 cDNA比CDS多了5'-UTR和3'-UTR区域,是所有 ...

  10. IslandViewer4|基因组岛在线预测

    IslandViewer 4 基因组岛(genomic island)是微生物基因组中可水平转移的基因簇.根据功能基因不同,可将其分为抗性岛.代谢岛.共生岛.毒力岛等. 基因组岛的预测主要分为两类: ...

最新文章

  1. VMware Workstation Pro 共享文件夾
  2. 集群系统实现方案详解
  3. iptables规则备份恢复,firewalld的9个zone
  4. 【干货】移动APP安全测试要点解析
  5. Crawler:基于urllib+requests库+伪装浏览器实现爬取国内知名招聘网站,上海地区与机器学习有关的招聘信息(2018.4.30之前)并保存在csv文件内
  6. 347. 前 K 个高频元素(哈希表)
  7. JavaScript及jQuery选择器(二)
  8. android下拉弹性gif,android-pulltorefresh 下拉加载中使用gif动图
  9. crontab -e 报错(E518: Unknown option: foldenable)
  10. vxetable显示html,vxe-table分页无法显示?
  11. WIN10重新下载安装MicroSoft Store的三种方法
  12. 飘云阁论坛出品汇编逆向专用记事本
  13. 解决联想笔记本电源选项 电源管理无效
  14. html预览pdf上的电子印章,移动端pdf预览-水印电子签章问题
  15. 台式计算机主板接口识别,硬件丨当前台式机主板接口知识普及与主板结构全讲解...
  16. datedif函数mysql_datedif函数怎么用
  17. Java多线程超时判断
  18. Python实现图像去噪(中值去噪和均值去噪)
  19. 47个经典java程序编程题
  20. ArcGIS平台概述

热门文章

  1. 一加手机怎么root权限_一加手机的两种ROOT权限获取教程详解
  2. 如何选择工业相机(转载)
  3. Effective c++笔记
  4. 攻防世界 ics-05
  5. 【华人学者风采】丛京生 加州大学洛杉矶分校
  6. Java实现支付宝网页支付
  7. 【Nginx 源码学习】平滑重启,源码追踪
  8. 【飞思卡尔】飞思卡尔摄像头算法基本方法
  9. java定义负数_java如何定义负数
  10. 视网膜眼底图像的一种检测方法,学习笔记(一)