1.问题导向

最近在做某个课题的时候，按老师的要求需要从NCBI中批量下载不同物种的参考基因组，同时收集相应参考基因组的一些组装信息，基因组非常多，导致工作量巨大，一个一个手动收集的话，既费时又费力，这时就想到了用python爬虫来完成这项任务。

2.爬虫思路

2.1找到所需爬取的网页并观察网址urls的异同点

以猪、马、牛、羊参考基因组为例：

# Sus scrofa (pig)
https://www.ncbi.nlm.nih.gov/assembly/GCA_000003025.6
# Equus caballus (horse)
https://www.ncbi.nlm.nih.gov/assembly/GCF_002863925.1
# Bos taurus (cattle)
https://www.ncbi.nlm.nih.gov/assembly/GCF_002263795.1
# Ovis aries (sheep)
https://www.ncbi.nlm.nih.gov/assembly/GCF_016772045.1
......
#汇总:
urls = "https://www.ncbi.nlm.nih.gov/assembly/{assembly_ID}"

NCBI中的参考基因组大部分是按照GenBank assembly accession号来存放位置的，因此我们只需要得到所需要收集物种的登录号，即可找到对应参考基因组的组装信息的页面。

2.2确认所需爬取的信息并确认是否需要二次爬取

此处，需要爬取的信息共分为三部分，分别为上图红框中部分：

第一部分为每个assembly的基本信息，按照自己的需要选择内容，如assembly name、Organism name、Genome coverage等。
第二部分为每个assembly的组装信息，主要反映assembly的组装质量，建议全都收集。
第三部分为常规下载的FTP地址，用来存放供下载的参考基因组、CDS序列、或注释文件GFF、GTF等文件，因为其拥有独立的网址url，需要二次爬取。新页面如下图所示：
如下图。本文主要下载参考基因组，即.fna文件，可按需要下载蛋白.faa、注释文件.gff或.gtf文件等。

2.3 在网页源代码中搜索定位所需要的信息

通过鼠标右键或快捷键"CTRL+U"来调出网页源代码，并利用"CTRL+F"来快速定位自己所需要爬取的内容的位置，如下：

第一部分：assembly基本信息

<div><div><div id="summary_cont"><div class="col margin_r0 nine_col"><div id="summary"><h1 xmlns:math="http://exslt.org/math" class="marginb0 margin_t0">Sscrofa11.1</h1><input type="hidden" value="true" id="ftp-genbank-refseq-exist" /><dl xmlns:math="http://exslt.org/math" class="assembly_summary_new margin_t0"><dt>Description: </dt><dd>Sscrofa11 with Y sequences from WTSI_X_Y_pig V2</dd><dt>Organism name: </dt><dd><a href="/Taxonomy/Browser/wwwtax.cgi?mode=Info&amp;id=9823&amp;lvl=3&amp;lin=f&amp;keep=1&amp;srchmode=1&amp;unlock"><span class="highlight" style="background-color:">Sus scrofa</span> (pig)</a></dd><dt>Infraspecific name: </dt><dd>Breed: Duroc</dd><dt>Isolate: </dt><dd>TJ Tabasco</dd><dt>Sex: </dt><dd>female</dd><dt>BioSample: </dt><dd><a href="/biosample/SAMN02953785/">SAMN02953785</a></dd><dt>BioProject: </dt><dd><a href="/bioproject/PRJNA13421/">PRJNA13421</a></dd><dt>Submitter: </dt><dd>The Swine Genome Sequencing Consortium (SGSC)</dd><dt>Date: </dt><dd>2017/02/07</dd><dt>Synonyms: </dt><dd>susScr11</dd><dt>Assembly level: </dt><dd>Chromosome</dd><dt>Genome representation: </dt><dd>full</dd><dt>RefSeq category: </dt><dd>representative genome</dd><dt>GenBank assembly accession: </dt><dd>GCA_000003025.6 (<span class="highlight" style="background-color:">latest</span>)</dd><dt>RefSeq assembly accession: </dt><dd>GCF_000003025.6 (<span class="highlight" style="background-color:">latest</span>)</dd><dt>RefSeq assembly and GenBank assembly identical: </dt><dd>no (<a href="#assembly-diff" id="assembly-diff-trigger">hide details</a>)</dd><dd id="assembly-diff"><ul><li>Only in RefSeq: chromosome MT (in non-nuclear assembly-unit)</li></ul></dd><dd class="displayed-from-refseq"><ul style="margin-left:0;"><li>Data displayed for RefSeq version</li></ul></dd><dt>WGS Project: </dt><dd><a href="/nuccore/AEMK00000000.2/">AEMK02</a></dd><dt>Assembly method: </dt><dd>Falcon v. OCT-2015</dd><dt>Expected final version: </dt><dd>yes</dd><dt>Genome coverage: </dt><dd>65.0x</dd><dt>Sequencing technology: </dt><dd>PacBio</dd></dl><div xmlns:math="http://exslt.org/math" style="clear:both"></div><p style="color:grey;"><span>IDs: </span><span>1004191 [UID] </span><span>4121818 [GenBank] </span><span>4192498 [RefSeq] </span></p></div></div><div class="more_genome_data-cont"><div class="more_genome_data shadow margin_r1"><h3>See <a href="/genome/?term=txid9823[orgn]">Genome</a> Information for<em><span class="highlight" style="background-color:">Sus scrofa</span></em></h3></div><div class="more_genome_data shadow margin_r1 links_to_isolate" data-accession="GCA_000003025.6"><h3>Pathogen Detection Resources</h3><ul><li><a href="#" id="link-to-isolate">Isolate Browser </a></li><li><a href="#" id="link-to-snp-tree">SNP Tree Viewer</a></li></ul></div><div class="more_genome_data genome_nav shadow margin_r1"><h3>There are 26 assemblies for this organism</h3><a href="/assembly/organism/9823/latest/">See more</a></div></div><div id="asm_history_cont"><div id="asm_history_cont"><h2 class="sec_header margin_b0 rev_history_tg" id="revision-history">History <a href="#" class="jig-ncbitoggler" data-ncbitoggler-toggles="asb_history">(Showrevision history)</a></h2><div class="jig-ncbigrid asb_history" id="asb_history" style="display: none;"><table class="margin_t0 "><thead><tr><th>GenBank Assembly<br />Accession</th><th></th><th>RefSeq Assembly<br />Accession</th><th>Assembly<br /> Name</th><th>Assembly<br />Level</th><th>Status</th></tr></thead><tbody><tr class="current_asm"><td><a href="https://www.ncbi.nlm.nih.gov/assembly/1004191/" target="_blank">GCA_000003025.6</a></td><td>≠</td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/1004191/" target="_blank">GCF_000003025.6</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.6/">Sscrofa11.1</a></td><td>Chromosome</td><td><span class="highlight" style="background-color:">Latest</span> GenBank, <span class="highlight" style="background-color:">Latest</span> RefSeq</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/905331/" target="_blank">GCA_000003025.5</a></td><td><span class="note_gray">n/a</span></td><td><span class="note_gray">n/a</span></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCA_000003025.5/">Sscrofa11</a></td><td>Chromosome</td><td>Replaced GenBank</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/304498/" target="_blank">GCA_000003025.4</a></td><td>≠</td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/304498/" target="_blank">GCF_000003025.5</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.5/">Sscrofa10.2</a></td><td>Chromosome</td><td>Replaced GenBank, Replaced RefSeq</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/284398/" target="_blank">GCA_000003025.3</a></td><td>≠</td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/284398/" target="_blank">GCF_000003025.4</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.4/">Sscrofa10</a></td><td>Chromosome</td><td>Replaced GenBank, Replaced RefSeq</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/111518/" target="_blank">GCA_000003025.2</a></td><td>=</td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/111518/" target="_blank">GCF_000003025.3</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.3/">Sscrofa9.2</a></td><td>Chromosome</td><td>Replaced GenBank, Replaced RefSeq</td></tr><tr><td><a href="https://www.ncbi.nlm.nih.gov/assembly/5178/" target="_blank">GCA_000003025.1</a></td><td><span class="note_gray">n/a</span></td><td><span class="note_gray">n/a</span></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCA_000003025.1/">Sscrofa9</a></td><td>Chromosome</td><td>Replaced GenBank</td></tr><tr><td><span class="note_gray">n/a</span></td><td><span class="note_gray">n/a</span></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/4418/" target="_blank">GCF_000003025.2</a></td><td><a href="https://www.ncbi.nlm.nih.gov/assembly/GCF_000003025.2/">Sscrofa5</a></td><td>Chromosome</td><td>Replaced RefSeq</td></tr></tbody></table></div></div></div><div id="asm_comment_cont"><div id="asm_comment_cont"><h2 class="sec_header margin_b0">Comment</h2><pre class="asm_comment_text"><span class="asm_comment_visible">This pig genome sequence (Sscrofa11) has been released by the International Swine Genome Sequencing Consortium under the terms of the Toronto Statement (Nature 2009, 461: 168). The Consortium is coordinating genome-wide analysis, annotation and publication.

第二部分：assembly组装信息

The sequence data from </span><span class="asm_comment_dot">... </span><span class="asm_comment_more">which this assembly was constructed largely comprise 65x genome coverage in whole genome shotgun (WGS) Pacific Biosciences long reads (Pacific Biosciences RSII, with P6/C4 chemistry). Illumina HiSeq2500 WGS paired-end and mate pair reads were used for final error correction using PILON. Sanger and Oxford Nanopore sequence data from a few CHORI-242 BAC clones were used to fill gaps. <span class="highlight" style="background-color:">All</span> the WGS data were generated from a single Duroc female (TJ Tabasco, also known as Duroc 2-14) which was also the source of DNA for the CHORI-BAC library.
Sscrofa11 replaces the previous assembly, Sscrofa10.2, which was largely established from the same Duroc 2-14 DNA source. </span> <a href="#" class="asm_comment_more">more</a></pre><div></div></div></div><div id="global-stats"><div id="global-stats"><h2 class="margin_b0 sec_header">Global statistics</h2><table summary="Global statistics" class="margin_t0 jig-ncbigrid"><tbody><tr><td>Number of regions with alternate loci or patches</td><td class="align_r">2</td></tr><tr><td>Total sequence length</td><td class="align_r">2,501,912,388</td></tr><tr><td>Total ungapped length</td><td class="align_r">2,472,047,747</td></tr><tr><td>Gaps between scaffolds</td><td class="align_r">93</td></tr><tr><td>Number of scaffolds</td><td class="align_r">706</td></tr><tr><td>Scaffold N50</td><td class="align_r">88,231,837</td></tr><tr><td>Scaffold L50</td><td class="align_r">9</td></tr><tr><td>Number of contigs</td><td class="align_r">1,118</td></tr><tr><td>Contig N50</td><td class="align_r">48,231,277</td></tr><tr><td>Contig L50</td><td class="align_r">15</td></tr><tr><td>Total number of chromosomes and plasmids</td><td class="align_r">21</td></tr><tr><td>Number of component sequences (WGS or clone)</td><td class="align_r">1,308</td></tr></tbody></table></div></div></div></div><input id="asm-has-egap-annot" type="hidden" value="true" /><script type="text/javascript" src="/projects/genome/uud/js/uud.js"></script><script src="/projects/genome/trackmgr/0.7/js/tms.js"></script></div><div id="messagearea_bottom">

第三部分：独立的下载网址url

    <li><a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1">FTP directory for RefSeq assembly</a></li><li><a href="https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/003/025/GCA_000003025.6_Sscrofa11.1">FTP directory for GenBank assembly</a></li>

第四部分：独立的下载网址url

# 网页标题index：
/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1
# 网页urls（以pig的RefSeq为例）：
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/
# 参考基因组下载链接
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1/GCF_000003025.6_Sscrofa11.1_genomic.fna.gz
...
# 通过总结得出通用下载链接为：
https://ftp.ncbi.nlm.nih.gov + /genomes/all/GCF/000/003/025/GCF_000003025.6_Sscrofa11.1 + /GCF_000003025.6_Sscrofa11.1 + _genomic.fna.gz

3.代码实现

3.1 提供初始遍历文件assemble_list.txt

每行记录了一个所需物种的assemble号，可根据需求自己批量查找。

GCA_000003025.6
GCF_002863925.1
GCF_002263795.1
GCF_016772045.1
GCF_003369695.1
GCF_000247795.1
...
#网页urls
url=str("https://www.ncbi.nlm.nih.gov/assembly/")+str(sample)

3.2 请求部分：发送请求获取网页源代码

def information_collect(url):headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ""(KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}response=requests.get(url,headers=headers)page_content=response.text

3.3 处理部分：写正则表达式进行匹配

3.3.1 第一部分：参考基因组发布信息的匹配

## 这里发现一些assembly缺少检测平台信息，故进行细分jiexi_1_1=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>',re.S)result_1_1=jiexi_1_1.findall(page_content)jiexi_1_2=re.compile(r'Sequencing technology.*?<dd>(.*?)</dd>',re.S)result_1_2=jiexi_1_2.findall(page_content)if result_1_2==[] : result_1_2="Na"result_1 = list_change(result_1_1) + result_1_2[0] + "\t"# 这里用了一个自己写的函数list_change，作用是将输入的列表转化为\t分割的字符串

3.3.2 第二部分：参考基因组装信息的匹配

#参考基因组的组装信息jiexi_2=re.compile(r'Total sequence length.*?">(.*?)</td>.*?Total ungapped length.*?">(.*?)</td>.*?Gaps between scaffolds.*?">(.*?)</td>.*?Number of scaffolds.*?">(.*?)</td>.*?Scaffold N50.*?">(.*?)</td>.*?Scaffold L50.*?">(.*?)</td>.*?Number of contigs.*?">(.*?)</td>.*?Contig N50.*?">(.*?)</td>.*?Contig L50.*?">(.*?)</td>.*?Total number of chromosomes and plasmids.*?">(.*?)</td>.*?Number of component sequences \(WGS or clone\).*?">(.*?)</td>',re.S)result_2=jiexi_2.findall(page_content)result=str(result_1)+list_change(result_2)
#另一种情况#jiexi=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>.*?Sequencing technology.*?<dd>(.*?)</dd>.*?Total sequence length.*?">(.*?)</td>.*?Total ungapped length.*?">(.*?)</td>.*?Gaps between scaffolds.*?">(.*?)</td>.*?Number of scaffolds.*?">(.*?)</td>.*?Scaffold N50.*?">(.*?)</td>.*?Scaffold L50.*?">(.*?)</td>.*?Number of contigs.*?">(.*?)</td>.*?Contig N50.*?">(.*?)</td>.*?Contig L50.*?">(.*?)</td>.*?Total number of chromosomes and plasmids.*?">(.*?)</td>.*?Number of component sequences \(WGS or clone\).*?">(.*?)</td>',re.S)#result=jiexi.findall(page_content)

3.3.3 第三部分：下载链接部分的匹配

    ##进一步爬取下载链接#xiazai=re.compile(r'Statistics report</a>.*?<a href="(.*?)">FTP directory for GenBank assembly</a>',re.S) xiazai=re.compile(r'FTP directory for RefSeq assembly</a>.*?<a href="(.*?)">FTP directory for GenBank assembly</a>',re.S)xiazai_url=xiazai.findall(page_content)#print(xiazai_url)xiazai_response=requests.get(xiazai_url[0],headers=headers)xiazai_page_content=xiazai_response.text# print(xiazai_page_content)xiazai_url_jiexi_1=re.compile(r'<title>Index of (.*?)</title>',re.S)xiazai_url_1=xiazai_url_jiexi_1.findall(xiazai_page_content)#print(xiazai_url_1)xiazai_url_jiexi_2=re.compile(r'fna.gz">(.*?)</a>',re.S)xiazai_url_2=xiazai_url_jiexi_2.findall(xiazai_page_content)#print(xiazai_url_2)###组合下载链接final_url=str("https://ftp.ncbi.nlm.nih.gov"+str(xiazai_url_1[0]+'/'+xiazai_url_2[0]))information=str(result+final_url)

3.4 主程序部分：

# 组装主程序
if __name__ == '__main__':all_sample_lists=sample_list("assemble_list.txt") # sample_list转换Linux文件到pyton列表for sample in all_sample_lists: # 遍历索引url=url_get(sample) # urls获取函数save_information=information_collect(url) # 处理部分函数#print(save_information)information_save(save_information) # 保存函数print("over")

4. 组合各部分代码：

#导入模块
import os
import requests
import re
import csvdef sample_list(list_path):   #输入所要使用的样本列表的Linux路径with open(list_path,'r') as slist:sample_lists=[]samplelists=slist.readlines()for sample in samplelists:sample=sample.strip('\n')  #去掉每个元素后的"\n",避免报错！sample_lists.append(sample)return(sample_lists)  #返回样本列表def url_get(sample):  # urls获取函数url=str("https://www.ncbi.nlm.nih.gov/assembly/")+str(sample)return(url)def information_collect(url): #处理函数headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ""(KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}response=requests.get(url,headers=headers)page_content=response.text# all_information=re.compile(r'<div id="summary_cont">.*?<div id="messagearea_bottom">',re.S).findall(page_content)#参考基因组发布信息
#     jiexi_1=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>.*?Sequencing technology.*?<dd>(.*?)</dd>',re.S)
#     result_1=jiexi_1.findall(page_content)##细分检测平台jiexi_1_1=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>',re.S)result_1_1=jiexi_1_1.findall(page_content)#print(result_1_1)#print(list_change(result_1_1))jiexi_1_2=re.compile(r'Sequencing technology.*?<dd>(.*?)</dd>',re.S)result_1_2=jiexi_1_2.findall(page_content)#print(result_1_2)if result_1_2==[]:result_1_2=" "result_1=list_change(result_1_1)+result_1_2[0]+"\t"#print(result_1)#参考基因组的组装信息jiexi_2=re.compile(r'Total sequence length.*?">(.*?)</td>.*?Total ungapped length.*?">(.*?)</td>.*?Gaps between scaffolds.*?">(.*?)</td>.*?Number of scaffolds.*?">(.*?)</td>.*?Scaffold N50.*?">(.*?)</td>.*?Scaffold L50.*?">(.*?)</td>.*?Number of contigs.*?">(.*?)</td>.*?Contig N50.*?">(.*?)</td>.*?Contig L50.*?">(.*?)</td>.*?Total number of chromosomes and plasmids.*?">(.*?)</td>.*?Number of component sequences \(WGS or clone\).*?">(.*?)</td>',re.S)result_2=jiexi_2.findall(page_content)result=str(result_1)+list_change(result_2)#jiexi=re.compile(r'<title>(.*?) - Genome.*?Organism name.*?">(.*?)</a>.*?Submitter.*?<dd>(.*?)</dd><dt>Date.*?<dd>(.*?)</dd>.*?GenBank assembly accession.*?<dd>(.*?)</dd>.*?Sequencing technology.*?<dd>(.*?)</dd>.*?Total sequence length.*?">(.*?)</td>.*?Total ungapped length.*?">(.*?)</td>.*?Gaps between scaffolds.*?">(.*?)</td>.*?Number of scaffolds.*?">(.*?)</td>.*?Scaffold N50.*?">(.*?)</td>.*?Scaffold L50.*?">(.*?)</td>.*?Number of contigs.*?">(.*?)</td>.*?Contig N50.*?">(.*?)</td>.*?Contig L50.*?">(.*?)</td>.*?Total number of chromosomes and plasmids.*?">(.*?)</td>.*?Number of component sequences \(WGS or clone\).*?">(.*?)</td>',re.S)#result=jiexi.findall(page_content)##进一步爬取下载链接#xiazai=re.compile(r'Statistics report</a>.*?<a href="(.*?)">FTP directory for GenBank assembly</a>',re.S)xiazai=re.compile(r'FTP directory for RefSeq assembly</a>.*?<a href="(.*?)">FTP directory for GenBank assembly</a>',re.S)xiazai_url=xiazai.findall(page_content)#print(xiazai_url)xiazai_response=requests.get(xiazai_url[0],headers=headers)xiazai_page_content=xiazai_response.text# print(xiazai_page_content)xiazai_url_jiexi_1=re.compile(r'<title>Index of (.*?)</title>',re.S)xiazai_url_1=xiazai_url_jiexi_1.findall(xiazai_page_content)#print(xiazai_url_1)xiazai_url_jiexi_2=re.compile(r'fna.gz">(.*?)</a>',re.S)xiazai_url_2=xiazai_url_jiexi_2.findall(xiazai_page_content)#print(xiazai_url_2)###组合下载链接final_url=str("https://ftp.ncbi.nlm.nih.gov"+str(xiazai_url_1[0]+'/'+xiazai_url_2[0]))information=str(result+final_url)return(information)def information_save(save_information):  # 保存函数with open('./information_collection.txt',mode='a') as file:file.writelines(save_information)file.writelines('\n')def list_change(save_list): ## sample_list转换Linux文件到pyton列list2=[]for o in save_list:for i in o:list2.append(i)inf=''for i in list2:inf=inf+i+'\t'#print(inf)return(inf)# 组装主程序
if __name__ == '__main__':all_sample_lists=sample_list("assemble_list.txt") # sample_list转换Linux文件到pyton列表for sample in all_sample_lists: # 遍历索引url=url_get(sample) # urls获取函数save_information=information_collect(url) # 处理部分函数#print(save_information)information_save(save_information) # 保存函数print("over")

5.效果展示

$/实验记录本/爬虫实战/NCBI信息获取/NCBI_collection.py
ARS-UCD1.2 - bosTau9    Bos taurus (cattle)     USDA ARS        2018/04/11     GCA_002263795.2 (latest)                                          PacBio; Illumina NextSeq 500; Illumina HiSeq; Illumina GAII      2,715,853,792  2,715,825,630                                                     02,211   103,308,737     12      2,597   25,896,116      32      31      2,211  https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/263/795/GCA_002263795.2_ARS-UCD1.2/GCA_002263795.2_ARS-UCD1.2_genomic.fna.gz
UOA_Brahman_1   Bos indicus x Bos taurus (hybrid cattle)        University of Adelaide                                                           2018/11/30       GCA_003369695.2 (latest)        PacBio Sequel; PacBio RSII; Illumina NextSeq                                                     2,680,953,056    2,679,316,559   0       1,250   104,466,507     11      1,552  26,764,281                                                        32       30      1,250   https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/369/695/GCA_003369695.2_UOA_Brahman_1/GCA_003369695.2_UOA_Brahman_1_genomic.fna.gz
Bos_indicus_1.0 Bos indicus (zebu cattle)       Genoa Biotecnologia SA         2014/11/25                                                        GCA_000247795.2 (latest) SOLiD   2,673,965,444   2,475,828,999   0       32     106,310,653                                                       11       253,770 28,375  25,227  32      253,770 https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/247/795/GCA_000247795.2_Bos_indicus_1.0/GCA_000247795.2_Bos_indicus_1.0_genomic.fna.gz
...
over

最后的输出结果也可以直接导入excel中进一步处理，获取的下载链接也可以在linux中用wget批量进行下载，节省了很多时间。但其实本文的代码，还有许多可以改进的地方，比如：在正则匹配处，对于缺失信息的处理还不到位，如果某个参考基因组缺少部分信息，就会导致程序报错，可以分开匹配，再加上条件判断就可以解决等。对于python爬虫，完全是个人感兴趣而自学的，开始也是什么都不会，先找别人的项目进行学习和练手，最后到自己亲自动手实践，完成下来还是有一些成就感的。学无止境，加油！

[爬虫实战]利用python快速爬取NCBI中参考基因组assembly的相关信息相关推荐

python爬取b站搜索结果播放地址_如何利用Python快速爬取B站全站视频信息
B 站我想大家都熟悉吧,其实 B 站的爬虫网上一搜一大堆.不过纸上得来终觉浅,绝知此事要躬行,我码故我在.最终爬取到数据总量为 760万条. 准备工作首先打开 B 站,随便在首页找一个视频点击进 ...
爬虫：利用python+requests爬取全国肯德基餐厅门店信息，并写入CSV文件中
爬虫思路: 1.确定url 2.发送请求 requests 3.解析数据 4.保存数据(本地) 关键库:requests,re,csv,pprint(用与console看数据) 一.利用lagou的一 ...
python爬斗鱼直播_Python爬虫：利用API实时爬取斗鱼弹幕
原标题:Python爬虫:利用API实时爬取斗鱼弹幕这些天一直想做一个斗鱼爬取弹幕,但是一直考试时间不够,而且这个斗鱼的api接口虽然开放了但是我在github上没有找到可以完美实现连接.我看了好多 ...
Python爬虫实战系列(一)-request爬取网站资源
Python爬虫实战系列(一)-request爬取网站资源 python爬虫实战系列第一期文章目录 Python爬虫实战系列(一)-request爬取网站资源前言一.request库是什么? 二 ...
Python 爬虫实战，模拟登陆爬取数据
Python 爬虫实战,模拟登陆爬取数据从0记录爬取某网站上的资源连接: 模拟登陆爬取数据保存到本地结果演示: 源网站展示: 爬到的本地文件展示: 环境准备: python环境安装略安装r ...
python爬虫实战（一）--爬取知乎话题图片
原文链接python爬虫实战(一)–爬取知乎话题图片前言在学习了python基础之后,该尝试用python做一些有趣的事情了–爬虫. 知识准备: 1.python基础知识 2.urllib库使用 ...
python为啥爬取数据会有重复_利用Python来爬取“吃鸡”数据，为什么别人能吃鸡？...
原标题:利用Python来爬取"吃鸡"数据,为什么别人能吃鸡? 首先,神装镇楼背景最近老板爱上了吃鸡(手游:全军出击),经常拉着我们开黑,只能放弃午休的时间,陪老板在沙漠里奔波 ...
利用python+selenium爬取derwent数据库上的patents
利用python+selenium爬取derwent数据库上的patents 需求: 登陆web of science,并进入derwent数据库,按照公司excel列表依次进行搜索,并将所有搜索道德 ...
深圳python数据分析师招聘_Python爬取智联招聘数据分析师岗位相关信息的方法
Python爬取智联招聘数据分析师岗位相关信息的方法发布时间:2020-09-23 23:23:12 来源:脚本之家阅读:88 进入智联招聘官网,在搜索界面输入'数据分析师',界面跳转,按F12查 ...

[爬虫实战]利用python快速爬取NCBI中参考基因组assembly的相关信息