根据氨基酸变化，从NCBI - ClinVar数据库抓取信息（基于python

写在前面

NCBI 网址中提供有各种数据库，这里使用’ClinVar’数据库。从ClinVar数据库搜索氨基酸变化信息后，获取搜索结果的相关信息。

biopython 提供了访问NCBI Entrez数据库的模块bio.Entrez ，中文文档在这里。

代码实现

需求：在ClinVar数据库中，根据输入的氨基酸变化信息，获取染色体的位置信息。
实现：1) 获取检索结果各条目的ID列表；2) 根据ID，获取条目信息，格式为xml格式；3) 解析xml，获取染色体位置信息等；4) 将获取的信息保存为字典；

使用NCBI Entrez数据库前，除了导入相应模块，还需要指定使用者的邮箱地址。

from Bio import Entrez
Entrez.email = "xxx@xxx.com"  # 邮箱地址

第一步：获取检索内容列表

指定要检索的数据库，最大检索条目数等。这里是检索"ClinVar"数据库，设定最大条目数为5。获取检索内容条目的ID列表使用Entrez.esearch(db, term, retmax)，其中，db即为检索的数据库名称，一般为小写，term为检索的内容，retmax为检索条目最大值。返回的结果为字典形式，包括

dict_keys([‘Count’, ‘RetMax’, ‘RetStart’, ‘IdList’, ‘TranslationSet’, ‘TranslationStack’, ‘QueryTranslation’])

这里我们仅使用IdList获取检索内容ID列表即可。

QUERY_DB = "clinvar"  # 要检索的数据库，一般都是小写
MAX_TERM = 5  # 检索获取的结果条目数最大值def get_id_lst(query_info):  # 获取查询内容的信息ID"""query_info: 'GeneName ProChg', eg: 'EGFR p.Gly719Ala'"""term_handle = Entrez.esearch(db=QUERY_DB, term=query_info, retmax=MAX_TERM)records = Entrez.read(term_handle) # records.keys()id_lst = records['IdList']return id_lst

第二步：根据ID，获取xml格式条目信息

使用Entrez.efetch模块可获取指定ID列表的信息，使用read()方法获取文本内容(xml格式)。这里需要指定的四个参数：db还是ClinVar数据库，id为第一步获取的id列表，rettype指定为variation，retmode格式为xml。

一开始不知道指定rettype的值是什么，最终在Biostars上的问答区找到了，回答者还有提供在ClinVar数据库使用"variation"获取的具体内容示例（这里）。

def query_info(id_list):handle = Entrez.efetch(db=QUERY_DB, id=id_list, rettype="variation", retmode="xml")handle_info = handle.read()handle.close()return handle_info

第三步：解析xml，获取染色体位置信息

上一步已获取检索信息的xml格式文本，使用python中bs4.BeautifulSoup模块处理（bs4.BeautifulSoup的英文文档和中文文档）。获取的内容如下图，

这里用到的方法有：

find_all(name): 搜索当前tag中，名称为"name"的所有tag字节点。这里就是获取名称为VariationReport的tag的节点及其子节点的内容。该方法还有属性等其他参数： find_all( name , attrs , recursive , string , **kwargs )，这里暂时只用到名称。
attrs: 使用find_all获取的节点内容后，获取VariationID和VariationName属性，“VariationName"也可使用节点"Name"获取，由于"Name"仅有一个子节点，则使用report.Name.string获取。对于变异类型和氨基酸变化，也使用”.string"获取 “VariantType” 和 “ProteinChange”。
.children：这里使用的是Gene以获取基因相关信息，即可直接指定字节点的名称，并使用"attrs"获取相应的属性值。
对于获取位置信息，类似的使用find_all(‘SequenceLocation’).

代码如下：

from bs4 import BeautifulSoup
from collections import OrderedDictREF_VS = "GRCh37"  # 获取染色体位置信息，使用GRCH37/hg19. 若为hg38, 请使用"GRCh38"def report_info_dict(handle_query_xml):reports_dict = OrderedDict()soup = BeautifulSoup(handle_query_xml, 'xml')  # lxmlreports = soup.find_all('VariationReport')if not reports:return {}for report in reports:var_id = report.attrs['VariationID']  # 变异信息IDvar_name = report.attrs['VariationName']  # 变异名称，包含转录本号，基因名称，核苷酸变化和氨基酸变化gene_name = report.Gene.attrs['Symbol']  # 基因名称var_type = report.VariantType.string  # 变异类型if report.ProteinChange:pro_change_abbr = report.ProteinChange.string  # 可能有多种else:pro_change_abbr = "NotFound ProteinChange"locations = report.find_all('SequenceLocation')  # 位置信息for seq_loc in locations:if seq_loc.attrs['Assembly'] == REF_VS:chrom = seq_loc.attrs['Chr']  # 染色体号start = seq_loc.attrs['display_start']  # display_start; innerStart, startstop = seq_loc.attrs['display_stop']reports_dict[var_id] = [var_name, var_type, gene_name, pro_change_abbr, chrom, start, stop]return reports_dict

问题：检索的结果中有多条，有的结果可能不是对应输入检索的氨基酸变化信息，只是包含输入的内容。
解决：可通过最后获取的字典，进行正则匹配。（暂无实现代码）
注：
– 对SNP的氨基酸变化，匹配方式固定，eg: 匹配p.Arg225Ter，p.([A-Za-z]{3})([0-9]+)([A-Za-z]{3})。
– 但对INDEL或者有frameshift的氨基酸变化，匹配的格式可能较多，有：p.Ser719del，p.Thr1018fs，p.Leu747_Pro753delinsSer等，如果格式与ClinVar数据库中提供的格式相同，检索结果相对一致，如果格式有差异，则检索结果的准确性则会降低。
– 如果氨基酸名称简写为1个字母，可考虑转换为3个字母写法，或者匹配获取内容的氨基酸变化信息，但后者在INDEL时ClinVar检索结果中可能没有氨基酸变化信息。因此，最好还是转换为3个字母的格式。对于氨基酸变化的格式标准应该遵循HGVS的标准规范。

汇总以上代码：

from Bio import Entrez
from bs4 import BeautifulSoup
from collections import OrderedDictEntrez.email = "xxx@xxx.com"  # 输入邮箱地址
QUERY_DB = "clinvar"  # 要检索的数据库，一般都是小写
REF_VS = "GRCh37"  # 获取染色体位置信息，使用GRCH37/hg19. 若为hg38, 请使用"GRCh38"
MAX_TERM = 5  # 检索获取的结果条目数最大值query_i = 'EGFR p.Gly719Ala'  # 示例 要查询的信息
query_i1 = 'EGFR p.Gly719'
query_info2 = 'BRCA1 p.Q1756fs'
q_i1 = 'EGFR p.D770_N771insG'
q_i2 = 'EGFR p.E746_A750delELREA'def get_id_lst(query_info):  # 获取查询内容的信息ID"""query_info: 'GeneName ProChg', eg: 'EGFR p.Gly719Ala'"""term_handle = Entrez.esearch(db=QUERY_DB, term=query_info, retmax=MAX_TERM)records = Entrez.read(term_handle) # records.keys()id_lst = records['IdList']return id_lstdef query_info(id_list):handle = Entrez.efetch(db=QUERY_DB, id=id_list, rettype="variation", retmode="xml")handle_info = handle.read()handle.close()return handle_infodef report_info_dict(handle_query_xml):reports_dict = OrderedDict()soup = BeautifulSoup(handle_query_xml, 'xml')  # lxmlreports = soup.find_all('VariationReport')if not reports:return {}for report in reports:var_id = report.attrs['VariationID']var_name = report.attrs['VariationName']gene_name = report.Gene.attrs['Symbol']var_type = report.VariantType.stringif report.ProteinChange:pro_change_abbr = report.ProteinChange.stringelse:pro_change_abbr = "NotFound ProteinChange"locations = report.find_all('SequenceLocation')for seq_loc in locations:if seq_loc.attrs['Assembly'] == REF_VS:chrom = seq_loc.attrs['Chr']start = seq_loc.attrs['display_start']  # display_start; innerStart, startstop = seq_loc.attrs['display_stop']reports_dict[var_id] = [var_name, var_type, gene_name, pro_change_abbr, chrom, start, stop]return reports_dictdef simple_report_dict(gene, pro_change):query_item = str(gene) + ' ' + str(pro_change)ids = get_id_lst(query_item)h_q_info = query_info(ids)reports_d = report_info_dict(h_q_info)return reports_dprint(simple_report_dict('EGFR', 'p.Gly719Ala'))

<完>