病原菌基因组快速搜索算法实现

算法来自(Wellcome Trust Centre for Human Genetics, University of Oxford)19年发表在NBT上的一篇文章

这是文章的算法的示意图

以及文章中的说明

这个算法主要基于布隆过滤器(BloomFilter)，如文章中所示，首先我们要拟定几种不同的hash生成器，但是需要注意不能用python的hash()模块，因为hash()在每次重新调用脚本时生成的值都不一样。

import hashlibdef my_hash(k_mer, index_length):tmp_hash = hashlib.md5(k_mer.encode())tmp_hash = tmp_hash.hexdigest()tmp_hash = abs(int(tmp_hash, 16))def hash1(_hash):_tmp = 0for _num in str(_hash):_tmp += int(_num)return _tmp % index_lengthdef hash2(_hash):return _hash % index_lengthdef hash3(_hash):_hash = str(_hash)[:5]return int(_hash) % index_lengthhash_list = [hash1(tmp_hash), hash2(tmp_hash), hash3(tmp_hash)]return hash_list

我们需要把生成的hash值固定在一个范围内，这个参数就由index_length来控制。然后需要一个按kmer长度切割fastq文件的方法，并且把切割后的字符串都进行hash转换

def cut_fastq(fastq, index_length, kmer):kmer_dict = {}for i in range(0, len(fastq)-kmer+1):seds = fastq[i:i+5]if seds not in kmer_dict:kmer_dict[seds] = my_hash(seds, index_length)return kmer_dict

接下来用几个fastq来构建用于索引的表(以字典形式传入几个样本的fastq)，返回pandas的dataframe

def main(fastq_dict, index_length, kmer=5):total_kmer = {}sample_dict = {}for _sample, _fastq in fastq_dict.items():sample_kmer = cut_fastq(_fastq, index_length, kmer)total_kmer.update(sample_kmer)tmp_sample = []for key, value in sample_kmer.items():tmp_sample += valuetmp_sample = list(set(tmp_sample))sample_dict[_sample] = [0 if i not in tmp_sample else 1 for i in range(0, index_length)]return pd.DataFrame(sample_dict)

测试函数:

if __name__ == "__main__":fastq_dic = {"fastq1": "tcgatcgatgcgc","fastq2": "gtcgaaaatcgacg","fastq3": "gtcgtcgagatttt","fastq4": "atcgtgcagggagc","fastq5": "tgtgcacacatcgtatg"}index_length = 60test_kmer = "atttt"index_data = main(fastq_dic, index_length, kmer=5)value1, value2, value3 = my_hash(test_kmer, index_length)output_data = index_data.iloc[value1] & index_data.iloc[value2] & index_data.iloc[value3]match_fastq = output_data[output_data == 1].index.tolist()print(match_fastq)

打印结果为[‘fastq3’]，说明只有fastq3中包含有"atttt"字段。这个算法可以超高速的从大规模细菌/病毒等数据库中检索到对应的条目，同时数据库更新时也只需要竖向添加索引。
但是这个算法的缺点就是容易出现误算，举个例子，比如fastq1中"tcgat"算出的hash值为[1, 2, 3]、"cgatc"算出的hash值为[1, 2, 5]，那么fastq1的索引就是[1, 2, 3, 5]，fastq2中"gtcga"算出的hash值为[2, 3, 5]就会认为fastq1中也存在这个字符串，所以首先需要确保index数目足够多(但是过多会影响效率，最好根据样本容量来确定)；其次不管index有多少，理论上都会出现误匹配，所以拿到样本编号后还需要做进一步的比对。

下面是整体代码：

#!/usr/local/bin/python3
# -*- coding: UTF-8 -*-import pandas as pd
import hashlibdef my_hash(k_mer, index_length):tmp_hash = hashlib.md5(k_mer.encode())tmp_hash = tmp_hash.hexdigest()tmp_hash = abs(int(tmp_hash, 16))def hash1(_hash):_tmp = 0for _num in str(_hash):_tmp += int(_num)return _tmp % index_lengthdef hash2(_hash):return _hash % index_lengthdef hash3(_hash):_hash = str(_hash)[:5]return int(_hash) % index_lengthhash_list = [hash1(tmp_hash), hash2(tmp_hash), hash3(tmp_hash)]return hash_listdef cut_fastq(fastq, index_length, kmer):kmer_dict = {}for i in range(0, len(fastq)-kmer+1):seds = fastq[i:i+5]if seds not in kmer_dict:kmer_dict[seds] = my_hash(seds, index_length)return kmer_dictdef main(fastq_dict, index_length, kmer=5):total_kmer = {}sample_dict = {}for _sample, _fastq in fastq_dict.items():sample_kmer = cut_fastq(_fastq, index_length, kmer)total_kmer.update(sample_kmer)tmp_sample = []for key, value in sample_kmer.items():tmp_sample += valuetmp_sample = list(set(tmp_sample))sample_dict[_sample] = [0 if i not in tmp_sample else 1 for i in range(0, index_length)]return pd.DataFrame(sample_dict)if __name__ == "__main__":fastq_dic = {"fastq1": "tcgatcgatgcgc","fastq2": "gtcgaaaatcgacg","fastq3": "gtcgtcgagatttt","fastq4": "atcgtgcagggagc","fastq5": "tgtgcacacatcgtatg"}index_length = 60test_kmer = "atttt"index_data = main(fastq_dic, index_length, kmer=5)value1, value2, value3 = my_hash(test_kmer, index_length)output_data = index_data.iloc[value1] & index_data.iloc[value2] & index_data.iloc[value3]match_fastq = output_data[output_data == 1].index.tolist()print(match_fastq)

参考文献:
Bradley Phelim, den Bakker Henk C, Rocha Eduardo P C et al. Ultrafast search of all deposited bacterial and viral genomic data[J]. Nat. Biotechnol., 2019, 37: 152-159.

病原菌基因组快速搜索算法实现相关推荐

drep：微生物基因组快速去冗余-文章解读+帮助文档+实战教程
在微生物分离培养.分箱中获得的大量的基因组.宏基因组拼接的基因组(MAG),如何确定到底有多少种非冗余的细菌基因组呢? 来自加州大学伯克利分校Jillian F Banfield组开发的dRep可以帮 ...
drep：微生物基因组快速去冗余-文章解读+帮助文档+实战
在微生物分离培养.分箱中获得的大量的基因组.宏基因组拼接的基因组(MAG),如何确定到底有多少种非冗余的细菌基因组呢? 来自加州大学伯克利分校Jillian F Banfield组开发的dRep可以帮 ...
Kraken2：宏基因组快速物种注释神器
简介 kraken是基于k-mer精确比对,并采用最LCA投票结果快速宏基因组DNA序列进行物种注释的软件. 图. Kraken2分类基本原理该文章于2014年发表于Genome Biology,目 ...
Microbiome：人类肠道和病原菌的可移动抗性组驱动环境中抗生素抗性增长
点击蓝字关注我们编译:张璐校稿:张慧林论文ID 原名:Mobile resistome of human gut and pathogen drives anthropogenic b ...
EST：西湖大学鞠峰组-污水厂病原菌与土著反硝化细菌是多重抗生素耐药基因的活跃表达者...
点击蓝字关注我们编译:祝新宇校稿:鞠峰.袁凌论文ID 原名:Pathogenic and Indigenous Denitrifying Bacteria areTranscriptio ...
你想要的宏基因组-微生物组知识全在这(2023.01)
欢迎点击上方蓝色"宏基因组"关注我们! 宏基因组/微生物组是当今世界科研最热门的研究领域之一,为加强宏基因组学技术和成果交流传播,推动全球华人微生物组领域发展,中科院青年科研人员创 ...
你想要的宏基因组-微生物组知识全在这(2022.12)
欢迎点击上方蓝色"宏基因组"关注我们! 宏基因组/微生物组是当今世界科研最热门的研究领域之一,为加强宏基因组学技术和成果交流传播,推动全球华人微生物组领域发展,中科院青年科研人员创 ...
你想要的宏基因组-微生物组知识全在这(2023.3)
欢迎点击上方蓝色"宏基因组"关注我们! 宏基因组/微生物组是当今世界科研最热门的研究领域之一,为加强宏基因组学技术和成果交流传播,推动全球华人微生物组领域发展,中科院青年科研人员创 ...
你想要的宏基因组-微生物组知识全在这(2023.7)
欢迎点击上方蓝色"宏基因组"关注我们! 宏基因组/微生物组是当今世界科研最热门的研究领域之一,为加强宏基因组学技术和成果交流传播,推动全球华人微生物组领域发展,中科院青年科研人员创 ...

病原菌基因组快速搜索算法实现

病原菌基因组快速搜索算法实现相关推荐

最新文章

热门文章