文章目录

超高速搜索现存细菌和病毒基因组
- 写在前面
- 热心肠日报
- 摘要
- 图1. 序列搜索方法
- 图2. 编码原理
- 图3. 权衡速度和空间下与索引大小关系
- 图4. 质粒序列的系统发育分布
- 图5. 质粒分布与抗生素抗性基因
- 图6. 在ENA数据集中抗生素基因随时间变化
- Reference
- 猜你喜欢
- 写在后面

超高速搜索现存细菌和病毒基因组

Ultrafast search of all deposited bacterial and viral genomic data

Nature Biotechnology, [35.724]

10.1038/s41587-018-0010-1

原文链接: https://www.nature.com/articles/s41587-018-0010-1

第一作者：Phelim Bradley

通讯作者：Zamin Iqbal

其它作者：Henk C den Bakker, Eduardo P C Rocha, Gil McVean

主要单位：牛津大学威康人类遗传学信托中心(Wellcome Trust Centre for Human Genetics, University of Oxford)

写在前面

Nature Biotechnology (NBT，自然生物技术，IF 35.7)在2019年2月刊(https://www.nature.com/nbt/volumes/37/issues/2
)共发表了8篇研究(Research)论文(包括3篇Letters，3篇Articles，2篇Resources)，其中4篇文章发表了宏基因组学研究进展(2篇Articles+2篇Resources)。其中关于超高速细菌基因组检索的技术作为本期的封面文章。

四篇文章的简介，点击以下链接阅读：

NBT-新年4篇35分文章聚焦宏基因组研究

本文是来自牛津大学威康人类遗传学信托中心(Wellcome Trust Centre for Human Genetics, University of Oxford)的Zamin Iqbal教授团队在宏基因组数据超高速搜索算法中取得突破进展，可实现全球细菌、病毒基因组的整合、更新和高速索引，新的数据索引方法存储空间较传统方法降低了4个数量级。该研究作为自然生物技术本期封面论文，推荐给读者。

热心肠日报

① 全球细菌和病毒基因组数据呈指数增长，对宏基因组分析、流行病学监测非常重要，但检索十分困难；

② 本文提出一种结合种群基因组学和网络搜索的计算方法，可生成方便快速检索的位片式基因组签名索引(BIGSI)的数据结构；

③ 该方法对目前全球存储的全部44万个细菌和病毒基因组索引的存储空间较之前方法减少4个数量集，并支持基因快速检索、定量等应用；

④ 此索引方法扩展性强，支持数据规模可达百万个数据集的级别。

生物大数据的产生给我们带来了存储、检索等方面的挑战。Nature Biotechnology上介绍的一种新型基因组存储方案，仅用低于四个数量级的空间就可检索目前所有的约44万个细菌、病毒基因组信息。

摘要

在全球的生物数据中心，存储的未经处理的细菌和病毒基因组序列数据呈指数级增长。拥有对这些数据进行序列搜索的能力将有助于基础研究和应用研究，如实时基因组流行病学和监测。然而，目前的技术手段仍无法实现。为了解决这一问题，我们将微生物种群基因组学的知识与网络搜索的计算方法相结合，生成一个可搜索的数据结构，即位片基因组签名索引（BItsliced Genomic Signature Index, BIGSI）。我们对来自全球数据库的447,833个细菌和病毒全基因组序列数据集的进行了索引，使用的存储空间比以前的方法减少四个数量级。我们应用BIGSI搜索功能快速寻找耐药基因MCR-1、MCR-2和 MCR-3，确定2827个质粒的宿主范围，并在存档数据集中量化抗生素耐药性。我们的索引可以随着新的（包括未处理或组装的）序列数据集的存储而递增，并且可以扩展至数百万个数据集的级别。

Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as realtime genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.

图1. 序列搜索方法

A，比对序列至同一物种的参考基因组，假设差异相对较低；需要在可接受的时间内比对数百万个序列，并返回一个对齐和比对得分。常用工具为BWA和bowtie。

B，BLAST将一个查询字符串与一个包含大量系统发育范围的参考基因组数据库（图中我们在虚线框中显示RefSeq基因组）进行比较。BLAST 从查询中获取k-mer，对于每个k-mer，它在一个固定的编辑距离内创建一个k-mer的“邻域(neighborhood)”（编辑显示为红色，b（iii）），并在参考基因组数据库中搜索这些。比对只能通过从这些候选位点扩展来完成。BLAST可用于核苷酸和蛋白质的搜索，并能找到近距离同源匹配。

C，MASH在数据库中存储每个参考数据的微小指纹（在本例中是RefSeq）。通过对组装序列集的查询，将组装序列的指纹与RefSeq的指纹进行比较，以找到最接近的参考序列。

D，序列开花树(Sequence Bloom Tree)是一种通过索引数据中的k-mers，然后压缩索引来搜索原始未组装的序列集（未组装的序列集显示为“堆(piles)”的序列（短线），所有这些序列的颜色都相同，表示相同的种类）。设计用于人类数据，SBT可以用来寻找哪些RNA测序数据集包含指定的转录本。

E，BIGSI可以搜索完整的细菌和病毒原始序列数据。RefSeq显示在未组装的readset之间的虚线框中；不同的颜色表示物种和门的巨大范围。SBT和BIGSI的不同输入数据意味着这些方法具有不同的速度和压缩的权衡考虑。

Fig. 1 | Sequence matching methods.

a, Mapping of sequence reads to a reference genome from the same species, assuming relatively low divergence; requirement to map millions of reads in acceptable time and return an alignment and mapping score. Common tools: bwa and bowtie.

b, BLAST compares a query string with a database of reference genomes (in the figure we show RefSeq genomes in a dotted box) covering a massive phylogenetic range. BLAST takes k-mers from the query, and for each k-mer it creates a ‘neighborhood’ of k-mers within a fixed edit distance (edits are shown in red, b(iii)), and searches for these in the reference genome database. Alignment is only done by extending from these hits. Blast can be applied to nucleotide and protein searches and can find close and remote homology matches.

c, MASH stores a tiny fingerprint of each reference in the database (in this case RefSeq). Querying with an assembly, the fingerprint of the assembly is compared with that of RefSeq to find the closest reference.

d, Sequence Bloom Tree13 was the first scalable method to search through raw unassembled readsets (unassembled readsets are shown as ‘piles’ of reads (short lines), all in same color to signify same species), by indexing the k-mers in the data and then compressing the index. Designed for human data, SBT can be applied to find which RNA-seq datasets contain a given transcript.

e, BIGSI can search the complete set of raw sequence data for bacteria and viruses. RefSeq is shown in a dotted box amongst unassembled readsets; different colors to signify the massive range of species and phyla. The different input data for SBT and BIGSI mean that these methods have different speed and compression trade-offs.

图2. 编码原理

A，第一步，每个输入数据集（原序列数据格式fastq或组装结果）被转为非冗余的k-mers列表（有可选的去噪步骤去除错误序列，详见方法）。一个固定的η散列函数（h1，h2…）应用于每个k-mer（在本图中 n = 3），给定元组向量位置设为1（Bloom Filter）。

B，第二步，可存储每个数据集作为一个固定长度的Bloom Filter，作为一个长方形矩阵的列。在BIGSI中查询k-mer AAT，η散列函数应用于查询k-mer，返回η行检查（这里是3,7,5）。全列（datasets）只有1/N行包含查询k-mer；这些被检验的行被称为“bitslices”。当访问需要的行，一个哈希从行的相应位置映射至存储更快的点数组，即O（1）。增加一个新的数据集只需要增加一个新列。

C，naïve编码与BIGSI方法比较。大数据集的一个完整k-mers列表，行是一个大矩阵，列是数据集。对于任何给定的k-mer，查询结果是唯一的。当新增加数据集时，矩阵垂直(k-mers)和水平(新数据集行)增加。

Fig. 2 | BIGSI encoding.

a, In step 1, each input dataset (raw sequence data in FASTQ format or assembly) is converted to a non-redundant list of k-mers (with an optional de-noising step to remove sequencing errors, detailed in Methods). A fixed set of η hash functions (h1, h2,…) is applied to each k-mer (η = 3 in this figure), giving a tuple of positions which are all set to 1 in a bit-vector (a Bloom filter).

b, In step 2, each dataset is stored as a fixed-length Bloom filter, as a column in a rectangular matrix. To query the BIGSI for k-mer AAT, the η hash functions are applied to the query k-mer, returning η rows to be checked (namely 3,7,5 here). All columns (datasets) that have 1 in all of those η rows contain the query k-mer; these rows that are checked are called “bitslices.” A hash, mapping from row index to corresponding bit-array is stored to allow fast, i.e., O(1), access to each row when needed. Adding a new dataset requires adding a new column.

c, Naïve encoding is shown to contrast with the BIGSI approach. A complete list of all k-mers in all datasets form the rows of a large matrix, and columns are datasets. For any given k-mer, entries are set to one for datasets containing that k-mer. When a new dataset is added, the matrix grows vertically (new k-mers added) and horizontally (new column for new dataset).

图3. 权衡速度和空间下与索引大小关系

使用一组2,157个抗微生物基因作为查询数据集，将BIGSI与快速和小版本的SBT和SSBT进行评测。我们执行了一个不精确的搜索（t=40%），并在搜索10-10000个微生物数据集大小的数据库时统计了查询速度与磁盘大小峰值之间的关系。两个轴都为对数转换的刻度；点的直径表示索引的数据集数量。为了比较两种方法，有必要比较相同大小的点。理想的方法将产生指向左下角的点。对于大于2,000的数据库，我们无法构建SBT-fast或SSBT-fast，因为它们的未压缩磁盘使用量超过了可用空间；三角形表示基于计算的磁盘使用下限（如k-mer信息已知）和外推查询时间（方法）的估计值。

Fig. 3 | Speed and space trade-offs as index grows.

Benchmarking of BIGSI against fast and small versions of each of SBT and SSBT, using a set of 2,157 antimicrobial resistance genes as a query dataset. We performed an inexact search (T = 40%) and show query speed versus peak disk size when searching databases of sizes from 10–10,000 microbial datasets. Both axes are on a log scale; the diameter of a dot represents the number of datasets indexed. To compare two methods, it is necessary to compare dots of the same size. The ideal method would produce dots toward the bottom left. For database sizes greater than 2,000, we were unable to build the SBT-fast or SSBT-fast, as their uncompressed disk usage exceeded available space; triangles signify estimated values based a calculated lower bound for disk use (as k-mer content is known) and extrapolated query times (Methods).

图4. 质粒序列的系统发育分布

所示质粒序列（37）在所有微生物索引的多个属中被发现至少5次。热图显示了每个属中每个质粒的频率。利用UPGMA算法和欧几里得距离度量对质粒和属进行层次聚类。左边第二个质粒（AF012911）具有非常广泛的系统发育分布，是一种已知的克隆载体。在大肠杆菌、沙门氏菌和志贺氏菌之间大量共享，与肠杆菌科已知的多样性现象一致。

Fig. 4 | Phylogenetic distribution of plasmid sequences.

Plasmid sequences (37) that are found at least five times in more than one genus in the all microbial index are shown. The heat map shows the frequency of each plasmid in each genus. The plasmids and genera were hierarchically clustered using the UPGMA algorithm and Euclidean distance metric. The plasmid on the left (AF012911) with extremely wide phylogenetic distribution is a known cloning vector. The large amount of sharing between Escherichia, Salmonella and Shigella is consistent with known promiscuity within Enterobacteriaceae.

图5. 质粒分布与抗生素抗性基因

对含有至少三个抗生素抗性基因（n=98，紫色）质粒与不含抗生素抗性基因（n=665，桃红色）质粒的系统发育分布（在所有质粒的属之间成对距离的中位数）进行了比较。柱状图被标准化以允许比较（概率密度）。分布的95%分位数分别为1.11（无抗性基因）和1.99（≥3个基因）。大亚基rRNA树与SILVA的距离。系统发育扩散的单位是每个位点的替换率；距离可能大于1，因为它被测量到共同的祖先，然后再次后退。

Fig. 5 | Plasmid spread and antibiotic resistance genes.

A comparison of phylogenetic spread (median of pairwise distances between all pairs of genera in which a plasmid is seen) of plasmids containing at least three antibiotic resistance genes (n = 98, purple) with those bearing none (n = 665, peach) is shown. Histograms are normalized to allow comparison (probability densities). The 95% quantiles of the distributions are 1.11 (no resistance genes) and 1.99 (≥ 3 genes). Distance measured on the large-subunit rRNA tree from SILVA. Units of phylogenetic spread are substitutions per site; it is possible to have a distance > 1 because it is measured up to the common ancestor and back down again.

图6. 在ENA数据集中抗生素基因随时间变化

a，包含一系列抗生素抗性基因的所有微生物索引中的样本计数；每个基因都是独立处理的，因此一个包含CTX-M和OXA的数据集被计数两次。

b，葡萄球菌（以金黄色葡萄球菌为主）编码的所有mecA， tet和aac基因（分别编码对甲氧西林、四环素和氨基糖苷类的抗性）的逐年频率（定义为数据公开的日期）。

c.各种抗生素耐药基因在克雷伯氏菌中的频率逐年增加；自2014年以来患病率的增加可能是由于全球范围内β-内酰胺酶的广泛监测和KPC耐药克雷伯氏菌的取样增加所致。

d，结核分枝杆菌数据集年复一年的细分，按基因型分为耐药（R）、泛易感（S）、多重耐药（MDR）或广泛耐药（XDR）。如下所示：所有数据集对参考文献27中的耐药目录中的变种进行基因分型，然后将其基因型分为耐药或易感的12种抗生素类型。如果数据集对异烟肼和利福平耐药，则将其归类为MDR（多药耐药）；如果数据集对MDR耐药，则将其归类为XDR（广泛耐药）；如果数据集对氟喹诺酮和任何卷曲霉素、卡那霉素和阿米卡星耐药，则将其归类为XDR（广泛耐药）；如果数据集对任何抗生素而不是MDR或XDR耐药，则将其归类为耐药；否则将其归类为敏感。b-d中估计频率（平均值）周围的误差条显示了使用Wilson二项置信检验计算的95%置信区间。

Fig. 6 | Antibiotic resistance gene prevalence in ENA over time.

a, Counts of samples in the all-microbial index containing a range of antibiotic resistance genes; each gene was treated independently, so a single dataset containing both CTX-M and OXA, for example, is counted twice.

b, Year-by-year frequency (defined by date of public availability) in Staphylococcus (dominated by S. aureus) of mecA, and all tet and aac genes, which encode resistance to methicillin, tetracycline and aminoglycosides, respectively.

c, Year-by-year frequency in Klebsiella of various antibiotic resistance genes; increase in prevalence since 2014 may be due to increased extended spectrum β -lactamase surveillance and sampling of KPC-resistant Klebsiella globally.

d, Yearby- year breakdown of M. tuberculosis datasets, classified by genotypes as resistant ®, pan-susceptible (S), multiple-drug resistant (MDR) or extensively drug resistant (XDR), as follows: all datasets were genotyped for variants from the resistance catalog from ref. 27, then classified as resistant or susceptible to 12 antibiotics based on their genotype. Datasets were classed as MDR (multi-drug resistant) if resistant to isoniazid and rifampicin; as XDR (extensively drug-resistant) if MDR and also resistant to a fluoroquinolone and any of capreomycin, kanamycin and amikacin; as resistant if resistant to any antibiotic but not MDR or XDR; and susceptible otherwise. Error bars around the estimated frequency (mean) in b–d show the 95% confidence interval calculated using the Wilson binomial confidence test.

Reference

Bradley Phelim,den Bakker Henk C,Rocha Eduardo P C et al. Ultrafast search of all deposited bacterial and viral genomic data.[J] .Nat. Biotechnol., 2019, 37: 152-159.

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外5000+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍末解决群内讨论，问题不私聊，帮助同行。

学习扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA

NBT：超高速细菌基因组检索技术相关推荐

Nature子刊：超高速细菌基因组检索技术
超高速搜索现存细菌和病毒基因组 Ultrafast search of all deposited bacterial and viral genomic data Nature Biotechnol ...
NBT：使用纳米孔测序从微生物组中得到完整闭环的细菌基因组
文章目录使用纳米孔测序从微生物组中得到完整成环的细菌基因组热心肠导读摘要前言结果图1 定义的12种细菌混合物中的序列分类学组成.每种细菌的读长分布和基因组组装图2:在两个健康的人类粪便微 ...
NBT：宏基因组二、三代混合组装软件OPERA-MS
文章目录宏基因组二.三代测序混合组装软件OPERA-MS 热心肠日报摘要主要结果图1. OPERA-MS工作流程图图2. 宏基因组数据混合组装基因组评测图3. 组装虚拟肠道微生物组图4. ...
NBT：宏基因组读云建库+雅典娜算法组装获得微生物高质量基因组
读云(read clouds)组装高质量末培养的微生物基因序列 High-quality genome sequences of uncultured microbes by assembly of ...
NBT：宏基因组10X建库+雅典娜算法组装获得微生物高质量基因组
文章目录读云(read clouds)组装高质量末培养的微生物基因序列热心肠日报导读摘要图1. 读云鸟枪测序和组装方法技术路线图2.两位健康人个体粪便的微生物属水平组成图3.三种方法获得 ...
16款测序平台性能大PK，华大表现不俗！基于人类和细菌基因组DNA水平的多平台测序数据研究成果发布...
生物信息学习的正确姿势 NGS系列文章包括NGS基础.转录组分析 (Nature重磅综述|关于RNA-seq你想知道的全在这).ChIP-seq分析 (ChIP-seq基本分析流程).单细胞测序分析 ...
MPB：华大孙海汐等-从细菌基因组中预测活性前噬菌体工具Prophage Hunter的使用流程和常见问题...
为进一步提高<微生物组实验手册>稿件质量,本项目新增大众评审环节.文章在通过同行评审后,采用公众号推送方式分享全文,任何人均可在线提交修改意见.公众号格式显示略有问题,建议电脑端点击文末阅 ...
MPB：扬大林淼组-瘤胃混合细菌连续传代培养技术
为进一步提高<微生物组实验手册>稿件质量,本项目新增大众评审环节.文章在通过同行评审后,采用公众号推送方式分享全文,任何人均可在线提交修改意见.公众号格式显示略有问题,建议电脑端点击文末阅 ...
PhiSpy：在细菌基因组中识别噬菌体
PhiSpy:在细菌基因组中识别噬菌体 PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combin ...

NBT：超高速细菌基因组检索技术