Advances in high-throughput genome sequencing have allowed relatively inexpensive genome projects to be conducted for almost any organism. Projects such as the ‘Genome 10K Project’, which aims to sequence 10,000 vertebrate genomes [1], and the ‘Bird 10K’ project, which aims to sequence 10,500 bird species [2] have accelerated the production of draft genome sequences. Although attempts have been made to establish standards for declaring a genome sequence ‘complete’ [3], the quality of draft genomes varies dramatically. The limitations of using these draft genomes for downstream analyses have been documented [4, 5]. Still, it is clear that such draft genomes will continue to be the basis for genetic research on many species for the foreseeable future.

Short read sequencing technologies are appealing, as the cost per base is relatively cheap [6]. However, short reads (up to several hundred bp) make the de novo assembly process more difficult when the genome contains repeats that exceed the read length, which is typical for even relatively small genomes [7]. In addition, sequencing coverage biases caused by variation in base composition and PCR amplification further complicate the task of the assembler [8, 9]. Many different molecular biology and computational techniques have been developed that attempt to circumvent the problems associated with short read length, while keeping the cost of genome sequencing projects low. One technique is the use of paired-end and mate-pair jumping libraries. The power of this technique was demonstrated when a usable human draft genome assembly was produced using a combination of differently sized short read jumping libraries (180 bp to 40 kb) with the ALLPATHS-LG assembler [10].

The Assemblathon2 contest was organized as a friendly competition to assess current methods and evaluate the state of genome assembly by providing datasets of primarily short reads for three different vertebrate genomes. Assemblathon2 demonstrated that there was a lot of variability among submitted assemblies, and still plenty of room for improvement [11]. One of the three species used in the Assemblathon2 was the Lake Malawi cichlid fish, Metriaclima zebra. African cichlid fish are an ideal system for studying evolutionary mechanisms due to their phenotypic diversity and rapid speciation [12]. Draft genomes of M. zebra and four other African cichlid fish were recently published [13]. According to most assembly metrics, this M. zebra draft assembly (‘M_zebra_v0’) was among the best entries submitted to Assemblathon2. However, our extensive use of this assembly has revealed problems with gene models in or near assembly gaps, misassemblies encountered during the course of chromosome walks, and spurious spikes of differentiation statistics near gap and scaffold edges. These problems are not unique to this genome project, and complicate the use of many other draft genomes.

To improve the M. zebra draft assembly, we generated a 16.5× set of Pacific Biosciences SMRT (Single Molecule, Real-Time) sequencing reads. These ‘long’ PacBio reads can be used to improve draft assemblies by spanning gaps around repetitive regions and joining contigs and scaffolds [14]. Here we set out to improve the M_zebra_v0 genome assembly both to create a better reference assembly for the cichlid research community and to explore the improvements made possible with the addition of 16.5× of PacBio reads to even a relatively good draft vertebrate genome assembly.

高通量基因组测序技术的进步使得几乎任何生物都可以进行相对廉价的基因组计划。“10K基因组计划”的目标是对10000种脊椎动物的[1]基因进行测序,“10K鸟类计划”的目标是对10500种鸟类的[2]基因进行测序,这些项目加快了草图基因组序列的产生。虽然已经有人尝试建立标准来宣布一个基因组序列“完整”的[3],但草案基因组的质量差异很大。使用这些草图基因组进行下游分析的局限性已经被记录在案[4,5]。尽管如此,很明显,在可预见的未来,这种初步的基因组将继续成为许多物种基因研究的基础。短读测序技术很有吸引力,因为每个碱基的成本相对较低。然而,短读(高达几百个bp)使重新组装过程更加困难,当基因组包含超过读长度的重复,这是典型的即使相对较小的基因组[7]。此外,由于碱基组成的变化和PCR扩增引起的测序覆盖偏差进一步使装配者的工作复杂化[8,9]。许多不同的分子生物学和计算技术已经被开发出来,试图绕过短读长度相关的问题,同时保持低成本的基因组测序项目。一种技术是使用配对和配对跳跃库。当使用不同大小的短读跳跃库(180 bp到40 kb)与ALLPATHS-LG汇编程序[10]组合生成一个可用的人体草图基因组组装时,该技术的威力得到了证明。

汇编2竞赛被组织成一个友好的竞赛来评估当前的方法和通过提供主要为三种不同脊椎动物基因组的短读数据集来评估基因组汇编的状态。assembly athon2演示了提交的程序集之间存在很大的可变性,并且仍然有很大的改进空间。装配过程中使用的三种鱼类之一是马拉维湖慈鲷和海斑马。非洲慈鲷由于其表型多样性和快速的物种形成,是研究进化机制的理想系统。M. zebra和其他四种非洲丽鱼的基因组草图最近发表在[13]杂志上。根据大多数的装配标准,这个M. zebra草案装配(' M_zebra_v0 ')是提交给assembly athon2的最佳条目之一。然而,我们对这种装配的广泛使用揭示了基因模型在装配间隙内或附近的问题,在染色体行走过程中遇到的装配错误,以及在间隙和支架边缘附近的假峰分化统计。这些问题并不是这个基因组计划所特有的,并且使许多其他草案基因组的使用复杂化。

为了改进M. zebra draft的装配,我们生成了一组16.5×的太平洋生物科学SMRT(单分子,实时)测序reads。这些“长”的PacBio读取可以用来改善牵伸组件,跨越重复区域周围的间隙,并连接contigs和支架[14]。在这里,我们着手改进M_zebra_v0基因组装配,以便为cichlid研究社区创建一个更好的参考装配,并探索增加16.5×PacBio reads所可能实现的改进,从而形成一个相对较好的脊椎动物基因组装配草案。

Pacific Biosciences SMRT sequencing

The Qiagen MagAttract HMW DNA kit was used to extract high-molecular weight DNA from a nucleated blood cell sample from a new individual from the same population used for the Broad Institute sequencing project. Size selection was performed at the University of Maryland Genomics Resource Center using a Blue Pippin pulse-field gel electrophoresis instrument. A library was constructed and 24 SMRT cells were sequenced on their PacBio RS II using the P5-C3 chemistry.

Proovread error correction

Proovread is a hybrid error correction pipeline for correcting PacBio SMRT reads using short read data [18]. This step is important as the raw PacBio subreads are only ~85 % accurate [19] and contain chimeric reads at a rate of 1–2 % [20].

太平洋生物科学SMRT测序Qiagen magdraw HMW DNA试剂盒用于从布罗德研究所测序项目使用的同一人群的新个体的有核血细胞样本中提取高分子量DNA。大小选择在马里兰大学基因组资源中心使用蓝色Pippin脉冲场凝胶电泳仪器进行。构建了一个文库,使用P5-C3化学方法对24个SMRT细胞的PacBio RS II进行了测序。

Proovread纠错

Proovread是一种混合错误纠正管道,用于纠正PacBio SMRT读取使用短读数据[18]。这一步很重要,因为原始的PacBio子序列只有~ 85%的准确[19],并且含有1-2 %[20]的嵌合读序列。

An improved genome reference for the African cichlid, Metriaclima zebra 非洲慈鲷,斑马宫丽鱼的改良基因组参考相关推荐

  1. The Genome Reference Consortium Human Genome Build 37 now Available(GRCh37)

    Categorized | 生物信息学 Tags | NCBI, NCBI News, 人类基因组(译文)NCBI发布版本37的人类基因组序列 Posted on 26 十月 2009 by 柳城 原 ...

  2. GATK使用说明-GRCh38(Genome Reference Consortium)(二)

    Reference Genome Components 1. GRCh38 is special because it has alternate contigs that represent pop ...

  3. C++中Reference与指针(Pointer)的使用对比

    了解引用reference与指针pointer到底有什么不同可以帮助你决定什么时候该用reference,什么时候该用pointer. 在C++ 中,reference在很多方面与指针(pointer ...

  4. Visual Studio 2015打开ASP.NET MVC的View提示“Object reference not set to an instance of an object“错误的解决方案

    Visual Studio 2015打开ASP.NET MVC的View提示"Object reference not set to an instance of an object&quo ...

  5. Charades CharadesEgo Action Genome 数据集以及论文总结

    0. 前言 本文介绍Charades系列数据集,包括: Charades:ECCV 2016,第一个家庭室内场景下的日常行为识别数据集,是通过众包完成的. 数据集采集方式挺有意思,用户先写剧本(根据关 ...

  6. Nature子刊:涵盖20多万个基因组的人体肠道微生物参考基因组集

    Nature子刊:涵盖20多万个人体肠道微生物基因组的参考基因组集 A unified catalog of 204,938 reference genomes from the human gut ...

  7. NBT:人类微生物组千万基因的参考基因集

    文章目录 人类肠道中整合参考基因集 热心肠日报 摘要 要点 Main 结果Results 构建整合基因集 图1. IGC的构建 整合基因集的质量和完整度 图2. IGC覆盖度 IGC中的物种 图3. ...

  8. NBT-19年2月刊4篇35分文章聚焦宏基因组研究

    新年4篇35分文章聚焦宏基因组研究 Nature Biotechnology (NBT,自然生物技术,IF 35.7)在2019年2月刊(https://www.nature.com/nbt/volu ...

  9. NBT-新年4篇35分文章聚焦宏基因组研究

    文章目录 新年4篇35分文章聚焦宏基因组研究 1. 超高速细菌基因组检索技术 摘要 序列搜索方法 2. 宏基因组中设计全面可扩展探针捕获序列多样性 摘要 CATCH设计探针 3. 1520个人类肠道可 ...

最新文章

  1. python hexdump_hexdump用法
  2. PHP array_combine
  3. 爬虫学习笔记(二十)—— 字体反爬
  4. 968. Binary Tree Cameras 监控二叉树
  5. J2SE核心开发实战(一)——认识J2SE
  6. Hive的基本操作-创建内部表
  7. ARINC818(FC-AV)协议详解
  8. JavaScript(一)——变量,数据类型及转换、运算符和逻辑结构
  9. Ubuntu 13.04 双显卡安装NVIDIA GT 630M驱动
  10. python爬虫什么意思-Python爬虫是什么意思有啥用 python爬虫原理实例介绍
  11. PPT中均匀分布各图形(水平或垂直)
  12. java lombok ppt,Lombok详解
  13. 一起学习“秋叶的如何成为PPT高手”
  14. QString中如何设置上下角标(Qt)
  15. js截取字符串第一个和最后一个字符
  16. 信息编码的运用——如何用二进制改图
  17. [转][火星帖][留档] 东京秋叶原电器街世风日下
  18. 【云原生】K8s简介之什么是K8s
  19. 手机sim卡插到电脑上网_你知道吗?关于手机SIM卡的一些事
  20. 关于标签系统的一点想法。

热门文章

  1. 开挂的 00 后!17 岁「天才少女」被 8 所世界名校录取,最终选择 MIT 计算机系...
  2. ECCV20 3D目标检测新框架3D-CVF
  3. 汇总|C++常见知识点总结,涉及文本输出、排序、生成随机数、异常处理、关联容器、printf重定向、sprintf用法、cout重定向
  4. php地址后面拼接页码,php分页类尾部页码导航代码
  5. pom.xml中的dependencyManagement
  6. Nat. Rev. Neurol. | 机器学习在神经退行性疾病诊断和治疗中的应用
  7. OpenGL函数库详解
  8. 共享可写节包含重定位_艾瑞咨询:2020年数说双11电商购物节报告
  9. 干货 | 第六期课程回顾遗传病基因检测和解读
  10. ISME Comm:南农韦中等-菌群移植筑建根际免疫新防线