An improved genome reference for the African cichlid, Metriaclima zebra 非洲慈鲷，斑马宫丽鱼的改良基因组参考

Advances in high-throughput genome sequencing have allowed relatively inexpensive genome projects to be conducted for almost any organism. Projects such as the ‘Genome 10K Project’, which aims to sequence 10,000 vertebrate genomes [1], and the ‘Bird 10K’ project, which aims to sequence 10,500 bird species [2] have accelerated the production of draft genome sequences. Although attempts have been made to establish standards for declaring a genome sequence ‘complete’ [3], the quality of draft genomes varies dramatically. The limitations of using these draft genomes for downstream analyses have been documented [4, 5]. Still, it is clear that such draft genomes will continue to be the basis for genetic research on many species for the foreseeable future.

Short read sequencing technologies are appealing, as the cost per base is relatively cheap [6]. However, short reads (up to several hundred bp) make the de novo assembly process more difficult when the genome contains repeats that exceed the read length, which is typical for even relatively small genomes [7]. In addition, sequencing coverage biases caused by variation in base composition and PCR amplification further complicate the task of the assembler [8, 9]. Many different molecular biology and computational techniques have been developed that attempt to circumvent the problems associated with short read length, while keeping the cost of genome sequencing projects low. One technique is the use of paired-end and mate-pair jumping libraries. The power of this technique was demonstrated when a usable human draft genome assembly was produced using a combination of differently sized short read jumping libraries (180 bp to 40 kb) with the ALLPATHS-LG assembler [10].

The Assemblathon2 contest was organized as a friendly competition to assess current methods and evaluate the state of genome assembly by providing datasets of primarily short reads for three different vertebrate genomes. Assemblathon2 demonstrated that there was a lot of variability among submitted assemblies, and still plenty of room for improvement [11]. One of the three species used in the Assemblathon2 was the Lake Malawi cichlid fish, Metriaclima zebra. African cichlid fish are an ideal system for studying evolutionary mechanisms due to their phenotypic diversity and rapid speciation [12]. Draft genomes of M. zebra and four other African cichlid fish were recently published [13]. According to most assembly metrics, this M. zebra draft assembly (‘M_zebra_v0’) was among the best entries submitted to Assemblathon2. However, our extensive use of this assembly has revealed problems with gene models in or near assembly gaps, misassemblies encountered during the course of chromosome walks, and spurious spikes of differentiation statistics near gap and scaffold edges. These problems are not unique to this genome project, and complicate the use of many other draft genomes.

To improve the M. zebra draft assembly, we generated a 16.5× set of Pacific Biosciences SMRT (Single Molecule, Real-Time) sequencing reads. These ‘long’ PacBio reads can be used to improve draft assemblies by spanning gaps around repetitive regions and joining contigs and scaffolds [14]. Here we set out to improve the M_zebra_v0 genome assembly both to create a better reference assembly for the cichlid research community and to explore the improvements made possible with the addition of 16.5× of PacBio reads to even a relatively good draft vertebrate genome assembly.

高通量基因组测序技术的进步使得几乎任何生物都可以进行相对廉价的基因组计划。“10K基因组计划”的目标是对10000种脊椎动物的[1]基因进行测序，“10K鸟类计划”的目标是对10500种鸟类的[2]基因进行测序，这些项目加快了草图基因组序列的产生。虽然已经有人尝试建立标准来宣布一个基因组序列“完整”的[3]，但草案基因组的质量差异很大。使用这些草图基因组进行下游分析的局限性已经被记录在案[4,5]。尽管如此，很明显，在可预见的未来，这种初步的基因组将继续成为许多物种基因研究的基础。短读测序技术很有吸引力，因为每个碱基的成本相对较低。然而，短读(高达几百个bp)使重新组装过程更加困难，当基因组包含超过读长度的重复，这是典型的即使相对较小的基因组[7]。此外，由于碱基组成的变化和PCR扩增引起的测序覆盖偏差进一步使装配者的工作复杂化[8,9]。许多不同的分子生物学和计算技术已经被开发出来，试图绕过短读长度相关的问题，同时保持低成本的基因组测序项目。一种技术是使用配对和配对跳跃库。当使用不同大小的短读跳跃库(180 bp到40 kb)与ALLPATHS-LG汇编程序[10]组合生成一个可用的人体草图基因组组装时，该技术的威力得到了证明。

汇编2竞赛被组织成一个友好的竞赛来评估当前的方法和通过提供主要为三种不同脊椎动物基因组的短读数据集来评估基因组汇编的状态。assembly athon2演示了提交的程序集之间存在很大的可变性，并且仍然有很大的改进空间。装配过程中使用的三种鱼类之一是马拉维湖慈鲷和海斑马。非洲慈鲷由于其表型多样性和快速的物种形成，是研究进化机制的理想系统。M. zebra和其他四种非洲丽鱼的基因组草图最近发表在[13]杂志上。根据大多数的装配标准，这个M. zebra草案装配(' M_zebra_v0 ')是提交给assembly athon2的最佳条目之一。然而，我们对这种装配的广泛使用揭示了基因模型在装配间隙内或附近的问题，在染色体行走过程中遇到的装配错误，以及在间隙和支架边缘附近的假峰分化统计。这些问题并不是这个基因组计划所特有的，并且使许多其他草案基因组的使用复杂化。

为了改进M. zebra draft的装配，我们生成了一组16.5×的太平洋生物科学SMRT(单分子，实时)测序reads。这些“长”的PacBio读取可以用来改善牵伸组件，跨越重复区域周围的间隙，并连接contigs和支架[14]。在这里，我们着手改进M_zebra_v0基因组装配，以便为cichlid研究社区创建一个更好的参考装配，并探索增加16.5×PacBio reads所可能实现的改进，从而形成一个相对较好的脊椎动物基因组装配草案。

Pacific Biosciences SMRT sequencing

The Qiagen MagAttract HMW DNA kit was used to extract high-molecular weight DNA from a nucleated blood cell sample from a new individual from the same population used for the Broad Institute sequencing project. Size selection was performed at the University of Maryland Genomics Resource Center using a Blue Pippin pulse-field gel electrophoresis instrument. A library was constructed and 24 SMRT cells were sequenced on their PacBio RS II using the P5-C3 chemistry.

Proovread error correction

Proovread is a hybrid error correction pipeline for correcting PacBio SMRT reads using short read data [18]. This step is important as the raw PacBio subreads are only ~85 % accurate [19] and contain chimeric reads at a rate of 1–2 % [20].

太平洋生物科学SMRT测序Qiagen magdraw HMW DNA试剂盒用于从布罗德研究所测序项目使用的同一人群的新个体的有核血细胞样本中提取高分子量DNA。大小选择在马里兰大学基因组资源中心使用蓝色Pippin脉冲场凝胶电泳仪器进行。构建了一个文库，使用P5-C3化学方法对24个SMRT细胞的PacBio RS II进行了测序。

Proovread纠错

Proovread是一种混合错误纠正管道，用于纠正PacBio SMRT读取使用短读数据[18]。这一步很重要，因为原始的PacBio子序列只有~ 85%的准确[19]，并且含有1-2 %[20]的嵌合读序列。