Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly   长读的混合错误纠正允许精确的变体调用和组装

Date: 15th July 2020 | Source: BioRxiv

Authors: Guillaume Holley, Doruk Beyter, Helga Ingimundardottir, Snædis Kristmundsdottir, Hannes Pétur Eggertsson, Bjarni Halldorsson.

Motivation
Long Read Sequencing (LRS) technologies are becoming essential to complement Short Read Sequencing (SRS) technologies for routine whole genome sequencing. LRS platforms produce DNA fragment reads, thousands to millions bases long, allowing the resolution of numerous uncertainties left by SRS reads for genome reconstruction and analysis. In particular, LRS characterizes long and complex structural variants undetected by SRS due to short read length. Furthermore, assemblies produced with LRS reads are considerably more contiguous than with SRS while spanning previously inaccessible telomeric and centromeric regions. However, a major challenge to LRS reads adoption is their much higher error rate than SRS of up to 15%, introducing obstacles in downstream analysis pipelines.

Results
We present Ratatosk, a new error correction method for erroneous long reads based on a compacted and colored de Bruijn graph built from accurate short reads. Short and long reads color paths in the graph while vertices are annotated with candidate Single Nucleotide Polymorphisms. Long reads are subsequently anchored to the graph using exact and inexact k-mer matches to find paths corresponding to corrected sequences.

We demonstrate that Ratatosk can reduce the raw error rate of Oxford Nanopore reads 6-fold on average with a median error rate as low as 0.28%. Ratatosk corrected data maintain nearly 99% accurate SNP calls and increase indel call accuracy by up to about 40% compared to the raw data. An assembly of the Ashkenazi individual HG002 created from Ratatosk corrected Oxford Nanopore reads yields a contig N50 of 43.22 Mbp and outperforms high quality LRS assemblies using PacBio HiFi reads.

In particular, the assembly of Ratatosk corrected reads contains about 2.5 times less errors than an assembly created from PacBio HiFi reads.

Availability: https://github.com/DecodeGenetics/Ratatosk.

Read the full text

动机
长读序列(LRS)技术正成为常规全基因组测序短读序列(SRS)技术的重要补充。
LRS平台产生数千到数百万碱基长的DNA片段读取,允许解决SRS读取留下的大量不确定性,用于基因组重建和分析。
特别地,LRS特征是由于读取长度短而SRS无法检测到的长而复杂的结构变体。
此外,与SRS相比,使用LRS读取产生的程序集在跨越以前无法访问的端粒和着丝粒区域时更具连续性。
然而,采用LRS读取的一个主要挑战是它们的错误率比SRS高得多,高达15%,这给下游分析管道带来了障碍。

结果
我们提出了一种新的纠错方法Ratatosk,它针对长读错误,基于基于精确短读建立的压缩和着色的de Bruijn图。
短和长读取图中的颜色路径,而顶点用候选单核苷酸多态性注释。
随后使用精确和不精确的k-mer匹配将长读取锚定到图上,以找到与修正后的序列对应的路径。

我们证明,Ratatosk可以将牛津纳米孔读取的原始错误率平均降低6倍,中位错误率低至0.28%。
与原始数据相比,经过Ratatosk校正的数据保持了近99%的SNP调用的准确性,并将indel调用的准确性提高了40%左右。
由Ratatosk校正的Oxford Nanopore reads创建的德系犹太人个体HG002的组装产生了43.22 Mbp的contig N50,并优于使用PacBio HiFi reads的高质量LRS组装。

特别是,通过Ratatosk校正后的读取集合包含的错误比通过PacBio HiFi读取创建的集合少2.5倍。

Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly相关推荐

  1. Jabba: hybrid error correction for long sequencing reads using maximal exact matches机译:Jabba:使用最大精

    Jabba: hybrid error correction for long sequencing reads using maximal exact matches 机译:Jabba:使用最大精确 ...

  2. Hybrid Error Correction approach and DeNovo Assembly for MinIon Sequencing Long Reads

    Hybrid Error Correction approach and DeNovo Assembly for MinIon Sequencing Long Reads 混合纠错方法和从头组装的Mi ...

  3. Hybrid error correction and de novo assembly of single-molecule sequencing reads

    Hybrid error correction and de novo assembly of single-molecule sequencing reads 混合误差校正和重新组装的单分子测序读取 ...

  4. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome

    Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome 牛津纳 ...

  5. Efficient Hybrid De Novo Error Correction and Assembly for Long Reads

    Efficient Hybrid De Novo Error Correction and Assembly for Long Reads       长read的高效的混合从头纠错和装配 Abstr ...

  6. Bi-level error correction for PacBio long reads

    Bi-level error correction for PacBio long reads 双级错误校正PacBio长read 最新的测序技术,如太平洋生物科学公司(PacBio)和牛津纳米孔机器 ...

  7. ParLECH: Parallel Long-Read Error Correction with Hadoop

    ParLECH: Parallel Long-Read Error Correction with Hadoop  使用Hadoop并行的长读错误更正 Abstract: Long-read sequ ...

  8. PacBio sequence error correction amd assemble via pacBioToCA

    Illumina二代测序有个致命缺陷,说到底还是基于PCR扩增的,所以存在偏向性和对于高GC含量区无法扩增等系统误差,测序错误是不可避免的,其次就是测序长度短:但其价格便宜,通量非常高,准确性达99% ...

  9. Bi-level error correction for PacBio long reads. PacBio长读数的两级纠错

    Bi-level error correction for PacBio long reads. PacBio长读数的两级纠错 作者: Liu Yuansheng; Lan Chaowang; Blu ...

最新文章

  1. java前台传多个id用什么接收_jsp 页面传多个id 到java后台的处理方式
  2. 兼容IE和FF的js脚本做法(比较常用)[问题点数:20分]
  3. 【Mac】mac安装go
  4. matlab端到端仿真中基站功率,基于matlab的cdma通信系统分析及仿真
  5. 创始人的领导力和合伙人选择
  6. EL表达式取Map,List值的总结
  7. 查看 Linux 中文件打开情况
  8. [OpenS-CAD]屏幕坐标转换分析
  9. 工具的使用——Photoshop
  10. android 开源框架
  11. CSS字体和文本相关
  12. Macbook Pro 自定义 Touchbar 教程,让 Touchbar 顺应你的脾气
  13. 双活数据中心解决方案
  14. python 完整的海龟策略_海龟策略btc现货版
  15. 交换机虚拟化和堆叠的区别_企业网络基础EI CCIE设计部署如何理解三层交换和路由器的区别...
  16. 【软件测试之测试方案】
  17. 数据处理(10):SHP与JSON格式文件相互转换
  18. C语言实现http服务器(Linux)
  19. 编程语言的学习路线通论
  20. 腾讯微博新浪微博相互转发工具(GreenBrowser浏览器插件)

热门文章

  1. 近期激光雷达点云的3D目标检测方法
  2. 论文简述 | 鸟瞰单目多体SLAM
  3. php 文件大小函数,php计算目录文件大小的函数
  4. Java 判断一个字符串是否为数字类型
  5. Jquery中get函数
  6. medRxiv | 基于网络的人类冠状病毒的药物重定位
  7. RDKit | 生物大分子的HELM表示法
  8. 使用命令行创建AVD时的出错总结
  9. mysql必知必会日期函数,MySQL:MySQL必知必会总结
  10. variant 字符串数组_VB数组部分核心知识总结