Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing dat

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing data

应用于Nanopore RNA测序数据的长读误差校正软件的比较评估

Abstract
Motivation: Nanopore long-read sequencing technology offers promising alternatives to high-throughput short read
sequencing, especially in the context of RNA-sequencing. However this technology is currently hindered by high error rates
in the output data that affect analyses such as the identification of isoforms, exon boundaries, open reading frames and
creation of gene catalogues. Due to the novelty of such data, computational methods are still actively being developed and
options for the error correction of Nanopore RNA-sequencing long reads remain limited.
Results: In this article, we evaluate the extent to which existing long-read DNA error correction methods are capable of
correcting cDNA Nanopore reads.We provide an automatic and extensive benchmark tool that not only reports classical
error correction metrics but also the effect of correction on gene families, isoform diversity, bias toward the major isoform
and splice site detection.We find that long read error correction tools that were originally developed for DNA are also
suitable for the correction of Nanopore RNA-sequencing data, especially in terms of increasing base pair accuracy. Yet
investigators should be warned that the correction process perturbs gene family sizes and isoform diversity. This work
provides guidelines on which (or whether) error correction tools should be used, depending on the application type.
Benchmarking software: https://gitlab.com/leoisl/LR_EC_analyser

摘要:

纳米孔长读测序技术为高通量短读测序提供了有前途的替代方案，特别是在RNA测序的背景下。然而，这项技术目前受到输出数据的高错误率的阻碍，这些错误率影响分析，如异构体的识别、外显子边界、开放的阅读框和基因目录的创建。由于这些数据的新颖性，计算方法仍在积极地发展中，纳米孔RNA测序长读的纠错方法仍然有限。

结果:在这篇文章中，我们评估了现有的长读DNA错误校正方法对cDNA纳米孔reads的校正能力。我们提供了一个自动化和广泛的基准测试工具，不仅报告了经典的误差校正指标，还报告了校正对基因家族、亚型多样性、对主要亚型的偏倚和剪接位点检测的影响。我们发现，最初为DNA开发的长读错误校正工具也适用于校正纳米孔RNA测序数据，特别是在提高碱基对精度方面。然而，研究人员应该注意到，纠正过程会干扰基因家族大小和亚型多样性。这项工作提供了根据应用程序类型使用哪些(或是否使用)错误纠正工具的指南。

原则上，数据的高错误率使转录组的分析变得复杂，特别是在精确检测外显子边界，或定量类似的亚型和杂合基因方面。读序列需要与参考基因组或转录组进行明确且高碱基对的比对。插入(即插入/删除)是长读技术产生的主要类型的错误，它们比替换错误[22]更容易混淆对准器。纠正RNAseq读取错误的方法有很多，主要是在短读时代[23,24]。它们不再适用于长读，因为它们是用来处理低错误率和主要替换的。然而，提出了一套新的方法来纠正基因组长读。长读错误校正算法有两种，一种是只使用长读的信息(自校正或非混合校正)，另一种是使用短读来校正长读(混合校正)。在这篇文章中，我们将报道在何种程度上，最先进的工具能够纠正由纳米孔测序仪产生的长噪声RNAseq读数。

有几种用于纠正长读错误的工具，包括ONT reads。即使Nanopore和PacBio读取的错误概况不同，错误率也非常相似，我们有理由认为，最初为PacBio数据设计的工具在最近的Nanopore数据上也表现良好。据我们所知，以前很少有专门针对RNA-seq长读的错误校正的工作。值得注意的例外包括:(i) LSC[25]，其设计错误更正PacBio RNA-seq长读使用Illumina RNAseq短读;(ii) PBcR[26]和(iii) HALC[27]，它们主要针对基因组设计，但也根据转录组数据进行评估。在这里，我们将站在评价RNA-seq数据的长读错误纠正工具的立场上，其中大多数设计用于处理DNA测序数据。

我们评估了以下DNA混合校正工具:HALC[27]、LoRDEC[28]、NaS[29]、PBcR[26]和proovread [30];

DNA自校正工具:Canu [31]， daccord [32]， LoRMA [33]， MECAT [34]， pbdagcon[35]。

我们还评估了一个额外的混合工具，LSC[25]，这是唯一一个专门用于纠正(PacBio) RNA-seq长读的工具。

大多数混合校正方法采用映射策略，将短片段放置在长读上，并使用相关的短读序列对长读区域进行校正。但是他们中的一些人依靠图表来建立一个共识，用于修正。这些图要么是k-mer图(de Bruijn图)，要么是由多个序列比对(部分序列比对)产生的核苷酸图。对于自校正方法，使用上述图的策略是最常见的。我们也考虑过评估nanocorrect[36]、nanopolish[36]、Falcon_sense[37]和LSCPlus[38]，但有些工具是不推荐的，不适合校正或不可用。我们的详细理据见补充资料第S1.12节。我们选择了我们认为具有代表性的一套工具，但也有其他工具没有被考虑在这项研究中，如HG-Color [39]， HECIL [40]， MIRCA [41]， Jabba [42]， nanocorr[43]和Racon[44]。

其他的工作已经在DNA测序的背景下评估了错误校正工具。LRCstats[45]和最近的ELECTOR[46]使用模拟框架提供了基因组长读校正的自动评估。[47]的一份技术报告对PacBio/Nanopore的误差校正工具进行了广泛的评估。该分析是在[48]中混合校正方法的最新结果中完成的。也许最接近我们的工作是AlignQC软件[21]，它提供了一组用于评估rna测序长读数据集质量的指标。在[21]中，对Nanopore和PacBio RNAsequencing数据集在错误模式、亚型鉴定和定量方面进行了比较。虽然[21]没有比较错误纠正工具，我们将使用和扩展AlignQCmetrics为此目的

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing dat相关推荐

Evaluation of long read error correction software 长读纠错软件的评估
Evaluation of long read error correction software Laurent Bouri∗ , Dominique Lavenier† Project-Team ...
Comparative assessment of long-read error-correction software applied to RNA-sequencing data
Comparative assessment of long-read error-correction software applied to RNA-sequencing data 用于RNA测序 ...
HALC: High throughput algorithm for long read error correction
Journal|[J]BMC BioinformaticsVolume 18, Issue 1. 2017. HALC: High throughput algorithm for long read ...
Spelling Error Correction with Soft-Masked BERT
使用Soft-Masked BERT纠正拼写错误 Shaohua Zhang 1 , Haoran Huang 1 , Jicong Liu 2 and Hang Li 1 1 ByteDance A ...
Jabba: hybrid error correction for long sequencing reads using maximal exact matches机译：Jabba：使用最大精
Jabba: hybrid error correction for long sequencing reads using maximal exact matches 机译:Jabba:使用最大精确 ...
Bi-level error correction for PacBio long reads. PacBio长读数的两级纠错
Bi-level error correction for PacBio long reads. PacBio长读数的两级纠错作者: Liu Yuansheng; Lan Chaowang; Blu ...
Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly
Ratatosk - Hybrid error correction of long reads enables accurate variant calling and assembly 长读的 ...
Hybrid Error Correction approach and DeNovo Assembly for MinIon Sequencing Long Reads
Hybrid Error Correction approach and DeNovo Assembly for MinIon Sequencing Long Reads 混合纠错方法和从头组装的Mi ...
Error Correction and DeNovo Genome Assembly for the MinION Sequencing Reads mixing Illumina Short Re
Error Correction and DeNovo Genome Assembly for the MinION Sequencing Reads mixing Illumina Short Re ...

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing dat

Comparative assessment of long-read error correction software applied to Nanopore RNA-sequencing dat相关推荐

最新文章

热门文章