基于 PacBio 测序数据的纠错算法评测与剪切位点识别研究

基于 PacBio 测序数据的纠错算法评测与剪切位点识别研究
摘要
高通量测序技术的产生和发展催生了许多大规模基因测序项目，如国际千人基
因组计划、英国 UK10K 计划以及中国的百万人群基因组测序计划等等。这些项目
已经或计划对成百上千万个个体进行基因测序，使得测序序列数据量呈指数增长。
基因测序为研究人类的遗传信息，解释基因功能、各种疾病的关联以及分析人类疾
病的发病机理提供了详细的基础数据。在此背景下，本文以目前最新型的测序平台
PacBio 为对象，针对第三代基因测序序列高错误率的特点重点研究纠错算法，并且
根据 DNA 测序数据设计和实现了高性能的剪切位点预测方法。
读段纠错和剪切位点识别是 PacBio 测序数据分析中的两个关键模块，本文首先
针对 PacBio 测序碱基判读错误率高这一固有缺陷，解析了所有纠错工具原理和底层
算法架构，并就目前所有的自纠错和混合纠错工具进行了系统地比较和评估；然后，
针对新的真核生物基因剪切位点的识别问题，本文在集成多种特征生成的基础上，
采用机器学习方法对基因剪切的宏微观规律进行探索和研究，以求达到准确预测剪
切位点的目标。本文主要研究内容与贡献具体描述如下：
一，针对 PacBio 纠错工具，在分析其纠错原理的基础上，本文提出了一套系统
的纠错工具评测方法，设计大量实验将现有的自纠错和混合纠错工具统一地应用于
不同测序深度大肠杆菌和酵母的 PacBio 公共数据集上。实验结果表明，几乎没有一
种工具在性能、效率以及对后续分析影响相关的所有指标上均表现很好，每种工具
都具有特有的优缺点和适用测序深度。最后，分别针对每种纠错工具给出了对应的
最优选择策略，且给出了不同测序深度数据集下的工具选择方案。本文的指标可以
作为用户选择合适纠错工具的依据，且为未来新工具的开发指明方向。
二，针对基因组剪切位点检测的问题，本文就常规剪切位点附近的序列模式信
息，提出一种基于多特征提取的机器学习识别方法。该方法首先利用多种基因序列
特征生成方法，分别获取常规剪切供体和受体位点附近的序列模式信息。然后通过
应用随机森林和支持向量机等机器学习方法对剪切模式进行建模，进一步辨别出序
列真伪剪切位点。实验结果表明，基于多特征的机器学习在识别剪切位点上准确率
较高，在供体、受体位点的识别上 AUC 值最高可以达到 0.904。该方法能够高效地
帮助研究人员准确检测基因组上的真正剪切位点以及其他相关功能位点，并能促进
新基因的注释和清晰认识基因的编码区域和结构。
关键词： PacBio；自纠错；混合纠错；剪切位点；机器学习

The evaluation of error-correction algorithm and identification
of splicing sites based on PacBio sequencing data
Abstract
The generation and development of high-throughput sequencing technology has led
to many large-scale gene sequencing projects, such as the International Thousand Human
Genome Project, the UK UK10K Program, and China's Million Population Genome
Sequencing Project. These projects already have sequenced or plan to sequence hundreds
of millions of individuals, leading to an exponential increase in the amount of sequencing
data. Gene sequencing provides detailed basic data for the study of human genetic
information, interpretation of gene function, association of various diseases, and analysis
of the pathogenesis of human diseases. In this context, the current novel sequencing
platform PacBio is used as the object, focusing on the error correction algorithm for the
high error rate of the third generation gene sequencing, and based on the DNA
sequencing data, a high-performance method of predicting splicing site is designed and
implemented.
Error correction and splicing site recognition are two key modules in PacBio
sequencing data analysis. This article first direct at the problem on intrinsic high error
rate of PacBio sequencing, analyzing the principle and the underlying algorithm
architecture of error correction tools, and construct a systematic comparison and
evaluation of all self-correction and hybrid error-correction tools currently. Then, in order
to identify the splicing sites of new eukaryotic genome, this paper uses machine learning
methods based on the integration of multiple feature generations to explore and study the
macro and microscopic laws of gene splicing, in order to achieve the goal of predicting
the splicing sites accurately. The main research contents and contributions of this article
are described as follows:
First, based on analyzing the principle of PacBio error correction tool, this paper
proposes a set of systematic evaluation methods for error correction tool, and designs a
large number of experiments to apply the existing self-correction and hybrid error
correction tools at different sequencing depths of PacBio sequencing public dataset of E.
coli and S. cere uniformly. The experimental results show that almost none of the tools
perform well on all the indicators, including performance, efficiency, and subsequent
analysis. Each tool has unique advantages and disadvantages, including applicable
sequencing depth. Finally, the corresponding optimal selection strategies are given for

each error correction tool, and selection schemes for different sequencing depth data sets
are given. The indicators in this article can serve as the basis for the user to select the
appropriate error correction tools, and indicate the direction for future development of
new tools.
Secondly, aiming at the problem of genomic splice sites detection, this paper
proposes a machine learning recognition method based on multi-feature extraction for the
sequence pattern information near the regular splice site. The method first uses a variety
of gene sequence feature generation methods to obtain information about the sequence
patterns and physicochemical properties in the vicinity of the conventional splicing donor
and acceptor sites. Then contrusting the sequence mode by using random forest and
support vector machine, we can further identify the true and false splicing sites. The
experimental results show that the multi-feature based machine learning has higher
accuracy in the recognition of splice sites, and the highest AUC value can reach 0.904 in
recognition of donor and acceptor sites. This method can effectively help researchers
accurately detect the true splice site and other related functional sites on the genome, and
can promote the annotation of new genes and clearly understand the coding regions and
structures of genes.
Keywords: PacBio; Self-correction; Hybrid error-correction; Splicing site; Machine
learning

基于 PacBio 测序数据的纠错算法评测与剪切位点识别研究相关推荐

一种基于三代PacBio测序数据的补洞方法
一种基于三代PacBio测序数据的补洞方法技术领域本发明涉及生物信息技术领域,具体涉及DNA组装的补洞方法,它使用三代PacBio 测序数据来进行基因组数据的补洞. 背景技术三代PacBio测序 ...
一种PacBio测序数据组装得到的基因组序列的纠错方法技术 (专利技术)
一种PacBio测序数据组装得到的基因组序列的纠错方法技术技术编号:17008244阅读:83留言:0更新日期:2018-01-11 04:20 本发明专利技术提供一种PacBio测序数据组装后序列 ...
一种PacBio测序数据组装得到的基因组序列的纠错方法
技术领域本发明涉及生物信息技术领域,更具体的说,它涉及一种PacBio测序数据组装得到的基因组序列的纠错方法. 背景技术 PacBio是一家测序仪公司,提供第三代测序技术测序平台,他们的测序仪产生的 ...
基于单细胞测序数据构建细胞状态转换轨迹(cell trajectory)方法总结
细胞状态转换轨迹构建示意图(Trapnell et al. Nature Biotechnology, 2014) 在各种生物系统中,细胞都会展现出一系列的不同状态(如基因表达的动态变化等),这些状态 ...
[当人工智能遇上安全] 5.基于机器学习算法的主机恶意代码识别研究
您或许知道,作者后续分享网络安全的文章会越来越少.但如果您想学习人工智能和安全结合的应用,您就有福利了,作者将重新打造一个<当人工智能遇上安全>系列博客,详细介绍人工智能与安全相关的论文. ...
基于三代测序数据的结构变异检测，PBHoney方法解读
本文来自"生信算法"公众号. 基因变异普遍存在于同一物种内的不同个体中,如人与人之间的基因组是不完全一样的(即是多态的),彼此之间都存在着一些差异,即使是和父母或是兄弟姐妹之间去比 ...
三代测序数据纠错的方法、装置和计算机可读存储介质与流程
三代测序数据纠错的方法.装置和计算机可读存储介质与流程文档序号:15616049发布日期:2018-10-09 21:24 导航: X技术> 最新专利>计算;推算;计数设备的制造及其应用 ...
经典：基因组测序数据从头拼接或组装算法的原理
基因组测序数据的拼接/组装 (图片来源:google) 每一个物种的参考基因组序列(reference genome)的产生都要先通过测序的方法,获得基因组的测序读段(reads),然后再进行从头拼接 ...
iMeta | 青岛华大范广益组基于共标签测序数据的高质量宏基因组组装工具MetaTrass...
点击蓝字关注我们 MetaTrass:基于共标签测序数据的人类肠道微生物高质量宏基因组组装工具 https://doi.org/10.1002/imt2.46 RESEARCH ARTICLE ●2 ...

基于 PacBio 测序数据的纠错算法评测与剪切位点识别研究

基于 PacBio 测序数据的纠错算法评测与剪切位点识别研究相关推荐

最新文章

热门文章