模拟一个简单计算器

Read simulators are widely being used within the research community to create synthetic and mock datasets for analysis. In this article, I will introduce some recently proposed, commonly used read simulators.

阅读模拟器在研究社区中被广泛使用，以创建用于分析的综合和模拟数据集。在本文中，我将介绍一些最近提出的，常用的读取模拟器。

DNA测序和读取 (DNA Sequencing and Reads)

If you have come across my previous article on DNA Sequence Data Analysis, you may have read about DNA sequencing. Sequencing is the process that determines the precise order of nucleotides of a given DNA molecule. We can determine the order of the four bases adenine, guanine, cytosine and thymine, in a strand of DNA. DNA sequencing is used to determine the sequence of individual genes, full chromosomes or entire genomes of an organism.

如果您看过我以前有关DNA序列数据分析的文章，那么您可能已经阅读了有关DNA测序的信息。测序是确定给定DNA分子核苷酸精确顺序的过程。我们可以确定四个碱基的腺嘌呤 ， 鸟嘌呤 ， 胞嘧啶和胸腺嘧啶的顺序，在DNA链中。 DNA测序用于确定生物的单个基因，完整染色体或完整基因组的序列。

Special machines known as sequencing machines are used to extract short random DNA sequences from a particular genome we wish to determine (target genome). Current DNA sequencing technologies cannot read one whole genome at once. It reads small pieces of between 100 and 30,000 bases, depending on the technology used. These short pieces are called reads.

使用称为测序机的特殊机器从我们希望确定的特定基因组( 目标基因组 )中提取随机的短DNA序列。当前的DNA测序技术无法一次读取一个完整的基因组。根据所使用的技术，它可以读取100到30,000个碱基之间的小片段。这些短片段称为读取。

读模拟器 (Read Simulators)

Sequencing machines may not be available as we wish and we may not be able to get hold of real-world samples to sequence. This is where read simulators come in handy for research purposes. Read simulators can mimic sequencing machines to simulate reads. They have pre-defined statistical models to mimic the error rates relevant to the particular sequencing machines. Furthermore, we can provide our own error models as well (different rates of insertions, deletions and substitutions).

测序机器可能无法如我们所愿，并且我们可能无法掌握现实世界中的样品进行测序。在这里，阅读模拟器可用于研究目的。阅读模拟器可以模仿测序仪来模拟阅读。他们具有预定义的统计模型，可以模拟与特定测序仪相关的错误率。此外，我们还可以提供自己的错误模型(插入，删除和替换的比率不同)。

估计测序覆盖率 (Estimating sequencing coverage)

Sequencing coverage is defined as the average number of reads that covers each base of the reference genome. Estimating the sequencing coverage is very important when you are simulating datasets. The coverage equation is defined as follows.

测序覆盖率定义为覆盖参考基因组每个碱基的平均读取数。模拟数据集时，估计测序覆盖率非常重要。覆盖方程定义如下。

C = LN / G

C = LN / G

C is the sequencing coverage
C是测序覆盖率
G is the length of the genome
G是基因组的长度
L is the read length
L是读取长度
N is the number of reads
N是读取次数

For example, if you have a genome of length 5Mbp and you simulate 1,000,000 HiSeq 2000 reads (read length is 100bp), then we will get a sequencing coverage of 20x as follows.

例如，如果您的基因组长度为5Mbp，并且模拟了1,000,000个HiSeq 2000读取(读取长度为100bp)，那么我们将获得如下20x的测序覆盖率。

C = LN / G = 100 * 1,000,000 / 5,000,000 = 20x

Here, at least each position of the reference genome is covered by 20 reads.

在这里，参考基因组的至少每个位置被20个读数覆盖。

估计丰度 (Estimating Abundance)

The abundance of a species in a dataset is considered as the fraction of reads that belong to that species. For example, if there is a dataset with 10,000,000 reads and 1,000,000 of them belong to E. coli, then the abundance of E. coli will be 0.1.

数据集中物种的丰富度被视为属于该物种的读段的分数。例如，如果存在具有10,000,000的数据集的读取和它们的百万属于大肠杆菌 ，然后大肠杆菌的丰度为0.1。

Note that coverage and abundance are not the same.

请注意，覆盖范围和丰度不同。

短读模拟器 (Short Read Simulators)

With the popularity of next-generation sequencing (NGS) technologies, many NGS read simulators have been developed. Currently, many of the popular short read simulators are designed to simulate reads mimicking many Illumina, 454 and SOLiD platforms. Listed below are some popular short read simulators. Links to their publications are provided as well.

随着下一代测序(NGS)技术的普及，已经开发了许多NGS读取模拟器。当前，许多流行的短读模拟器被设计为模拟模仿许多Illumina，454和SOLiD平台的读。下面列出了一些流行的简短阅读模拟器。还提供了指向其出版物的链接。

MetaSim

MetaSim
wgsim

wgsim
SimNGS

SimNGS
ArtificialFastqGenerator

人工快速生成器
InSilicoSeq

InSilicoSeq

长读模拟器 (Long Read Simulators)

With the advancements in sequencing technologies, scientists have shown an increasing interest in using third-generation sequencing (TGS) technologies. Currently, many of the popular long read simulators are designed to simulate reads mimicking the two main TGS technologies; (1) Pacific Biosciences (PacBio) and (2) Oxford Nanopore (ONT). Listed below are some of the popular and recently introduced PacBio and ONT simulators. Links to their publications are provided as well.

随着测序技术的进步，科学家对使用第三代测序(TGS)技术的兴趣日益浓厚。当前，许多流行的长读模拟器被设计为模拟模仿两种主要TGS技术的读操作。 (1) 太平洋生物科学(PacBio)和(2) 牛津纳米Kong(ONT) 。下面列出的是一些最近流行的PacBio和ONT模拟器。还提供了指向其出版物的链接。

PacBio模拟器 (PacBio Simulators)

PBSIM

PBSIM
LongISLND

长ISLND
SimLoRD

模拟神
NPBSS

全国公共广播电台
PaSS

通过

ONT模拟器 (ONT Simulators)

NanoSim

纳米模拟
Nanopore SimulatION

纳米Kong模拟
DeepSimulator

深度模拟器
DeepSimulator1.5

DeepSimulator1.5

InSilicoSeq (InSilicoSeq)

I have been using InSilicoSeq in my work a lot and I find it very intuitive and easy to use. I will walk you through some sample commands to simulate reads. You can easily install InSilicoSeq using conda or pip.

我在工作中经常使用InSilicoSeq ，发现它非常直观且易于使用。我将引导您完成一些示例命令以模拟读取。您可以使用轻松安装InSilicoSeq conda或pip 。

conda install -c bioconda insilicoseq
OR
pip install InSilicoSeq

Simulate reads by providing the number of reads

通过提供读取次数来模拟读取

Assume that you have a single reference genome and you want to simulate 1 million Illumina MiSeq reads. Given below is a sample command you can run using InSilicoSeq.

假设您有一个参考基因组，并且您想模拟一百万个Illumina MiSeq读数。下面给出的是可以使用InSilicoSeq运行的示例命令。

iss generate --model miseq --genomes ref.fasta --n_reads 1M --cpus 8 --output reads

Simulate reads by providing the coverage

通过提供覆盖范围来模拟阅读

Assume that you have two reference genome files ref1.fasta and ref2.fasta. You want to simulate 30x coverage from ref1 and 10x coverage from ref2. You will need to create a tab-separated file named coverages.tsv and add the coverage details as follows.

假设您有两个参考基因组文件ref1.fasta和ref2.fasta 。您要模拟ref1 30x覆盖率和ref2 10x覆盖率。您将需要创建一个以制表符分隔的文件，名为coverages.tsv ，并按如下所示添加coverage的详细信息。

red1_id     30
ref2_id     10

ref1_id and ref2_id refer to the identifiers of the filesref1.fasta and ref2.fasta. If you download the reference genomes from NCBI, the identifies will consist of letters and numbers and for example, may look something like thisNC_007712.1 or CP001844.2. These identifiers are NCBI accession numbers provided for each reference genome.

ref1_id和ref2_id引用文件ref1.fasta和ref2.fasta 。如果从NCBI下载参考基因组，则标识将由字母和数字组成，例如，看起来可能类似于NC_007712.1或CP001844.2 。这些标识符是为每个参考基因组提供的NCBI登录号。

Now you can simulate the reads using the following command.

现在，您可以使用以下命令模拟读取。

iss generate --model miseq --genomes ref1.fasta ref2.fasta --coverage coverages.tsv --cpus 8 --output reads

Simulate reads by providing the abundance

通过提供丰富的内容来模拟阅读

Assume that you have two reference genome files ref1.fasta and ref2.fasta. You want to simulate 0.4 abundance from ref1 and 0.6 abundance from ref2. Note that the sum of all the abundance values should be 1.0. Similar to coverage, you will need to create a tab-separated file named abundance.tsv and add the abundance details as follows.

假设您有两个参考基因组文件ref1.fasta和ref2.fasta 。您要模拟ref1 0.4丰度和ref2 0.6丰度。注意所有丰度值的总和应为1.0 。与覆盖范围类似，您将需要创建一个制表符分隔的文件abundance.tsv ，并按如下所示添加丰度详细信息。

red1_id     0.4
ref2_id     0.6

Now you can simulate the reads using the following command.

现在，您可以使用以下命令模拟读取。

iss generate --model miseq --genomes ref1.fasta ref2.fasta --abundance abundance.txt --cpus 8 --output reads

You can read more details from the InSilicoSeq documentation.

您可以从InSilicoSeq文档中详细信息。

PBSIM (PBSIM)

PBSIM is a PacBio reads simulator which provides both sampling-based and model-based simulations. I will walk you through some sample commands to simulate reads using PBSIM.

PBSIM是PacBio读取模拟器，它提供基于采样和基于模型的模拟。我将引导您完成一些示例命令，以使用PBSIM模拟读取。

基于模型的仿真 (Model-based simulation)

For model-based simulation, you can run the following command.

对于基于模型的仿真，可以运行以下命令。

pbsim --data-type CLR --depth 100 --length-min 10000 --length-max 20000 --prefix test --model_qc data/model_qc_clr ref.fasta

The model can be found in the PBSIM folder PBSIM-PacBio-Simulator/data/model_qc_clr. The data type CLR refers to Continuous Long Read which simulates long and high error rates. The other data type CCS refers to Circular consensus Read which simulates short and low error rates.

该模型可以在PBSIM文件夹PBSIM-PacBio-Simulator/data/model_qc_clr 。数据类型CLR是指连续长读取 ，它模拟长错误率和高错误率。另一种数据类型CCS指的是“ 循环共识读取” ，它可以模拟短错误率和低错误率。

基于采样的模拟 (Sampling-based simulation)

For sampling-based simulation, you can run the following command.

对于基于采样的模拟，可以运行以下命令。

pbsim --data-type CLR --depth 100 --sample-fastq sample/sample.fastq sample/sample.fasta

The sample FASTQ file can be found in the PBSIM folder PBSIM-PacBio-Simulator/sample/sample.fastq. You can use your own FASTQ file as well.

样本FASTQ文件可在PBSIM文件夹PBSIM-PacBio-Simulator/sample/sample.fastq 。您也可以使用自己的FASTQ文件。

You can read more details from the PBSIM documentation.

您可以从PBSIM文档中详细信息。

模拟神 (SimLoRD)

SimLoRD is a TGS read simulator based on the Pacific Biosciences SMRT error model. I have frequently used SimLoRD to simulate PacBio datasets for my work. I will walk you through some sample commands to simulate reads using SimLoRD.

SimLoRD是基于Pacific Biosciences SMRT错误模型的TGS读取模拟器。我经常使用SimLoRD为我的工作模拟PacBio数据集。我将引导您完成一些示例命令，以使用SimLoRD模拟读取。

通过提供读取次数来模拟定长读取 (Simulate fixed-length reads by providing the number of reads)

Assume that you have a reference genome and you want to simulate fixed-length reads with 60x coverage. Given below is a sample command you can run using SimLoRD.

假设您有一个参考基因组，并且想要模拟覆盖率是60x固定长度读取。下面给出的是可以使用SimLoRD运行的示例命令。

simlord --read-reference ref.fasta --coverage 60 --fixed-readlength 5000 output_prefix

通过提供覆盖范围来模拟定长读取 (Simulate fixed-length reads by providing the coverage)

Assume that you have a reference genome and you want to simulate 2000 fixed-length reads. Given below is a sample command you can run using SimLoRD.

假设您有一个参考基因组，并且想要模拟2000个固定长度的读取。下面给出的是可以使用SimLoRD运行的示例命令。

simlord --read-reference ref.fasta --num-reads 2000 --fixed-readlength 5000 output_prefix

You can also set a minimum length for the reads using the --min-readlength parameter during the simulation. You can read more from the SimLoRD documentation.

您还可以在仿真过程中使用--min-readlength参数设置读取的最小长度。您可以从SimLoRD文档中了解更多信息。

最后的想法 (Final Thoughts)

Read simulators have given us the opportunity to simulate reads ranging from zero errors to very high error rates. Also, they have allowed us to create synthetic and mock datasets mimicking different sequencing machines and different species compositions.

读取模拟器使我们有机会模拟从零错误到很高错误率的读取。此外，它们还使我们能够创建模仿不同测序仪和不同物种组成的合成和模拟数据集。

Hope you found this article useful and informative as a starting point towards using read simulators. Feel free to use these tools for your projects and research work as they are freely available.

希望您发现本文对使用阅读模拟器有帮助，并为您提供了有益的信息。您可以免费使用这些工具来进行项目和研究工作。

Cheers, and stay safe!

干杯，保持安全！

You can read my previous articles related to bioinformatics and DNA analysis.

您可以阅读我以前有关生物信息学和DNA分析的文章。

翻译自: https://medium.com/computational-biology/a-simple-introduction-to-read-simulators-bbeff4f0c0c6