基因数据处理47之ART基因序列数据生成器（仿真）

1.概念：
ART基因序列数据生成器
详细请见论文：【1】
和官网【2】

2.下载：
ART-bin-GreatSmokyMountains-04.17.16-Linux64.tgz

http://www.niehs.nih.gov/research/resources/assets/docs/artbingreatsmokymountains041716linux64tgz.tgz

3.配置
sudo cp到用户的bin下

4.使用:

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina

详细请看附录

5.例子：

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina -ss HS20 -i GRCH38chr1L3556522.fna -l 100 -f 20 -o G38L100F20Nhs20

结果：

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina -ss HS20 -i GRCH38chr1L3556522.fna -l 100 -f 20 -o G38L100F20Nhs20====================ART====================ART_Illumina (2008-2016)          Q Version 2.5.1 (Apr 17, 2016)       Contact: Weichun Huang <whduke@gmail.com> -------------------------------------------
还在运行

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ ll
total 9443836
drwxrwxr-x 2 hadoop hadoop       4096  6月  2 23:10 ./
drwxrwxr-x 6 hadoop hadoop       4096  6月  2 22:59 ../
-rw-rw-r-- 1 hadoop hadoop 4635232124  6月  2 23:11 G38L100F20Nhs20.aln
-rw-rw-r-- 1 hadoop hadoop 4347022003  6月  2 23:11 G38L100F20Nhs20.fq
-rw-r--r-- 1 hadoop hadoop  252513055  6月  2 23:00 GRCH38chr1L3556522.fna

参考
【1】 http://bioinformatics.oxfordjournals.org/content/28/4/593.short
【2】 http://www.niehs.nih.gov/research/resources/software/biostatistics/art/
【3】 http://www.niehs.nih.gov/research/resources/assets/docs/artbingreatsmokymountains041716linux64tgz.tgz
附录：

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina ====================ART====================ART_Illumina (2008-2016)          Q Version 2.5.1 (Apr 17, 2016)       Contact: Weichun Huang <whduke@gmail.com> -------------------------------------------===== USAGE =====art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>===== PARAMETERS =====-1   --qprof1   the first-read quality profile-2   --qprof2   the second-read quality profile-amp --amplicon amplicon sequencing simulation-c   --rcount   number of reads/read pairs to be generated per sequence/amplicon (not be used together with -f/--fcov)-d   --id       the prefix identification tag for read ID-ef  --errfree  indicate to generate the zero sequencing errors SAM file as well the regular oneNOTE: the reads in the zero-error SAM file have the same alignment positionsas those in the regular SAM file, but have no sequencing errors-f   --fcov     the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon-h   --help     print out usage information-i   --in       the filename of input DNA/RNA reference-ir  --insRate  the first-read insertion rate (default: 0.00009)-ir2 --insRate2 the second-read insertion rate (default: 0.00015)-dr  --delRate  the first-read deletion rate (default:  0.00011)-dr2 --delRate2 the second-read deletion rate (default: 0.00023)-l   --len      the length of reads to be simulated-m   --mflen    the mean size of DNA/RNA fragments for paired-end simulations-mp  --matepair indicate a mate-pair read simulation-M  --cigarM    indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch-nf  --maskN    the cutoff frequency of 'N' in a window size of the read length for masking genomic regionsNOTE: default: '-nf 1' to mask all regions with 'N'. Use '-nf 0' to turn off masking-na  --noALN    do not output ALN alignment file-o   --out      the prefix of output filename-p   --paired   indicate a paired-end read simulation or to generate reads from both ends of ampliconsNOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000-q   --quiet    turn off end of run summary-qL  --minQ     the minimum base quality score-qU  --maxQ     the maxiumum base quality score-qs  --qShift   the amount to shift every first-read quality score by -qs2 --qShift2  the amount to shift every second-read quality score byNOTE: For -qs/-qs2 option, a positive number will shift up quality scores (the max is 93) that reduce substitution sequencing errors and a negative number will shift down quality scores that increase sequencing errors. If shifting scores by x, the errorrate will be 1/(10^(x/10)) of the default profile.-rs  --rndSeed  the seed for random number generator (default: system time in second)NOTE: using a fixed seed to generate two identical datasets from different runs-s   --sdev     the standard deviation of DNA/RNA fragment size for paired-end simulations.-sam --samout   indicate to generate SAM alignment file-sp  --sepProf  indicate to use separate quality profiles for different bases (ATGC)-ss  --seqSys   The name of Illumina sequencing system of the built-in profile used for simulationNOTE: sequencing system ID names are:GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)HSXn - HiSeqX PCR free (150bp),     HSXt - HiSeqX TruSeq (150bp),   MinS - MiniSeq TruSeq (50bp)MSv1 - MiSeq v1 (250bp),            MSv3 - MiSeq v3 (250bp),        NS50 - NextSeq500 v2 (75bp)
===== NOTES =====* ART by default selects a built-in quality score profile according to the read length specified for the run.* For single-end simulation, ART requires input sequence file, outputfile prefix, read length, and read count/fold coverage.* For paired-end simulation (except for amplicon sequencing), ART also requires the parameter values ofthe mean and standard deviation of DNA/RNA fragment lengths===== EXAMPLES =====1) single-end read simulationart_illumina -ss HS25 -sam -i reference.fa -l 150 -f 10 -o single_dat2) paired-end read simulationart_illumina -ss HS25 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_dat3) mate-pair read simulationart_illumina -ss HS10 -sam -i reference.fa -mp -l 100 -f 20 -m 2500 -s 50 -o matepair_dat4) amplicon sequencing simulation with 5' end single-end reads art_illumina -ss GA2 -amp -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_5end_dat5) amplicon sequencing simulation with paired-end readsart_illumina -ss GA2 -amp -p -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_pair_dat6) amplicon sequencing simulation with matepair readsart_illumina -ss MSv1 -amp -mp -sam -na -i amp_reference.fa -l 150 -f 10 -o amplicon_mate_dat7) generate an extra SAM file with zero-sequencing errors for a paired-end read simulationart_illumina -ss HSXn -ef -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_twosam_dat8) reduce the substitution error rate to one 10th of the default profileart_illumina -i reference.fa -qs 10 -qs2 10 -l 50 -f 10 -p -m 500 -s 10 -sam -o reduce_error9) turn off the masking of genomic regions with unknown nucleotides 'N'art_illumina -ss HS20 -nf 0  -sam -i reference.fa -p -l 100 -f 20 -m 200 -s 10 -o paired_nomask10) masking genomic regions with >=5 'N's within the read length 50art_illumina -ss HSXt -nf 5 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_maskN5

基因数据处理47之ART基因序列数据生成器（仿真）相关推荐

基因数据处理48之ART使用实例
相关参数请见上一篇 1.使用实例1: hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina -ss HS20 - ...
ART基因序列生成器，究竟是做什么的？
ART是一款比较流行的模拟数据软件.可以模拟生成三大二代测序平台Illumina's Solexa, Roche's 454和Applied Biosystems' SOLiD的single-end, ...
基因数据处理8之BWA_MEM小数据集处理（成功）
基因数据处理8之BWA_MEM小数据集处理环境:ubuntu14.04 6G内存参考基因:GRCH38 来源请参考[1] 1.fastq数据:SRR003161.fastq 的头20行,即5条re ...
Java随机数据生成器
Java随机数据生成器一.概述简单易用的随机数据生成器.一般用于开发和测试阶段的数据填充.模拟.仿真研究.演示等场景. 可以集成到各种类型的java项目中使用. 二.优点非常轻量级(不到1M), ...
【BZOJ3217】ALOEXT-暴力重构线段树-替罪羊树-Trie树-树套树-10k大代码(+数据生成器)...
Problem ALOEXT 题目大意给出一个数据结构维护一个数列,要求支持以下操作: 向数列中某个位置插入一个数将数列中某个位置的数删除将数列中某个位置的数换成另外一个数查询一段区间内的次大 ...
Keras图像分割实战：数据整理分割、自定义数据生成器、模型训练
Keras图像分割实战:数据整理分割.自定义数据生成器.模型训练目录 Keras图像分割实战:数据整理分割.自定义数据生成器.模型训练
集成学习模型（xgboost、lightgbm、catboost）进行回归预测构建实战：异常数据处理、缺失值处理、数据重采样resample、独热编码、预测特征检查、特征可视化、预测结构可视化、模型
集成学习模型(xgboost.lightgbm.catboost)进行回归预测构建实战:异常数据处理.缺失值处理.数据重采样resample.独热编码.预测特征检查.特征可视化.预测结构可视化.模型保 ...
Keras用动态数据生成器(DataGenerator)和fitgenerator动态训练模型
有了这个生成器,我们就可以用fit_generator 方法进行训练,格式套路如下: model.fit_generator(generator, steps_per_epoch=..., epoch ...
树形结构 —— 树与二叉树 —— 树的数据生成器
为方便测试数据,给出一个树的数据生成器. 树的结点为 1~10 个,边权为 1~100,各点编号随机化 struct Edge {int x, y;int dis; } edge[N]; int n, ...

基因数据处理47之ART基因序列数据生成器（仿真）

基因数据处理47之ART基因序列数据生成器（仿真）相关推荐

最新文章

热门文章