1.概念:
ART基因序列数据生成器
详细请见论文:【1】
和官网【2】

2.下载:
ART-bin-GreatSmokyMountains-04.17.16-Linux64.tgz

http://www.niehs.nih.gov/research/resources/assets/docs/artbingreatsmokymountains041716linux64tgz.tgz

3.配置
sudo cp到用户的bin下

4.使用:

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina 

详细请看附录

5.例子:

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina -ss HS20 -i GRCH38chr1L3556522.fna -l 100 -f 20 -o G38L100F20Nhs20

结果:

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina -ss HS20 -i GRCH38chr1L3556522.fna -l 100 -f 20 -o G38L100F20Nhs20====================ART====================ART_Illumina (2008-2016)          Q Version 2.5.1 (Apr 17, 2016)       Contact: Weichun Huang <whduke@gmail.com> -------------------------------------------
还在运行
hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ ll
total 9443836
drwxrwxr-x 2 hadoop hadoop       4096  6月  2 23:10 ./
drwxrwxr-x 6 hadoop hadoop       4096  6月  2 22:59 ../
-rw-rw-r-- 1 hadoop hadoop 4635232124  6月  2 23:11 G38L100F20Nhs20.aln
-rw-rw-r-- 1 hadoop hadoop 4347022003  6月  2 23:11 G38L100F20Nhs20.fq
-rw-r--r-- 1 hadoop hadoop  252513055  6月  2 23:00 GRCH38chr1L3556522.fna

参考
【1】 http://bioinformatics.oxfordjournals.org/content/28/4/593.short
【2】 http://www.niehs.nih.gov/research/resources/software/biostatistics/art/
【3】 http://www.niehs.nih.gov/research/resources/assets/docs/artbingreatsmokymountains041716linux64tgz.tgz
附录:

hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina ====================ART====================ART_Illumina (2008-2016)          Q Version 2.5.1 (Apr 17, 2016)       Contact: Weichun Huang <whduke@gmail.com> -------------------------------------------===== USAGE =====art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -f <fold_coverage> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>
art_illumina [options] -ss <sequencing_system> -sam -i <seq_ref_file> -l <read_length> -c <num_reads_per_sequence> -m <mean_fragsize> -s <std_fragsize> -o <outfile_prefix>===== PARAMETERS =====-1   --qprof1   the first-read quality profile-2   --qprof2   the second-read quality profile-amp --amplicon amplicon sequencing simulation-c   --rcount   number of reads/read pairs to be generated per sequence/amplicon (not be used together with -f/--fcov)-d   --id       the prefix identification tag for read ID-ef  --errfree  indicate to generate the zero sequencing errors SAM file as well the regular oneNOTE: the reads in the zero-error SAM file have the same alignment positionsas those in the regular SAM file, but have no sequencing errors-f   --fcov     the fold of read coverage to be simulated or number of reads/read pairs generated for each amplicon-h   --help     print out usage information-i   --in       the filename of input DNA/RNA reference-ir  --insRate  the first-read insertion rate (default: 0.00009)-ir2 --insRate2 the second-read insertion rate (default: 0.00015)-dr  --delRate  the first-read deletion rate (default:  0.00011)-dr2 --delRate2 the second-read deletion rate (default: 0.00023)-l   --len      the length of reads to be simulated-m   --mflen    the mean size of DNA/RNA fragments for paired-end simulations-mp  --matepair indicate a mate-pair read simulation-M  --cigarM    indicate to use CIGAR 'M' instead of '=/X' for alignment match/mismatch-nf  --maskN    the cutoff frequency of 'N' in a window size of the read length for masking genomic regionsNOTE: default: '-nf 1' to mask all regions with 'N'. Use '-nf 0' to turn off masking-na  --noALN    do not output ALN alignment file-o   --out      the prefix of output filename-p   --paired   indicate a paired-end read simulation or to generate reads from both ends of ampliconsNOTE: art will automatically switch to a mate-pair simulation if the given mean fragment size >= 2000-q   --quiet    turn off end of run summary-qL  --minQ     the minimum base quality score-qU  --maxQ     the maxiumum base quality score-qs  --qShift   the amount to shift every first-read quality score by -qs2 --qShift2  the amount to shift every second-read quality score byNOTE: For -qs/-qs2 option, a positive number will shift up quality scores (the max is 93) that reduce substitution sequencing errors and a negative number will shift down quality scores that increase sequencing errors. If shifting scores by x, the errorrate will be 1/(10^(x/10)) of the default profile.-rs  --rndSeed  the seed for random number generator (default: system time in second)NOTE: using a fixed seed to generate two identical datasets from different runs-s   --sdev     the standard deviation of DNA/RNA fragment size for paired-end simulations.-sam --samout   indicate to generate SAM alignment file-sp  --sepProf  indicate to use separate quality profiles for different bases (ATGC)-ss  --seqSys   The name of Illumina sequencing system of the built-in profile used for simulationNOTE: sequencing system ID names are:GA1 - GenomeAnalyzer I (36bp,44bp), GA2 - GenomeAnalyzer II (50bp, 75bp)HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)HS10 - HiSeq 1000 (100bp),          HS20 - HiSeq 2000 (100bp),      HS25 - HiSeq 2500 (125bp, 150bp)HSXn - HiSeqX PCR free (150bp),     HSXt - HiSeqX TruSeq (150bp),   MinS - MiniSeq TruSeq (50bp)MSv1 - MiSeq v1 (250bp),            MSv3 - MiSeq v3 (250bp),        NS50 - NextSeq500 v2 (75bp)
===== NOTES =====* ART by default selects a built-in quality score profile according to the read length specified for the run.* For single-end simulation, ART requires input sequence file, outputfile prefix, read length, and read count/fold coverage.* For paired-end simulation (except for amplicon sequencing), ART also requires the parameter values ofthe mean and standard deviation of DNA/RNA fragment lengths===== EXAMPLES =====1) single-end read simulationart_illumina -ss HS25 -sam -i reference.fa -l 150 -f 10 -o single_dat2) paired-end read simulationart_illumina -ss HS25 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_dat3) mate-pair read simulationart_illumina -ss HS10 -sam -i reference.fa -mp -l 100 -f 20 -m 2500 -s 50 -o matepair_dat4) amplicon sequencing simulation with 5' end single-end reads art_illumina -ss GA2 -amp -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_5end_dat5) amplicon sequencing simulation with paired-end readsart_illumina -ss GA2 -amp -p -sam -na -i amp_reference.fa -l 50 -f 10 -o amplicon_pair_dat6) amplicon sequencing simulation with matepair readsart_illumina -ss MSv1 -amp -mp -sam -na -i amp_reference.fa -l 150 -f 10 -o amplicon_mate_dat7) generate an extra SAM file with zero-sequencing errors for a paired-end read simulationart_illumina -ss HSXn -ef -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_twosam_dat8) reduce the substitution error rate to one 10th of the default profileart_illumina -i reference.fa -qs 10 -qs2 10 -l 50 -f 10 -p -m 500 -s 10 -sam -o reduce_error9) turn off the masking of genomic regions with unknown nucleotides 'N'art_illumina -ss HS20 -nf 0  -sam -i reference.fa -p -l 100 -f 20 -m 200 -s 10 -o paired_nomask10) masking genomic regions with >=5 'N's within the read length 50art_illumina -ss HSXt -nf 5 -sam -i reference.fa -p -l 150 -f 20 -m 200 -s 10 -o paired_maskN5

基因数据处理47之ART基因序列数据生成器(仿真)相关推荐

  1. 基因数据处理48之ART使用实例

    相关参数请见上一篇 1.使用实例1: hadoop@Master:~/cloud/adam/xubo/data/GRCH38Sub/cs-bwamem$ art_illumina -ss HS20 - ...

  2. ART基因序列生成器,究竟是做什么的?

    ART是一款比较流行的模拟数据软件.可以模拟生成三大二代测序平台Illumina's Solexa, Roche's 454和Applied Biosystems' SOLiD的single-end, ...

  3. 基因数据处理8之BWA_MEM小数据集处理(成功)

    基因数据处理8之BWA_MEM小数据集处理 环境:ubuntu14.04 6G内存 参考基因:GRCH38 来源请参考[1] 1.fastq数据:SRR003161.fastq 的头20行,即5条re ...

  4. Java随机数据生成器

    Java随机数据生成器 一.概述 简单易用的随机数据生成器.一般用于开发和测试阶段的数据填充.模拟.仿真研究.演示等场景. 可以集成到各种类型的java项目中使用. 二.优点 非常轻量级(不到1M), ...

  5. 【BZOJ3217】ALOEXT-暴力重构线段树-替罪羊树-Trie树-树套树-10k大代码(+数据生成器)...

    Problem ALOEXT 题目大意 给出一个数据结构维护一个数列,要求支持以下操作: 向数列中某个位置插入一个数 将数列中某个位置的数删除 将数列中某个位置的数换成另外一个数 查询一段区间内的次大 ...

  6. Keras图像分割实战:数据整理分割、自定义数据生成器、模型训练

    Keras图像分割实战:数据整理分割.自定义数据生成器.模型训练 目录 Keras图像分割实战:数据整理分割.自定义数据生成器.模型训练

  7. 集成学习模型(xgboost、lightgbm、catboost)进行回归预测构建实战:异常数据处理、缺失值处理、数据重采样resample、独热编码、预测特征检查、特征可视化、预测结构可视化、模型

    集成学习模型(xgboost.lightgbm.catboost)进行回归预测构建实战:异常数据处理.缺失值处理.数据重采样resample.独热编码.预测特征检查.特征可视化.预测结构可视化.模型保 ...

  8. Keras用动态数据生成器(DataGenerator)和fitgenerator动态训练模型

    有了这个生成器,我们就可以用fit_generator 方法进行训练,格式套路如下: model.fit_generator(generator, steps_per_epoch=..., epoch ...

  9. 树形结构 —— 树与二叉树 —— 树的数据生成器

    为方便测试数据,给出一个树的数据生成器. 树的结点为 1~10 个,边权为 1~100,各点编号随机化 struct Edge {int x, y;int dis; } edge[N]; int n, ...

最新文章

  1. 云计算究竟是什么呢?“汇新杯”新兴科技成果专项赛之——云计算
  2. 深度优先搜索 和问题 简单函数递归 “加 还是不加”
  3. 乐在其中设计模式(C#) - 适配器模式(Adapter Pattern)
  4. mongodb(2)
  5. GacUI学习(一)
  6. 找到的比较好的工作面试题笔试题
  7. Go如何按行读取文本
  8. 【白皮书分享】2020新式茶饮白皮书:数字化进阶-奈雪.pdf(附下载链接)
  9. PLSQL Split分割字符串
  10. 区块链浏览器_带你走进Filecoin区块链浏览器filscout.io
  11. 耗时1个月整理的这份英语资源!一次性全部分享给你,手慢无!
  12. 编译nginx源码包
  13. 天勤2022数据结构(七)排序
  14. 如何利用TFTP服务器上传文件到真机交换机?可使用SecureCRT和3CDaemon工具
  15. 搜狗双拼输入法--快速入门
  16. 现代心理与教育统计学 第二章 统计图表
  17. Wireshark抓不到vlan tag问题解决
  18. erlang 开源项目之 Bigwig
  19. linux//常用命令
  20. puppy linux 版本,Puppy Linux 8.0 发布,轻量级发行版

热门文章

  1. 无刷直流电机常用控制方式比较
  2. UWB 超带宽寻迹定位模块——STM32设计部分
  3. 如何Renew你的Office 365开发者订阅
  4. 物联网时代,物联网卡将何去何从?
  5. 小米6Android8开发版,小米6安卓8.0系统来了 MIUI论坛开启内测招募
  6. 嵌入式工程师“中年危机”应对策略中
  7. jad环境变量配置_用jad做一个快乐的java代码阅读师
  8. SEO 如何提升网站权重?
  9. 《数据结构》王争 学习笔记
  10. 信息学竞赛中常用名词