Sambamba: process your BAM data faster!

  对于很大的(>100G)的bam文件,排序时间很慢不说,往往需要1天或更多的时间,但结果还会出错。如下边的错误. 经测试Sambamba表现较好,能够节省很多时间。随着接触的数据越来越多,感觉很简单的事情也需要花很多时间。不仅仅是数据多了的问题!

[bam_sort_core] merging from 3288 files...
[E::hts_open_format] fail to open file 'CS_RNA_seq.sorted.bam.tmp.1020.bam'
[bam_merge_core] fail to open file CS_RNA_seq.sorted.bam.tmp.1020.bam

1、下载地址

https://github.com/lomereiter/sambamba/releases

2、安装

拷贝至全局环境变量路径即可

3、使用

4、view

sambamba-view - tool for extracting information from SAM/BAM/CRAM files

sambamba view [OPTIONS] <input.bam | input.sam | input.cram> [region1 […]]

sambamba view allows to efficiently filter SAM/BAM/CRAM files for alignments satisfying various conditions, as well as access its SAM header and information about reference sequences. In order to make these data readily available for consumption by scripts in Perl/Python/Ruby, JSON output is provided.

By default, the tool expects BAM file as an input. In order to work with CRAM, specify -C and for SAM, specify -S|--sam-input as a command-line option, the tool does NOT try to guess file format from the extension. Beware that when reading SAM, the tool will skip tags which don’t conform to the SAM/BAM specification, and set invalid fields to their default values.

FILTERING

Filtering is presented in two ways. First, you can specify a condition with -F option, using a special language for filtering, described at

https://github.com/lomereiter/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax

Second, if you have an indexed BAM file, several regions can be specified as well. The syntax for regions is the same as in samtools: chr:beg-end where beg and end are 1-based start and end of a closed-end interval on the reference chr.

JSON

Alignment record JSON representation is a hash with keys ‘qname’, ‘flag’, ‘rname’, ‘pos’, ‘mapq’, ‘cigar’, ‘rnext’, ‘qual’, ‘tags’, e.g.

{“qname”:”EAS56_57:6:190:289:82”,”flag”:69,”rname”:”chr1”,”pos”:100,
“mapq”:0,”cigar”:”*”,”rnext”:”=”,”pnext”:100,”tlen”:0,
“seq”:”CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA”,
“qual”:[27,27,27,22,27,27,27,26,27,27,27,27,27,27,27,27,23,26,26,27,
22,26,19,27,26,27,26,26,26,26,26,24,19,27,26],”tags”:{“MF”:192}}

JSON representation mimics SAM format except quality is given as an array of integers.

Postprocessing JSON output is best accomplished with https://stedolan.github.io/jq/

The output is one line per read, for building a proper JSON array pipe the output into jq --slurp.

OPTIONS

  • -F, --filter=FILTER Set custom filter for alignments.
  • -f, --format=FORMAT Specify output format. FORMAT must be one of sam, bam, cram, or json (in lowercase). Default is SAM.
  • -h, --with-header Print SAM header before reads. This is always done for BAM output.
  • -H, --header Print only SAM header to STDOUT. If FORMAT is sam or bam, its text version is printed, otherwise JSON object is written.
  • -I, --reference-info Output to STDOUT reference sequence names and lengths in JSON (see EXAMPLES).
  • -L, --regions=BEDFILE Intersect a file with regions specified in the BED file.
  • -c, --count Output to STDOUT only the number of matching records, -hHI options are ignored.
  • -v, --valid Output only valid reads.
  • -S, --sam-input Specify that the input is SAM file (default is BAM for all operations).
  • -C, --cram-input Specify that input is in CRAM format
  • -p, --show-progress Show progressbar in STDERR. Works only for BAM files, and with no regions specified, i.e. only when reading full file.
  • -l, --compression-level=COMPRESSION_LEVEL Set compression level for BAM output, a number from 0 to 9.
  • -o, --output-filename=FILENAME Specify output filename (by default everything is written to STDOUT).
  • -t, --nthreads=NTHREADS Number of threads to use.
EXAMPLES

Print basic reference sequence information:

 $ sambamba view --reference-info ex1_header.bam[{"name":"chr1","length":1575},{"name":"chr2","length":1584}]

Count reads with mapping quality not less than 50:

 $ sambamba view -c -F "mapping_quality >= 50" ex1_header.bam3124

Count properly paired reads overlapping 100..200 on chr1:

 $ sambamba view -c -F "proper_pair" ex1_header.bam chr1:100-20039

Output header in JSON format:

 $ sambamba view --header --format=json ex1_header.bam{"format_version":"1.3","rg_lines":[],  "sq_lines":[{"sequence_length":1575,"species":"","uri":"",  "sequence_name":"chr1","assembly":"","md5":""},  {"sequence_length":1584,"species":"","uri":"",  "sequence_name":"chr2","assembly":"","md5":""}],  "sorting_order":"coordinate","pg_lines":[]}

5、sort

sambamba-sort - tool for sorting BAM files

sambamba sort [OPTIONS] <input.bam>

BAM files can have either ‘coordinate’ sort order, or ‘qname’ one.

The first one means to sort the file by (integer) reference ID, and for each reference sort corresponding reads by start coordinate.

‘qname’ sorting order is when reads are sorted lexicographically by their names.

sambamba sort does an external stable-type sort on input file. That means it reads the source BAM file in chunks that fit into memory, sorts them and writes to a temporary directory, and then merges them. After merging temporary files are removed automatically.

Both sorting orders are supported. Default one is ‘coordinate’ because this is the one used for building index later. In order to switch to ‘qname’ sorting order, use -n|--sort-by-name flag.

OPTIONS

  • -m, --memory-limit=LIMIT

    Sets an upper bound for used memory. However, this is very approximate. Default memory limit is 2GB. Increasing it will allow to make chunk sizes larger and also reduce amount of I/O seeks thus improving the overall performance.LIMIT must be a number with an optional suffix specyfying unit of measumerent. The following endings are recognized: K, KiB, KB, M, MiB, MB, G, GiB, GB.

  • --tmpdir=TMPDIR

    Use TMPDIR to output sorted chunks. Default behaviour is to use system temporary directory.

  • -o, --out=OUTPUTFILE Output file name. If not provided, the result is written to a file with .sorted.bam extension.

  • -n, --sort-by-name Sort by read name lexicographically

  • -N, --natural-sortSort by read name using natural order (like samtools)

  • -F, --filter=FILTER Keep only reads that pass FILTER. More on that in sambamba-view documentation.

  • -l, --compression-level=COMPRESSION_LEVEL

    Compression level to use for sorted BAM, from 0 (known as uncompressed BAM in samtools) to 9.

  • -u, --uncompressed-chunks

    Write sorted chunks as uncompressed BAM. Default behaviour is to write them with compression level 1, because that reduces time spent on I/O, but in some cases using this option can give you a better speed. Note, however, that the disk space needed for sorting will typically be 3-4 times more than without enabling this option.

  • -p, --show-progressShow wget-like progressbar in STDERR (in fact, two of them one after another, first one for sorting, and then another one for merging).

  • -t, --nthreads=NTHREADS Number of threads to use.

6、merge

sambamba-merge - tool for merging several BAM files into one

sambamba merge [OPTIONS] <output.bam> <input1.bam> <input2.bam> […]

DESCRIPTION

sambamba merge is used for merging several sorted BAM files into one. The sorting order of all the files must be the same, and it is maintained in the output file.

SAM headers are merged automatically like in Picard merging tool.

OPTIONS

  • -t, --nthreads=NTHREADS

    Specify number of threads to use for compression/decompression tasks.

  • -l, --compression-level=COMPRESSION_LEVEL

    Specify compression level of output file as a number from 0 to 9

  • -H, --header Output only merged header to stdout in SAM format (e.g. for debugging purposes). Other options are ignored.

  • -p, --show-progressShow progressbar in STDERR.

7、slice

sambamba-slice - copying a slice of BAM file

sambamba slice OPTIONS <input.bam> region

DESCRIPTION

Outputs reads overlapping specified region into new BAM file. (Default destination is STDOUT.) Input file must be coordinate-sorted and indexed.

While the same can be done with sambamba-view, that would be much slower due to a lot of compression/decompression work. Instead of naive method, sambamba-slice leverages knowledge about structure of BAM file and recompresses only a few BGZF blocks at the beginning and the end of the region, while the BGZF blocks in the middle are copied as is. As such, this tool doesn’t offer any options related to number of threads or compression level - most of the time is spent on I/O operations.

OPTIONS

  • -o, --output-filename=OUTPUTFILE

    Name of output file

8、flagstat

sambamba-flagstat - getting flag statistics from BAM file

sambamba flagstat OPTIONS

8、markdup

sambamba-markdup - finding duplicate reads in BAM file

sambamba markdup OPTIONS <input.bam> <output.bam>

Marks (by default) or removes duplicate reads. For determining whether a read is a duplicate or not, the same criteria as in Picard are used.

OPTIONS

  • -r, --remove-duplicates

    remove duplicates instead of just marking them

  • -t, --nthreads=NTHREADS

    number of threads to use

  • -l, --compression-level=N

    specify compression level of the resulting file (from 0 to 9)”);

  • -p, --show-progress

    show progressbar in STDERR

  • --tmpdir=TMPDIR

    specify directory for temporary files; default is /tmp

  • --hash-table-size=HASHTABLESIZE

    size of hash table for finding read pairs (default is 262144 reads); will be rounded down to the nearest power of two; should be > (average coverage) * (insert size) for good performance

  • --overflow-list-size=OVERFLOWLISTSIZE

    size of the overflow list where reads, thrown away from the hash table, get a second chance to meet their pairs (default is 200000 reads); increasing the size reduces the number of temporary files created

  • --io-buffer-size=BUFFERSIZE

    controls sizes of two buffers of BUFFERSIZE megabytes each, used for reading and writing BAM during the second pass (default is 128)

BUGS

External sort is not implemented. Thus, memory consumption grows by 2Gb per each 100M reads. Check that you have enough RAM before running the tool.

Sambamba: process your BAM data faster!相关推荐

  1. RPA之家手把手带你入门Blue Prism教程系列6_运行Process并认识Data Item

    RPA之家手把手带你入门Blue Prism 练习一: 打开并运行一段Process Data Item -本文章由RPA之家(rpazj.com)提供, 学习交流群QQ群465620839 微信交流 ...

  2. 《计算机网络》第三章:数据链路层(The Data Link Layer)

    Copyright(C)肖文栋教授@北京科技大学自动化学院 内容安排 3.1 Data Link Layer Design Issues数据链路层设计要点 3.2 Error Detection an ...

  3. SAP MM Vendor Rebate Process and Settings

    SAP MM Vendor Rebate Process and Settings Abstract – This document covers – Overview of SAP Vendor R ...

  4. [SD2.0大会]王坚:Data–centric Computing

    1969年 APAR网仅有4个节点 人类登月 一.未来会很不同 PC会不同 Software会不同 Internet会不同 Software is Different Value of 传统软件 功能 ...

  5. 《Credit Risk Scorecard》第四章:Data Review and Project Parameters

    第四章:Scorecard Development Process, Stage 2: Data Review and Project Parameters 一: data avaliablity a ...

  6. 大数据(big data)_如何使用Big Query&Data Studio处理和可视化Google Cloud上的财务数据...

    大数据(big data) 介绍 (Introduction) This article will show you one of the ways you can process stock pri ...

  7. Node.js进程管理之Process模块

    在前面Node.js事件运行机制也有提到,Node.js应用在单个线程运行,但是现在大部分服务器都是多处理器,为了方便使用多个进程,Node.js提供了3个模块.Process模块提供了访问正在运行的 ...

  8. node process

    nodejs的process是一个全局对象,他提供了一些方法和属性,node.js官方的API说的很简单,并没有把一些详细的应用方法和作用写出来,下面结合我自己的学习,做一下小结吧. 1.Event: ...

  9. sql azure 语法_深入了解Azure Data Studio:扩展和Azure SQL DB开发

    sql azure 语法 In the previous articles listed below, we went through the Azure Data Studio tool, star ...

最新文章

  1. Java Swing 探索(一)LayoutManager
  2. Windchill的web中的Spring
  3. 上海虹桥站 启动建设5G网络,DMA助力5G加速
  4. 查看数据库、表、索引的物理存储情况
  5. 锤子科技犯过的构图错误你一定也犯过
  6. eclipse远程开发
  7. 1. 赋值运算符函数
  8. 170329、用 Maven 部署 war 包到远程 Tomcat 服务器
  9. mysql操作语句(简单笔记)
  10. mac自动生成路径问题
  11. facebook 广告目标详解
  12. OpenCV入门: Mat数据类型及其转换,访问
  13. 高通模式9008模式linux,重磅干货!高通9008模式与数据提取
  14. 期权与期货有哪些不同?
  15. 总有些中文情歌,让我莫明的感动了
  16. 高效能人士的七个习惯总结
  17. HTML/CSS-花样边框案例
  18. 1月第1周业务风控关注 | 四部门联合印发App违法违规收集使用个人信息行为认定方法
  19. wafw00f--一款基于python识别网站WAF的工具
  20. 若依前后端分离框架——初始化参数功能源码学习

热门文章

  1. 干旱生态系统中土壤真菌与细菌群落构建的关系
  2. 在计算机网络中 dte,在计算机网络中,DTE设备兼备()的作用
  3. layui时间选择器选择周和季度
  4. 人工智能伦理无法回避的5个问题,生物进化是否有方向是关键
  5. 蓝桥杯:铁塔尼号营救(Java版本)
  6. keepalived原理 mysql_高可用实现KeepAlived原理简介
  7. 你是否还对pycharm的高级使用一头雾水? 手把手教你快速高效使用pycharm英文界面各种实用功能, 附带常见英文解释
  8. 【摘录】NVRAM\FLASH\NVM的区别
  9. C++笔试题(一)【高级C++开发工程师综合测试题(风林火山)】
  10. 推荐一款 java 小程序 saas 新零售数字化产品 - weiit-saas