Sambamba: process your BAM data faster!

　　对于很大的(>100G)的bam文件，排序时间很慢不说，往往需要1天或更多的时间，但结果还会出错。如下边的错误. 经测试Sambamba表现较好，能够节省很多时间。随着接触的数据越来越多，感觉很简单的事情也需要花很多时间。不仅仅是数据多了的问题！

[bam_sort_core] merging from 3288 files...
[E::hts_open_format] fail to open file 'CS_RNA_seq.sorted.bam.tmp.1020.bam'
[bam_merge_core] fail to open file CS_RNA_seq.sorted.bam.tmp.1020.bam

1、下载地址

https://github.com/lomereiter/sambamba/releases

2、安装

拷贝至全局环境变量路径即可

3、使用

4、view

sambamba-view - tool for extracting information from SAM/BAM/CRAM files

sambamba view [OPTIONS] <input.bam | input.sam | input.cram> [region1 […]]

sambamba view allows to efficiently filter SAM/BAM/CRAM files for alignments satisfying various conditions, as well as access its SAM header and information about reference sequences. In order to make these data readily available for consumption by scripts in Perl/Python/Ruby, JSON output is provided.

By default, the tool expects BAM file as an input. In order to work with CRAM, specify -C and for SAM, specify -S|--sam-input as a command-line option, the tool does NOT try to guess file format from the extension. Beware that when reading SAM, the tool will skip tags which don’t conform to the SAM/BAM specification, and set invalid fields to their default values.

FILTERING

Filtering is presented in two ways. First, you can specify a condition with -F option, using a special language for filtering, described at

https://github.com/lomereiter/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax

Second, if you have an indexed BAM file, several regions can be specified as well. The syntax for regions is the same as in samtools: chr:beg-end where beg and end are 1-based start and end of a closed-end interval on the reference chr.

JSON

Alignment record JSON representation is a hash with keys ‘qname’, ‘flag’, ‘rname’, ‘pos’, ‘mapq’, ‘cigar’, ‘rnext’, ‘qual’, ‘tags’, e.g.

{“qname”:”EAS56_57:6:190:289:82”,”flag”:69,”rname”:”chr1”,”pos”:100,
“mapq”:0,”cigar”:”*”,”rnext”:”=”,”pnext”:100,”tlen”:0,
“seq”:”CTCAAGGTTGTTGCAAGGGGGTCTATGTGAACAAA”,
“qual”:[27,27,27,22,27,27,27,26,27,27,27,27,27,27,27,27,23,26,26,27,
22,26,19,27,26,27,26,26,26,26,26,24,19,27,26],”tags”:{“MF”:192}}

JSON representation mimics SAM format except quality is given as an array of integers.

Postprocessing JSON output is best accomplished with https://stedolan.github.io/jq/

The output is one line per read, for building a proper JSON array pipe the output into jq --slurp.

OPTIONS

-F, --filter=FILTER Set custom filter for alignments.
-f, --format=FORMAT Specify output format. FORMAT must be one of sam, bam, cram, or json (in lowercase). Default is SAM.
-h, --with-header Print SAM header before reads. This is always done for BAM output.
-H, --header Print only SAM header to STDOUT. If FORMAT is sam or bam, its text version is printed, otherwise JSON object is written.
-I, --reference-info Output to STDOUT reference sequence names and lengths in JSON (see EXAMPLES).
-L, --regions=BEDFILE Intersect a file with regions specified in the BED file.
-c, --count Output to STDOUT only the number of matching records, -hHI options are ignored.
-v, --valid Output only valid reads.
-S, --sam-input Specify that the input is SAM file (default is BAM for all operations).
-C, --cram-input Specify that input is in CRAM format
-p, --show-progress Show progressbar in STDERR. Works only for BAM files, and with no regions specified, i.e. only when reading full file.
-l, --compression-level=COMPRESSION_LEVEL Set compression level for BAM output, a number from 0 to 9.
-o, --output-filename=FILENAME Specify output filename (by default everything is written to STDOUT).
-t, --nthreads=NTHREADS Number of threads to use.

EXAMPLES

Print basic reference sequence information:

 $ sambamba view --reference-info ex1_header.bam[{"name":"chr1","length":1575},{"name":"chr2","length":1584}]

Count reads with mapping quality not less than 50:

 $ sambamba view -c -F "mapping_quality >= 50" ex1_header.bam3124

Count properly paired reads overlapping 100..200 on chr1:

 $ sambamba view -c -F "proper_pair" ex1_header.bam chr1:100-20039

Output header in JSON format:

 $ sambamba view --header --format=json ex1_header.bam{"format_version":"1.3","rg_lines":[],  "sq_lines":[{"sequence_length":1575,"species":"","uri":"",  "sequence_name":"chr1","assembly":"","md5":""},  {"sequence_length":1584,"species":"","uri":"",  "sequence_name":"chr2","assembly":"","md5":""}],  "sorting_order":"coordinate","pg_lines":[]}

5、sort

sambamba-sort - tool for sorting BAM files

sambamba sort [OPTIONS] <input.bam>

BAM files can have either ‘coordinate’ sort order, or ‘qname’ one.

The first one means to sort the file by (integer) reference ID, and for each reference sort corresponding reads by start coordinate.

‘qname’ sorting order is when reads are sorted lexicographically by their names.

sambamba sort does an external stable-type sort on input file. That means it reads the source BAM file in chunks that fit into memory, sorts them and writes to a temporary directory, and then merges them. After merging temporary files are removed automatically.

Both sorting orders are supported. Default one is ‘coordinate’ because this is the one used for building index later. In order to switch to ‘qname’ sorting order, use -n|--sort-by-name flag.

OPTIONS

-m, --memory-limit=LIMIT

Sets an upper bound for used memory. However, this is very approximate. Default memory limit is 2GB. Increasing it will allow to make chunk sizes larger and also reduce amount of I/O seeks thus improving the overall performance.LIMIT must be a number with an optional suffix specyfying unit of measumerent. The following endings are recognized: K, KiB, KB, M, MiB, MB, G, GiB, GB.
--tmpdir=TMPDIR

Use TMPDIR to output sorted chunks. Default behaviour is to use system temporary directory.
-o, --out=OUTPUTFILE Output file name. If not provided, the result is written to a file with .sorted.bam extension.
-n, --sort-by-name Sort by read name lexicographically
-N, --natural-sortSort by read name using natural order (like samtools)
-F, --filter=FILTER Keep only reads that pass FILTER. More on that in sambamba-view documentation.
-l, --compression-level=COMPRESSION_LEVEL

Compression level to use for sorted BAM, from 0 (known as uncompressed BAM in samtools) to 9.
-u, --uncompressed-chunks

Write sorted chunks as uncompressed BAM. Default behaviour is to write them with compression level 1, because that reduces time spent on I/O, but in some cases using this option can give you a better speed. Note, however, that the disk space needed for sorting will typically be 3-4 times more than without enabling this option.
-p, --show-progressShow wget-like progressbar in STDERR (in fact, two of them one after another, first one for sorting, and then another one for merging).
-t, --nthreads=NTHREADS Number of threads to use.

6、merge

sambamba-merge - tool for merging several BAM files into one

sambamba merge [OPTIONS] <output.bam> <input1.bam> <input2.bam> […]

DESCRIPTION

sambamba merge is used for merging several sorted BAM files into one. The sorting order of all the files must be the same, and it is maintained in the output file.

SAM headers are merged automatically like in Picard merging tool.

OPTIONS

-t, --nthreads=NTHREADS

Specify number of threads to use for compression/decompression tasks.
-l, --compression-level=COMPRESSION_LEVEL

Specify compression level of output file as a number from 0 to 9
-H, --header Output only merged header to stdout in SAM format (e.g. for debugging purposes). Other options are ignored.
-p, --show-progressShow progressbar in STDERR.

7、slice

sambamba-slice - copying a slice of BAM file

sambamba slice OPTIONS <input.bam> region

DESCRIPTION

Outputs reads overlapping specified region into new BAM file. (Default destination is STDOUT.) Input file must be coordinate-sorted and indexed.

While the same can be done with sambamba-view, that would be much slower due to a lot of compression/decompression work. Instead of naive method, sambamba-slice leverages knowledge about structure of BAM file and recompresses only a few BGZF blocks at the beginning and the end of the region, while the BGZF blocks in the middle are copied as is. As such, this tool doesn’t offer any options related to number of threads or compression level - most of the time is spent on I/O operations.

OPTIONS

-o, --output-filename=OUTPUTFILE

Name of output file

8、flagstat

sambamba-flagstat - getting flag statistics from BAM file

sambamba flagstat OPTIONS

8、markdup

sambamba-markdup - finding duplicate reads in BAM file

sambamba markdup OPTIONS <input.bam> <output.bam>

Marks (by default) or removes duplicate reads. For determining whether a read is a duplicate or not, the same criteria as in Picard are used.

OPTIONS

-r, --remove-duplicates

remove duplicates instead of just marking them
-t, --nthreads=NTHREADS

number of threads to use
-l, --compression-level=N

specify compression level of the resulting file (from 0 to 9)”);
-p, --show-progress

show progressbar in STDERR
--tmpdir=TMPDIR

specify directory for temporary files; default is /tmp
--hash-table-size=HASHTABLESIZE

size of hash table for finding read pairs (default is 262144 reads); will be rounded down to the nearest power of two; should be > (average coverage) * (insert size) for good performance
--overflow-list-size=OVERFLOWLISTSIZE

size of the overflow list where reads, thrown away from the hash table, get a second chance to meet their pairs (default is 200000 reads); increasing the size reduces the number of temporary files created
--io-buffer-size=BUFFERSIZE

controls sizes of two buffers of BUFFERSIZE megabytes each, used for reading and writing BAM during the second pass (default is 128)

BUGS

External sort is not implemented. Thus, memory consumption grows by 2Gb per each 100M reads. Check that you have enough RAM before running the tool.