How metaWRAP refine bins

前言
- 过程处理
- - bin_refinement.sh
  - binning_refiner.py
  - consolidate_two_sets_of_bins.py
  - dereplicate_contigs_in_bins.py
- 总结

前言

metaWRAP1 是一个高度模块化和傻瓜化的宏基因组分析集成工具. 通过 metaRWAP 分箱优化的结果往往优于2单独分箱方法或者另外一种综合分箱软件 DAS_Tool3.

过程处理

bin_refinement.sh

metawrap 首先将多个 bin 分组综合起来, 产生一个 “refine” 的分组
```
${SOFT}/binning_refiner.py -1 binsA -2 binsB -o Refined_AB
```
“refine” 分组是以上 bin 分组的最小共识交集

随后对原始和 refine 的每个组合运行 checkm, 并读取之

if [ "$quick" == "true" ]; thencomm "Note: running with --reduced_tree option"checkm lineage_wf -x fa $bin_set ${bin_set}.checkm -t $threads --tmpdir ${bin_set}.tmp --pplacer_threads $p_threads --reduced_tree
elsecheckm lineage_wf -x fa $bin_set ${bin_set}.checkm -t $threads --tmpdir ${bin_set}.tmp --pplacer_threads $p_threads
fiif [[ ! -s ${bin_set}.checkm/storage/bin_stats_ext.tsv ]]; then error "Something went wrong with running CheckM. Exiting..."; fi
${SOFT}/summarize_checkm.py ${bin_set}.checkm/storage/bin_stats_ext.tsv $bin_set | (read -r; printf "%s\n" "$REPLY"; sort) > ${bin_set}.stats

此处 “summarize_checkm.py” 输出的表格每列分别为:
- bin, completeness, contamination, GC, lineage, N50, size, binner

最后进行去冗余

for bins in $(ls | grep .stats | grep -v binsM); docomm "merging $bins and binsM"${SOFT}/consolidate_two_sets_of_bins.py binsM ${bins%.*} binsM.stats $bins binsM1 $comp $contif [[ $? -ne 0 ]]; then error "Something went wrong with merging two sets of bins"; firm -r binsM binsM.statsmv binsM1 binsM; mv binsM1.stats binsM.stats
done

仅对 80% 以上重叠度的 Bin 去冗余, 其余 Bin 不作处理

comm "Scanning to find duplicate contigs between bins and only keep them in the best bin..."
${SOFT}/dereplicate_contigs_in_bins.py binsM.stats binsM binsO

输出之前, 去除低质量的 MAGs

${SOFT}/summarize_checkm.py binsO.checkm/storage/bin_stats_ext.tsv manual binsM.stats | (read -r; printf "%s\n" "$REPLY"; sort -rn -k2) > binsO.stats
if [[ $? -ne 0 ]]; then error "Cannot make checkm summary file. Exiting."; fi
rm -r binsO.checkm
num=$(cat binsO.stats | awk -v c="$comp" -v x="$cont" '{if ($2>=c && $2<=100 && $3>=0 && $3<=x) print $1 }' | wc -l)
comm "There are $num 'good' bins found in binsO.checkm! (>${comp}% completion and <${cont}% contamination)"comm "Removing bins that are inadequate quality..."
for bin_name in $(cat binsO.stats | grep -v compl | awk -v c="$comp" -v x="$cont" '{if ($2<c || $2>100 || $3<0 || $3>x) print $1 }' | cut -f1); doecho "${bin_name} will be removed because it fell below the quality threshhold after de-replication of contigs..."rm binsO/${bin_name}.fa
done

binning_refiner.py

首先提取 “folder_bins_dict”
- Dict[str, List[str]]: {"binsA": ["Genome1.fa", "Genome2.fa"]}

随后整合序列名和文件名:

重命名

each_contig_new_id = '%s%s%s%s%s' % (each_folder, separator, bin_file_name, separator, each_contig.id)
# binsA__metabat.1.fa__k141_1

写入新文件

os.system('cat %s/%s/%s_new/*.fasta > %s/%s/combined_%s_bins.fa' % (wd, output_folder, each_folder, wd, output_folder, each_folder))

os.system('cat %s %s > %s' % (pwd_combined_folder_1_bins, pwd_combined_folder_2_bins, combined_all_bins_file))

这是一步中间的无效过程

随后按序列名记录相关 Bin:

if contig_id not in contig_bin_dict:contig_bin_dict[contig_id] = ['%s%s%s' % (folder_name, separator, bin_name)]contig_length_dict[contig_id] = length
elif contig_id in contig_bin_dict:contig_bin_dict[contig_id].append('%s%s%s' % (folder_name, separator, bin_name))
# {"k141_1": ["binsA__metabat.1.fa", "binsB__maxin.6.fa"]}

只考虑在多个 bin 中都出现的序列:

for each in contig_bin_dict:if len(contig_bin_dict[each]) == len(input_bin_folder_list):contig_assignments.write('%s\t%s\t%s\n' % ('\t'.join(contig_bin_dict[each]), each, contig_length_dict[each]))
# binsA__metabat.1.fa | binsB__maxin.6.fa | k141_1 | 35000 (| := \t)

随后按几个 Bin 的顺序重排, 这对后续划分 bin 是必要的

对全部 Bin 求全交集, 保留长度足够的新 Bin

refined_bin_name = 'refined_bin%s' % n
if current_length_total >= bin_size_cutoff:contig_assignments_sorted_one_line.write('Refined_%s\t%s\t%sbp\t%s\n' % (n, current_match, current_length_total,'\t'.join(current_match_contigs)))n += 1

consolidate_two_sets_of_bins.py

首先仅保留足够好的 Bins

if float(cut[1])>c and float(cut[2])<x: good_bins_1[cut[0]+'.fa']=None

记录各序列的 Bin 及长度

bins_1[bin_file][contig_name] = contig_len

对具有相同 contig 的 Bin 计算重叠率, 给出最大值

# chose the highest % ID, dependinsh of which bin is  asubset of the other
ratio_1=100*match_1_length/(match_1_length+mismatch_1_length)
ratio_2=100*match_2_length/(match_2_length+mismatch_2_length)all_bin_pairs[bin_1][bin_2]=max([ratio_1, ratio_2])

对 80% 以上重叠度的 bin, 选择最优 Bin, 否则两个同时保留

dereplicate_contigs_in_bins.py

记录每个 contig 所在的 Bin, 根据 Bin 的质量将 contig 划分到质量最高的 Bin 中

总结

metaWRAP 通过每次最多三个组合的共识对序列取交集, 再通过 checkm 计算分箱质量指标, 其实质是通过 checkm marker 指示不同分箱方法的非共识区是否携带标志基因或是否冗余.

一个例子:

-------------------| 1. Assembly to genomes
assemblied contigs | [---- contig 1 ----] [------------ contig 2 ------------] [--- contig 3 ---] [- contig 4 -] ...
checkm markers     |         A   B   C      B   C   D   E   F   G   H   I          J  K  L                       M N
-------------------| 2. Now binning by two different methods: method1, method2
method1.bin.x      | [                        bin 1.x                        ] [            bin 1.y            ]
method2.bin.c      |                      [                           bin 2.c                                  ]
-------------------| 3. Now refine with metaWRAP
refineAB.bin.m     |                      [            bin AB.m              ]
refineAB.bin.n     |                                                           [            bin AB.n           ]

假设对 4 条 contig 进行分箱, 方法 1 将 contig 1 和contig 2 分为一个箱, 方法 2 将 contig 2, contig 3 和 contig 4 分为一个箱.
metaWRAP 首先取交集, 获得 refineAB 中的一个新箱, 仅包含 contig 2 (示意图中的 3.)

随后调用 checkm 计算每个分箱的完整度和污染度:

假设 checkm 将这几个分箱都注释到一个类群, 其包含 14 个标志基因, 分别以 A-Z 表示, 其中 A-L 分别在 contig 1-3 上.

bins	markers	uniq genes	completeness	duplicate genes	contamination
bin 1.x	ABCBCDEFGHI	A-I (9)	9/14	BC	2/14
bin 1.y	JKL	3	3/14		0
bin 2.c	BCDEFGHIJKL	B-L (11)	11/14		0
bin AB.m	BCDEFGHI	B-I (8)	8/14		0
bin AB.n	JKL	3	3/14		0

根据完整度和污染度评估, 最终选择 bin 2.y
- 假设 contig 3 上包含的标志基因也是 ABC, 则最终选择的箱为 bin AB.m
- 可见, contig 4 的去留由同组的 contig 3 决定
但是, checkm 是目前分箱质量的事实标准. 将其他分箱结果与 metaWRAP 结果相比较, 实际上使 checkm 既当 “裁判员”, 又当 “运动员”, 容易造成对 metaWRAP 结果的高估. 比如说, 重组菌株 (一个 Bin 中包含了相近类群的两个不同源的片段) 可能会影响分析结果. 同时, metaWRAP 代码笨重过时, 包含大量重复计算步骤, 有巨大优化空间.

metaWRAP bin_refine 模块如何优化分箱结果相关推荐

MetaWRAP分箱流程实战和结果解读
MetaWRAP--灵活的单基因组精度宏基因组分析流程关于宏基因组Binning,有无数的软件和数据库,大家分析费时费力,结果也差别很大.现在有了MetaWRAP,一个软件就够了,整合3个主流分箱工 ...
系统学习机器学习之特征工程（四）--分箱总结
首先from wiki给出一个标准的连续特征离散化的定义: 在统计和机器学习中,离散化是指将连续属性,特征或变量转换或划分为离散或标称属性/特征/变量/间隔的过程.这在创建概率质量函数时非常有用 - ...
三天实现独立分析宏基因组数据(有参、无参和分箱等)
在广大粉丝的期待下,<生信宝典>联合<宏基因组>在2019年11月1-3日,北京鼓楼推出<宏基因组分析>专题培训第六期,为大家提供一条走进生信大门的捷径.为同行提供 ...
pandas用众数填充缺失值_【机器学习】scikit-learn中的数据预处理小结(归一化、缺失值填充、离散特征编码、连续值分箱)...
一.概述 1. 数据预处理数据预处理是从数据中检测,修改或删除不准确或不适用于模型的记录的过程可能面对的问题有:数据类型不同,比如有的是文字,有的是数字,有的含时间序列,有的连续,有的间断. 也可 ...
关于模型分箱，最容易被忽略的这几点
关注 "番茄风控大数据",获取更多数据分析与风控大数据的实用干货. 许多年前,在开发模型的时,做各个变量的分箱,基本都是用excel自己手动一个一个调变量的具体分箱.那个时候没有特 ...
在职位招聘数据处理中使用Loess回归曲线以及分箱、回归、聚类方法检查离群点及光滑数据【数据挖掘机器学习】
文章目录一.需求分析二.使用局部回归(Loess)曲线(增加一条光滑曲线到散布图)方法处理数据三.使用分箱.回归.聚类方法检查离群点及光滑数据: 一.需求分析本文主题:使用局部回归(Loes ...
《Python金融大数据风控建模实战》第6章变量分箱方法
<Python金融大数据风控建模实战> 第6章变量分箱方法本章引言 Python代码实现及注释本章引言变量分箱是一种特征工程方法,意在增强变量的可解释性与预测能力.变量分箱方法主要 ...
十二、案例：加利福尼亚房屋价值数据集（多元线性回归） Lasso 岭回归分箱处理非线性问题多项式回归
案例:加利福尼亚房屋价值数据集(线性回归)& Lasso & 岭回归 & 分箱处理非线性问题点击标题即可获取文章源代码和笔记 1. 导入需要的模块和库 from sklear ...
SemiBin宏基因组半监督分箱工具中GTDB数据的下载与配置
最近想学一学宏基因组的分箱工具使用(讲真的,感觉bin还是挺复杂的,不是我这种小白该去涉猎的),本来想看看老牌工具metaWRAP的使用细节,奈何微信推送了一条新的分箱工具--SemiBin,还是基于 ...

metaWRAP bin_refine 模块如何优化分箱结果