VCF | 多等位基因位点如何拆分?InDel变异如何标准化?
统一标准
多等位基因位点 (Multiallelic sites) 的拆分,左对齐标准化InDel,对ANNOVAR注释、MAF文件转换、映射人群频率、映射致病性等在少数特定位点上有影响,除非不分析多等位基因 (可能是因为分析起来过于复杂,或者这些位点不是那么重要,很多研究和应用领域不考虑这些位点)。
工具:bcftools norm
bcftools norm
About:
Left-align and normalize indels;
check if REF alleles match the reference;
split multiallelic sites into multiple rows;
recover multiallelics frommultiple rows.
Usage: bcftools norm [options] <in.vcf.gz>
Options:
-c, --check-ref <e|w|x|s> check REF alleles and exit (e), warn (w), exclude (x), or set (s) bad sites [e]
-D, --remove-duplicates remove duplicate lines of the same type.
-d, --rm-dup <type> remove duplicate snps|indels|both|all|none
-f, --fasta-ref <file> reference sequence (MANDATORY)
-m, --multiallelics <-|+>[type] split multiallelics (-) or join biallelics (+), type: snps|indels|both|any [both]
--no-version do not append version and command line to the header
-N, --do-not-normalize do not normalize indels (with -m or -c s)
-o, --output <file> write output to a file [standard output]
-O, --output-type <type> 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
-r, --regions <region> restrict to comma-separated list of regions
-R, --regions-file <file> restrict to regions listed in a file
-s, --strict-filter when merging (-m+), merged site is PASS only if all sites being merged PASS
-t, --targets <region> similar to -r but streams rather than index-jumps
-T, --targets-file <file> similar to -R but streams rather than index-jumps
--threads <int> number of extra (de)compression threads [0]
-w, --site-win <int> buffer for sorting lines which changed position during realignment [1000]
位点测试
位点1 - 含Del
# 拆分前
zcat ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs34364959" | sed 's/\t/ /g'
# 拆分后
bcftools norm -f /db/gatk/hg38/Homo_sapiens_assembly38.fasta \--multiallelics -both -Ov ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs34364959" | sed 's/\t/ /g'
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2
chr11 2411436 rs34364959 CG TG,C 652.11 PASS AC=1,1;AF=0.125,0.125;AN=8;BaseQRankSum=2.92;DB;DP=227;ExcessHet=3.6798;FS=3.302;MLEAC=1,1;MLEAF=0.125,0.125;MQ=59.87;MQRankSum=0;QD=7.25;ReadPosRankSum=-1.212;SOR=1.121 GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:28,22,0:50:99:.:.:473,0,498,554,564,1118:. 0|2:33,0,7:40:99:0|1:2411398_C_A:195,294,1680,0,1386,1365:2411398 0/0:63,0,0:63:0:.:.:0,0,1220,0,1220,1220:. 0/0:71,0,0:71:59:.:.:0,59,1675,59,1675,1675:.
拆分后
Lines total/split/realigned/skipped:1439/51/9/0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2
chr11 2411436 rs34364959 C T 652.11 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=2.92;DB;DP=227;ExcessHet=3.6798;FS=3.302;MLEAC=1;MLEAF=0.125;MQ=59.87;MQRankSum=0;QD=7.25;ReadPosRankSum=-1.212;SOR=1.121 GT:AD:DP:GQ:PGT:PID:PL:PS 0/1:28,22:50:99:.:.:473,0,498:. 0|0:33,0:40:99:0|1:2411398_C_A:195,294,1680:2411398 0/0:63,0:63:0:.:.:0,0,1220:. 0/0:71,0:71:59:.:.:0,59,1675:.
chr11 2411436 rs34364959 CG C 652.11 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=2.92;DB;DP=227;ExcessHet=3.6798;FS=3.302;MLEAC=1;MLEAF=0.125;MQ=59.87;MQRankSum=0;QD=7.25;ReadPosRankSum=-1.212;SOR=1.121 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:28,0:50:99:.:.:473,554,1118:. 0|1:33,7:40:99:0|1:2411398_C_A:195,0,1365:2411398 0/0:63,0:63:0:.:.:0,0,1220:. 0/0:71,0:71:59:.:.:0,59,1675:.
还不错,基本该做到的、都做到了;难以做到的、也都没做到
位点2 - 纯SNP
# 拆分前
zcat ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs752579667" | sed 's/\t/ /g'
# 拆分后
bcftools norm -f /db/gatk/hg38/Homo_sapiens_assembly38.fasta \--multiallelics -both -Ov ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs752579667" | sed 's/\t/ /g'
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2
chr11 3361836 rs752579667 C T,A 69.04 PASS AC=1,1;AF=0.125,0.125;AN=8;BaseQRankSum=2.24;DB;DP=86;ExcessHet=3.6798;FS=12.725;MLEAC=1,1;MLEAF=0.125,0.125;MQ=52.03;MQRankSum=-1.799;QD=2.16;ReadPosRankSum=1.59;SOR=1.057 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:17,0,0:17:23:.:.:0,23,365,23,365,365:. 0/0:35,0,0:35:78:.:.:0,78,1170,78,1170,1170:. 0/1:13,2,1:17:42:.:.:45,0,540,42,504,585:. 0|2:14,0,2:16:42:0|1:3361828_A_G:42,84,657,0,573,567:3361828
拆分后
Lines total/split/realigned/skipped:1439/51/9/0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2
chr11 3361836 rs752579667 C T 69.04 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=2.24;DB;DP=86;ExcessHet=3.6798;FS=12.725;MLEAC=1;MLEAF=0.125;MQ=52.03;MQRankSum=-1.799;QD=2.16;ReadPosRankSum=1.59;SOR=1.057 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:17,0:17:23:.:.:0,23,365:. 0/0:35,0:35:78:.:.:0,78,1170:. 0/1:13,2:17:42:.:.:45,0,540:. 0|0:14,0:16:42:0|1:3361828_A_G:42,84,657:3361828
chr11 3361836 rs752579667 C A 69.04 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=2.24;DB;DP=86;ExcessHet=3.6798;FS=12.725;MLEAC=1;MLEAF=0.125;MQ=52.03;MQRankSum=-1.799;QD=2.16;ReadPosRankSum=1.59;SOR=1.057 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:17,0:17:23:.:.:0,23,365:. 0/0:35,0:35:78:.:.:0,78,1170:. 0/0:13,1:17:42:.:.:45,42,585:. 0|1:14,2:16:42:0|1:3361828_A_G:42,0,567:3361828
位点3 - 含Del
# 拆分前
zcat ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs1050228\|rs749512034\|rs558809956" | sed 's/\t/ /g'
# 拆分后
bcftools norm -f /db/gatk/hg38/Homo_sapiens_assembly38.fasta \--multiallelics -both -Ov ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep "CHROM\|rs1050228\|rs749512034\|rs558809956" | sed 's/\t/ /g'
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2
chr11 6390705 rs1050228;rs749512034;rs558809956 TGCTGGCGCTGGCGCTGGC TGCTGGCGCTGGC,CGCTGGCGCTGGCGCTGGC,T 2139.3 PASS AC=4,2,1;AF=0.5,0.25,0.125;AN=8;BaseQRankSum=0.105;DB;DP=126;ExcessHet=3.0103;FS=1.475;MLEAC=4,2,1;MLEAF=0.5,0.25,0.125;MQ=59.95;MQRankSum=0;QD=30.56;ReadPosRankSum=-1.026;SOR=0.399 GT:AD:DP:GQ:PL 1/2:0,6,7,0:13:99:372,166,199,220,0,254,387,198,250,433 1/2:0,4,3,0:7:86:245,86,114,168,0,159,252,109,168,270 0/1:10,14,0,0:24:99:546,0,331,576,375,951,576,375,951,951 1/3:0,13,0,13:26:99:1007,516,476,1048,548,1237,503,0,729,878
拆分后
Lines total/split/realigned/skipped: 1439/51/9/0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2
chr11 6390705 rs1050228;rs749512034;rs558809956 TGCTGGC T 2139.3 PASS AC=4;AF=0.5;AN=8;BaseQRankSum=0.105;DB;DP=126;ExcessHet=3.0103;FS=1.475;MLEAC=4;MLEAF=0.5;MQ=59.95;MQRankSum=0;QD=30.56;ReadPosRankSum=-1.026;SOR=0.399 GT:AD:DP:GQ:PL 1/0:0,6:13:99:372,166,199 1/0:0,4:7:86:245,86,114 0/1:10,14:24:99:546,0,331 1/0:0,13:26:99:1007,516,476
chr11 6390705 rs1050228;rs749512034;rs558809956 T C 2139.3 PASS AC=2;AF=0.25;AN=8;BaseQRankSum=0.105;DB;DP=126;ExcessHet=3.0103;FS=1.475;MLEAC=2;MLEAF=0.25;MQ=59.95;MQRankSum=0;QD=30.56;ReadPosRankSum=-1.026;SOR=0.399 GT:AD:DP:GQ:PL 0/1:0,7:13:99:372,220,254 0/1:0,3:7:86:245,168,159 0/0:10,0:24:99:546,576,951 0/0:0,0:26:99:1007,1048,1237
chr11 6390705 rs1050228;rs749512034;rs558809956 TGCTGGCGCTGGCGCTGGC T 2139.3 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=0.105;DB;DP=126;ExcessHet=3.0103;FS=1.475;MLEAC=1;MLEAF=0.125;MQ=59.95;MQRankSum=0;QD=30.56;ReadPosRankSum=-1.026;SOR=0.399 GT:AD:DP:GQ:PL 0/0:0,0:13:99:372,387,433 0/0:0,0:7:86:245,252,270 0/0:10,0:24:99:546,576,951 0/1:0,13:26:99:1007,503,878
位点4 - 含Ins
# chr11:281137
# 拆分前
zcat ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep -P 'CHROM|chr11\t281137' | sed 's/\t/ /g'
# 拆分后
bcftools norm -f /db/gatk/hg38/Homo_sapiens_assembly38.fasta \--multiallelics -both -Ov ~/wgs/result/Genotype.cohort.dbSNP.g.vcf.gz | \grep -P 'CHROM|chr11\t281137' | sed 's/\t/ /g'
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2
chr11 281137 . T TTTTCTCGCAATACGGCCGTATCATCACCTCACGAATCCTGGTTGATCAAGTCACA,TTTTCTCGCAATAC 1406.78 PASS AC=2,1;AF=0.25,0.125;AN=8;BaseQRankSum=-0.471;DP=417;ExcessHet=5.4407;FS=157.296;MLEAC=2,1;MLEAF=0.25,0.125;MQ=57.36;MQRankSum=-11.61;QD=4.08;ReadPosRankSum=-0.004;SOR=8.051 GT:AD:DP:GQ:PGT:PID:PL:PS 0|2:25,0,26:51:99:0|1:281137_T_TTTTCTCGCAATAC:1017,1092,2108,0,1016,938:281137 0/0:56,0,0:56:90:.:.:0,90,1350,90,1350,1350:. 0/1:143,15,0:158:99:.:.:195,0,5834,630,5879,6509:. 0/1:122,14,0:136:99:.:.:215,0,5041,588,5083,5671:.
拆分后
Lines total/split/realigned/skipped:1439/51/9/0
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT CHB1 CHB2 GWD1 GWD2
chr11 281137 . T TTTTCTCGCAATACGGCCGTATCATCACCTCACGAATCCTGGTTGATCAAGTCACA 1406.78 PASS AC=2;AF=0.25;AN=8;BaseQRankSum=-0.471;DP=417;ExcessHet=5.4407;FS=157.296;MLEAC=2;MLEAF=0.25;MQ=57.36;MQRankSum=-11.61;QD=4.08;ReadPosRankSum=-0.004;SOR=8.051 GT:AD:DP:GQ:PGT:PID:PL:PS 0|0:25,0:51:99:0|1:281137_T_TTTTCTCGCAATAC:1017,1092,2108:281137 0/0:56,0:56:90:.:.:0,90,1350:. 0/1:143,15:158:99:.:.:195,0,5834:. 0/1:122,14:136:99:.:.:215,0,5041:.
chr11 281137 . T TTTTCTCGCAATAC 1406.78 PASS AC=1;AF=0.125;AN=8;BaseQRankSum=-0.471;DP=417;ExcessHet=5.4407;FS=157.296;MLEAC=1;MLEAF=0.125;MQ=57.36;MQRankSum=-11.61;QD=4.08;ReadPosRankSum=-0.004;SOR=8.051 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:25,26:51:99:0|1:281137_T_TTTTCTCGCAATAC:1017,0,938:281137 0/0:56,0:56:90:.:.:0,90,1350:. 0/0:143,0:158:99:.:.:195,630,6509:. 0/0:122,0:136:99:.:.:215,588,5671:.
使用方法
# 变异标准化
# 多等位基因位点 (Multiallelic sites) 的拆分
bcftools norm -f ${ref} --multiallelics -both ${result}/Genotype.cohort.dbSNP.g.vcf \-Oz -o ${result}/Genotype.cohort.dbSNP.g.vcf.gz
变异标准化及左对齐等相关理论
下面的例子中,多核苷酸多态性(Multi Nucleotide Polymorphism, MNP)的前 3 个核苷酸是多余的 (Represented superfluously),而从第 4 个核苷酸开始表示时,是简约化的 (Parsimoniously)。
Variant normalization
当一个变异在左侧有多余的核苷酸时,它被定义为"非左简约化",因为需要向左修剪 (Defined as not being left parsimonious as there is a need to left trim)。相对于这个概念有:Right parsimony and trimming。
Parsimony 也适用于 Indel,及左对齐 (Left alignment)。
更多解释:
https://genome.sph.umich.edu/wiki/Variant_Normalization
Adrian Tan, Gonçalo R. Abecasis and Hyun Min Kang. (2015) Unified Representation of Genetic Variants. Bioinformatics.
VCF | 多等位基因位点如何拆分?InDel变异如何标准化?相关推荐
- R语言实现GWAS结果显著SNP位点归类提取与变异类型转化
GWAS结果显著SNP位点归类提取与变异类型转化 根据GWAS得到的Rresult文件信息,能够找出每个snp位点对应的显著性情况和基因变异信息,接下来,需要根据表格中的信息进行归纳总结,对不同显著性 ...
- vcf 文件拼接(snp、indel)
bgzip -c raw_snp_vcf > raw_snp_bgzip.gzbcftools index raw_snp_bgzip.gzbgzip -c raw_indel_vcf > ...
- GATK教程 / 体细胞短变异检测 (SNV+InDel)流程概览
体细胞短变体检测 (SNV + InDel) Somatic short variant discovery (SNVs + Indels) 目的 在单个个体的一个或多个肿瘤样本中,识别体细胞短变异( ...
- 处理vcf文本设计多态性indel标记--GATK、vcf
1.下载GTAK并解压 下载GATK的安装包. 参考链接:Releases · broadinstitute/gatk (github.com) 注意解压指令是unzip.没有需要安装 unzip g ...
- 【Bioconductor系列】利用Bioconductor包进行基因组变异位点注释
基因组变异位点注释 安装工作流程所需的biconductor包 source("http://bioconductor.org/workflows.R") workflowInst ...
- 用GATK进行二代测序数据 SNP Calling 流程:(四)变异过滤
GATK推荐的最好的过滤方式是用 VQSR功能,它通过机器学习算法来判断SNP的优劣,因此至少需要两个已存在的 SNP 数据集,一个是经过验证的高质量 SNP 数据集作为真集(如 HapMap),还需 ...
- 实操 | 合并VCF文件的几种方法及注意事项
背 景 在基因组分析领域的很多不同场景中,需要合并VCF文件. VCF文件.简单来说,就是记录样本基因型的文件.但多数VCF文件不只记录了基因型,也包含有关该基因型的来源的细节. 其它文件.VCF文件 ...
- biostar handbook(十)|如何进行变异检测
变异检测流程 什么是基因组变异 基因组变异是一个定义比较模糊的概念. 所谓的变异是相对于一个完美的"参考基因组"而言.但是其实完美的"参考基因组"并不存在,因为 ...
- 突变检测软件 测试数据库,测序数据比对和变异检测
全基因组重测序是对已知基因组序列的物种进行不同个体的基因组测序,并在此基础上对个体或群体进行差异性分析.所以首先我们需要某个物种的全基因组序列和该物种的某个个体的基因组序列. 重测序数据分析流程 重测 ...
最新文章
- 合理估算线程池线程数量
- chrome/FF/safari浏览器下input和textarea的默认样式outline和resize
- 计算机二级c语言程序,二级C语言考试系统
- 用python编写的无线AP扫描器
- android 单位转换工具,Android单位转换----常用单位转换工具类
- mysql8碰到 ERROR 1064 (42000)
- 魔幻艰难的2020上半年!
- 明年的现在我也想去“双选会”应聘!
- aws rds监控慢sql_探索AWS RDS SQL Server上SQL Server集成服务(SSIS)
- Oracle传统基本体系结构初步介绍(2)
- 学习廖雪峰的Python教程之Python基础
- java完成crm系统ppt,客户关系管理系统答辩稿.ppt
- 必备24个宝藏工具,赶紧收藏,在家做自媒体8天收益4100
- php定义一个矩形类rectangle,c#定义一个类圆Circle或者定义一个矩形类Rectangle,分别计算它们的周长和面积....
- win10重装系统后出现的0xc0000225问题解决
- 摩斯密码php,普及一下LOL中的摩斯密码 绝对的干货
- 技术Leader的30条军规
- linux tcp repair及tcp热迁移
- JAVA架构师面试分享—链家网
- LXD 2.0 系列(八):LXD中的LXD
热门文章
- jQuery从入门到进阶视频教程-汤小洋-专题视频课程
- 大型医院影像PACS系统三维重建技术(获取数据、预处理、配准、重建和可视化)
- python + snownlp 正负面分析
- ccflow 代码分析
- 计算机为什么不读500g硬盘,为什么我的500g硬盘的实际容量只有46 5. 1G
- 某学生的记录由学号、5门课程成绩和平均分组成,学号和5门课程的成绩已在主函数中给出。请编写函数fun,它的功能是:求出该学生的平均分,并放在记录的ave成员中。
- 加勒比海盗1英文剧本
- 软件 黑苹果盒盖不休眠_怎么解决苹果电脑合盖自动休眠问题?
- 【华为OD机试真题 python】 比赛【2022 Q4 | 100分】
- Java GC 介绍