基因组数据质控中：先进行SNP缺失质控还是样本缺失质控？

这个问题，我之前没有测试过，所以我自以为是等价的，毫无疑问，我以为的是错误的。

答案是：先进行SNP缺失质控，在进行样本缺失质控。

错误的做法：

先进行样本缺失质控，在进行SNP缺失质控
同时进行SNP和样本的缺失质控

1. 测试数据

测试数据：

样本数：165
SNP数：1457897

$ wc -l test_data.map test_data.ped1457897 test_data.map165 test_data.ped1458062 total

2. 正确做法，先SNP后样本

先对SNP进行缺失质控：
这里--geno 0.02是plink中对SNP进行的缺失质控，质控标准为0.02，即删除缺失率大于2%的SNP。

 plink --file test_data --geno 0.02 --recode --out test1

运行日志：

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test1.log.
Options in effect:--file test_data--geno 0.02--out test1--recode15236 MB RAM detected; reserving 7618 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (1457897 variants, 165 people).
--file: test1-temporary.bed + test1-temporary.bim + test1-temporary.fam
written.
1457897 variants loaded from .bim file.
165 people (80 males, 85 females) loaded from .fam.
112 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 112 founders and 53 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.997378.
27454 variants removed due to missing genotype data (--geno).
1430443 variants and 165 people pass filters and QC.
Among remaining phenotypes, 56 are cases and 56 are controls.  (53 phenotypes
are missing.)
--recode ped to test1.ped + test1.map ... done.
Warning: 225 het. haploid genotypes present (see test1.hh ); many commands
treat these as missing.

可以看到：

SNP质控掉：27454
SNP剩余未点：1430443

再对样本进行缺失质控：

这里--mind 0.02是plink中对样本进行的缺失质控，质控标准为0.02，即删除缺失率大于2%的样本。

plink --file test1 --mind 0.02 --recode --out test2

运行日志：

$ plink --file test1 --mind 0.02 --recode --out test2
PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to test2.log.
Options in effect:--file test1--mind 0.02--out test2--recode15236 MB RAM detected; reserving 7618 MB for main workspace.
.ped scan complete (for binary autoconversion).
Performing single-pass .bed write (1430443 variants, 165 people).
--file: test2-temporary.bed + test2-temporary.bim + test2-temporary.fam
written.
1430443 variants loaded from .bim file.
165 people (80 males, 85 females) loaded from .fam.
112 phenotype values loaded from .fam.
0 people removed due to missing genotype data (--mind).
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 112 founders and 53 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.997899.
1430443 variants and 165 people pass filters and QC.
Among remaining phenotypes, 56 are cases and 56 are controls.  (53 phenotypes
are missing.)
--recode ped to test2.ped + test2.map ... done.
Warning: 179 het. haploid genotypes present (see test2.hh ); many commands
treat these as missing.

正确的结果：

剩余SNP：1430443
剩余样本：165

3. 错误做法1，先样本后SNP

下面演示，先对样本进行缺失质控，再对SNP进行缺失质控。

plink --file test_data --mind 0.02 --recode --out test3
plink --file test3 --geno 0.02 --recode --out test4

运行结果：

剩余SNP：1431211
剩余样本：164

$ wc -l test4.map test4.ped1431211 test4.map164 test4.ped1431375 total

和正确的结果对比：

可以看到，顺序不一样，SNP位点不一样，剩余样本也不一样。

4. 错误做法2，SNP和样本同时质控

 plink --file test_data --geno 0.02 --mind 0.02 --recode --out test5

结果是错误的：

$ wc -l test5.map test5.ped1431211 test5.map164 test5.ped1431375 total

如果是–geno和–mind顺序反呢？

 plink --file test_data  --mind 0.02 --geno 0.02 --recode --out test6

结果也是错误的：

$ wc -l test6.map test6.ped1431211 test6.map164 test6.ped1431375 total

5. 为何先质控SNP后质控样本？

SNP的数据来自实验室，无论是芯片数据，GBS数据，二代重测序数等，DNA 与阵列的杂交不佳、基因型探针性能不佳以及样本混淆或污染，都会导致数据质量差。

无论是SNP的缺失率，还是样本的缺失率，都是针对检出率进行的质控。如果一个群体中有些亚群对某些片段没有分型（片段缺失），这种情况下，对于样本进行质控或者样本和SNP同时质控，会将样本删除，而这些样本不是由于DNA质量差或者实验室的原因导致的缺失，而是由于这些样本本身的片段缺失导致的缺失，这种情况下，先对样本或者样本和SNP同时质控，会导致较大的错误。

为了避免这种情况，可以先对SNP的缺失率进行质控，这样由于某些亚群片段缺失导致的缺失，就会在SNP质控时将其删除，就不会影响后续的样本缺失质控的结果。

上面案例中，有一个样本，如果先进行SNP缺失质控再进行样本质控就不会被删除。而先进行样本质控或者同时质控，就会被删除。

6. 参考文献

该篇的缘由是因为有老师提出前后顺序对他的数据影响较大，在这里十分感谢这位老师。我这里总结一下，希望大家少走弯路。

文献地址：

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6001694/

文中截图：

感觉对您有帮助，三连走起吧！

欢迎关注我的公众号：育种数据分析之放飞自我。主要分享R语言，Python，育种数据分析，生物统计，数量遗传学，混合线性模型，GWAS和GS相关的知识。