PRSice-2软件安装教程

1. 软件下载

https://choishingwan.github.io/PRS-Tutorial/prsice/

最新版的包括Mac和Linux系统，下面我们用Linux系统进行演示

安装包：

测试数据：

注意，上面数据如果无法下载，可以公众号（育种数据分析之放飞自我）后台回复PRS，获得软件包和测试数据。

2. 软件安装

本演示，在Linux系统下Centos7进行。

下载好的安装包和测试数据：

新建一个prs文件夹，将压缩包放进去，解压软件：

$ unzip PRSice_linux.zip
Archive:  PRSice_linux.zipinflating: PRSice.Rinflating: TOY_BASE_GWAS.associnflating: TOY_TARGET_DATA.bedinflating: TOY_TARGET_DATA.biminflating: TOY_TARGET_DATA.faminflating: TOY_TARGET_DATA.phenoinflating: PRSice_linux

新建一个压缩包，把测试数据放进去，解压：

$ unzip post-qc.zip
Archive:  post-qc.zipinflating: Height.QC.gzinflating: EUR.QC.bedinflating: EUR.QC.biminflating: EUR.QC.faminflating: EUR.heightinflating: EUR.covinflating: EUR.eigenvec

3. 命令文件介绍

3.1 PRSice_linux

二进制文件，直接运行即可。核心计算文件，计算PRS。

3.2 PRSice.R

R 脚本文件，主要是绘图可视化，包括bar plot, high-resolution plot和quantile plot

3.3 BESE文件

基础数据文件，这里是GWAS summary的结果。

包括91063个snp结果。

TOY_BASE_GWAS.assoc

数据包括：

SNP名称
染色体
物理位置
A1，有效的分型，effective allel
A2，无效的分型，non-effective allel
P，P值
OR，OR值

$ head TOY_BASE_GWAS.assoc
SNP CHR BP A1 A2 P OR
SNP_22857 4 103593179 1 2 0.2852 13.29
SNP_13879 2 237416793 1 2 0.8784 21.624
SNP_20771 4 16957461 1 2 0.1994 91.265
SNP_13787 2 235355721 1 2 0.7234 3.178
SNP_25383 4 189927377 1 2 0.3309 3.167
SNP_25290 4 187995996 1 2 0.6327 0.427
SNP_21478 4 40161304 1 2 0.06454 5.066
SNP_12129 2 176643771 1 2 0.9378 1.276
SNP_22809 4 101441465 1 2 0.8111 0.004

3.4 TARGET文件

测试文件，包括：plink的二进制文件和对应的表型数据文件：

包括2000个个体。

ls TOY_TARGET_DATA.*
TOY_TARGET_DATA.bed  TOY_TARGET_DATA.bim  TOY_TARGET_DATA.fam  TOY_TARGET_DATA.pheno

二进制文件包括：bim，bed和fam文件
表型数据文件：

FID IID Pheno
CAS_1 CAS_1 0.687940475297167
CAS_2 CAS_2 -0.156139175886002
CAS_3 CAS_3 -0.690876014335686
CAS_4 CAS_4 -0.147899250768441
CAS_5 CAS_5 -0.66034095162489
CAS_6 CAS_6 0.0438675950151819
CAS_7 CAS_7 -0.472359275893301
CAS_8 CAS_8 1.49482653529014
CAS_9 CAS_9 1.01876296041241

4. PRSice-2参数介绍

查看帮助文档：

./PRSice_linux

文档大体分为几个部分：

Base file，基础文件，可以根据指定a1，a2，maf，beta，bp，chr，pvalue等信息
Target file，目标文件，支持plink二进制文件，可以定义表型，maf质控等
Dosage，主要是定义运行的资源配置，比如线程、硬盘、内存等（可以省略）
Clumping，主要是质控Clumping参数，有默认值（可以省略）
Covariates，定义协变量，包括数字协变量和因子协变量
P-value Thresholding，P值的阈值定义
PRSet
Misc

具体参数介绍：

$ ./PRSice_linux
usage: PRSice [options] <-b base_file> <-t target_file>Base File:--a1                    Column header containing allele 1 (effective allele)Default: A1--a2                    Column header containing allele 2 (non-effective allele)Default: A2--base          | -b    Base association file--base-info             Base INFO score filtering. Format should be<Column name>:<Threshold>. SNPs with infoscore less than <Threshold> will be ignoredColumn name default: INFOThreshold default: 0.9--base-maf              Base MAF filtering. Format should be<Column name>:<Threshold>. SNPs with mafless than <Threshold> will be ignored. Anadditional column can also be added (e.g.also filter MAF for cases), using thefollowing format:<Column name>:<Threshold>,<Column name>:<Threshold>--beta                  Whether the test statistic is in the form ofBETA or OR. If set, test statistic is assumeto be in the form of BETA. Mutually exclusivefrom --or--bp                    Column header containing the SNP coordinateDefault: BP--chr                   Column header containing the chromosomeDefault: CHR--index                 If set, assume the INDEX instead of NAME forthe corresponding columns are provided. Indexshould be 0-based (start counting from 0)--no-default            Remove all default options. If set, PRSicewill not set any default column name and youmust manually provide all required columns(--snp, --stat, --A1, --pvalue)--or                    Whether the test statistic is in the form ofBETA or OR. If set, test statistic is assumeto be in the form of OR. Mutually exclusivefrom --beta--pvalue        | -p    Column header containing the p-valueDefault: P--snp                   Column header containing the SNP IDDefault: SNP--stat                  Column header containing the summary statisticIf --beta is set, default as BETA. Otherwise,will search for OR or BETA from the headerof the base fileTarget File:--binary-target         Indicate whether the target phenotypeis binary or not. Either T or F should beprovided where T represent a binary phenotype.For multiple phenotypes, the input should beseparated by comma without space.Default: T if --beta and F if --beta is not--geno                  Filter SNPs based on gentype missingness--info                  Filter SNPs based on info score. Only usedfor imputed target--keep                  File containing the sample(s) to be extracted fromthe target file. First column should be FID andthe second column should be IID. If --ignore-fid isset, first column should be IIDMutually exclusive from --remove--maf                   Filter SNPs based on minor allele frequency (MAF)--nonfounders           Keep the nonfounders in the analysisNote: They will still be excluded from LD calculation--pheno         | -f    Phenotype file containing the phenotype(s).First column must be FID of the samples andthe second column must be IID of the samples.When --ignore-fid is set, first column mustbe the IID of the samples.Must contain a header if --pheno-col isspecified--pheno-col     | -F    Headers of phenotypes to be included from thephenotype file--prevalence    | -k    Prevalence of all binary trait. If providedwill adjust the ascertainment bias of the R2.Note that when multiple binary trait is found,prevalence information must be provided forall of them--remove                File containing the sample(s) to be removed fromthe target file. First column should be FID andthe second column should be IID. If --ignore-fid isset, first column should be IIDMutually exclusive from --keep--target        | -t    Target genotype file. Currently supportboth BGEN and binary PLINK format. Formultiple chromosome input, simply substitutethe chromosome number with #. PRSice willautomatically replace # with 1-22For binary plink format, you can also specifya seperate fam file by <prefix>,<fam file>--target-list           File containing prefix of target genotypefiles. Similar to --target but allow moreflexibility. Do not support external fam fileat the moment--type                  File type of the target file. Support bed(binary plink) and bgen format. Default: bedDosage:--allow-inter           Allow the generate of intermediate file. This willspeed up PRSice when using dosage data as clumpingreference and for hard coding PRS calculation--dose-thres            Translate any SNPs with highest genotype probabilityless than this threshold to missing call--hard-thres            A hardcall is saved when the distance to the nearesthardcall is less than the hardcall threshold.Otherwise a missing code is savedDefault is: 0.1--hard                  Use hard coding instead of dosage for PRS construction.Default is to use dosage instead of hard codingClumping:--clump-kb              The distance for clumping in kbDefault: 250--clump-r2              The R2 threshold for clumpingDefault: 0.1 (1mb for PRSet)--clump-p               The p-value threshold use for clumping.Default: 1--ld            | -L    LD reference file. Use for LD calculation. If notprovided, will use the post-filtered target genotypefor LD calculation. Support multiple chromosome inputPlease see --target for more information--ld-dose-thres         Translate any SNPs with highest genotype probabilityless than this threshold to missing call--ld-geno               Filter SNPs based on genotype missingness--ld-hard-thres         A hardcall is saved when the distance to the nearesthardcall is less than the hardcall threshold.Otherwise a missing code is savedDefault is: 0.1--ld-info               Filter SNPs based on info score. Only usedfor imputed LD reference--ld-keep               File containing the sample(s) to be extracted fromthe LD reference file. First column should be FID andthe second column should be IID. If --ignore-fid isset, first column should be IIDMutually exclusive from --ld-removeNo effect if --ld was not provided--ld-list               File containing prefix of LD reference files.Similar to --ld but allow moreflexibility. Do not support external fam fileat the moment--ld-maf                Filter SNPs based on minor allele frequency--ld-remove             File containing the sample(s) to be removed fromthe LD reference file. First column should be FID andthe second column should be IID. If --ignore-fid isset, first column should be IIDMutually exclusive from --ld-keep--ld-type               File type of the LD file. Support bed (binary plink)and bgen format. Default: bed--no-clump              Stop PRSice from performing clumping--proxy                 Proxy threshold for index SNP to be consideredas part of the region represented by the clumpedSNP(s). e.g. --proxy 0.8 means the index SNP willrepresent region of any clumped SNP(s) that has aR2>=0.8 even if the index SNP does not physicallylocate within the regionCovariate:--cov           | -C    Covariate file. First column should be FID andthe second column should be IID. If --ignore-fidis set, first column should be IID--cov-col       | -c    Header of covariates. If not provided, will useall variables in the covariate file. By adding@ in front of the string, any numbers within [and ] will be parsed. E.g. @PC[1-3] will beread as PC1,PC2,PC3. Discontinuous input are alsosupported: @cov[1.3-5] will be parsed ascov1,cov3,cov4,cov5--cov-factor            Header of categorical covariate(s). Dummy variablewill be automatically generated. Any items in--cov-factor must also be found in --cov-colAlso accept continuous input (start with @).P-value Thresholding:--bar-levels            Level of barchart to be plotted. When --fastscoreis set, PRSice will only calculate the PRS forthreshold within the bar level. Levels should becomma separated without space--fastscore             Only calculate threshold stated in --bar-levels--no-full               By default, PRSice will include the full model,i.e. p-value threshold = 1. Setting this flag willdisable that behaviour--interval      | -i    The step size of the threshold. Default: 5e-05--lower         | -l    The starting p-value threshold. Default: 5e-08--model                 Genetic model use for regression. The geneticencoding is based on the base data where theencoding represent number of the coding alleleAvailable models include:add - Additive model, code as 0/1/2 (default)dom - Dominant model, code as 0/1/1rec - Recessive model, code as 0/0/1het - Heterozygous only model, code as 0/1/0--missing               Method to handle missing genotypes. By default,final scores are averages of valid per-allelescores with missing genotypes contribute an amountproportional to imputed allele frequency. To throwout missing observations instead (decreasing thedenominator in the final average when this happens),use the 'SET_ZERO' modifier. Alternatively,you can use the 'CENTER' modifier to shift all scoresto mean zero.--no-regress            Do not perform the regression analysis and simplyoutput all PRS.--score                 Method to calculate the polygenic score.Available methods include:avg     - Take the average effect size (default)std     - Standardize the effect sizecon-std - Standardize the effect size using meanand sd derived from control samplessum     - Direct summation of the effect size--upper         | -u    The final p-value threshold. Default: 0.5PRSet:--background            String to indicate a background file. This stringshould have the format of Name:Type where type can bebed   - 0-based range with 3 column. Chr Start Endrange - 1-based range with 3 column. Chr Start Endgene  - A file contain a column of gene name--bed           | -B    Bed file containing the selected regions.Name of bed file will be used as the regionidentifier. WARNING: Bed file is 0-based--feature               Feature(s) to be included from the gtf file.Default: exon,CDS,gene,protein_coding.--full-back             Use the whole genome as background for competitivep-value calculation--gtf           | -g    GTF file containing gene boundaries. Requiredwhen --msigdb is used--msigdb        | -m    MSIGDB file containing the pathway information.Require the gtf file--snp-set               Provide a SNP set file containing the snp set(s).Two different file format is allowed:SNP list format - A file containing a singlecolumn of SNP ID. Name of theset will be the file name orcan be provided using--snp-set File:NameMSigDB format   - Each row represent a single SNPset with the first columncontaining the name of the SNPset.--wind-3                Add N base(s) to the 3' region of each feature(s)--wind-5                Add N base(s) to the 5' region of each feature(s)Misc:--all-score             Output PRS for ALL threshold. WARNING: Thiswill generate a huge file--chr-id                Try to construct an RS ID for SNP based on itschromosome, coordinate, effective allele andnon-effective allele.e.g. c:L-aBd is translated to:<chr>:<coordinate>-<effective><noneffective>dThis is always true for target file, whereas forbase file, this is only used if the RS IDwasn't provided--exclude               File contains SNPs to be excluded from theanalysis--extract               File contains SNPs to be included in theanalysis--id-delim              This parameter causes sample IDs to be parsed as<FID><delimiter><IID>; the default delimiteris '_'.--ignore-fid            Ignore FID for all input. When this is set,first column of all file will be assume tobe IID instead of FID--keep-ambig            Keep ambiguous SNPs. Only use this optionif you are certain that the base and targethas the same A1 and A2 alleles--logit-perm            When performing permutation, still use logisticregression instead of linear regression. Thiswill substantially slow down PRSice--memory                Maximum memory usage allowed (in Mb). PRSice will tryits best to honor this setting--non-cumulate          Calculate non-cumulative PRS. PRS will be resetto 0 for each new P-value threshold instead ofadding up--out           | -o    Prefix for all file output--perm                  Number of permutation to perform. This swillgenerate the empirical p-value. Recommend touse value larger than 10,000--print-snp             Print all SNPs that remains in the analysisafter clumping is performed. For PRSet, Yindicate the SNPs falls within the gene setof interest and N otherwise. If only PRSiceis performed, a single "gene set" called"Base" will be presented with all entriesmarked as Y--seed          | -s    Seed used for permutation. If not provided,system time will be used as seed. When sameseed and same input is provided, same resultcan be generated--thread        | -n    Number of thread use--use-ref-maf           When specified, missingness imputation will beperformed based on the reference samples--ultra                 Ultra aggressive memory usage. When this is enabledPRSice and PRSet will try to load all genotypes intomemory after clumping is performed. This shoulddrastically speed up PRSice and PRSet at the expenseof higher memory consumption.Has no effect for dosage score--x-range               Range of SNPs to be excluded from the wholeanalysis. It can either be a single bed fileor a comma seperated list of range. Range mustbe in the format of chr:start-end or chr:coordinate--help          | -h    Display this help messagePlease provide the required parameters

5. 二分类性状计算

5.1 运行代码

代码：

Rscript PRSice.R --dir . --prsice ./PRSice_linux --base TOY_BASE_GWAS.assoc --target TOY_TARGET_DATA --thread 1 --stat OR --binary-target T

Rscript，是用R语言进行操作
–dir，默认是当前路径
–prsice，指定PRSice的执行文件
–base，是基础数据，这里是GWAS的结果，TOY_BASE_GWAS.assoc
–target，这里是plink的二进制文件，前缀名
–thread 1，用1个线程
–stat OR，这里用的是OR值（二分类性状的OR值，连续性状是Beta值）
–binary-target T，用的是二分类性状

5.2 运行日志

日志文件：

$ Rscript PRSice.R --dir . --prsice ./PRSice_linux --base TOY_BASE_GWAS.assoc --target TOY_TARGET_DATA --thread 1 --stat OR --binary-target T
PRSice 2.3.3 (2020-08-05)
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2022-10-21 11:07:18
./PRSice_linux \--a1 A1 \--a2 A2 \--bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \--base TOY_BASE_GWAS.assoc \--binary-target T \--bp BP \--chr CHR \--clump-kb 250kb \--clump-p 1.000000 \--clump-r2 0.100000 \--interval 5e-05 \--lower 5e-08 \--num-auto 22 \--or  \--out PRSice \--pvalue P \--seed 2668735456 \--snp SNP \--stat OR \--target TOY_TARGET_DATA \--thread 1 \--upper 0.5Initializing Genotype file: TOY_TARGET_DATA (bed)Start processing TOY_BASE_GWAS
==================================================Base file: TOY_BASE_GWAS.assoc
Header of file is:
SNP CHR BP A1 A2 P ORReading 100.00%
91062 variant(s) observed in base file, with:
2226 variant(s) located on haploid chromosome
88836 total variant(s) included from base fileLoading Genotype info from target
==================================================2000 people (1024 male(s), 976 female(s)) observed
2000 founder(s) includedWarning: Currently not support haploid chromosome and sexchromosomes88836 variant(s) includedThere are a total of 1 phenotype to processStart performing clumpingClumping Progress: 100.00%
Number of variant(s) after clumping : 88836Processing the 1 th phenotypePhenotype is a binary phenotype
1000 control(s)
1000 case(s)Start Processing
Processing 100.00%
There are 1 region(s) with p-value less than 1e-5. Please
note that these results are inflated due to the overfitting
inherent in finding the best-fit PRS (but it's still best
to find the best-fit PRS!).
You can use the --perm option (see manual) to calculate an
empirical P-value.Begin plotting
Current Rscript version = 2.3.3
Plotting Bar Plot
Plotting the high resolution plot

5.3 运行结果

结果文件：

6. 二分类结果解释

6.1 PRSice.price文件

该文件，主要是根据不同Threshold阈值后，选择的SNP个数（Num_SNP）以及对应的解释度（R2）等信息

Pheno    Set Threshold   R2  P   Coefficient Standard.Error  Num_SNP
-   Base    0.00025005  0.0133696   8.43169e-06 -0.197266   0.0442903   2
-   Base    0.00030005  0.00824473  0.000456434 -0.225204   0.0642503   3
-   Base    0.00040005  0.0089725   0.000256089 -0.350267   0.0958035   5
-   Base    0.00045005  0.0101339   0.000102845 -0.445497   0.114707    6
-   Base    0.00065005  0.00532975  0.004775    -0.402003   0.142462    8
-   Base    0.00070005  0.00876654  0.00030122  -0.549246   0.151967    9
-   Base    0.00080005  0.00233607  0.061455    -0.369219   0.197422    13
-   Base    0.00085005  0.00153157  0.129826    -0.342923   0.226384    15
-   Base    0.00095005  0.000124324 0.665873    -0.100725   0.233258    16

6.2 PRSice.best文件

这个文件，是每个个体，计算的PRS值

FID IID In_Regression PRS
CAS_1 CAS_1 Yes -0.00599501328
CAS_2 CAS_2 Yes -0.00631017938
CAS_3 CAS_3 Yes -0.00227495325
CAS_4 CAS_4 Yes -0.00204360007
CAS_5 CAS_5 Yes -0.000830676955
CAS_6 CAS_6 Yes -0.00224943517
CAS_7 CAS_7 Yes -0.000687589983
CAS_8 CAS_8 Yes -0.00413102565
CAS_9 CAS_9 Yes 0.00256661049

6.3 PRSice.summy文件

这个文件，是给出最优模型的结果，比如适合的SNP个数，R2，回归系数，P值等信息。

head PRSice.summary
Phenotype   Set Threshold   PRS.R2  Full.R2 Null.R2 Prevalence  Coefficient Standard.Error  P   Num_SNP
-   Base    0.4463  0.0520082   0.0520082   0   -   86.288  9.96331 4.69368e-18 36759

6.4 PRSice_BARPLOT_*.png

这个柱形图，是应用比较广泛的图，X坐标是不同P值，Y坐标是PRS风险得分的解释百分比（R2），柱形图最高的点表示该模型最优，比如下面图中，在P值为0.4463时，模型最优，解释的百分比是5%左右，P值为4.7e-18，极显著。
0.

6.5 PRSice_HIGH-RES_PLOT_*.png

下图X坐标是不同的P阈值，Y坐标是显著性（-log转化），可以看到最显著的P的阈值是在0.5左右。

7. 连续性状计算

7.1 运行代码

注意，这里的Base数据是OR，这里强行用连续性状演示一下：

Rscript PRSice.R --dir . --prsice ./PRSice_linux --base TOY_BASE_GWAS.assoc --target TOY_TARGET_DATA --thread 1  --beta --binary-target T

–beta，这里用的是连续性状

7.2 运行日志

$ Rscript PRSice.R --dir . --prsice ./PRSice_linux --base TOY_BASE_GWAS.assoc --target TOY_TARGET_DATA --thread 1  --beta --binary-target T
PRSice 2.3.3 (2020-08-05)
https://github.com/choishingwan/PRSice
(C) 2016-2020 Shing Wan (Sam) Choi and Paul F. O'Reilly
GNU General Public License v3
If you use PRSice in any published work, please cite:
Choi SW, O'Reilly PF.
PRSice-2: Polygenic Risk Score Software for Biobank-Scale Data.
GigaScience 8, no. 7 (July 1, 2019)
2022-10-21 11:45:37
./PRSice_linux \--a1 A1 \--a2 A2 \--bar-levels 0.001,0.05,0.1,0.2,0.3,0.4,0.5,1 \--base TOY_BASE_GWAS.assoc \--beta  \--binary-target T \--bp BP \--chr CHR \--clump-kb 250kb \--clump-p 1.000000 \--clump-r2 0.100000 \--interval 5e-05 \--lower 5e-08 \--num-auto 22 \--out PRSice \--pvalue P \--seed 752466145 \--snp SNP \--stat OR \--target TOY_TARGET_DATA \--thread 1 \--upper 0.5Initializing Genotype file: TOY_TARGET_DATA (bed)Start processing TOY_BASE_GWAS
==================================================Base file: TOY_BASE_GWAS.assoc
Header of file is:
SNP CHR BP A1 A2 P ORReading 100.00%
91062 variant(s) observed in base file, with:
2226 variant(s) located on haploid chromosome
88836 total variant(s) included from base fileLoading Genotype info from target
==================================================2000 people (1024 male(s), 976 female(s)) observed
2000 founder(s) includedWarning: Currently not support haploid chromosome and sexchromosomes88836 variant(s) includedThere are a total of 1 phenotype to processStart performing clumpingClumping Progress: 100.00%
Number of variant(s) after clumping : 88836Processing the 1 th phenotypePhenotype is a binary phenotype
1000 control(s)
1000 case(s)Start Processing
Processing 100.00%
There are 1 region(s) with p-value between 0.1 and 1e-5
(may not be significant).Begin plotting
Current Rscript version = 2.3.3
Plotting Bar Plot
Plotting the high resolution plot

7.3 运行结果

8. 连续性状结果

结果形式和二分类性状基本一致。

8.1 PRSice.price文件