This function calculates the polygenic risk score, which is the total of allele counts (genotypes) weighted by estimated effect sizes from genome-wide association studies. It uses C+T filtering techniques. The users could perform clumping procedure choromosome-wise and genome-wide. Also, the function offers the choice of including several genetic principal components along with other covariates. Using this function, users have the freedom to experiment with various clumping and thresholding arrangements to test a wide range of various parameter values.
Usage
ComputePRS(
DataDir,
ResultDir = tempdir(),
finput,
summarystat,
phenofile,
covarfile = NULL,
effectsize = c("BETA", "OR"),
ldclump = FALSE,
LDreference,
clump_p1,
clump_p2,
clump_r2,
clump_kb,
byCHR = TRUE,
pthreshold = c(0.001, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5),
highLD_regions,
ld_prunning = FALSE,
window_size = 50,
step_size = 5,
r2_threshold = 0.02,
nPC = 6,
pheno_type = "binary"
)
Arguments
- DataDir
A character string for the file path of the all the input files.
- ResultDir
A character string for the file path where all output files will be stored. The default is tempdir().
- finput
Character string, specifying the prefix of the input PLINK binary files for the genotype data i.e., the target data based on which clumping procedure will be performed. This file needs to be in DataDir. If your target data are small (e.g. N < 500) then you can use the 1000 Genomes Project samples. Make sure to use the population that most closely reflects represents the base sample.
- summarystat
A dataframe object with GWAS summary statistics.
The mandatory column headers in this dataframe are:
CHR
(Chromosome code)BP
(Basepair position)A1
(effect allele)SNP
(i.e., SNP identifier)BETA
orOR
(i.e., effect-size or logarithm of odds ratio)P
(i.e., p-values).
Special Notes: The first three columns needed to be
SNP
,A1
andBETA
orOR
.- phenofile
A character string, specifying the name of the mandatory phenotype file. This is a plain text file with no header line; columns family ID, individual ID and phenotype columns. For binary trait, the phenotypic value should be coded as 0 or 1, then it will be recognized as a case-control study (0 for controls and 1 for cases). Missing value should be represented by "-9" or "NA". The interested phenotype column should be labeled as "Pheno1". This file needs to be in
DataDir
.- covarfile
A character string, specifying the name of the covariate file which is a plain .text file with no header line; columns are family ID, individual ID and the covariates. The default is
NULL
. This file needs to be inDataDir
.- effectsize
Boolean value,
BETA
orOR
, specifying the type of the GWAS effectsize. The default isBETA
.- ldclump
Boolean value,
TRUE
orFALSE
, specifying whether to perform clumping or not.- LDreference
A character string, specifying the prefix of the PLINK files of the population reference panel of the same ancestry, and ideally the one that was used for imputing your target dataset. These files should be in
DataDir
.- clump_p1
Numeric value, specifying the significance threshold for index SNPs if
ldclump
was set to beTRUE
. The default is 0.0001.- clump_p2
Numeric value, specifying the secondary significance threshold for clumped SNPs if
ldclump
was set to beTRUE
. The default is 0.01- clump_r2
Numeric value, specifying the linkage disequilibrium (LD) threshold for clumping if
ldclump
was set to beTRUE
. The default is 0.50.- clump_kb
Integer value, specifying the physical distance threshold in base-pair for clumping if
ldclump
was set to beTRUE
. The default is 250.- byCHR
Boolean value, 'TRUE' or 'FALSE', specifying chromosome-wise clumping procedure if
ldclump
was set to beTRUE
. The default isTRUE
- pthreshold
Numeric vector, containing several p value thresholds to maximize predictive ability of the derived polygenic scores.
- highLD_regions
Character string, specifying the .txt file name with known genomic regions with high LD. The default is
NULL
.- ld_prunning
Boolean value,
TRUE
orFALSE
for LD-based filtering for computing genetic PC as covariates.- window_size
Integer value, specifying a window size in variant count or kilobase for LD-based filtering in computing genetic PC. The default is 50.
- step_size
Integer value, specifying a variant count to shift the window at the end of each step for LD filtering in computing genetic PCs. The default is 5.
- r2_threshold
Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering in computing genetic PCs. The default is 0.02.
- nPC
Positive integer value, specifying the number of genetic PCs to be included as predictor in the PRS model fit. The default is 6.
- pheno_type
Boolean value, ‘binary’ or ‘quantitative’, specifying the type of the trait. The default is ‘binary’.
Value
A list object containing a dataframe a numeric value, a GeneticPC plot (if requested), and a PRS plot. The dataframe,PRS, contains four mandatory columns, such as, IID (i.e., Individual ID), FID (i.e., Family ID), Pheno1 (i.e., the trait for PRS) and Score (i.e., the best PRS). Other columns of covariates could be there. The numeric value, BestP contains the threshold of of the best p-value for the best pRS model fit.
Also, the function produces several plots such as p-value thresholds vs PRS model fit and PRS distribution among male and females. For case-control data, it shows PRS distribution among cases and controls and ROC curves as well.
Examples
data("Summary_Stat_Ex1", package = "GXwasR")
data("Example_phenofile", package = "GXwasR")
data("Example_covarfile", package = "GXwasR")
data("Example_pthresoldfile", package = "GXwasR")
data("highLD_hg19", package = "GXwasR")
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
summarystat <- Summary_Stat_Ex1[, c(2, 4, 7, 1, 3, 12)]
phenofile <- Example_phenofile # Cannot be NULL
# The interested phenotype column should be labeled as "Pheno1".
covarfile <- Example_covarfile
clump_p1 <- 0.0001
clump_p2 <- 0.0001
clump_kb <- 500
clump_r2 <- 0.5
byCHR <- TRUE
pthreshold <- Example_pthresoldfile$Threshold
ld_prunning <- TRUE
highLD_regions <- highLD_hg19
window_size <- 50
step_size <- 5
r2_threshold <- 0.02
nPC <- 6 # We can incorporate PCs into our PRS analysis to account for population stratification.
pheno_type <- "binary"
PRSresult <- ComputePRS(DataDir, ResultDir, finput, summarystat, phenofile, covarfile,
effectsize = "BETA", LDreference = "GXwasR_example", ldclump = FALSE, clump_p1, clump_p2,
clump_r2, clump_kb, byCHR = TRUE, pthreshold = pthreshold, highLD_regions = highLD_regions,
ld_prunning = TRUE, window_size = 50, step_size = 5, r2_threshold = 0.02, nPC = 6,
pheno_type = "binary"
)
#> ℹ 0.001
#> • Computing PRS for threshold 0.001
#> ℹ 0.05
#> • Computing PRS for threshold 0.05
#> ℹ 0.1
#> • Computing PRS for threshold 0.1
#> ℹ 0.2
#> • Computing PRS for threshold 0.2
#> ℹ 0.3
#> • Computing PRS for threshold 0.3
#> ℹ 0.4
#> • Computing PRS for threshold 0.4
#> ℹ 0.5
#> • Computing PRS for threshold 0.5
## This table shows 10 samples with phenotype, covariates and a PRS column.
PRS <- PRSresult$PRS
PRS[seq_len(10), ]
#> FID IID Pheno1 AGE testcovar PC1 PC2 PC3
#> 1 EUR_FIN HG00171 1 36 1 0.0964618 -0.03825140 0.00903796
#> 2 EUR_FIN HG00173 1 81 1 0.0743932 -0.05672710 0.03969940
#> 3 EUR_FIN HG00174 1 83 1 0.0645349 -0.05610350 -0.01457860
#> 4 EUR_FIN HG00176 2 75 0 0.0832472 -0.05602080 0.03026540
#> 5 EUR_FIN HG00177 1 88 1 0.0775687 -0.00946444 0.04517850
#> 6 EUR_FIN HG00178 1 24 1 0.0661714 -0.05049060 0.02485390
#> 7 EUR_FIN HG00179 2 78 1 0.0806110 -0.01154090 -0.00750723
#> 8 EUR_FIN HG00180 1 39 1 0.0840638 0.00117830 0.01286460
#> 9 EUR_FIN HG00182 1 50 1 0.1029760 -0.04300460 0.01999710
#> 10 EUR_FIN HG00183 1 58 0 0.0827103 -0.03865300 0.05093660
#> PC4 PC5 PC6 SCORE
#> 1 -0.08952930 -0.02522910 0.01465430 0.04842120
#> 2 0.02053430 0.02304100 -0.03603800 0.02937420
#> 3 -0.02643430 -0.00180949 0.01445210 0.00996742
#> 4 -0.00110770 -0.05303710 -0.03358030 0.10420600
#> 5 -0.03732150 -0.06174310 -0.07199510 0.03143030
#> 6 -0.08280660 0.04094180 0.00559525 0.07226890
#> 7 -0.03288230 0.03559730 0.04245120 0.11652400
#> 8 0.05915400 0.05476670 -0.01015950 0.06171590
#> 9 -0.03023650 -0.09019110 -0.03785150 0.02837030
#> 10 -0.00153775 -0.10456100 0.06236340 0.03862970
## The best threshold
BestPvalue <- PRSresult$BestP$Threshold
BestPvalue
#> [1] 0.05