ComputePRS: Computing polygenic risk score (PRS)

This function calculates the polygenic risk score, which is the total of allele counts (genotypes) weighted by estimated effect sizes from genome-wide association studies. It uses C+T filtering techniques. The users could perform clumping procedure choromosome-wise and genome-wide. Also, the function offers the choice of including several genetic principal components along with other covariates. Using this function, users have the freedom to experiment with various clumping and thresholding arrangements to test a wide range of various parameter values.

Usage

ComputePRS(
  DataDir,
  ResultDir = tempdir(),
  finput,
  summarystat,
  phenofile,
  covarfile = NULL,
  effectsize = c("BETA", "OR"),
  ldclump = FALSE,
  LDreference,
  clump_p1,
  clump_p2,
  clump_r2,
  clump_kb,
  byCHR = TRUE,
  pthreshold = c(0.001, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5),
  highLD_regions,
  ld_prunning = FALSE,
  window_size = 50,
  step_size = 5,
  r2_threshold = 0.02,
  nPC = 6,
  pheno_type = "binary"
)

Arguments

DataDir

A character string for the file path of the all the input files.

ResultDir

A character string for the file path where all output files will be stored. The default is tempdir().

finput

Character string, specifying the prefix of the input PLINK binary files for the genotype data i.e., the target data based on which clumping procedure will be performed. This file needs to be in DataDir. If your target data are small (e.g. N < 500) then you can use the 1000 Genomes Project samples. Make sure to use the population that most closely reflects represents the base sample.

summarystat

A dataframe object with GWAS summary statistics.

The mandatory column headers in this dataframe are:

CHR(Chromosome code)
BP(Basepair position)
A1 (effect allele)
SNP (i.e., SNP identifier)
BETA or OR (i.e., effect-size or logarithm of odds ratio)
P (i.e., p-values).

Special Notes: The first three columns needed to be SNP, A1 and BETA or OR.

phenofile

A character string, specifying the name of the mandatory phenotype file. This is a plain text file with no header line; columns family ID, individual ID and phenotype columns. For binary trait, the phenotypic value should be coded as 0 or 1, then it will be recognized as a case-control study (0 for controls and 1 for cases). Missing value should be represented by "-9" or "NA". The interested phenotype column should be labeled as "Pheno1". This file needs to be in DataDir.

covarfile

A character string, specifying the name of the covariate file which is a plain .text file with no header line; columns are family ID, individual ID and the covariates. The default is NULL. This file needs to be in DataDir.

effectsize

Boolean value, BETA or OR, specifying the type of the GWAS effectsize. The default is BETA.

ldclump

Boolean value, TRUE or FALSE, specifying whether to perform clumping or not.

LDreference

A character string, specifying the prefix of the PLINK files of the population reference panel of the same ancestry, and ideally the one that was used for imputing your target dataset. These files should be in DataDir.

clump_p1

Numeric value, specifying the significance threshold for index SNPs if ldclump was set to be TRUE. The default is 0.0001.

clump_p2

Numeric value, specifying the secondary significance threshold for clumped SNPs if ldclump was set to be TRUE. The default is 0.01

clump_r2

Numeric value, specifying the linkage disequilibrium (LD) threshold for clumping if ldclump was set to be TRUE. The default is 0.50.

clump_kb

Integer value, specifying the physical distance threshold in base-pair for clumping if ldclump was set to be TRUE. The default is 250.

byCHR

Boolean value, 'TRUE' or 'FALSE', specifying chromosome-wise clumping procedure if ldclump was set to be TRUE. The default is TRUE

pthreshold

Numeric vector, containing several p value thresholds to maximize predictive ability of the derived polygenic scores.

highLD_regions

Character string, specifying the .txt file name with known genomic regions with high LD. The default is NULL.

ld_prunning

Boolean value, TRUE or FALSE for LD-based filtering for computing genetic PC as covariates.

window_size

Integer value, specifying a window size in variant count or kilobase for LD-based filtering in computing genetic PC. The default is 50.

step_size

Integer value, specifying a variant count to shift the window at the end of each step for LD filtering in computing genetic PCs. The default is 5.

r2_threshold

Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering in computing genetic PCs. The default is 0.02.

nPC

Positive integer value, specifying the number of genetic PCs to be included as predictor in the PRS model fit. The default is 6.

pheno_type

Boolean value, ‘binary’ or ‘quantitative’, specifying the type of the trait. The default is ‘binary’.

Value

A list object containing a dataframe a numeric value, a GeneticPC plot (if requested), and a PRS plot. The dataframe,PRS, contains four mandatory columns, such as, IID (i.e., Individual ID), FID (i.e., Family ID), Pheno1 (i.e., the trait for PRS) and Score (i.e., the best PRS). Other columns of covariates could be there. The numeric value, BestP contains the threshold of of the best p-value for the best pRS model fit.

Also, the function produces several plots such as p-value thresholds vs PRS model fit and PRS distribution among male and females. For case-control data, it shows PRS distribution among cases and controls and ROC curves as well.

Author

Banabithi Bose

Examples

data("Summary_Stat_Ex1", package = "GXwasR")
data("Example_phenofile", package = "GXwasR")
data("Example_covarfile", package = "GXwasR")
data("Example_pthresoldfile", package = "GXwasR")
data("highLD_hg19", package = "GXwasR")
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
summarystat <- Summary_Stat_Ex1[, c(2, 4, 7, 1, 3, 12)]
phenofile <- Example_phenofile # Cannot be NULL
# The interested phenotype column should be labeled as "Pheno1".
covarfile <- Example_covarfile
clump_p1 <- 0.0001
clump_p2 <- 0.0001
clump_kb <- 500
clump_r2 <- 0.5
byCHR <- TRUE
pthreshold <- Example_pthresoldfile$Threshold
ld_prunning <- TRUE
highLD_regions <- highLD_hg19
window_size <- 50
step_size <- 5
r2_threshold <- 0.02
nPC <- 6 # We can incorporate PCs into our PRS analysis to account for population stratification.
pheno_type <- "binary"

PRSresult <- ComputePRS(DataDir, ResultDir, finput, summarystat, phenofile, covarfile,
    effectsize = "BETA", LDreference = "GXwasR_example", ldclump = FALSE, clump_p1, clump_p2,
    clump_r2, clump_kb, byCHR = TRUE, pthreshold = pthreshold, highLD_regions = highLD_regions,
    ld_prunning = TRUE, window_size = 50, step_size = 5, r2_threshold = 0.02, nPC = 6,
    pheno_type = "binary"
)
#> ℹ 0.001
#> • Computing PRS for threshold 0.001
#> ℹ 0.05
#> • Computing PRS for threshold 0.05
#> ℹ 0.1
#> • Computing PRS for threshold 0.1
#> ℹ 0.2
#> • Computing PRS for threshold 0.2
#> ℹ 0.3
#> • Computing PRS for threshold 0.3
#> ℹ 0.4
#> • Computing PRS for threshold 0.4
#> ℹ 0.5
#> • Computing PRS for threshold 0.5

## This table shows 10 samples with phenotype, covariates and a PRS column.
PRS <- PRSresult$PRS
PRS[seq_len(10), ]
#>        FID     IID Pheno1 AGE testcovar       PC1         PC2         PC3
#> 1  EUR_FIN HG00171      1  36         1 0.0964618 -0.03825140  0.00903796
#> 2  EUR_FIN HG00173      1  81         1 0.0743932 -0.05672710  0.03969940
#> 3  EUR_FIN HG00174      1  83         1 0.0645349 -0.05610350 -0.01457860
#> 4  EUR_FIN HG00176      2  75         0 0.0832472 -0.05602080  0.03026540
#> 5  EUR_FIN HG00177      1  88         1 0.0775687 -0.00946444  0.04517850
#> 6  EUR_FIN HG00178      1  24         1 0.0661714 -0.05049060  0.02485390
#> 7  EUR_FIN HG00179      2  78         1 0.0806110 -0.01154090 -0.00750723
#> 8  EUR_FIN HG00180      1  39         1 0.0840638  0.00117830  0.01286460
#> 9  EUR_FIN HG00182      1  50         1 0.1029760 -0.04300460  0.01999710
#> 10 EUR_FIN HG00183      1  58         0 0.0827103 -0.03865300  0.05093660
#>            PC4         PC5         PC6      SCORE
#> 1  -0.08952930 -0.02522910  0.01465430 0.04842120
#> 2   0.02053430  0.02304100 -0.03603800 0.02937420
#> 3  -0.02643430 -0.00180949  0.01445210 0.00996742
#> 4  -0.00110770 -0.05303710 -0.03358030 0.10420600
#> 5  -0.03732150 -0.06174310 -0.07199510 0.03143030
#> 6  -0.08280660  0.04094180  0.00559525 0.07226890
#> 7  -0.03288230  0.03559730  0.04245120 0.11652400
#> 8   0.05915400  0.05476670 -0.01015950 0.06171590
#> 9  -0.03023650 -0.09019110 -0.03785150 0.02837030
#> 10 -0.00153775 -0.10456100  0.06236340 0.03862970
## The best threshold
BestPvalue <- PRSresult$BestP$Threshold
BestPvalue
#> [1] 0.05