ComputePGS: Computing polygenic score (PGS)

This function calculates the polygenic risk score, which summarizes the estimated effect of many genetic variants on an individual’s phenotype. It is calculated as the sum of the allele counts (genotypes), each weighted by their estimated phenotypic effect sizes from genome-wide association studies. It uses C+T filtering techniques. Users can perform the clumping procedure chromosome-wise or genome-wide. Also, the function offers the choice of including genetic principal components and other covariates. Using this function, users have freedom to experiment with various clumping and thresholding arrangements to test a wide range of various parameter values.

Usage

ComputePGS(
  DataDir,
  ResultDir = tempdir(),
  finput,
  summarystat,
  phenofile,
  covarfile = NULL,
  effectsize = c("BETA", "OR"),
  ldclump = FALSE,
  LDreference,
  clump_p1,
  clump_p2,
  clump_r2,
  clump_kb,
  byCHR = TRUE,
  pthreshold = c(0.001, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5),
  highLD_regions,
  ld_prunning = FALSE,
  window_size = 50,
  step_size = 5,
  r2_threshold = 0.02,
  nPC = 6,
  pheno_type = "binary"
)

Arguments

DataDir

A character string specifying file path of the all the input files.

ResultDir

A character string specifying file path where all output files will be stored. The default is tempdir().

finput

Character string, specifying the prefix of the input PLINK binary files containing the genotype data i.e., the target data based on which the clumping procedure will be performed. This file needs to be in DataDir. If the sample size of the target dataset is small (e.g., N < 500 individuals) then users can utilize the 1000 Genomes Project samples, ensuring the use of the population that most closely represents the target sample.

summarystat

A dataframe object containing GWAS summary statistics.

The mandatory column headers in this dataframe are:

CHR(Chromosome code)
BP(Basepair position)
A1 (effect allele)
SNP (i.e., SNP identifier)
BETA or OR (i.e., effect-size or logarithm of odds ratio)
P (i.e., p-values).

Special Notes: The first three columns needed to be SNP, A1 and BETA or OR.

phenofile

A character string, specifying the name of the mandatory phenotype file. This is a plain text file with no header line; columns are: family ID, individual ID, and phenotype. For a binary trait, the phenotypic value should be coded as 0 or 1, then it will be recognized as a case-control study (0 for controls and 1 for cases). The missing value should be represented by "-9" or "NA". The desired phenotype column should be labeled as "Pheno1". This file needs to be in 'DataDir'.

covarfile

A character string, specifying the name of the covariate file which is a plain .text file with no header line; columns are family ID, individual ID, and the covariates. The default is 'NULL'. This file needs to be in 'DataDir'.

effectsize

Boolean value, 'BETA' or 'OR', specifying the type of the GWAS effect size. The default is 'BETA'.

ldclump

Boolean value, TRUE or FALSE, specifying whether to perform clumping or not.

LDreference

A character string, specifying the prefix of the PLINK files of the genetic similarity reference panel, (ideally the same that was used to impute the target dataset). These files should be in 'DataDir'.

clump_p1

Numeric value, specifying the significance threshold for index SNPs if ldclump was set to be TRUE. The default is 0.0001.

clump_p2

Numeric value, specifying the secondary significance threshold for clumped SNPs if ldclump was set to be TRUE. The default is 0.01

clump_r2

Numeric value, specifying the linkage disequilibrium (LD) threshold for clumping if ldclump was set to be TRUE. The default is 0.50.

clump_kb

Integer value, specifying the physical distance threshold in base-pair for clumping if ldclump was set to be TRUE. The default is 250.

byCHR

Boolean value, 'TRUE' or 'FALSE', specifying chromosome-wise clumping if 'ldclump' was set to be 'TRUE'. The default is 'TRUE'.

pthreshold

Numeric vector, containing several p value thresholds to maximize predictive ability of the derived polygenic scores.

highLD_regions

Character string, specifying the .txt file name with known genomic regions with high LD. The default is NULL.

ld_prunning

Boolean value, TRUE or FALSE for LD-based filtering for computing genetic PC as covariates.

window_size

Integer value, specifying a window size in variant count or kilobase for LD-based filtering in computing genetic PC. The default is 50.

step_size

Integer value, specifying a variant count to shift the window at the end of each step for LD filtering in computing genetic PCs. The default is 5.

r2_threshold

Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering in computing genetic PCs. The default is 0.02.

nPC

Positive integer value, specifying the number of genetic PCs to be included as predictor in the PGS model fit. The default is 6.

pheno_type

Boolean value, ‘binary’ or ‘quantitative’, specifying the type of the trait. The default is ‘binary’.

Value

A list object containing a dataframe, a numeric value, a GeneticPC plot (if requested), and a PGS plot. The dataframe, PGS, contains four mandatory columns: IID (i.e., Individual ID), FID (i.e., Family ID), Pheno1 (i.e., the trait for PGS) and Score (i.e., the best PGS). Other columns of covariates could be there. The numeric value, BestP contains the threshold of the best p-value for the best PGS model fit. Also, the function produces several plots, including p-value thresholds vs PGS model fit and PGS distribution among male and females. For case-control data, it also plots the PGS distribution among cases and controls, and plots ROC curves.

Also, the function produces several plots such as p-value thresholds vs PGS model fit and PGS distribution among male and females. For case-control data, it shows PGS distribution among cases and controls and ROC curves as well.

Author

Banabithi Bose

Examples

data("Summary_Stat_Ex1", package = "GXwasR")
data("Example_phenofile", package = "GXwasR")
data("Example_covarfile", package = "GXwasR")
data("Example_pthresoldfile", package = "GXwasR")
data("highLD_hg19", package = "GXwasR")
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
summarystat <- Summary_Stat_Ex1[, c(2, 4, 7, 1, 3, 12)]
phenofile <- Example_phenofile # Cannot be NULL
# The interested phenotype column should be labeled as "Pheno1".
covarfile <- Example_covarfile
clump_p1 <- 0.0001
clump_p2 <- 0.0001
clump_kb <- 500
clump_r2 <- 0.5
byCHR <- TRUE
pthreshold <- Example_pthresoldfile$Threshold
ld_prunning <- TRUE
highLD_regions <- highLD_hg19
window_size <- 50
step_size <- 5
r2_threshold <- 0.02
nPC <- 6 # We can incorporate PCs into our PGS analysis to account for population stratification.
pheno_type <- "binary"

PGSresult <- ComputePGS(DataDir, ResultDir, finput, summarystat, phenofile, covarfile,
    effectsize = "BETA", LDreference = "GXwasR_example", ldclump = FALSE, clump_p1, clump_p2,
    clump_r2, clump_kb, byCHR = TRUE, pthreshold = pthreshold, highLD_regions = highLD_regions,
    ld_prunning = TRUE, window_size = 50, step_size = 5, r2_threshold = 0.02, nPC = 6,
    pheno_type = "binary"
)
#> ℹ 0.001
#> • Computing PGS for threshold 0.001
#> ℹ 0.05
#> • Computing PGS for threshold 0.05
#> ℹ 0.1
#> • Computing PGS for threshold 0.1
#> ℹ 0.2
#> • Computing PGS for threshold 0.2
#> ℹ 0.3
#> • Computing PGS for threshold 0.3
#> ℹ 0.4
#> • Computing PGS for threshold 0.4
#> ℹ 0.5
#> • Computing PGS for threshold 0.5

## This table shows 10 samples with phenotype, covariates and a PGS column.
PGS <- PGSresult$PGS
PGS[seq_len(10), ]
#>        FID     IID Pheno1 AGE testcovar       PC1         PC2         PC3
#> 1  EUR_FIN HG00171      1  36         1 0.0964618 -0.03825140  0.00903796
#> 2  EUR_FIN HG00173      1  81         1 0.0743932 -0.05672710  0.03969940
#> 3  EUR_FIN HG00174      1  83         1 0.0645349 -0.05610350 -0.01457860
#> 4  EUR_FIN HG00176      2  75         0 0.0832472 -0.05602080  0.03026540
#> 5  EUR_FIN HG00177      1  88         1 0.0775687 -0.00946444  0.04517850
#> 6  EUR_FIN HG00178      1  24         1 0.0661714 -0.05049060  0.02485390
#> 7  EUR_FIN HG00179      2  78         1 0.0806110 -0.01154090 -0.00750723
#> 8  EUR_FIN HG00180      1  39         1 0.0840638  0.00117830  0.01286460
#> 9  EUR_FIN HG00182      1  50         1 0.1029760 -0.04300460  0.01999710
#> 10 EUR_FIN HG00183      1  58         0 0.0827103 -0.03865300  0.05093660
#>            PC4         PC5         PC6      SCORE
#> 1  -0.08952930 -0.02522910  0.01465430 0.04842120
#> 2   0.02053430  0.02304100 -0.03603800 0.02937420
#> 3  -0.02643430 -0.00180949  0.01445210 0.00996742
#> 4  -0.00110770 -0.05303710 -0.03358030 0.10420600
#> 5  -0.03732150 -0.06174310 -0.07199510 0.03143030
#> 6  -0.08280660  0.04094180  0.00559525 0.07226890
#> 7  -0.03288230  0.03559730  0.04245120 0.11652400
#> 8   0.05915400  0.05476670 -0.01015950 0.06171590
#> 9  -0.03023650 -0.09019110 -0.03785150 0.02837030
#> 10 -0.00153775 -0.10456100  0.06236340 0.03862970
## The best threshold
BestPvalue <- PGSresult$BestP$Threshold
BestPvalue
#> [1] 0.05