EstimateHerit: Computing SNP heritability i.e., the proportion of phenotypic variance explained by SNPs.

This functions performs two types of heritability estimation, (i)GREML:Genomic relatedness matrix (GRM) restricted maximum likelihood-based method following GCTA (Yang et al. 2011) and (ii)LDSC: LD score regression-based method following (Bulik-Sullivan et al. 2014; Privé et al. 2020) . For the details, please follow the associated paper.

Prior to using this function, it is recommended to apply QCsnp and QCsample to ensure data quality control.

Usage

EstimateHerit(
  DataDir = NULL,
  ResultDir = tempdir(),
  finput = NULL,
  precomputedLD = NULL,
  indepSNPs = NULL,
  summarystat = NULL,
  ncores = 2,
  model = c("LDSC", "GREML"),
  computeGRM = TRUE,
  grmfile_name = NULL,
  byCHR = FALSE,
  r2_LD = 0,
  LDSC_blocks = 20,
  intercept = NULL,
  chi2_thr1 = 30,
  chi2_thr2 = Inf,
  REMLalgo = c(0, 1, 2),
  nitr = 100,
  cat_covarfile = NULL,
  quant_covarfile = NULL,
  prevalence = NULL,
  partGRM = FALSE,
  autosome = TRUE,
  Xsome = TRUE,
  nGRM = 3,
  cripticut = 0.025,
  minMAF = NULL,
  maxMAF = NULL,
  hg = c("hg19", "hg38"),
  PlotIndepSNP = TRUE,
  IndepSNP_window_size = 50,
  IndepSNP_step_size = 5,
  IndepSNP_r2_threshold = 0.02,
  highLD_regions = NULL,
  plotjpeg = TRUE,
  plotname = "Heritability_Plots"
)

Arguments

DataDir

A character string for the file path of the all the input files. The default is NULL.

ResultDir

A character string for the file path where all output files will be stored. The default is tempdir().

finput

Character string, specifying the prefix of the input PLINK binary files for the genotype data. This file needs to be in DataDir. For LDSC model, if the original genotype data is not available, Hapmap 3 or 1000Genome data can be used. If use NULL, then you need to provide precomputedLD argument. See below.

precomputedLD

A dataframe object as LD matrix with columns: CHR, SNP, BP, ld_size, MAF, ld_score. . The default is NULL.

indepSNPs

A dataframe with independent SNP ids with column name "rsid". The default is NULL.

summarystat

A dataframe object with GWAS summary statistics. The mandatory column headers in this dataframe are

chr (Chromosome code),
pos (Basepair position)
a1 (First allele code)
rsid (i.e., SNP identifier)
beta (i.e., effect-size or logarithm of odds ratio)
beta_se (i.e., standard error of beta)
P (i.e., p-values)
n_eff (i.e., effective sample size)

For case-control study, effective sample size should be \(4 / (1/<# of cases> + 1/<# of controls>)\). The default is NULL.

ncores

Integer value, specifying the number of cores to be used for running LDSC model. The default is 2.

model

Character string, specifying the heritability estimation model. There are two options, “GREML” or “LDSC”. The default is “GREML”.

Note: argument For LDSC, DataDir and finput can be NULL.

computeGRM

Boolean value, TRUE or FALSE, specifying whether to compute GRM matrices or not. The default is TRUE.

grmfile_name

A string of characters specifying the prefix of autosomal .grm.bin file. Users need to provide separate GRM files for autosomes and X chromosome in ResultDir.

The X chromosomal GRM file should have "x" added in the autosomal prefix as file name. For instance, if autosomal file is "ABC.grm.bin", then X chromosomal file should be "xABC.grm.bim".

If you are providing chromosome-wise GRMs, then the prefix should add "ChrNumber_" at the start of the prefix like, "Chr1_ABC.grm.bin". The default is NULL.

byCHR

Boolean value, TRUE or FALSE, specifying whether the analysis will be performed chromosome wise or not. The default is FALSE.

r2_LD

Numeric value, specifying the LD threshold for clumping in LDSC model. The default is 0.

LDSC_blocks

Integer value, specifying the block size for performing jackknife variance estimator in LDSC model following (Privé et al. 2020) . The default is 200.

intercept

Numeric value to constrain the intercept to some value (e.g. 1) in LDSC regression. Default is NULL.

chi2_thr1

Numeric value for threshold on chi2 in step 1 of LDSC regression. Default is 30.

chi2_thr2

Numeric value for threshold on chi2 in step 2. Default is Inf (none).

REMLalgo

Integer value of 0, 1 or 2, specifying the algorithm to run REML iterations, 0 for average information (AI), 1 for Fisher-scoring and 2 for EM. The default option is 0, i.e. AI-REML (Yang et al. 2011) .

nitr

Integer value, specifying the number of iterations for performing the REML. The default is 100.

cat_covarfile

A character string, specifying the name of the categorical covariate file which is a plain text file with no header line; columns are family ID, individual ID and discrete covariates. The default is NULL. This file needs to be in DataDir.

quant_covarfile

A character string, specifying the name of the quantitative covariate file which is a plain text file with no header line; columns are family ID, individual ID and continuous covariates. The default is NULL. This file needs to be in DataDir.

prevalence

Numeric value, specifying the disease prevalence. The default is NULL.

Note: for the continuous trait value, users should use the default.

partGRM

Boolean value, TRUE or FALSE, specifying whether the GRM will be partitioned into n parts (by row) in GREML model. The default is FALSE.

autosome

Boolean value, TRUE or FALSE, specifying whether estimate of heritability will be done for autosomes or not. The default is TRUE.

Xsome

Boolean value, TRUE or FALSE, specifying whether estimate of heritability will be done for X chromosome or not. The default is TRUE.

nGRM

Integer value, specifying the number of the partition of the GRM in GREML model. The default is 3.

cripticut

Numeric value, specifying the threshold to create a new GRM of "unrelated" individuals in GREML model. The default is arbitrary chosen as 0.025 following (Yang et al. 2011) .

minMAF

Positive numeric value (0,1), specifying the minimum threshold for the MAF filter of the SNPs in the GREML model. This value cannot be greater than maxMAF parameter. The default is NULL. For NULL, maximum MAF value of the genotype data will be computed and printed on the plot.

maxMAF

Positive numeric value (0,1), specifying the maximum threshold for the MAF filter of the SNPs in the GREML model. This value cannot be less than minMAF parameter. The default is NULL. For NULL, minimum MAF value of the genotype data will be computed and printed on the plot.

hg

Boolean value, specifying the genome built, “hg19” or “hg38” to use chromosome length from UCSC genome browser and getting genes and proteins according to this built. The default is “hg19”.

PlotIndepSNP

Boolean value, TRUE or FALSE, specifying whether to use independent SNPs i.e., chromosome-wise LD pruned SNPs in the plots or not. The default is TRUE.

IndepSNP_window_size

Integer value, specifying a window size in variant count or kilobase for LD-based filtering. The default is 50.

IndepSNP_step_size

Integer value, specifying a variant count to shift the window at the end of each step for LD filtering for pruned SNPs in the plots. The default is 5.

IndepSNP_r2_threshold

Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering for pruned SNPs in the plots. The default is 0.02.

highLD_regions

Character string, specifying the .txt file name with genomic regions with high LD for using in finding pruned SNPs in the plots. This file needs to be in DataDir.

plotjpeg

Boolean value, TRUE or FALSE, specifying whether to save the plots in jpeg file in ResultDir. The default is TRUE.

plotname

String of character value specifying the name of the jpeg file with the plots. The default is "Heritability_Plots".

Value

A dataframe with maximum eight columns for GREML (here, three columns if running genome-wide) and ten columns for LDSC model if byCHR is TRUE. The columns, such as, "chromosome"(i.e., chromosome code),"snp_proportion" (i.e.,chromosome-wise SNP proportion)", "no.of.genes" (i.e., number of genes per chromosome), "no.of.proteins" (i.e., number of genes per chromosome),"size_mb" (i.e., chromosome length), "Source" (i.e., source of heritability), "Variance" (i.e., estimated heritability), and "SE" (i.e., standard error of the estimated heritability) are common for both GREML and LDSC model. The column, "Intercept" (i.e., LDSC regression intercept) and "Int_SE" (i.e., standard error of the intercept) will be two extra columns for LDSC models. Source column will have rows, such as V(1) (i.e., name of genetic variance), V(e) (i.e., residual variance), V(p) (i.e., phenotypic variance), V(1)/Vp (i.e., ratio of genetic variance to phenotypic variance), and V(1)/Vp_L (i.e., ratio of genetic variance to phenotypic variance in liability scale for binary phenotypes). If byCHR is FALSE, then the first five columns will not be reported in the dataframe.

References

Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh P, Consortium R, Consortium PG, of the Wellcome Trust Case Control Consortium GCfAN, Perry JRB, Patterson N, Robinson EB, Daly MJ, Price AL, Neale BM (2014). “LD score regression distinguishes confounding from polygenicity in genome-wide association studies.” Nature Genetics, 47, 291–295.

Privé F, Arbel J, Vilhjálmsson BJ (2020). “LDpred2: better, faster, stronger.” Bioinformatics, 36(22-23), 5424–5431. doi:10.1093/bioinformatics/btaa1029 .

Yang J, Lee SH, Goddard ME, Visscher PM (2011). “GCTA: a tool for Genome-wide Complex Trait Analysis.” American Journal of Human Genetics, 88(1), 76–82.

Author

Banabithi Bose

Examples

data("Summary_Stat_Ex1", package = "GXwasR")
data("highLD_hg19", package = "GXwasR")
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
precomputedLD <- NULL
finput <- "GXwasR_example"
test.sumstats <- na.omit(Summary_Stat_Ex1[Summary_Stat_Ex1$TEST == "ADD", c(seq_len(4), 6:8)])
colnames(test.sumstats) <- c("chr", "rsid", "pos", "a1", "n_eff", "beta", "beta_se")
summarystat <- test.sumstats
ncores <- 3
model <- "GREML"
byCHR <- FALSE
r2_LD <- 0
LDSC_blocks <- 20
REMLalgo <- 0
nitr <- 3
cat_covarfile <- NULL
quant_covarfile <- NULL
prevalence <- 0.01
partGRM <- FALSE
autosome <- TRUE
Xsome <- TRUE
nGRM <- 3
cripticut <- 0.025
minMAF <- NULL
maxMAF <- NULL
hg <- "hg19"
PlotIndepSNP <- TRUE
IndepSNP_window_size <- 50
IndepSNP_step_size <- 5
IndepSNP_r2_threshold <- 0.02
highLD_regions <- highLD_hg19
H2 <- EstimateHerit(
    DataDir = DataDir, ResultDir = ResultDir, finput = finput,
    summarystat = NULL, ncores, model = "GREML", byCHR = TRUE, r2_LD = 0,
    LDSC_blocks = 20, REMLalgo = 0, nitr = 100, cat_covarfile = NULL, quant_covarfile = NULL,
    prevalence = 0.01, partGRM = FALSE, autosome = TRUE, Xsome = TRUE, nGRM = 3,
    cripticut = 0.025, minMAF = NULL, maxMAF = NULL, hg = "hg19", PlotIndepSNP = TRUE,
    IndepSNP_window_size = 50, IndepSNP_step_size = 5, IndepSNP_r2_threshold = 0.02,
    highLD_regions = highLD_hg19
)
#> Processing chromosome 1
#> Using GCTA version v1.94.4 Mac
#> This message is displayed once every 8 hours.
#> Processing chromosome 2
#> Processing chromosome 3
#> Processing chromosome 4
#> Processing chromosome 5
#> Processing chromosome 6
#> Processing chromosome 7
#> Processing chromosome 8
#> Processing chromosome 9
#> Processing chromosome 10
#> Processing chromosome 23
#> ✔ GRM has been saved in the file [/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//Rtmp0uYjFT/xGXwasR.grm.bin]
#> ℹ Number of SNPs in each pair of individuals has been saved in the file [/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//Rtmp0uYjFT/xGXwasR.grm.N.bin]
#> 
#> Processing chromosome 24
#> ✔ Plots are saved in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//Rtmp0uYjFT with name Heritability_Plots.jpeg
#> ✔ All GRM related files are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//Rtmp0uYjFT