EstimateHerit: Computing SNP heritability i.e., the proportion of phenotypic variance explained by SNPs.
Source:R/GXwasR_main_functions.R
EstimateHerit.Rd
This functions performs two types of heritability estimation, (i)GREML:Genomic relatedness matrix (GRM) restricted maximum likelihood-based method following GCTA (Yang et al. 2011) and (ii)LDSC: LD score regression-based method following (Bulik-Sullivan et al. 2014; Privé et al. 2020) . For the details, please follow the associated paper.
Prior to using this function, it is recommended to apply QCsnp and QCsample to ensure data quality control.
Usage
EstimateHerit(
DataDir = NULL,
ResultDir = tempdir(),
finput = NULL,
precomputedLD = NULL,
indepSNPs = NULL,
summarystat = NULL,
ncores = 2,
model = c("LDSC", "GREML"),
computeGRM = TRUE,
grmfile_name = NULL,
byCHR = FALSE,
r2_LD = 0,
LDSC_blocks = 20,
intercept = NULL,
chi2_thr1 = 30,
chi2_thr2 = Inf,
REMLalgo = c(0, 1, 2),
nitr = 100,
cat_covarfile = NULL,
quant_covarfile = NULL,
prevalence = NULL,
partGRM = FALSE,
autosome = TRUE,
Xsome = TRUE,
nGRM = 3,
cripticut = 0.025,
minMAF = NULL,
maxMAF = NULL,
hg = c("hg19", "hg38"),
PlotIndepSNP = TRUE,
IndepSNP_window_size = 50,
IndepSNP_step_size = 5,
IndepSNP_r2_threshold = 0.02,
highLD_regions = NULL,
plotjpeg = TRUE,
plotname = "Heritability_Plots"
)
Arguments
- DataDir
A character string for the file path of the all the input files. The default is
NULL
.- ResultDir
A character string for the file path where all output files will be stored. The default is
tempdir()
.- finput
Character string, specifying the prefix of the input PLINK binary files for the genotype data. This file needs to be in
DataDir
. For LDSC model, if the original genotype data is not available, Hapmap 3 or 1000Genome data can be used. If use NULL, then you need to provideprecomputedLD
argument. See below.- precomputedLD
A dataframe object as LD matrix with columns:
CHR
,SNP
,BP
,ld_size
,MAF
,ld_score
. . The default isNULL
.- indepSNPs
A dataframe with independent SNP ids with column name "rsid". The default is
NULL
.- summarystat
A dataframe object with GWAS summary statistics. The mandatory column headers in this dataframe are
chr
(Chromosome code),pos
(Basepair position)a1
(First allele code)rsid
(i.e., SNP identifier)beta
(i.e., effect-size or logarithm of odds ratio)beta_se
(i.e., standard error of beta)P
(i.e., p-values)n_eff
(i.e., effective sample size)
For case-control study, effective sample size should be \(4 / (1/<# of cases> + 1/<# of controls>)\). The default is
NULL
.- ncores
Integer value, specifying the number of cores to be used for running LDSC model. The default is 2.
- model
Character string, specifying the heritability estimation model. There are two options, “GREML” or “LDSC”. The default is “GREML”.
Note: argument For LDSC, DataDir and finput can be
NULL
.- computeGRM
Boolean value,
TRUE
orFALSE
, specifying whether to compute GRM matrices or not. The default isTRUE
.- grmfile_name
A string of characters specifying the prefix of autosomal .grm.bin file. Users need to provide separate GRM files for autosomes and X chromosome in
ResultDir
.The X chromosomal GRM file should have "x" added in the autosomal prefix as file name. For instance, if autosomal file is "ABC.grm.bin", then X chromosomal file should be "xABC.grm.bim".
If you are providing chromosome-wise GRMs, then the prefix should add "ChrNumber_" at the start of the prefix like, "Chr1_ABC.grm.bin". The default is NULL.
- byCHR
Boolean value,
TRUE
orFALSE
, specifying whether the analysis will be performed chromosome wise or not. The default isFALSE
.- r2_LD
Numeric value, specifying the LD threshold for clumping in LDSC model. The default is 0.
- LDSC_blocks
Integer value, specifying the block size for performing jackknife variance estimator in LDSC model following (Privé et al. 2020) . The default is 200.
- intercept
Numeric value to constrain the intercept to some value (e.g. 1) in LDSC regression. Default is
NULL
.- chi2_thr1
Numeric value for threshold on chi2 in step 1 of LDSC regression. Default is 30.
- chi2_thr2
Numeric value for threshold on chi2 in step 2. Default is
Inf
(none).- REMLalgo
Integer value of 0, 1 or 2, specifying the algorithm to run REML iterations, 0 for average information (AI), 1 for Fisher-scoring and 2 for EM. The default option is 0, i.e. AI-REML (Yang et al. 2011) .
- nitr
Integer value, specifying the number of iterations for performing the REML. The default is 100.
- cat_covarfile
A character string, specifying the name of the categorical covariate file which is a plain text file with no header line; columns are family ID, individual ID and discrete covariates. The default is
NULL
. This file needs to be inDataDir
.- quant_covarfile
A character string, specifying the name of the quantitative covariate file which is a plain text file with no header line; columns are family ID, individual ID and continuous covariates. The default is
NULL
. This file needs to be inDataDir
.- prevalence
Numeric value, specifying the disease prevalence. The default is
NULL
.Note: for the continuous trait value, users should use the default.
- partGRM
Boolean value,
TRUE
orFALSE
, specifying whether the GRM will be partitioned into n parts (by row) in GREML model. The default isFALSE
.- autosome
Boolean value,
TRUE
orFALSE
, specifying whether estimate of heritability will be done for autosomes or not. The default isTRUE
.- Xsome
Boolean value,
TRUE
orFALSE
, specifying whether estimate of heritability will be done for X chromosome or not. The default isTRUE
.- nGRM
Integer value, specifying the number of the partition of the GRM in GREML model. The default is 3.
- cripticut
Numeric value, specifying the threshold to create a new GRM of "unrelated" individuals in GREML model. The default is arbitrary chosen as 0.025 following (Yang et al. 2011) .
- minMAF
Positive numeric value (0,1), specifying the minimum threshold for the MAF filter of the SNPs in the GREML model. This value cannot be greater than
maxMAF
parameter. The default isNULL
. ForNULL
, maximum MAF value of the genotype data will be computed and printed on the plot.- maxMAF
Positive numeric value (0,1), specifying the maximum threshold for the MAF filter of the SNPs in the GREML model. This value cannot be less than
minMAF
parameter. The default isNULL
. ForNULL
, minimum MAF value of the genotype data will be computed and printed on the plot.- hg
Boolean value, specifying the genome built, “hg19” or “hg38” to use chromosome length from UCSC genome browser and getting genes and proteins according to this built. The default is “hg19”.
- PlotIndepSNP
Boolean value,
TRUE
orFALSE
, specifying whether to use independent SNPs i.e., chromosome-wise LD pruned SNPs in the plots or not. The default isTRUE
.- IndepSNP_window_size
Integer value, specifying a window size in variant count or kilobase for LD-based filtering. The default is 50.
- IndepSNP_step_size
Integer value, specifying a variant count to shift the window at the end of each step for LD filtering for pruned SNPs in the plots. The default is 5.
- IndepSNP_r2_threshold
Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering for pruned SNPs in the plots. The default is 0.02.
- highLD_regions
Character string, specifying the .txt file name with genomic regions with high LD for using in finding pruned SNPs in the plots. This file needs to be in
DataDir
.- plotjpeg
Boolean value,
TRUE
orFALSE
, specifying whether to save the plots in jpeg file inResultDir
. The default isTRUE
.- plotname
String of character value specifying the name of the jpeg file with the plots. The default is "Heritability_Plots".
Value
A dataframe with maximum eight columns for GREML (here, three columns if running genome-wide) and ten columns for LDSC model if byCHR is TRUE
.
The columns, such as, "chromosome"(i.e., chromosome code),"snp_proportion" (i.e.,chromosome-wise SNP proportion)", "no.of.genes" (i.e., number of genes per chromosome),
"no.of.proteins" (i.e., number of genes per chromosome),"size_mb" (i.e., chromosome length), "Source" (i.e., source of heritability),
"Variance" (i.e., estimated heritability), and "SE" (i.e., standard error of the estimated heritability) are common for both GREML and LDSC model.
The column, "Intercept" (i.e., LDSC regression intercept) and "Int_SE" (i.e., standard error of the intercept) will be two extra columns for LDSC models.
Source column will have rows, such as V(1)
(i.e., name of genetic variance), V(e)
(i.e., residual variance), V(p)
(i.e., phenotypic variance), V(1)/Vp
(i.e.,
ratio of genetic variance to phenotypic variance), and V(1)/Vp_L
(i.e., ratio of genetic variance to phenotypic variance in liability scale for binary phenotypes).
If byCHR
is FALSE
, then the first five columns will not be reported in the dataframe.
References
Bulik-Sullivan B, Finucane HK, Anttila V, Gusev A, Day FR, Loh P, Consortium R, Consortium PG, of the Wellcome Trust Case Control Consortium GCfAN, Perry JRB, Patterson N, Robinson EB, Daly MJ, Price AL, Neale BM (2014).
“LD score regression distinguishes confounding from polygenicity in genome-wide association studies.”
Nature Genetics, 47, 291–295.
Privé F, Arbel J, Vilhjálmsson BJ (2020).
“LDpred2: better, faster, stronger.”
Bioinformatics, 36(22-23), 5424–5431.
doi:10.1093/bioinformatics/btaa1029
.
Yang J, Lee SH, Goddard ME, Visscher PM (2011).
“GCTA: a tool for Genome-wide Complex Trait Analysis.”
American Journal of Human Genetics, 88(1), 76–82.
Examples
data("Summary_Stat_Ex1", package = "GXwasR")
data("highLD_hg19", package = "GXwasR")
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
precomputedLD <- NULL
finput <- "GXwasR_example"
test.sumstats <- na.omit(Summary_Stat_Ex1[Summary_Stat_Ex1$TEST == "ADD", c(seq_len(4), 6:8)])
colnames(test.sumstats) <- c("chr", "rsid", "pos", "a1", "n_eff", "beta", "beta_se")
summarystat <- test.sumstats
ncores <- 3
model <- "GREML"
byCHR <- FALSE
r2_LD <- 0
LDSC_blocks <- 20
REMLalgo <- 0
nitr <- 3
cat_covarfile <- NULL
quant_covarfile <- NULL
prevalence <- 0.01
partGRM <- FALSE
autosome <- TRUE
Xsome <- TRUE
nGRM <- 3
cripticut <- 0.025
minMAF <- NULL
maxMAF <- NULL
hg <- "hg19"
PlotIndepSNP <- TRUE
IndepSNP_window_size <- 50
IndepSNP_step_size <- 5
IndepSNP_r2_threshold <- 0.02
highLD_regions <- highLD_hg19
H2 <- EstimateHerit(
DataDir = DataDir, ResultDir = ResultDir, finput = finput,
summarystat = NULL, ncores, model = "GREML", byCHR = TRUE, r2_LD = 0,
LDSC_blocks = 20, REMLalgo = 0, nitr = 100, cat_covarfile = NULL, quant_covarfile = NULL,
prevalence = 0.01, partGRM = FALSE, autosome = TRUE, Xsome = TRUE, nGRM = 3,
cripticut = 0.025, minMAF = NULL, maxMAF = NULL, hg = "hg19", PlotIndepSNP = TRUE,
IndepSNP_window_size = 50, IndepSNP_step_size = 5, IndepSNP_r2_threshold = 0.02,
highLD_regions = highLD_hg19
)
#> Processing chromosome 1
#> Processing chromosome 2
#> Processing chromosome 3
#> Processing chromosome 4
#> Processing chromosome 5
#> Processing chromosome 6
#> Processing chromosome 7
#> Processing chromosome 8
#> Processing chromosome 9
#> Processing chromosome 10
#> Processing chromosome 23
#> ✔ GRM has been saved in the file [/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpO7c0S8/xGXwasR.grm.bin]
#> ℹ Number of SNPs in each pair of individuals has been saved in the file [/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpO7c0S8/xGXwasR.grm.N.bin]
#>
#> Processing chromosome 24
#> ✔ Plots are saved in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpO7c0S8 with name Heritability_Plots.jpeg
#> ✔ All GRM related files are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpO7c0S8