TestXGene: Performing gene-based association test using GWAS/XWAS summary statistics.
Source:R/GXwasR_main_functions.R
TestXGene.Rd
This function performs gene-based association tests using GWAS/XWAS summary statistics and SNP-SNP correlation matrices. For SNP-SNP correlation matrices, users have the flexibility to use either the base genotype data or 1000 Genomes Phase 3 reference genotype data. Users also have options to define the regional positions of genes to include the SNPs according to their investigation.
This function computes gene-wise SNP-SNP correlation matrices and can perform nine different gene-based tests, such as, “BT" (burden test), "SKAT" (sequence kernel association test), "SKATO" (combination of BT and SKAT), "sumchi" (sum of χ2-statistics), "ACAT" (aggregated Cauchy association test for combining P values), "PCA"(principal component approach), "FLM"( functional multiple linear regression model), "simpleM" (Bonferroni correction test), "minp" (minimum P-value) leveraging PLINK1.9 (Purcell et al. 2007) and sumFREGAT (Svishcheva et al. 2019; Belonogova et al. 2022) tools.
Though this function implicitly performs X-linked gene-based test, it is flexible to perform this analysis genome-wide. For the details about the different tests, please follow the associated paper.
Usage
TestXGene(
DataDir,
ResultDir = tempdir(),
finput,
sumstat,
gene_file,
gene_range = 5e+05,
score_file,
ref_data = NULL,
max_gene = NULL,
sample_size = NULL,
genebasedTest = c("SKAT", "SKATO", "sumchi", "ACAT", "BT", "PCA", "FLM", "simpleM",
"minp"),
gene_approximation = TRUE,
beta_par,
weights_function,
geno_variance_weights,
kernel_p_method = "kuonen",
acc_devies = 1e-08,
lim_devies = 1e+06,
rho = TRUE,
skato_p_threshold = 0.8,
anno_type = "",
mac_threshold,
reference_matrix_used,
regularize_fun,
pca_var_fraction = 0.85,
flm_basis_function = "fourier",
flm_num_basis = 25,
flm_poly_order = 4,
flip_genotypes = FALSE,
omit_linear_variant = FALSE
)
Arguments
- DataDir
A character string for the file path of the all the input files.
- ResultDir
A character string for the file path where all output files will be stored. The default is
tempdir()
.- finput
Character string, specifying the prefix of the input PLINK binary files for the genotype data. This file is used to compute the correlation between the SNPs. This file needs to be in DataDir. If the base genotype data is unavailable, then users can use the 1000 Genomes Project samples. Users should use the population that most closely represents the base sample. For ACAT model, this parameter is not mandatory and could be set
NULL
.- sumstat
A dataframe object with GWAS summary statistics. When the base-genotype data is used to compute genetic correlations, the mandatory columns are:
Column 1:
CHROM
(i.e., chromosome code),Column 2:
POS
(i.e., base-pair position),Column 3:
ID
(i.e. SNP IDs),Column 4:
P
(i.e., p-values),Column 5:
BETA
(i.e., effect-size),Column 6:
A1
(i.e., effect allele),Column 7:
A2
(i.e., alternative allele) andColumn 8:
EAF
(i.e., the effect allele frequency)
These are mandatory when base-genotype data is used to compute genetic correlations. Otherwise, if the users are using reference data, then columns 5 to 8 are optional. Also, in that case, columns, such as
REF
(i.e., reference allele), andALT
(i.e., alternative allele) could be present to compare alleles with those in the reference file and exclude genetic variants if alleles do not match. There could be an additional column,ANNO
with functional annotations (like "intron_variant", "synonymous", "missense" etc.)- gene_file
Character string, specifying the prefix of the name of a .txt file listing genes in refFlat format. This file needs to be in
DataDir
. The X-linked gene files, "Xlinkedgenes_hg19.txt" and "Xlinkedgenes_hg38.txt" and autosomal gene files, “Autosomes_hg19.txt” and “Autosomes_hg38.txt” can be specified. The default is "Xlinkedgenes_hg19.txt". The genome built should be in agreement with the analysis.- gene_range
Integer value, specifying the up stream and down stream range (in kilo base) of a gene for SNPs to be considered. The default is 500000.
- score_file
Character string, specifying the prefix of a file which will be used to produce score files with Z scores from P values and beta input from GWAS summary statistics.
- ref_data
Character string, specifying the path to a reference dataframe with additional data needed to recode user data according to correlation matrices that will be used. It contains
ID
column with names of SNPs,REF
andALT
columns with alleles that were coded as 0 and 1, respectively. Effect sizes from data will be inverted for variants with effect alleles different fromALT
alleles in reference data. If presented,REF
andALT
columns from the input data will be used to sort out variants with alleles different from those in reference data. This dataframe can also be a source of map data and allele frequencies if they are not present in data.AF
column in the reference file represents the allele frequency ofALT
allele. The default is "ref1KG.MAC5.EUR_AF.RData".- max_gene
Positive integer value, specifying the number of genes for which the gene-based test will be performed. The default is NULL to consider all the genes.
- sample_size
Positive integer value, specifying the sample size of the GWAS. Only needed for FLM and PCA models.
- genebasedTest
Character string, specifying the name of the gene-based test. Nine different tests can be specified, "SKAT","SKATO","sumchi","ACAT","BT","PCA", "FLM","simpleM","minp". The default is "SKAT".
- gene_approximation
Boolean value,
TRUE
orFALSE
, specifying whether approximation for large genes (>= 500 SNPs) should be used. Applicable for SKAT, SKATO, sumchi, PCA, FLM. The default isTRUE
for these methods).- beta_par
Boolean value,
TRUE
orFALSE
, specifying whether approximation for large genes (>= 500 SNPs) should be used. Applicable for SKAT, SKATO, sumchi, PCA, FLM (default =TRUE
for these methods).- weights_function
A function of MAF to assign weights for each genetic variant. By default is
NULL
. In this case the weights will be calculated using the beta distribution.- geno_variance_weights
Character string, indicating whether scores should be weighted by the variance of genotypes: "none" (i.e., no weights applied, resulting in a sum chi-square test); "se.beta" (i.e., scores weighted by variance of genotypes estimated from P values and effect sizes); "af" (i.e., scores weighted by variance of genotypes calculated as \(AF * (1 - AF)\), where AF is allele frequency.
- kernel_p_method
Character string, specifying the method for computing P value in kernel-based tests, such as SKAT, SKATO and sumchi. Available methods are "kuonen" (Belonogova et al. 2022) "davies" (Belonogova et al. 2022) and "hybrid" (Belonogova et al. 2022) . The default is "kuonen".
- acc_devies
Positive numeric value, specifying the accuracy parameter for "davies" method. The default is 1e-8.
- lim_devies
Positive numeric value, specifying the limit parameter for "davies" method. The default is 1e+6.
- rho
Logical value, 'TRUE' or 'FALSE' or can be a vector of grid values from 0 to 1. If TRUE, the optimal test (SKAT-O) is performed (12). The default grid is c(0, 0.1^2, 0.2^2, 0.3^2, 0.4^2, 0.5^2, 0.5, 1).
- skato_p_threshold
Positive numeric value, specifying the largest P value that will be considered as important when performing computational optimization in SKAT-O. All P values larger than skato_p_threshold will be processed via burden test. The default is 0.8
- anno_type
A character (or character vector) indicating annotation types to be used. The default is "" (i.e, nothing).
- mac_threshold
Integer value, specifying the threshold of MACs (Minor allele content) calculated from MAFs. In ACAT, scores with MAC <= 10 will be combined using Burden test.
- reference_matrix_used
Boolean value,
TRUE
orFALSE
logical indicating whether the correlation matrices were generated using the reference matrix. The default isFALSE
. IfTRUE
, regularization algorithms will be applied to ensure the invertibility and numerical stability of the matrices.- regularize_fun
Character string, specifying the one of two regularization algorithms if ‘reference_matrix’ is TRUE: 'LH' (default) or 'derivLH'. Currently, both give similar results.
- pca_var_fraction
P ositive numeric value, specifying the minimal proportion of genetic variance within the region that should be explained by principal components used in PCA method. This is also valid in 'simpleM'. The default is 0.85.
- flm_basis_function
Character string, specifying the name of a basis function type for beta-smooth in FLM method. Can be set to "bspline" (B-spline basis) or "fourier" (Fourier basis, default).
- flm_num_basis
Positive integer value, specifying the number of basis functions to be used for beta-smooth in FLM method. The default is 25.
- flm_poly_order
Positive integer value, specifying the polynomial order to be used in "bspline" for FLM model. The default = 4, which corresponds to the cubic B-splines. This has no effect if only Fourier bases are used
- flip_genotypes
L ogical value,
TRUE
orFALSE
, indicating whether the genotypes of some genetic variants should be flipped (relabelled) for their better functional representation (13). The default isFALSE
.- omit_linear_variant
Logical value,
TRUE
orFALSE
, indicating whether to omit linearly dependent genetic variants. It was done in the FLM test (4). The default isFALSE
.
Value
A data frame with columns:
gene
chrom
start
end
markers (i.e., numbers of SNPs),
filtered.markers (i.e. filtered SNPs)
pvalue (i.e., p-value).
Additionally, for “BT”, there will be “beta” (i.e., gene-level estimates of betas) and “beta.se” (i.e., standard errors of betas). For “FLM”, there will be the “model” column with the names of the functional models used for each region. Names shortly describe the functional basis and the number of basis functions used. E.g., "F25" means 25 Fourier basis functions, "B15" means 15 B-spline basis functions. For “PCA”, there will be the “ncomponents” (the number of the principal components used for each region) and “explained.variance.fraction” (i.e., the proportion of genetic variance they make up) columns.
References
Belonogova NM, Svishcheva GR, Kirichenko AV, Zorkoltseva IV, Tsepilov YA, Axenovich TI (2022).
“sumSTAAR: A flexible framework for gene-based association studies using GWAS summary statistics.”
PLoS Computational Biology, 18(6), e1010172.
doi:10.1371/journal.pcbi.1010172
, http://www.ncbi.nlm.nih.gov/pubmed/35653402.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, others (2007).
“PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses.”
The American Journal of Human Genetics, 81(3), 559–575.
doi:10.1086/519795
.
Svishcheva GR, Belonogova NM, Zorkoltseva IV, Kirichenko AV, Axenovich TI (2019).
“Gene-based association tests using GWAS summary statistics.”
Bioinformatics, 35(19), 3701–3708.
doi:10.1093/bioinformatics/btz172
, http://www.ncbi.nlm.nih.gov/pubmed/30860568.
Examples
if (!(Sys.getenv("CI") == "true" && Sys.info()[["sysname"]] == "Darwin")) {
data("XWAS_Summary_Example", package = "GXwasR")
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
sumstat <- XWAS_Summary_Example
ref_data <- NULL
gene_file <- "Xlinkedgenes_hg19.txt"
gene_range <- 500000
max_gene <- 10
gene_approximation <- TRUE
beta_par <- c(1, 25)
weights_function <- NULL
geno_variance_weights <- "se.beta"
method <- "kuonen"
acc_devies <- 1e-8
lim_devies <- 1e+6
rho <- TRUE
skato_p_threshold <- 0.8
mac_threshold <- 3
sample_size <- 4000
reference_matrix_used <- FALSE
regularize_fun <- "LH"
pca_var_fraction <- 0.85
flm_basis_function <- "fourier"
flm_num_basis <- 25
flm_poly_order <- 4
flip_genotypes <- FALSE
omit_linear_variant <- FALSE
kernel_p_method <- "kuonen"
anno_type <- ""
GenetestResult <- TestXGene(DataDir, ResultDir, finput, sumstat, gene_file,
gene_range, score_file, ref_data, max_gene, sample_size,
genebasedTest = "SKAT",
gene_approximation, beta_par, weights_function, geno_variance_weights,
kernel_p_method, acc_devies, lim_devies, rho, skato_p_threshold, anno_type,
mac_threshold, reference_matrix_used, regularize_fun, pca_var_fraction,
flm_basis_function, flm_num_basis, flm_poly_order, flip_genotypes,
omit_linear_variant
)
}
#> ℹ 80 of 80 variants found in reference
#> ℹ Effect sizes recoded for 0 variant(s)
#> ✔ Allele frequencies found and linked
#> ✔ File /private/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T/RtmpO7c0S8/gene.test.score.file.vcf.gz has been created and indexed
#> ℹ 157 genes are having 321 SNPs
#> ! Warning: '/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpO7c0S8/cormatrix' already exists
#> • SNP-SNP correlation matrices are being created...
#> ✔ SNP-SNP correlation matrices are done.
#> ! Warning: No variants to analyze in gene uc004cqy.3, skipped
#> ! Warning: No variants to analyze in gene uc004csf.3, skipped
#> ! Warning: No variants to analyze in gene uc004csr.3, skipped
#> ! Warning: No variants to analyze in gene uc004cst.2, skipped
#> ! Warning: No variants to analyze in gene uc004csu.1, skipped
#> ! Warning: No variants to analyze in gene uc004csx.4, skipped
#> ! Warning: No variants to analyze in gene uc004cuw.3, skipped