ClumpLD: Clumping SNPs using linkage disequilibrium between SNPs

This function, which is based on empirical estimations of linkage disequilibrium between SNPs, groups the SNP-based results across one or more datasets or analysis. This approach can be used in two basic scenarios: (i) To summarize the top X single SNP findings from a genome-wide scan as fewer clusters of connected SNPs (i.e., to assess how many independent loci are associated). (ii) To give researchers a simple approach to merge sets of data from multiple studies when those studies may have used various marker sets for genotyping.

The clumping process begins with the index SNPs that are significant at threshold p1 and have not yet been clumped. It then creates clumps of all additional SNPs that are within a specified kb of the index SNP and that are in linkage disequilibrium with the index SNP based on an r-squared threshold. Following that, these SNPs are filtered based on the outcome for that SNP. As this method is greedy (Purcell et al. 2007) , each SNP will, at most, only appear in one clump. The P value and ALLELES would always, at random, be chosen from the first input file if the same SNP appeared in several input files in SNPdata argument. Instead of the best p-value, the function refer to the SNP that has the strongest LD to the index as the best proxy. Based on the genotype data, the SNP with the highest LD will be the same for all input files.

Usage

ClumpLD(
  DataDir,
  finput,
  SNPdata,
  ResultDir = tempdir(),
  clump_p1,
  clump_p2,
  clump_r2,
  clump_kb,
  byCHR = TRUE,
  clump_best = TRUE,
  clump_index_first = TRUE
)

Arguments

DataDir: A character string for the file path of the input PLINK binary files.
finput: Character string, specifying the prefix of the input PLINK binary files which will be used to calculate linkage disequilibrium between the SNPs. This actual genotype data may or may not be the same dataset that was used to generate the summary statistics. This file needs to be in DataDir.
SNPdata: A list of R dataframes containing a single or multiple summary statistics with SNP and P (i.e., p-values) in mandatory column headers. Other columns could be present.
ResultDir: A character string for the file path where all output files will be stored. The default is tempdir().
clump_p1: Numeric value, specifying the significance threshold for index SNPs. The default is 0.0001.
clump_p2: Numeric value, specifying the secondary significance threshold for clumped SNPs. The default is 0.01
clump_r2: Numeric value, specifying the LD threshold for clumping. The default is 0.50.
clump_kb: Integer value, specifying the physical distance threshold in base-pair for clumping. The default is 250.
byCHR: Boolean value, TRUE or FALSE, specifying whether to perform the clumping chromosome-wise.
clump_best: Boolean value, TRUE or FALSE, specifying whether to select and output the best SNP from each clump. Default is TRUE.
clump_index_first: Boolean value, TRUE or FALSE, specifying whether to force the index SNP to appear first in each clump. This option should typically be TRUE if clump_best is TRUE. Default is TRUE.

Value

A list with two dataframes.

BestClump: a dataframe with eight columns showing the single best proxy SNP for each index SNP with columns "INDEX"(Index SNP identifier), "PSNP"(Best proxy SNP), "RSQ LD"(r-squared) between index and proxy, "KB"(Physical distance between index and proxy), P(p-value for proxy SNP), "ALLELES"(The associated haplotypes for the index and proxy SNP), and "F"(Which file used for clumping from which this result came from).

AllClump: a dataframe with eight columns providing a detailed summary of each clump identified by PLINK. It includes "INDEX_SNP" (the identifier for the index SNP that represents the clump), "SNP" (the SNP being reported, which for the index SNP is the same as INDEX_SNP), "DISTANCE" (the physical distance in base pairs between the index SNP and the reported SNP, with 0.0 indicating the index itself), "RSQ" (the r-squared value showing the degree of linkage disequilibrium between the index SNP and the SNP in the clump), "ALLELES" (the allele information, which in some cases may appear misaligned if the data isn’t formatted as expected), "F" (a statistic or indicator related to the association test, which may be NA when not applicable), "P" (the p-value for the association test of the SNP), and "CHR" (the chromosome on which the SNP is located).

References

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, others (2007). “PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses.” The American Journal of Human Genetics, 81(3), 559–575. doi:10.1086/519795 .

Author

Banabithi Bose

Examples

data("Summary_Stat_Ex1", package = "GXwasR")
data("Summary_Stat_Ex2", package = "GXwasR")
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
SNPdata <- list(Summary_Stat_Ex1, Summary_Stat_Ex2)
clump_p1 <- 0.0001
clump_p2 <- 0.001
clump_r2 <- 0.5
clump_kb <- 250
byCHR <- TRUE
clumpedResult <- ClumpLD(
    DataDir, finput, SNPdata, ResultDir, clump_p1,
    clump_p2, clump_r2, clump_kb, byCHR
)
#> • Processing summary statistics 1
#> • Processing summary statistics 2
#> • Running LD clumping for chromosome 1
#> ℹ No significant clump results for chromosome 1
#> • Running LD clumping for chromosome 2
#> ℹ No significant clump results for chromosome 2
#> • Running LD clumping for chromosome 3
#> ℹ No significant clump results for chromosome 3
#> • Running LD clumping for chromosome 4
#> ℹ No significant clump results for chromosome 4
#> • Running LD clumping for chromosome 5
#> ℹ No significant clump results for chromosome 5
#> • Running LD clumping for chromosome 6
#> ℹ No significant clump results for chromosome 6
#> • Running LD clumping for chromosome 7
#> ℹ No significant clump results for chromosome 7
#> • Running LD clumping for chromosome 8
#> ℹ No significant clump results for chromosome 8
#> • Running LD clumping for chromosome 9
#> ℹ No significant clump results for chromosome 9
#> • Running LD clumping for chromosome 10
#> ℹ No significant clump results for chromosome 10
#> • Running LD clumping for chromosome 23
#> • Running LD clumping for chromosome 24
#> ℹ No significant clump results for chromosome 24