Skip to contents

This function performs QC of genotype data from PLINK binary files. It can filter based on minor allele frequency, Hardy-Weinberg equilibrium, call rate, and differential missingness between cases and controls. It can also perform linkage disequilibrium-based filtering.

Usage

QCsnp(
  DataDir,
  ResultDir = tempdir(),
  finput,
  foutput = "FALSE",
  casecontrol = TRUE,
  hweCase = NULL,
  hweControl = NULL,
  hwe = NULL,
  maf = 0.05,
  geno = 0.1,
  monomorphicSNPs = FALSE,
  caldiffmiss = FALSE,
  diffmissFilter = FALSE,
  dmissX = FALSE,
  dmissAutoY = FALSE,
  highLD_regions = NULL,
  ld_prunning = FALSE,
  window_size = 50,
  step_size = 5,
  r2_threshold = 0.02
)

Arguments

DataDir

A character string for the file path of the input PLINK binary files.

ResultDir

A character string for the file path where all output files will be stored. The default is tempdir().

finput

Character string, specifying the prefix of the input PLINK binary files with both male and female samples. This file needs to be in DataDir.

foutput

Character string, specifying the prefix of the output PLINK binary files if the filtering option for the SNPs is chosen. The default is "FALSE".

casecontrol

Boolean value, TRUE or FALSE indicating if the input PLINK files has cases-control status or not. The default is FALSE.

hweCase

Numeric value between 0 to 1 or NULL for removing SNPs which fail Hardy-Weinberg equilibrium for cases. The default is NULL.

hweControl

Numeric value between 0 to 1 or NULL for removing SNPs which fail Hardy-Weinberg equilibrium for controls. The default is NULL.

hwe

Numeric value between 0 to 1 or NULL for removing SNPs which fail Hardy-Weinberg equilibrium for entire dataset. The default is NULL.

maf

Numeric value between 0 to 1 for removing SNPs with minor allele frequency less than the specified threshold. The default is 0.05.

geno

Numeric value between 0 to 1 for removing SNPs that have less than the specified call rate. The default is 0.05.

Users can set this as NULL to not apply this filter.

monomorphicSNPs

Boolean value, TRUE or FALSE for filtering out monomorphic SNP. The default is TRUE.

caldiffmiss

Boolean value, TRUE or FALSE, specifying whether to compute differential missingness between cases and controls for each SNP (threshold is \(0.05/length(unique(No. of. SNPs in the test))\)). The default is TRUE.

diffmissFilter

Boolean value, TRUE or FALSE, specifying whether to filter out the SNPs or only flagged them for differential missingness in cases vs controls. The default is TRUE.

dmissX

Boolean value, TRUE or FALSE for computing differential missingness between cases and controls for X chromosome SNPs only. The default is FALSE. The diffmissFilter will work for all these SNPs.

dmissAutoY

Boolean value, TRUE or FALSE for computing differential missingness between cases and controls for SNPs on autosomes and Y chromosome only. The default is FALSE.

If dmissX and dmissAutoY are both FALSE, then this will be computed genome-wide. The diffmissFilter will work for all these SNPs.

highLD_regions

A dataframe with known high LD regions (Anderson et al. 2010) is provided with the package.

ld_prunning

Boolean value, TRUE or FALSE for applying linkage disequilibrium (LD)-based filtering.

window_size

Integer value, specifying a window size in the variant counts for LD-based filtering. The default is 50.

step_size

Integer value, specifying a variant count to shift the window at the end of each step for LD filtering. The default is 5.

r2_threshold

Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering. The default is 0.02.

Value

A list of two objects, namely, MonomorSNPs and DiffMissSNPs containing monomorphic SNPs and SNPs with differential missingness in cases vs controls, respectively. Output PLINK binary files in the working directory.

References

Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT (2010). “Data quality control in genetic case-control association studies.” Nature Protocols, 5(9), 1564–1573. doi:10.1038/nprot.2010.116 , http://www.ncbi.nlm.nih.gov/pubmed/21085122.

Author

Banabithi Bose

Examples

DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
foutput <- "Test_output"
geno <- NULL
maf <- 0.05
casecontrol <- FALSE
hweCase <- NULL
hweControl <- NULL
hweCase <- NULL
monomorphicSNPs <- FALSE
caldiffmiss <- FALSE
ld_prunning <- FALSE
x <- QCsnp(
    DataDir = DataDir, ResultDir = ResultDir, finput = finput, foutput = foutput,
    geno = geno, maf = maf, hweCase = hweCase, hweControl = hweControl,
    ld_prunning = ld_prunning, casecontrol = casecontrol, monomorphicSNPs = monomorphicSNPs,
    caldiffmiss = caldiffmiss
)
#>  4214 Ambiguous SNPs (A-T/G-C), indels etc. were removed.
#>  Thresholds for maf, geno and hwe worked.
#>  5467 variants removed due to minor allele threshold(s)
#>  No filter based on differential missingness will be applied.
#>  Output PLINK files prefixed as ,Test_output, with passed SNPs are saved in ResultDir.
#>  Input file has 26527 SNPs.
#>  Output file has 16846 SNPs after filtering.