This function performs QC of genotype data from PLINK binary files. It can filter based on minor allele frequency, Hardy-Weinberg equilibrium, call rate, and differential missingness between cases and controls. It can also perform linkage disequilibrium-based filtering.
Usage
QCsnp(
DataDir,
ResultDir = tempdir(),
finput,
foutput = "FALSE",
casecontrol = TRUE,
hweCase = NULL,
hweControl = NULL,
hwe = NULL,
maf = 0.05,
geno = 0.1,
monomorphicSNPs = FALSE,
caldiffmiss = FALSE,
diffmissFilter = FALSE,
dmissX = FALSE,
dmissAutoY = FALSE,
highLD_regions = NULL,
ld_prunning = FALSE,
window_size = 50,
step_size = 5,
r2_threshold = 0.02
)
Arguments
- DataDir
A character string for the file path of the input PLINK binary files.
- ResultDir
A character string for the file path where all output files will be stored. The default is
tempdir()
.- finput
Character string, specifying the prefix of the input PLINK binary files with both male and female samples. This file needs to be in
DataDir
.- foutput
Character string, specifying the prefix of the output PLINK binary files if the filtering option for the SNPs is chosen. The default is "FALSE".
- casecontrol
Boolean value,
TRUE
orFALSE
indicating if the input PLINK files has cases-control status or not. The default isFALSE
.- hweCase
Numeric value between 0 to 1 or
NULL
for removing SNPs which fail Hardy-Weinberg equilibrium for cases. The default isNULL
.- hweControl
Numeric value between 0 to 1 or
NULL
for removing SNPs which fail Hardy-Weinberg equilibrium for controls. The default isNULL
.- hwe
Numeric value between 0 to 1 or
NULL
for removing SNPs which fail Hardy-Weinberg equilibrium for entire dataset. The default isNULL
.- maf
Numeric value between 0 to 1 for removing SNPs with minor allele frequency less than the specified threshold. The default is 0.05.
- geno
Numeric value between 0 to 1 for removing SNPs that have less than the specified call rate. The default is 0.05.
Users can set this as
NULL
to not apply this filter.- monomorphicSNPs
Boolean value,
TRUE
orFALSE
for filtering out monomorphic SNP. The default isTRUE
.- caldiffmiss
Boolean value,
TRUE
orFALSE
, specifying whether to compute differential missingness between cases and controls for each SNP (threshold is \(0.05/length(unique(No. of. SNPs in the test))\)). The default isTRUE.
- diffmissFilter
Boolean value,
TRUE
orFALSE
, specifying whether to filter out the SNPs or only flagged them for differential missingness in cases vs controls. The default isTRUE
.- dmissX
Boolean value,
TRUE
orFALSE
for computing differential missingness between cases and controls for X chromosome SNPs only. The default isFALSE
. The diffmissFilter will work for all these SNPs.- dmissAutoY
Boolean value,
TRUE
orFALSE
for computing differential missingness between cases and controls for SNPs on autosomes and Y chromosome only. The default isFALSE
.If
dmissX
anddmissAutoY
are bothFALSE
, then this will be computed genome-wide. ThediffmissFilter
will work for all these SNPs.- highLD_regions
A dataframe with known high LD regions (Anderson et al. 2010) is provided with the package.
- ld_prunning
Boolean value,
TRUE
orFALSE
for applying linkage disequilibrium (LD)-based filtering.- window_size
Integer value, specifying a window size in the variant counts for LD-based filtering. The default is 50.
- step_size
Integer value, specifying a variant count to shift the window at the end of each step for LD filtering. The default is 5.
- r2_threshold
Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering. The default is 0.02.
Value
A list of two objects, namely, MonomorSNPs
and DiffMissSNPs
containing monomorphic SNPs and SNPs with differential missingness
in cases vs controls, respectively. Output PLINK binary files in the working directory.
References
Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT (2010). “Data quality control in genetic case-control association studies.” Nature Protocols, 5(9), 1564–1573. doi:10.1038/nprot.2010.116 , http://www.ncbi.nlm.nih.gov/pubmed/21085122.
Examples
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
foutput <- "Test_output"
geno <- NULL
maf <- 0.05
casecontrol <- FALSE
hweCase <- NULL
hweControl <- NULL
hweCase <- NULL
monomorphicSNPs <- FALSE
caldiffmiss <- FALSE
ld_prunning <- FALSE
x <- QCsnp(
DataDir = DataDir, ResultDir = ResultDir, finput = finput, foutput = foutput,
geno = geno, maf = maf, hweCase = hweCase, hweControl = hweControl,
ld_prunning = ld_prunning, casecontrol = casecontrol, monomorphicSNPs = monomorphicSNPs,
caldiffmiss = caldiffmiss
)
#> ℹ 4214 Ambiguous SNPs (A-T/G-C), indels etc. were removed.
#> ✔ Thresholds for maf, geno and hwe worked.
#> ℹ 5467 variants removed due to minor allele threshold(s)
#> ℹ No filter based on differential missingness will be applied.
#> ✔ Output PLINK files prefixed as ,Test_output, with passed SNPs are saved in ResultDir.
#> ℹ Input file has 26527 SNPs.
#> ℹ Output file has 16846 SNPs after filtering.