QCsample: Quality control for samples in the PLINK binary files.
Source:R/GXwasR_main_functions.R
QCsample.Rd
This function identifies outlier individuals for heterozygosity and/or missing genotype rates, which aids in the detection of samples with subpar DNA quality and/or concentration that should be removed from the study. Individuals missing more than 3-7% of their genotype calls are often excluded from the analysis.
Having the correct designation of sex is important to obtain accurate genotype rate estimates, or avoid incorrectly removing samples, etc. Details can be accessed from the paper.
Usage
QCsample(
DataDir,
ResultDir,
finput,
foutput = NULL,
imiss,
het,
small_sample_mod = FALSE,
IBD,
IBDmatrix = FALSE,
ambi_out = TRUE,
legend_text_size = 8,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
filterSample = TRUE
)
Arguments
- DataDir
Character string, specifying the file path of the input PLINK binary files. The default is
NULL
.- ResultDir
A character string for the file path where all output files will be stored. The default is
tempdir()
.- finput
Character string, specifying the prefix of the input PLINK binary files with both male and female samples. This file needs to be in
DataDir
.- foutput
Character string, specifying the prefix of the output PLINK binary files if filtering option for the samples is chosen.
- imiss
Numeric value between 0 to 1 for removing samples that have more than the specified missingness. The default is 0.03.
- het
Positive numeric value, specifying the standard deviation from the mean heterozygosity rate. The samples whose rates are more than the specified sd from the mean heterozygosity rate are removed. The default is 3. With this default value, outlying heterozygosity rates would remove individuals who are three sd away from the mean rate (1).
- small_sample_mod
Boolean value indicating whether to apply modifications for small sample sizes. Default is
FALSE
.- IBD
Numeric value for setting the threshold for Identity by Descent (IBD) analysis. Default is
NULL
.- IBDmatrix
Boolean value indicating whether to generate an entire IBD matrix. Default is
FALSE
. In this case filtered IBD matrix will be stored.- ambi_out
Boolean value indicating whether to process ambiguous samples.
- legend_text_size
Integer, specifying the size for legend text in the plot.
- legend_title_size
Integer, specifying the size for the legend title in the plot.
- axis_text_size
Integer, specifying the size for axis text in the plot.
- axis_title_size
Integer, specifying the size for the axis title in the plot.
- title_size
Integer, specifying the size of the title of the plot heterozygosity estimate vs missingness across samples.
- filterSample
Boolean value,
TRUE
orFALSE
for filtering out the samples or not (i.e., only flagged). The default isTRUE
.
Value
A plot of heterogysity estimate vs missingness across sample and a list containing five R dataframe objects, namely,
HM
(samples with outlying heterozygosity and/or missing genotype rates), Failed_Missingness
(samples with missing genotype rates),
Failed_heterozygosity
(samples with outlying heterozygosity), Missingness_results
(missingness results) and Heterozygosity_results
(heterozygosity results) with output PLINK files in ResultDir if filtering out the samples option is chosen.
Missingness_results
contains missingness results for each individual, with six columns as FID
, IID
, MISS_PHENO
, N_MISS
, N_GENO
and
F_MISS
for Family ID, Within-family ID, Phenotype missing? (Y/N), Number of missing genotype call(s), not including obligatory missings
or heterozygous haploids, number of potentially valid call(s), and missing call rate, respectively.
Heterozygosity_results
contains heterozygosity results for each individual, with six columns as FID
, IID
, O(HOM)
, E(HOM)
, N(NM)
,
and F
for Family ID, Within-family ID, Observed number of homozygotes, Expected number of homozygotes, Number of (non-missing, non-monomorphic)
autosomal genotype observations and, Method-of-moments F coefficient estimate, respectively.
Examples
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
foutput <- "Test_output"
imiss <- 0.01
het <- 2
small_sample_mod <- FALSE
IBD <- 0.2
IBDmatrix <- FALSE
ambi_out <- TRUE
x <- QCsample(
DataDir = DataDir, ResultDir = ResultDir, finput = finput,
foutput = foutput, imiss = imiss, het = het, IBD = IBD,
ambi_out = ambi_out
)
#> • Plots are initiated.
#> ℹ No. of samples filtered/flagged for missingness: 0
#> ℹ No. of samples filtered/flagged for heterozygosity threshold: 3
#> ℹ No. of samples filtered/flagged for missingness and heterozygosity: 3
#> ℹ No. of samples marked to be filtered out for IDB after missingness and heterozygosity filter: 2
#> ℹ No. of samples in input PLINK files: 276
#> ℹ No. of samples in output PLINK files: 271
#> ✔ Output PLINK files, Test_output with final samples are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpO7c0S8.