Skip to contents

This function identifies outlier individuals for heterozygosity and/or missing genotype rates, which aids in the detection of samples with subpar DNA quality and/or concentration that should be removed from the study. Individuals missing more than 3-7% of their genotype calls are often excluded from the analysis.

Having the correct designation of sex is important to obtain accurate genotype rate estimates, or avoid incorrectly removing samples, etc. Details can be accessed from the paper.

Usage

QCsample(
  DataDir,
  ResultDir,
  finput,
  foutput = NULL,
  imiss,
  het,
  small_sample_mod = FALSE,
  IBD,
  IBDmatrix = FALSE,
  ambi_out = TRUE,
  legend_text_size = 8,
  legend_title_size = 7,
  axis_text_size = 5,
  axis_title_size = 7,
  title_size = 9,
  filterSample = TRUE
)

Arguments

DataDir

Character string, specifying the file path of the input PLINK binary files. The default is NULL.

ResultDir

A character string for the file path where all output files will be stored. The default is tempdir().

finput

Character string, specifying the prefix of the input PLINK binary files with both male and female samples. This file needs to be in DataDir.

foutput

Character string, specifying the prefix of the output PLINK binary files if filtering option for the samples is chosen.

imiss

Numeric value between 0 to 1 for removing samples that have more than the specified missingness. The default is 0.03.

het

Positive numeric value, specifying the standard deviation from the mean heterozygosity rate. The samples whose rates are more than the specified sd from the mean heterozygosity rate are removed. The default is 3. With this default value, outlying heterozygosity rates would remove individuals who are three sd away from the mean rate (1).

small_sample_mod

Boolean value indicating whether to apply modifications for small sample sizes. Default is FALSE.

IBD

Numeric value for setting the threshold for Identity by Descent (IBD) analysis. Default is NULL.

IBDmatrix

Boolean value indicating whether to generate an entire IBD matrix. Default is FALSE. In this case filtered IBD matrix will be stored.

ambi_out

Boolean value indicating whether to process ambiguous samples.

legend_text_size

Integer, specifying the size for legend text in the plot.

legend_title_size

Integer, specifying the size for the legend title in the plot.

axis_text_size

Integer, specifying the size for axis text in the plot.

axis_title_size

Integer, specifying the size for the axis title in the plot.

title_size

Integer, specifying the size of the title of the plot heterozygosity estimate vs missingness across samples.

filterSample

Boolean value, TRUE or FALSE for filtering out the samples or not (i.e., only flagged). The default is TRUE.

Value

A plot of heterogysity estimate vs missingness across sample and a list containing five R dataframe objects, namely, HM (samples with outlying heterozygosity and/or missing genotype rates), Failed_Missingness (samples with missing genotype rates), Failed_heterozygosity (samples with outlying heterozygosity), Missingness_results (missingness results) and Heterozygosity_results (heterozygosity results) with output PLINK files in ResultDir if filtering out the samples option is chosen.

Missingness_results contains missingness results for each individual, with six columns as FID, IID, MISS_PHENO, N_MISS, N_GENO and F_MISS for Family ID, Within-family ID, Phenotype missing? (Y/N), Number of missing genotype call(s), not including obligatory missings or heterozygous haploids, number of potentially valid call(s), and missing call rate, respectively.

Heterozygosity_results contains heterozygosity results for each individual, with six columns as FID, IID, O(HOM), E(HOM), N(NM), and F for Family ID, Within-family ID, Observed number of homozygotes, Expected number of homozygotes, Number of (non-missing, non-monomorphic) autosomal genotype observations and, Method-of-moments F coefficient estimate, respectively.

Author

Banabithi Bose

Examples

DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
foutput <- "Test_output"
imiss <- 0.01
het <- 2
small_sample_mod <- FALSE
IBD <- 0.2
IBDmatrix <- FALSE
ambi_out <- TRUE

x <- QCsample(
    DataDir = DataDir, ResultDir = ResultDir, finput = finput,
    foutput = foutput, imiss = imiss, het = het, IBD = IBD,
    ambi_out = ambi_out
)
#>  Plots are initiated.
#>  No. of samples filtered/flagged for missingness: 0
#>  No. of samples filtered/flagged for heterozygosity threshold: 3
#>  No. of samples filtered/flagged for missingness and heterozygosity: 3
#>  No. of samples marked to be filtered out for IDB after missingness and heterozygosity filter: 2
#>  No. of samples in input PLINK files: 276
#>  No. of samples in output PLINK files: 271
#>  Output PLINK files, Test_output with final samples are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpO7c0S8.