Pre-impute Quality Control Pipeline Using GXwasR

This document outlines the pre-imputation quality control (QC) steps, which represent our recommended best practices. Users have the flexibility to create their own pipelines using the GXwasR functions based on their specific needs.

Pre-imputation QC pipeline:

We will run the pre-imputation QC pipeline in this vignette utilizing the functions from GXwasR package.

Example Datasets

The PLINK bed, bim, and fam files are the three mandatory files representing a genotype dataset to run this pipeline. To know about these file extensions, please check https://www.cog-genomics.org/plink/1.9/formats.

GXwasR_example:

GXwasR_example.bed
GXwasR_example.bim
GXwasR_example.fam

These plink files contain genotypes for 276 individuals (males and females) simulated from 1000Genome from European decent with 26515 variants across twelve chromosomes (1-10,23,24). This dataset contains 125 males, 151 females, 108 cases and 168 controls. We will utilize this set of plink files as proxy of pre-imputated genotype data.

Pre-imputation QC steps

Filtering Multi-Allelic Variants: This step is essential for simplifying the genetic data analysis by removing variants that have more than two alleles. Multi-allelic variants can complicate analyses and interpretations, making this filtering crucial for a streamlined dataset.
Removal of Ambiguous SNPs and Indels: We specifically target SNPs with ambiguous mappings (like AT and GC pairings) and indels (insertions and deletions), which can lead to inaccuracies in genotype calling. Removing these elements helps in maintaining the integrity of our genetic data.
QC of SNPs with Relaxed Genotype Call Rate: In this phase, we filter SNPs based on a less stringent genotype call rate. This step aims to balance between retaining enough informative SNPs and excluding those with too much missing data.
Filtering Samples with Relaxed Missing Rate Threshold: Similar to SNPs, we apply a relaxed threshold for sample missingness. This approach ensures that we don’t prematurely exclude samples that might be informative despite having a slightly higher rate of missing data.
Application of Sex Check: We verify the consistency between reported and genetic sex. This check helps identify potential sample mix-ups or issues with data integrity.
Filter for Needed Chromosomes: Focus is narrowed to autosomes and sex chromosomes, as these are typically the most relevant for genetic studies. This step streamlines the dataset to the chromosomes of interest.
Second Round of SNP Filtering: This more stringent phase includes filters for minor allele frequency (MAF) < 0.01, call rate < 0.02, case-control differential missingness, Hardy-Weinberg equilibrium (cases < e-10; controls < e-06), and the removal of monomorphic SNPs. This rigorous filtering ensures that only the most reliable SNPs are retained for analysis.
Second Round of Sample Filtering: Here, we apply stringent thresholds for call rate (< 0.02), heterozygosity, and relatedness (IBD F-stat < 0.2). This step further refines the sample pool, ensuring that only high-quality samples are included.
Applying Ancestry Check: We assess the ancestry of the samples to identify and exclude outliers from different ancestries, which is vital for controlling population stratification in genetic association studies.
Third Round of SNP Filtering: This final filtering step ensures that all previously applied SNP thresholds are maintained even after the rigorous sample filtering. It’s a crucial step to guarantee the overall quality and consistency of the SNP data set.

Loading the GXwasR library

## Call some libraries
library(GXwasR)
#> 
#> GXwasR: Genome-wide and x-chromosome wide association analyses applying
#> best practices of quality control over genetic data
#> Version 0.99.0 () installed
#> Author: c(
#>     person(given = "Banabithi",
#>            family = "Bose",
#>            role = c("cre", "aut"),
#>            email = "banabithi.bose@gmail.com",
#>            comment = c(ORCID = "0000-0003-0842-8768"))
#>            )
#> Maintainer: Banabithi Bose <banabithi.bose@gmail.com>
#> Tutorial: https://github.com
#> Use citation("GXwasR") to know how to cite this work.
library(printr)
#> Registered S3 method overwritten by 'printr':
#>   method                from     
#>   knit_print.data.frame rmarkdown
library(rmarkdown)
#> 
#> Attaching package: 'rmarkdown'
#> The following objects are masked from 'package:BiocStyle':
#> 
#>     html_document, md_document, pdf_document

Learn about all the GXwasR functions

Before beginning the pre-imputation quality control (QC), it is recommended that users familiarize themselves with all the functions in the GXwasR package. This knowledge will help them understand the required input arguments, how to execute the functions, and what outputs to expect.

To explore all available functions, users should first load the GXwasR library and then use the browseVignettes() function:

browseVignettes("GXwasR")

Example Dataset Summary

Dataset: GXwasR_example

DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
x <- PlinkSummary(DataDir, ResultDir, finput)
#> ℹ Dataset: GXwasR_example
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 12
#>   - Chr: 1
#>   - Chr: 2
#>   - Chr: 3
#>   - Chr: 4
#>   - Chr: 5
#>   - Chr: 6
#>   - Chr: 7
#>   - Chr: 8
#>   - Chr: 9
#>   - Chr: 10
#>   - Chr: 23
#>   - Chr: 24
#> ℹ Total number of SNPs: 26527
#> ℹ Total number of samples: 276

Step 1: Filtering multi-allelic variants

Firstly, users need to ensure that their input plink files are bi-allelic. To ensure that, it is advised to run the FilterAllele() function with the input dataset.

foutput <- "PreimputeEX_QC1"
x <- FilterAllele(DataDir, ResultDir, finput, foutput)
#> ℹ There is no multi-allelic SNP present in the input dataset.

Since there was no multi-allelic SNPs, we will proceed with the same input data.

Step 2: QC of SNPs for genotype call rate

It removes variants that are missing genotype data for more than 20% of the individuals in your dataset. This function implicitly remove ambiguous SNPs (i.e., AT<>GC), indels etc. Users can set their own threshold.

# Running
foutput <- "PreimputeEX_QC1"
geno <- 0.2
maf <- NULL
casecontrol <- FALSE
caldiffmiss <- FALSE
diffmissFilter <- FALSE
dmissX <- FALSE
dmissAutoY <- FALSE
monomorphicSNPs <- TRUE
ld_prunning <- FALSE
casecontrol <- FALSE
hweCase <- NULL
hweControl <- NULL
monomorphicSNPs <- FALSE
ld_prunning <- FALSE
x <- QCsnp(DataDir = DataDir, ResultDir = ResultDir, finput = finput,foutput = foutput, geno = geno, maf = maf,hweCase = hweCase, hweControl = hweControl, ld_prunning = ld_prunning, casecontrol = casecontrol, monomorphicSNPs = monomorphicSNPs, caldiffmiss = caldiffmiss, dmissX = dmissX, dmissAutoY = dmissAutoY, diffmissFilter = diffmissFilter)
#> Using PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024)
#> ℹ 4214 Ambiguous SNPs (A-T/G-C), indels etc. were removed.
#> ✔ Thresholds for maf, geno and hwe worked.
#> ℹ 11 variants removed due to missing genotype data (--geno).
#> ℹ No filter based on differential missingness will be applied.
#> ✔ Output PLINK files prefixed as ,PreimputeEX_QC1, with passed SNPs are saved in ResultDir.
#> ℹ Input file has 26527 SNPs.
#> ℹ Output file has 22302 SNPs after filtering.
#> This message is displayed once every 8 hours.

Copying the plink files from ResultDir to DataDir.

ftemp <- list.files(paste0(ResultDir,"/"),pattern = "PreimputeEX_QC1")
file.copy(paste0(ResultDir,"/",ftemp),DataDir)
#> [1] TRUE TRUE TRUE TRUE
PlinkSummary(DataDir, ResultDir, finput = "PreimputeEX_QC1")
#> ℹ Dataset: PreimputeEX_QC1
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#>   - Chr: 1
#>   - Chr: 2
#>   - Chr: 3
#>   - Chr: 4
#>   - Chr: 5
#>   - Chr: 6
#>   - Chr: 7
#>   - Chr: 8
#>   - Chr: 9
#>   - Chr: 10
#>   - Chr: 23
#> ℹ Total number of SNPs: 22302
#> ℹ Total number of samples: 276

Step 3: Filtering samples with high missing rate threshold.

It excludes individuals who have missing genotype data for more than 20% of the genetic variants in the dataset.

# Running
finput <- "PreimputeEX_QC1"
foutput <- "PreimputeEX_QC2"
imiss = 0.2
het = NULL
IBD = NULL
filterSample <- TRUE
ambi_out <- TRUE
x = QCsample(DataDir = DataDir,ResultDir = ResultDir, finput = finput,foutput = foutput, imiss = imiss,het = het, IBD = NULL, filterSample = filterSample, ambi_out = ambi_out)
#> ℹ Missingness and heterogygosity thresholds are NULL.
#> ℹ There will be no sample to be filtered for IBD with 'IBD' threshold.
#> Warning: cannot open file '/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpAgOfTC/PreimputeEX_QC2.fam': No such file or directory

Step 4: Apply sex check

finput <- "PreimputeEX_QC1" # Using the same input file
LD = TRUE
LD_window_size <- 50
LD_step_size <- 5
LD_r2_threshold <- 0.02
fmax_F <- 0.3
mmin_F <- 0.7
# We will not infer sex
infer_sex <- FALSE
compute_freq <- FALSE
SexCheckResult <-SexCheck(DataDir=DataDir,ResultDir = ResultDir, finput=finput, infer_sex = infer_sex,compute_freq =compute_freq,LD_window_size=LD_window_size,LD_step_size=LD_step_size,LD_r2_threshold=0.02,fmax_F = fmax_F,mmin_F=mmin_F)
#> ℹ There are no Y chromosomes in the input PLINK files. Estimates will
#> be based solely on the X chromosome.
problematic_sex <- SexCheckResult[SexCheckResult$STATUS != "OK",]
problematic_sex <- problematic_sex[,1:2]

Number of samples with problematic sex assignment: 4

write.table(problematic_sex,file = paste0(DataDir,"/problematic_sex_samples"),quote = F, row.names = F,col.names = F)
SexCheckResult[1:5,]

FID	IID	PEDSEX	SNPSEX	STATUS	F
EUR_FIN	HG00171	2	2	OK	0.081780
EUR_FIN	HG00173	2	2	OK	-0.031060
EUR_FIN	HG00174	2	2	OK	0.058430
EUR_FIN	HG00176	2	2	OK	-0.023280
EUR_FIN	HG00177	2	2	OK	-0.003822

Note, there is no filtering of the samples based on sex prediction.

This function uses X chromosome data to determine sex (i.e. based on heterozygosity rates) and flags individuals for whom the reported sex in the PED file does not match the estimated sex (given genomic data).

A PROBLEM arises if the two sexes do not match, or if the SNP data or pedigree data are ambiguous with regard to sex. A male call is made if F is more than 0.8; a female call is made if F is less than 0.2.

SNPSEX Sex as determined by X chromosome

F The actual X chromosome inbreeding (homozygosity) estimate

Users have the ability to impute the sex for their dataset.

Step 5: Filter for needed chromosomes.

We prefer to keep only chromosome 1 to 23.

test1 <- read.table(paste0(DataDir,"/",finput,".bim"))
unique(test1$V1)
#>  [1]  1  2  3  4  5  6  7  8  9 10 23

foutput <- "PreimputeEX_QC2"
x <- FilterRegion(DataDir = DataDir, ResultDir = ResultDir, finput = finput, foutput = foutput, CHRX = FALSE, CHRY = FALSE, filterPAR = FALSE, filterXTR = FALSE, filterAmpliconic = FALSE, regionfile = FALSE, filterCHR = c(24, 25, 26),Hg = "38", exclude = TRUE)

Copying the plink files from ResultDir to DataDir.

ftemp <- list.files(paste0(ResultDir,"/"),pattern = "PreimputeEX_QC2")
file.copy(paste0(ResultDir,"/",ftemp),DataDir)
#> [1] TRUE TRUE TRUE TRUE TRUE
PlinkSummary(DataDir, ResultDir, finput = "PreimputeEX_QC2")
#> ℹ Dataset: PreimputeEX_QC2
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#>   - Chr: 1
#>   - Chr: 2
#>   - Chr: 3
#>   - Chr: 4
#>   - Chr: 5
#>   - Chr: 6
#>   - Chr: 7
#>   - Chr: 8
#>   - Chr: 9
#>   - Chr: 10
#>   - Chr: 23
#> ℹ Total number of SNPs: 22302
#> ℹ Total number of samples: 276

Step 6: Second round of SNP filtering.

library(GXwasR)
finput <- "PreimputeEX_QC2"
foutput <- "PreimputeEX_QC3"
geno <- 0.02
maf <- 0.01
casecontrol <- TRUE
caldiffmiss <- TRUE
diffmissFilter <- TRUE
dmissX <- TRUE
dmissAutoY <- TRUE
monomorphicSNPs <- TRUE
ld_prunning <- FALSE
hwe = NULL
hweCase <- 1e-10
hweControl <- 1e-06
x <- QCsnp(DataDir = DataDir, ResultDir = ResultDir, finput = finput,foutput = foutput, geno = geno, maf = maf,hweCase = hweCase, hweControl = hweControl, ld_prunning = ld_prunning, casecontrol = casecontrol, monomorphicSNPs = monomorphicSNPs, caldiffmiss = caldiffmiss, dmissX = dmissX, dmissAutoY = dmissAutoY, diffmissFilter = diffmissFilter)
#> ℹ 0 Ambiguous SNPs (A-T/G-C), indels etc. were removed.
#> ✔ Thresholds for maf, geno and hwe worked.
#> 0 variants removed due to missing genotype data (--geno).
#> 1930 variants removed due to minor allele threshold(s)
#> • In cases,  1 variant removed due to Hardy-Weinberg exact test.
#> • In controls,  19 variants removed due to Hardy-Weinberg exact test.
#> ✔ Merging is done using the common SNPs between the input genotype files.
#> ✔ Plink files with merged regions are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpAgOfTC prefixed as filtered_temp2
#> ℹ There are no monomorphic SNPs.
#> ℹ No SNP with differential missingness between cases and controls.
#> ✔ Output PLINK files prefixed as ,PreimputeEX_QC3, with passed SNPs are saved in ResultDir.
#> ℹ Input file has 22302 SNPs.
#> ℹ Output file has 20353 SNPs after filtering.

Copying the plink files from ResultDir to DataDir.

ftemp <- list.files(paste0(ResultDir,"/"),pattern = "PreimputeEX_QC3")
file.copy(paste0(ResultDir,"/",ftemp),DataDir)
#> [1] TRUE TRUE TRUE TRUE
PlinkSummary(DataDir, ResultDir, finput = "PreimputeEX_QC3")
#> ℹ Dataset: PreimputeEX_QC3
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#>   - Chr: 1
#>   - Chr: 2
#>   - Chr: 3
#>   - Chr: 4
#>   - Chr: 5
#>   - Chr: 6
#>   - Chr: 7
#>   - Chr: 8
#>   - Chr: 9
#>   - Chr: 10
#>   - Chr: 23
#> ℹ Total number of SNPs: 20353
#> ℹ Total number of samples: 276

Step 7: Second round of sample filtering

Here we are using stringent thresholds for sample filter.

finput <- "PreimputeEX_QC3"
foutput <- "PreimputeEX_QC4"
imiss = 0.02
het = 3
IBD = 0.2
IBDmatrix = FALSE
small_sample_mod = TRUE
filterSample = TRUE
ambi_out = TRUE
x = QCsample(DataDir = DataDir,ResultDir = ResultDir, finput = finput,foutput = foutput, imiss = imiss,het = het, IBD = IBD, IBDmatrix = IBDmatrix, filterSample = filterSample, ambi_out = ambi_out)
#> • Plots are initiated.
#> ℹ No. of samples filtered/flagged for missingness: 0
#> ℹ No. of samples filtered/flagged for heterozygosity: 0
#> ℹ No. of samples filtered for missingness and heterozygosity: 0
#> ℹ No. of samples marked to be filtered out for IDB after missingness and heterozygosity filter: 2
#> ℹ No. of samples in input PLINK files: 276
#> ℹ No. of samples in output PLINK files: 274
#> ✔ Output PLINK files, PreimputeEX_QC4 with final samples are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpAgOfTC.

Copying the plink files from ResultDir to DataDir.

ftemp <- list.files(paste0(ResultDir,"/"),pattern = "PreimputeEX_QC4")
file.copy(paste0(ResultDir,"/",ftemp),DataDir)
#> [1] TRUE TRUE TRUE TRUE
PlinkSummary(DataDir, ResultDir, finput = "PreimputeEX_QC4")
#> ℹ Dataset: PreimputeEX_QC4
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 124
#> ℹ Number of females: 150
#> ℹ This is case-control data
#> ℹ Number of cases: 106
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 52
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 54
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#>   - Chr: 1
#>   - Chr: 2
#>   - Chr: 3
#>   - Chr: 4
#>   - Chr: 5
#>   - Chr: 6
#>   - Chr: 7
#>   - Chr: 8
#>   - Chr: 9
#>   - Chr: 10
#>   - Chr: 23
#> ℹ Total number of SNPs: 20353
#> ℹ Total number of samples: 274

Step 8: Ancestry check

Prior to use this function, users are highly encouraged to read our tutorial ““Decoding Ancestry: A Guide to Using GXwasR for Genetic Ancestry Estimation”.

data("highLD_hg19", package = "GXwasR")
data("example_data_study_sample_ancestry", package = "GXwasR")
finput <- "PreimputeEX_QC4"
reference <- "HapMapIII_NCBI36"
highLD_regions <- highLD_hg19
study_pop <- example_data_study_sample_ancestry
studyLD_window_size = 50
studyLD_step_size = 5
studyLD_r2_threshold = 0.02
filterSNP = TRUE
studyLD = TRUE
referLD = TRUE
referLD_window_size = 50
referLD_step_size = 5
referLD_r2_threshold = 0.02
outlier = TRUE
outlier_threshold = 3
outlierOf = "EUR"
x <- AncestryCheck(DataDir = DataDir,
    ResultDir = ResultDir,
    finput = finput,
    reference = reference,
    filterSNP = TRUE,
    studyLD = TRUE,
    studyLD_window_size = 50,
    studyLD_step_size = 5,
    studyLD_r2_threshold = 0.02,
    referLD = FALSE,
    referLD_window_size = 50,
    referLD_step_size = 5,
    referLD_r2_threshold = 0.02,
    highLD_regions = highLD_regions,
    study_pop = study_pop,
    outlier = TRUE,
    outlierOf = "EUR",
    outlier_threshold = 3)
#> ℹ 'HapMapIII_NCBI36' reference data found at /Users/mayerdav/Downloads/HapMapIII_NCBI36.
#> No A-T or G-C SNPs found in study data.
#> 111854 ambiguous SNPs removed from reference data.
#> ! LD pruning is recommended for reference dataset. Set referLD = TRUE.
#> ✔ LD pruning was performed for study dataset.
#> 
#> ℹ Number of overlapping SNPs between study and reference data using rsID: 1032
#> ✔ PCA done.
#> ℹ No allele flips between study and reference data.
#> ℹ 7 samples are outliers of selected reference population.
#> ℹ 179 samples are NOT outliers of selected reference population.
#> This message is displayed once every 8 hours.

Samples_with_predicted_ancestry <- x[["Samples_with_predicted_ancestry"]]

outlier <- x[["Outlier_samples"]]

outlier

FID	IID
EUR_FIN	HG00280
EUR_FIN	HG00315
EUR_FIN	HG00321
EUR_FIN	HG00350
EUR_FIN	HG00369
EUR_FIN	HG00377
EUR_FIN	HG00383


Samples_with_predicted_ancestry[1:5,]

FID	IID	Predicted Ancestry
EUR_GBR	HG00096	EUR
EUR_GBR	HG00097	EUR
EUR_GBR	HG00099	EUR
EUR_GBR	HG00100	EUR
EUR_GBR	HG00101	EUR

There was no outlier samples.

Step 9: Third round of SNP filtering

To make sure all the filtering thresholds for SNPs are maintained after sample filtering.

library(GXwasR)
finput <- "PreimputeEX_QC4"
foutput <- "Preimpute_Final"
geno <- 0.02
maf <- 0.01
casecontrol <- TRUE
caldiffmiss <- TRUE
diffmissFilter <- TRUE
dmissX <- TRUE
dmissAutoY <- TRUE
monomorphicSNPs <- TRUE
ld_prunning <- FALSE
 hwe = NULL
hweCase <- 1e-10
hweControl <- 1e-06
x <- QCsnp(DataDir = ResultDir, ResultDir = ResultDir, finput = finput,foutput = foutput, geno = geno, maf = maf,hweCase = hweCase, hweControl = hweControl, ld_prunning = ld_prunning, casecontrol = casecontrol, monomorphicSNPs = monomorphicSNPs, caldiffmiss = caldiffmiss, dmissX = dmissX, dmissAutoY = dmissAutoY, diffmissFilter = diffmissFilter)
#> ℹ 0 Ambiguous SNPs (A-T/G-C), indels etc. were removed.
#> ✔ Thresholds for maf, geno and hwe worked.
#> 0 variants removed due to missing genotype data (--geno).
#> 3 variants removed due to minor allele threshold(s)
#> • In cases,  0 variants removed due to Hardy-Weinberg exact test.
#> • In controls,  0 variants removed due to Hardy-Weinberg exact test.
#> ✔ Merging is done using the common SNPs between the input genotype files.
#> ✔ Plink files with merged regions are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpAgOfTC prefixed as filtered_temp2
#> ℹ There are no monomorphic SNPs.
#> ℹ No SNP with differential missingness between cases and controls.
#> ✔ Output PLINK files prefixed as ,Preimpute_Final, with passed SNPs are saved in ResultDir.
#> ℹ Input file has 20353 SNPs.
#> ℹ Output file has 20350 SNPs after filtering.

Final Summary of final Qc-ed dataset.

PlinkSummary(ResultDir,ResultDir,foutput)
#> ℹ Dataset: Preimpute_Final
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 124
#> ℹ Number of females: 150
#> ℹ This is case-control data
#> ℹ Number of cases: 106
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 52
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 54
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#>   - Chr: 1
#>   - Chr: 2
#>   - Chr: 3
#>   - Chr: 4
#>   - Chr: 5
#>   - Chr: 6
#>   - Chr: 7
#>   - Chr: 8
#>   - Chr: 9
#>   - Chr: 10
#>   - Chr: 23
#> ℹ Total number of SNPs: 20350
#> ℹ Total number of samples: 274

Citation

If you are using GXwasR package and/or following this pipeline in your study, please cite it as:

citation("GXwasR")
#> To cite package 'GXwasR' in publications use:
#> 
#>   Bose B, Blostein F, Kim J, Winters J, Actkins KV, Mayer D, Congivaram
#>   H, Niarchou M, Edwards DV, Davis LK, Stranger BE (2025). "GXwasR: A
#>   Toolkit for Investigating Sex-Differentiated Genetic Effects on
#>   Complex Traits." _medRxiv 2025.06.10.25329327_.
#>   doi:10.1101/2025.06.10.25329327
#>   <https://doi.org/10.1101/2025.06.10.25329327>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {GXwasR: A Toolkit for Investigating Sex-Differentiated Genetic Effects on Complex Traits},
#>     author = {Banabithi Bose and Freida Blostein and Jeewoo Kim and Jessica Winters and Ky’Era V. Actkins and David Mayer and Harrsha Congivaram and Maria Niarchou and Digna Velez Edwards and Lea K. Davis and Barbara E. Stranger},
#>     journal = {medRxiv 2025.06.10.25329327},
#>     year = {2025},
#>     doi = {10.1101/2025.06.10.25329327},
#>   }

Reproducibility

The GXwasR package (Bose, Blostein, Kim, Winters, Actkins, Mayer, Congivaram, Niarchou, Edwards, Davis, and Stranger, 2025) was made possible thanks to:

R (R Core Team, 2025)
BiocStyle (Oleś, 2025)
knitr (Xie, 2025)
RefManageR (McLean, 2017)
rmarkdown (Allaire, Xie, Dervieux, McPherson, Luraschi, Ushey, Atkins, Wickham, Cheng, Chang, and Iannone, 2025)
sessioninfo (Wickham, Chang, Flight, Müller, and Hester, 2025)
testthat (Wickham, 2011)

This package was developed using biocthis.

R session information.

#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.1 (2025-06-13)
#>  os       macOS Sequoia 15.7.1
#>  system   aarch64, darwin24.4.0
#>  ui       unknown
#>  language en-US
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/New_York
#>  date     2025-10-16
#>  pandoc   3.6.3 @ /Applications/Positron.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#>  quarto   1.8.25 @ /usr/local/bin/quarto
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  package              * version   date (UTC) lib source
#>  abind                  1.4-8     2024-09-12 [1] CRAN (R 4.5.0)
#>  backports              1.5.0     2024-05-23 [1] CRAN (R 4.5.1)
#>  bibtex                 0.5.1     2023-01-26 [1] CRAN (R 4.5.0)
#>  bigassertr             0.1.7     2025-06-27 [1] CRAN (R 4.5.1)
#>  bigparallelr           0.3.2     2021-10-02 [1] CRAN (R 4.5.0)
#>  bigsnpr                1.12.21   2025-08-21 [1] CRAN (R 4.5.1)
#>  bigsparser             0.7.3     2024-09-06 [1] CRAN (R 4.5.1)
#>  bigstatsr              1.6.2     2025-07-29 [1] CRAN (R 4.5.1)
#>  Biobase                2.68.0    2025-04-15 [1] Bioconduc~
#>  BiocGenerics           0.54.0    2025-04-15 [1] Bioconduc~
#>  BiocIO                 1.18.0    2025-04-15 [1] Bioconduc~
#>  BiocManager            1.30.26   2025-06-05 [1] CRAN (R 4.5.0)
#>  BiocParallel           1.42.2    2025-09-14 [1] Bioconductor 3.21 (R 4.5.1)
#>  BiocStyle            * 2.36.0    2025-04-15 [1] Bioconduc~
#>  Biostrings             2.76.0    2025-04-15 [1] Bioconduc~
#>  bit                    4.6.0     2025-03-06 [1] CRAN (R 4.5.1)
#>  bit64                  4.6.0-1   2025-01-16 [1] CRAN (R 4.5.1)
#>  bitops                 1.0-9     2024-10-03 [1] CRAN (R 4.5.0)
#>  bookdown               0.45      2025-10-03 [1] CRAN (R 4.5.1)
#>  broom                  1.0.10    2025-09-13 [1] CRAN (R 4.5.1)
#>  BSgenome               1.76.0    2025-04-15 [1] Bioconduc~
#>  bslib                  0.9.0     2025-01-30 [1] CRAN (R 4.5.0)
#>  cachem                 1.1.0     2024-05-16 [1] CRAN (R 4.5.0)
#>  calibrate              1.7.7     2020-06-19 [1] CRAN (R 4.5.0)
#>  car                    3.1-3     2024-09-27 [1] CRAN (R 4.5.0)
#>  carData                3.0-5     2022-01-06 [1] CRAN (R 4.5.0)
#>  cli                    3.6.5     2025-04-23 [1] CRAN (R 4.5.0)
#>  codetools              0.2-20    2024-03-31 [3] CRAN (R 4.5.1)
#>  cowplot                1.2.0     2025-07-07 [1] CRAN (R 4.5.1)
#>  crayon                 1.5.3     2024-06-20 [1] CRAN (R 4.5.0)
#>  curl                   7.0.0     2025-08-19 [1] CRAN (R 4.5.1)
#>  data.table             1.17.8    2025-07-10 [1] CRAN (R 4.5.1)
#>  DelayedArray           0.34.1    2025-04-17 [1] Bioconduc~
#>  desc                   1.4.3     2023-12-10 [1] CRAN (R 4.5.0)
#>  digest                 0.6.37    2024-08-19 [1] CRAN (R 4.5.0)
#>  doParallel             1.0.17    2022-02-07 [1] CRAN (R 4.5.0)
#>  doRNG                  1.8.6.2   2025-04-02 [1] CRAN (R 4.5.0)
#>  dplyr                  1.1.4     2023-11-17 [1] CRAN (R 4.5.0)
#>  evaluate               1.0.5     2025-08-27 [1] CRAN (R 4.5.1)
#>  farver                 2.1.2     2024-05-13 [1] CRAN (R 4.5.0)
#>  fastmap                1.2.0     2024-05-15 [1] CRAN (R 4.5.0)
#>  flock                  0.7       2016-11-12 [1] CRAN (R 4.5.1)
#>  foreach                1.5.2     2022-02-02 [1] CRAN (R 4.5.0)
#>  Formula                1.2-5     2023-02-24 [1] CRAN (R 4.5.0)
#>  fs                     1.6.6     2025-04-12 [1] CRAN (R 4.5.0)
#>  gdsfmt                 1.44.1    2025-07-09 [1] Bioconduc~
#>  generics               0.1.4     2025-05-09 [1] CRAN (R 4.5.0)
#>  GenomeInfoDb           1.44.3    2025-09-21 [1] Bioconductor 3.21 (R 4.5.1)
#>  GenomeInfoDbData       1.2.14    2025-04-21 [1] Bioconductor
#>  GenomicAlignments      1.44.0    2025-04-15 [1] Bioconduc~
#>  GenomicRanges          1.60.0    2025-04-15 [1] Bioconduc~
#>  ggplot2                4.0.0     2025-09-11 [1] CRAN (R 4.5.1)
#>  ggpubr                 0.6.1     2025-06-27 [1] CRAN (R 4.5.1)
#>  ggrepel                0.9.6     2024-09-07 [1] CRAN (R 4.5.1)
#>  ggsignif               0.6.4     2022-10-13 [1] CRAN (R 4.5.0)
#>  glue                   1.8.0     2024-09-30 [1] CRAN (R 4.5.0)
#>  gridExtra              2.3       2017-09-09 [1] CRAN (R 4.5.0)
#>  gtable                 0.3.6     2024-10-25 [1] CRAN (R 4.5.0)
#>  GXwasR               * 0.99.0    2025-10-02 [1] Bioconductor
#>  hms                    1.1.3     2023-03-21 [1] CRAN (R 4.5.0)
#>  htmltools              0.5.8.1   2024-04-04 [1] CRAN (R 4.5.0)
#>  htmlwidgets            1.6.4     2023-12-06 [1] CRAN (R 4.5.0)
#>  httr                   1.4.7     2023-08-15 [1] CRAN (R 4.5.0)
#>  IRanges                2.42.0    2025-04-15 [1] Bioconduc~
#>  iterators              1.0.14    2022-02-05 [1] CRAN (R 4.5.0)
#>  jquerylib              0.1.4     2021-04-26 [1] CRAN (R 4.5.0)
#>  jsonlite               2.0.0     2025-03-27 [1] CRAN (R 4.5.0)
#>  knitr                  1.50      2025-03-16 [1] CRAN (R 4.5.0)
#>  lattice                0.22-7    2025-04-02 [3] CRAN (R 4.5.1)
#>  lifecycle              1.0.4     2023-11-07 [1] CRAN (R 4.5.0)
#>  lubridate              1.9.4     2024-12-08 [1] CRAN (R 4.5.1)
#>  magrittr               2.0.4     2025-09-12 [1] CRAN (R 4.5.1)
#>  MASS                   7.3-65    2025-02-28 [3] CRAN (R 4.5.1)
#>  mathjaxr               1.8-0     2025-04-30 [1] CRAN (R 4.5.1)
#>  Matrix                 1.7-4     2025-08-28 [3] CRAN (R 4.5.1)
#>  MatrixGenerics         1.20.0    2025-04-15 [1] Bioconduc~
#>  matrixStats            1.5.0     2025-01-07 [1] CRAN (R 4.5.0)
#>  memoise                2.0.1     2021-11-26 [1] CRAN (R 4.5.0)
#>  pillar                 1.11.1    2025-09-17 [1] CRAN (R 4.5.1)
#>  pkgconfig              2.0.3     2019-09-22 [1] CRAN (R 4.5.0)
#>  pkgdown                2.1.3     2025-05-25 [2] CRAN (R 4.5.0)
#>  plyr                   1.8.9     2023-10-02 [1] CRAN (R 4.5.1)
#>  plyranges              1.28.0    2025-04-15 [1] Bioconduc~
#>  poolr                  1.2-0     2025-05-07 [1] CRAN (R 4.5.0)
#>  prettyunits            1.2.0     2023-09-24 [1] CRAN (R 4.5.0)
#>  printr               * 0.3       2023-03-08 [1] CRAN (R 4.5.0)
#>  progress               1.2.3     2023-12-06 [1] CRAN (R 4.5.0)
#>  purrr                  1.1.0     2025-07-10 [1] CRAN (R 4.5.1)
#>  qqman                  0.1.9     2023-08-23 [1] CRAN (R 4.5.0)
#>  R.methodsS3            1.8.2     2022-06-13 [1] CRAN (R 4.5.0)
#>  R.oo                   1.27.1    2025-05-02 [1] CRAN (R 4.5.0)
#>  R.utils                2.13.0    2025-02-24 [1] CRAN (R 4.5.0)
#>  R6                     2.6.1     2025-02-15 [1] CRAN (R 4.5.0)
#>  ragg                   1.5.0     2025-09-02 [2] CRAN (R 4.5.1)
#>  rbibutils              2.3       2024-10-04 [1] CRAN (R 4.5.1)
#>  RColorBrewer           1.1-3     2022-04-03 [1] CRAN (R 4.5.0)
#>  Rcpp                   1.1.0     2025-07-02 [1] CRAN (R 4.5.1)
#>  RCurl                  1.98-1.17 2025-03-22 [1] CRAN (R 4.5.0)
#>  Rdpack                 2.6.4     2025-04-09 [1] CRAN (R 4.5.0)
#>  RefManageR           * 1.4.0     2022-09-30 [1] CRAN (R 4.5.1)
#>  regioneR               1.40.1    2025-06-01 [1] Bioconductor 3.21 (R 4.5.0)
#>  restfulr               0.0.16    2025-06-27 [1] CRAN (R 4.5.1)
#>  rjson                  0.2.23    2024-09-16 [1] CRAN (R 4.5.0)
#>  rlang                  1.1.6     2025-04-11 [1] CRAN (R 4.5.0)
#>  rmarkdown            * 2.30      2025-09-28 [1] CRAN (R 4.5.1)
#>  rmio                   0.4.0     2022-02-17 [1] CRAN (R 4.5.0)
#>  rngtools               1.5.2     2021-09-20 [1] CRAN (R 4.5.0)
#>  Rsamtools              2.24.1    2025-09-07 [1] Bioconductor 3.21 (R 4.5.1)
#>  rstatix                0.7.2     2023-02-01 [1] CRAN (R 4.5.0)
#>  rtracklayer            1.68.0    2025-04-15 [1] Bioconduc~
#>  S4Arrays               1.8.1     2025-06-01 [1] Bioconductor 3.21 (R 4.5.0)
#>  S4Vectors              0.46.0    2025-04-15 [1] Bioconduc~
#>  S7                     0.2.0     2024-11-07 [1] CRAN (R 4.5.1)
#>  sass                   0.4.10    2025-04-11 [1] CRAN (R 4.5.0)
#>  scales                 1.4.0     2025-04-24 [1] CRAN (R 4.5.0)
#>  sessioninfo          * 1.2.3     2025-02-05 [1] CRAN (R 4.5.1)
#>  SNPRelate              1.42.0    2025-04-15 [1] Bioconduc~
#>  SparseArray            1.8.1     2025-07-23 [1] Bioconduc~
#>  stringi                1.8.7     2025-03-27 [1] CRAN (R 4.5.0)
#>  stringr                1.5.2     2025-09-08 [1] CRAN (R 4.5.1)
#>  sumFREGAT              1.2.5     2022-06-07 [1] CRAN (R 4.5.1)
#>  SummarizedExperiment   1.38.1    2025-04-30 [1] Bioconductor 3.21 (R 4.5.0)
#>  sys                    3.4.3     2024-10-04 [1] CRAN (R 4.5.0)
#>  systemfonts            1.3.1     2025-10-01 [1] CRAN (R 4.5.1)
#>  textshaping            1.0.3     2025-09-02 [1] CRAN (R 4.5.1)
#>  tibble                 3.3.0     2025-06-08 [1] CRAN (R 4.5.0)
#>  tidyr                  1.3.1     2024-01-24 [1] CRAN (R 4.5.1)
#>  tidyselect             1.2.1     2024-03-11 [1] CRAN (R 4.5.0)
#>  timechange             0.3.0     2024-01-18 [1] CRAN (R 4.5.1)
#>  tzdb                   0.5.0     2025-03-15 [1] CRAN (R 4.5.1)
#>  UCSC.utils             1.4.0     2025-04-15 [1] Bioconduc~
#>  vctrs                  0.6.5     2023-12-01 [1] CRAN (R 4.5.0)
#>  vroom                  1.6.6     2025-09-19 [1] CRAN (R 4.5.1)
#>  withr                  3.0.2     2024-10-28 [1] CRAN (R 4.5.0)
#>  xfun                   0.53      2025-08-19 [1] CRAN (R 4.5.1)
#>  XML                    3.99-0.19 2025-08-22 [1] CRAN (R 4.5.1)
#>  xml2                   1.4.0     2025-08-20 [1] CRAN (R 4.5.1)
#>  XVector                0.48.0    2025-04-15 [1] Bioconduc~
#>  yaml                   2.3.10    2024-07-26 [1] CRAN (R 4.5.0)
#> 
#>  [1] /Users/mayerdav/Library/R/arm64/4.5/library
#>  [2] /opt/homebrew/lib/R/4.5/site-library
#>  [3] /opt/homebrew/Cellar/r/4.5.1/lib/R/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Bibliography

This vignette was generated using BiocStyle (Oleś, 2025) with knitr (Xie, 2025) and rmarkdown (Allaire, Xie, Dervieux et al., 2025) running behind the scenes.

Citations made with RefManageR (McLean, 2017).

[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.30. 2025. URL: https://github.com/rstudio/rmarkdown.

[2] B. Bose, F. Blostein, J. Kim, et al. “GXwasR: A Toolkit for Investigating Sex-Differentiated Genetic Effects on Complex Traits”. In: medRxiv 2025.06.10.25329327 (2025). DOI: 10.1101/2025.06.10.25329327.

[3] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.

[4] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.36.0. 2025. DOI: 10.18129/B9.bioc.BiocStyle. URL: https://bioconductor.org/packages/BiocStyle.

[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2025. URL: https://www.R-project.org/.

[6] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.

[7] H. Wickham, W. Chang, R. Flight, et al. sessioninfo: R Session Information. R package version 1.2.3. 2025. DOI: 10.32614/CRAN.package.sessioninfo. URL: https://CRAN.R-project.org/package=sessioninfo.

[8] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.50. 2025. URL: https://yihui.org/knitr/.

Banabithi Bose

16 October 2025

Pre-imputation QC pipeline:

Example Datasets

Pre-imputation QC steps

Loading the GXwasR library

Learn about all the GXwasR functions

Example Dataset Summary

Step 1: Filtering multi-allelic variants

Step 2: QC of SNPs for genotype call rate

Step 3: Filtering samples with high missing rate threshold.

Step 4: Apply sex check

Step 5: Filter for needed chromosomes.

Step 6: Second round of SNP filtering.

Step 7: Second round of sample filtering

Step 8: Ancestry check

Step 9: Third round of SNP filtering

Final Summary of final Qc-ed dataset.

Citation

Reproducibility

Bibliography