Pre-impute Quality Control Pipeline Using GXwasR
Banabithi Bose
Northwestern UniversityUniversity of Colorado Anschutz Medical Campusbanabithi.bose@gmail.com
8 August 2025
Source:vignettes/preimputationQC.Rmd
preimputationQC.Rmd
This document outlines the pre-imputation quality control (QC) steps, which represent our recommended best practices. Users have the flexibility to create their own pipelines using the GXwasR functions based on their specific needs.
Pre-imputation QC pipeline:
We will run the pre-imputation QC pipeline in this vignette utilizing the functions from GXwasR package.
Example Datasets
The PLINK bed, bim, and fam files are the three mandatory files representing a genotype dataset to run this pipeline. To know about these file extensions, please check https://www.cog-genomics.org/plink/1.9/formats.
GXwasR_example:
- GXwasR_example.bed
- GXwasR_example.bim
- GXwasR_example.fam
These plink files contain genotypes for 276 individuals (males and females) simulated from 1000Genome from European decent with 26515 variants across twelve chromosomes (1-10,23,24). This dataset contains 125 males, 151 females, 108 cases and 168 controls. We will utilize this set of plink files as proxy of pre-imputated genotype data.
Pre-imputation QC steps
Filtering Multi-Allelic Variants: This step is essential for simplifying the genetic data analysis by removing variants that have more than two alleles. Multi-allelic variants can complicate analyses and interpretations, making this filtering crucial for a streamlined dataset.
Removal of Ambiguous SNPs and Indels: We specifically target SNPs with ambiguous mappings (like AT and GC pairings) and indels (insertions and deletions), which can lead to inaccuracies in genotype calling. Removing these elements helps in maintaining the integrity of our genetic data.
QC of SNPs with Relaxed Genotype Call Rate: In this phase, we filter SNPs based on a less stringent genotype call rate. This step aims to balance between retaining enough informative SNPs and excluding those with too much missing data.
Filtering Samples with Relaxed Missing Rate Threshold: Similar to SNPs, we apply a relaxed threshold for sample missingness. This approach ensures that we don’t prematurely exclude samples that might be informative despite having a slightly higher rate of missing data.
Application of Sex Check: We verify the consistency between reported and genetic sex. This check helps identify potential sample mix-ups or issues with data integrity.
Filter for Needed Chromosomes: Focus is narrowed to autosomes and sex chromosomes, as these are typically the most relevant for genetic studies. This step streamlines the dataset to the chromosomes of interest.
Second Round of SNP Filtering: This more stringent phase includes filters for minor allele frequency (MAF) < 0.01, call rate < 0.02, case-control differential missingness, Hardy-Weinberg equilibrium (cases < e-10; controls < e-06), and the removal of monomorphic SNPs. This rigorous filtering ensures that only the most reliable SNPs are retained for analysis.
Second Round of Sample Filtering: Here, we apply stringent thresholds for call rate (< 0.02), heterozygosity, and relatedness (IBD F-stat < 0.2). This step further refines the sample pool, ensuring that only high-quality samples are included.
Applying Ancestry Check: We assess the ancestry of the samples to identify and exclude outliers from different ancestries, which is vital for controlling population stratification in genetic association studies.
Third Round of SNP Filtering: This final filtering step ensures that all previously applied SNP thresholds are maintained even after the rigorous sample filtering. It’s a crucial step to guarantee the overall quality and consistency of the SNP data set.
Learn about all the GXwasR functions
Before beginning the pre-imputation quality control (QC), it is recommended that users familiarize themselves with all the functions in the GXwasR package. This knowledge will help them understand the required input arguments, how to execute the functions, and what outputs to expect.
To explore all available functions, users should first load the GXwasR library and then use the browseVignettes() function:
browseVignettes("GXwasR")
Example Dataset Summary
Dataset: GXwasR_example
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
x <- PlinkSummary(DataDir, ResultDir, finput)
#> ℹ Dataset: GXwasR_example
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 12
#> - Chr: 1
#> - Chr: 2
#> - Chr: 3
#> - Chr: 4
#> - Chr: 5
#> - Chr: 6
#> - Chr: 7
#> - Chr: 8
#> - Chr: 9
#> - Chr: 10
#> - Chr: 23
#> - Chr: 24
#> ℹ Total number of SNPs: 26527
#> ℹ Total number of samples: 276
Step 1: Filtering multi-allelic variants
Firstly, users need to ensure that their input plink files are bi-allelic. To ensure that, it is advised to run the FilterAllele() function with the input dataset.
foutput <- "PreimputeEX_QC1"
x <- FilterAllele(DataDir, ResultDir, finput, foutput)
#> ℹ There is no multi-allelic SNP present in the input dataset.
Since there was no multi-allelic SNPs, we will proceed with the same input data.
Step 2: QC of SNPs for genotype call rate
It removes variants that are missing genotype data for more than 20% of the individuals in your dataset. This function implicitly remove ambiguous SNPs (i.e., AT<>GC), indels etc. Users can set their own threshold.
# Running
foutput <- "PreimputeEX_QC1"
geno <- 0.2
maf <- NULL
casecontrol <- FALSE
caldiffmiss <- FALSE
diffmissFilter <- FALSE
dmissX <- FALSE
dmissAutoY <- FALSE
monomorphicSNPs <- TRUE
ld_prunning <- FALSE
casecontrol <- FALSE
hweCase <- NULL
hweControl <- NULL
monomorphicSNPs <- FALSE
ld_prunning <- FALSE
x <- QCsnp(DataDir = DataDir, ResultDir = ResultDir, finput = finput,foutput = foutput, geno = geno, maf = maf,hweCase = hweCase, hweControl = hweControl, ld_prunning = ld_prunning, casecontrol = casecontrol, monomorphicSNPs = monomorphicSNPs, caldiffmiss = caldiffmiss, dmissX = dmissX, dmissAutoY = dmissAutoY, diffmissFilter = diffmissFilter)
#> ℹ 4214 Ambiguous SNPs (A-T/G-C), indels etc. were removed.
#> ✔ Thresholds for maf, geno and hwe worked.
#> ℹ 11 variants removed due to missing genotype data (--geno).
#> ℹ No filter based on differential missingness will be applied.
#> ✔ Output PLINK files prefixed as ,PreimputeEX_QC1, with passed SNPs are saved in ResultDir.
#> ℹ Input file has 26527 SNPs.
#> ℹ Output file has 22302 SNPs after filtering.
Copying the plink files from ResultDir to DataDir.
ftemp <- list.files(paste0(ResultDir,"/"),pattern = "PreimputeEX_QC1")
file.copy(paste0(ResultDir,"/",ftemp),DataDir)
#> [1] TRUE TRUE TRUE TRUE
PlinkSummary(DataDir, ResultDir, finput = "PreimputeEX_QC1")
#> ℹ Dataset: PreimputeEX_QC1
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#> - Chr: 1
#> - Chr: 2
#> - Chr: 3
#> - Chr: 4
#> - Chr: 5
#> - Chr: 6
#> - Chr: 7
#> - Chr: 8
#> - Chr: 9
#> - Chr: 10
#> - Chr: 23
#> ℹ Total number of SNPs: 22302
#> ℹ Total number of samples: 276
Step 3: Filtering samples with high missing rate threshold.
It excludes individuals who have missing genotype data for more than 20% of the genetic variants in the dataset.
# Running
finput <- "PreimputeEX_QC1"
foutput <- "PreimputeEX_QC2"
imiss = 0.2
het = NULL
IBD = NULL
filterSample <- TRUE
ambi_out <- TRUE
x = QCsample(DataDir = DataDir,ResultDir = ResultDir, finput = finput,foutput = foutput, imiss = imiss,het = het, IBD = NULL, filterSample = filterSample, ambi_out = ambi_out)
#> ℹ Missingness and heterogygosity thresholds are NULL.
#> ℹ There will be no sample to be filtered for IBD with 'IBD' threshold.
#> Warning: cannot open file '/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpcRh9ZZ/PreimputeEX_QC2.fam': No such file or directory
Step 4: Apply sex check
finput <- "PreimputeEX_QC1" # Using the same input file
LD = TRUE
LD_window_size <- 50
LD_step_size <- 5
LD_r2_threshold <- 0.02
fmax_F <- 0.3
mmin_F <- 0.7
# We will not impute sex
impute_sex <- FALSE
compute_freq <- FALSE
SexCheckResult <-SexCheck(DataDir=DataDir,ResultDir = ResultDir, finput=finput,impute_sex=impute_sex,compute_freq =compute_freq,LD_window_size=LD_window_size,LD_step_size=LD_step_size,LD_r2_threshold=0.02,fmax_F = fmax_F,mmin_F=mmin_F)
#> ℹ There are no Y chromosomes in the input PLINK files. Estimates will be based solely on the X chromosome.
problematic_sex <- SexCheckResult[SexCheckResult$STATUS != "OK",]
problematic_sex <- problematic_sex[,1:2]
Number of samples with problematic sex assignment: 4
write.table(problematic_sex,file = paste0(DataDir,"/problematic_sex_samples"),quote = F, row.names = F,col.names = F)
SexCheckResult[1:5,]
FID | IID | PEDSEX | SNPSEX | STATUS | F |
---|---|---|---|---|---|
EUR_FIN | HG00171 | 2 | 2 | OK | 0.081780 |
EUR_FIN | HG00173 | 2 | 2 | OK | -0.031060 |
EUR_FIN | HG00174 | 2 | 2 | OK | 0.058430 |
EUR_FIN | HG00176 | 2 | 2 | OK | -0.023280 |
EUR_FIN | HG00177 | 2 | 2 | OK | -0.003822 |
Note, there is no filtering of the samples based on sex prediction.
This function uses X chromosome data to determine sex (i.e. based on heterozygosity rates) and flags individuals for whom the reported sex in the PED file does not match the estimated sex (given genomic data).
A PROBLEM arises if the two sexes do not match, or if the SNP data or pedigree data are ambiguous with regard to sex. A male call is made if F is more than 0.8; a female call is made if F is less than 0.2.
SNPSEX Sex as determined by X chromosome
F The actual X chromosome inbreeding (homozygosity) estimate
Users have the ability to impute the sex for their dataset.
Step 5: Filter for needed chromosomes.
We prefer to keep only chromosome 1 to 23.
test1 <- read.table(paste0(DataDir,"/",finput,".bim"))
unique(test1$V1)
#> [1] 1 2 3 4 5 6 7 8 9 10 23
foutput <- "PreimputeEX_QC2"
x <- FilterRegion(DataDir = DataDir, ResultDir = ResultDir, finput = finput, foutput = foutput, CHRX = FALSE, CHRY = FALSE, filterPAR = FALSE, filterXTR = FALSE, filterAmpliconic = FALSE, regionfile = FALSE, filterCHR = c(24, 25, 26),Hg = "38", exclude = TRUE)
Copying the plink files from ResultDir to DataDir.
ftemp <- list.files(paste0(ResultDir,"/"),pattern = "PreimputeEX_QC2")
file.copy(paste0(ResultDir,"/",ftemp),DataDir)
#> [1] TRUE TRUE TRUE TRUE TRUE
PlinkSummary(DataDir, ResultDir, finput = "PreimputeEX_QC2")
#> ℹ Dataset: PreimputeEX_QC2
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#> - Chr: 1
#> - Chr: 2
#> - Chr: 3
#> - Chr: 4
#> - Chr: 5
#> - Chr: 6
#> - Chr: 7
#> - Chr: 8
#> - Chr: 9
#> - Chr: 10
#> - Chr: 23
#> ℹ Total number of SNPs: 22302
#> ℹ Total number of samples: 276
Step 6: Second round of SNP filtering.
library(GXwasR)
finput <- "PreimputeEX_QC2"
foutput <- "PreimputeEX_QC3"
geno <- 0.02
maf <- 0.01
casecontrol <- TRUE
caldiffmiss <- TRUE
diffmissFilter <- TRUE
dmissX <- TRUE
dmissAutoY <- TRUE
monomorphicSNPs <- TRUE
ld_prunning <- FALSE
hwe = NULL
hweCase <- 1e-10
hweControl <- 1e-06
x <- QCsnp(DataDir = DataDir, ResultDir = ResultDir, finput = finput,foutput = foutput, geno = geno, maf = maf,hweCase = hweCase, hweControl = hweControl, ld_prunning = ld_prunning, casecontrol = casecontrol, monomorphicSNPs = monomorphicSNPs, caldiffmiss = caldiffmiss, dmissX = dmissX, dmissAutoY = dmissAutoY, diffmissFilter = diffmissFilter)
#> ℹ 0 Ambiguous SNPs (A-T/G-C), indels etc. were removed.
#> ✔ Thresholds for maf, geno and hwe worked.
#> 0 variants removed due to missing genotype data (--geno).
#> 1930 variants removed due to minor allele threshold(s)
#> • In cases, 1 variant removed due to Hardy-Weinberg exact test.
#> • In controls, 19 variants removed due to Hardy-Weinberg exact test.
#> ✔ Merging is done using the common SNPs between the input genotype files.
#> ✔ Plink files with merged regions are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpcRh9ZZ prefixed as filtered_temp2
#> ℹ There are no monomorphic SNPs.
#> ℹ No SNP with differential missingness between cases and controls.
#> ✔ Output PLINK files prefixed as ,PreimputeEX_QC3, with passed SNPs are saved in ResultDir.
#> ℹ Input file has 22302 SNPs.
#> ℹ Output file has 20353 SNPs after filtering.
Copying the plink files from ResultDir to DataDir.
ftemp <- list.files(paste0(ResultDir,"/"),pattern = "PreimputeEX_QC3")
file.copy(paste0(ResultDir,"/",ftemp),DataDir)
#> [1] TRUE TRUE TRUE TRUE
PlinkSummary(DataDir, ResultDir, finput = "PreimputeEX_QC3")
#> ℹ Dataset: PreimputeEX_QC3
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#> - Chr: 1
#> - Chr: 2
#> - Chr: 3
#> - Chr: 4
#> - Chr: 5
#> - Chr: 6
#> - Chr: 7
#> - Chr: 8
#> - Chr: 9
#> - Chr: 10
#> - Chr: 23
#> ℹ Total number of SNPs: 20353
#> ℹ Total number of samples: 276
Step 7: Second round of sample filtering
Here we are using stringent thresholds for sample filter.
finput <- "PreimputeEX_QC3"
foutput <- "PreimputeEX_QC4"
imiss = 0.02
het = 3
IBD = 0.2
IBDmatrix = FALSE
small_sample_mod = TRUE
filterSample = TRUE
ambi_out = TRUE
x = QCsample(DataDir = DataDir,ResultDir = ResultDir, finput = finput,foutput = foutput, imiss = imiss,het = het, IBD = IBD, IBDmatrix = IBDmatrix, filterSample = filterSample, ambi_out = ambi_out)
#> • Plots are initiated.
#> ℹ No. of samples filtered/flagged for missingness: 0
#> ℹ No. of samples filtered/flagged for heterozygosity: 0
#> ℹ No. of samples filtered for missingness and heterozygosity: 0
#> ℹ No. of samples marked to be filtered out for IDB after missingness and heterozygosity filter: 2
#> ℹ No. of samples in input PLINK files: 276
#> ℹ No. of samples in output PLINK files: 274
#> ✔ Output PLINK files, PreimputeEX_QC4 with final samples are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpcRh9ZZ.
Copying the plink files from ResultDir to DataDir.
ftemp <- list.files(paste0(ResultDir,"/"),pattern = "PreimputeEX_QC4")
file.copy(paste0(ResultDir,"/",ftemp),DataDir)
#> [1] TRUE TRUE TRUE TRUE
PlinkSummary(DataDir, ResultDir, finput = "PreimputeEX_QC4")
#> ℹ Dataset: PreimputeEX_QC4
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 124
#> ℹ Number of females: 150
#> ℹ This is case-control data
#> ℹ Number of cases: 106
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 52
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 54
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#> - Chr: 1
#> - Chr: 2
#> - Chr: 3
#> - Chr: 4
#> - Chr: 5
#> - Chr: 6
#> - Chr: 7
#> - Chr: 8
#> - Chr: 9
#> - Chr: 10
#> - Chr: 23
#> ℹ Total number of SNPs: 20353
#> ℹ Total number of samples: 274
Step 8: Ancestry check
Prior to use this function, users are highly encouraged to read our tutorial ““Decoding Ancestry: A Guide to Using GXwasR for Genetic Ancestry Estimation”.
data("highLD_hg19", package = "GXwasR")
data("example_data_study_sample_ancestry", package = "GXwasR")
finput <- "PreimputeEX_QC4"
reference <- "HapMapIII_NCBI36"
highLD_regions <- highLD_hg19
study_pop <- example_data_study_sample_ancestry
studyLD_window_size = 50
studyLD_step_size = 5
studyLD_r2_threshold = 0.02
filterSNP = TRUE
studyLD = TRUE
referLD = TRUE
referLD_window_size = 50
referLD_step_size = 5
referLD_r2_threshold = 0.02
outlier = TRUE
outlier_threshold = 3
outlierOf = "EUR"
x <- AncestryCheck(DataDir = DataDir,
ResultDir = ResultDir,
finput = finput,
reference = reference,
filterSNP = TRUE,
studyLD = TRUE,
studyLD_window_size = 50,
studyLD_step_size = 5,
studyLD_r2_threshold = 0.02,
referLD = FALSE,
referLD_window_size = 50,
referLD_step_size = 5,
referLD_r2_threshold = 0.02,
highLD_regions = highLD_regions,
study_pop = study_pop,
outlier = TRUE,
outlierOf = "EUR",
outlier_threshold = 3)
#> No A-T or G-C SNPs found in study data.
#> 111854 ambiguous SNPs removed from reference data.
#> ! LD pruning is recommended for reference dataset. Set referLD = TRUE.
#> ✔ LD pruning was performed for study dataset.
#>
#> ℹ Number of overlapping SNPs between study and reference data using rsID: 1032
#> ✔ PCA done.
#> ℹ No allele flips between study and reference data.
#> ℹ 7 samples are outliers of selected reference population.
#> ℹ 179 samples are NOT outliers of selected reference population.
Samples_with_predicted_ancestry <- x[["Samples_with_predicted_ancestry"]]
outlier <- x[["Outlier_samples"]]
outlier
FID | IID |
---|---|
EUR_FIN | HG00280 |
EUR_FIN | HG00315 |
EUR_FIN | HG00321 |
EUR_FIN | HG00350 |
EUR_FIN | HG00369 |
EUR_FIN | HG00377 |
EUR_FIN | HG00383 |
Samples_with_predicted_ancestry[1:5,]
FID | IID | Predicted Ancestry |
---|---|---|
EUR_GBR | HG00096 | EUR |
EUR_GBR | HG00097 | EUR |
EUR_GBR | HG00099 | EUR |
EUR_GBR | HG00100 | EUR |
EUR_GBR | HG00101 | EUR |
There was no outlier samples.
Step 9: Third round of SNP filtering
To make sure all the filtering thresholds for SNPs are maintained after sample filtering.
library(GXwasR)
finput <- "PreimputeEX_QC4"
foutput <- "Preimpute_Final"
geno <- 0.02
maf <- 0.01
casecontrol <- TRUE
caldiffmiss <- TRUE
diffmissFilter <- TRUE
dmissX <- TRUE
dmissAutoY <- TRUE
monomorphicSNPs <- TRUE
ld_prunning <- FALSE
hwe = NULL
hweCase <- 1e-10
hweControl <- 1e-06
x <- QCsnp(DataDir = ResultDir, ResultDir = ResultDir, finput = finput,foutput = foutput, geno = geno, maf = maf,hweCase = hweCase, hweControl = hweControl, ld_prunning = ld_prunning, casecontrol = casecontrol, monomorphicSNPs = monomorphicSNPs, caldiffmiss = caldiffmiss, dmissX = dmissX, dmissAutoY = dmissAutoY, diffmissFilter = diffmissFilter)
#> ℹ 0 Ambiguous SNPs (A-T/G-C), indels etc. were removed.
#> ✔ Thresholds for maf, geno and hwe worked.
#> 0 variants removed due to missing genotype data (--geno).
#> 3 variants removed due to minor allele threshold(s)
#> • In cases, 0 variants removed due to Hardy-Weinberg exact test.
#> • In controls, 0 variants removed due to Hardy-Weinberg exact test.
#> ✔ Merging is done using the common SNPs between the input genotype files.
#> ✔ Plink files with merged regions are in /var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T//RtmpcRh9ZZ prefixed as filtered_temp2
#> ℹ There are no monomorphic SNPs.
#> ℹ No SNP with differential missingness between cases and controls.
#> ✔ Output PLINK files prefixed as ,Preimpute_Final, with passed SNPs are saved in ResultDir.
#> ℹ Input file has 20353 SNPs.
#> ℹ Output file has 20350 SNPs after filtering.
Final Summary of final Qc-ed dataset.
PlinkSummary(ResultDir,ResultDir,foutput)
#> ℹ Dataset: Preimpute_Final
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 124
#> ℹ Number of females: 150
#> ℹ This is case-control data
#> ℹ Number of cases: 106
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 52
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 54
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 11
#> - Chr: 1
#> - Chr: 2
#> - Chr: 3
#> - Chr: 4
#> - Chr: 5
#> - Chr: 6
#> - Chr: 7
#> - Chr: 8
#> - Chr: 9
#> - Chr: 10
#> - Chr: 23
#> ℹ Total number of SNPs: 20350
#> ℹ Total number of samples: 274
Citation
If you are using GXwasR package and/or following this pipeline in your study, please cite it as:
citation("GXwasR")
#> To cite package 'GXwasR' in publications use:
#>
#> Bose B, Blostein F, Kim J, Winters J, Actkins KV, Mayer D, Congivaram H, Niarchou M, Edwards DV, Davis
#> LK, Stranger BE (2025). "GXwasR: A Toolkit for Investigating Sex-Differentiated Genetic Effects on
#> Complex Traits." _medRxiv 2025.06.10.25329327_. doi:10.1101/2025.06.10.25329327
#> <https://doi.org/10.1101/2025.06.10.25329327>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {GXwasR: A Toolkit for Investigating Sex-Differentiated Genetic Effects on Complex Traits},
#> author = {Banabithi Bose and Freida Blostein and Jeewoo Kim and Jessica Winters and Ky’Era V. Actkins and David Mayer and Harrsha Congivaram and Maria Niarchou and Digna Velez Edwards and Lea K. Davis and Barbara E. Stranger},
#> journal = {medRxiv 2025.06.10.25329327},
#> year = {2025},
#> doi = {10.1101/2025.06.10.25329327},
#> }
Reproducibility
The GXwasR package (Bose, Blostein, Kim et al., 2025) was made possible thanks to:
- R (R Core Team, 2025)
- BiocStyle (Oleś, 2025)
- knitr (Xie, 2025)
- RefManageR (McLean, 2017)
- rmarkdown (Allaire, Xie, Dervieux et al., 2024)
- sessioninfo (Wickham, Chang, Flight et al., 2025)
- testthat (Wickham, 2011)
This package was developed using biocthis.
R
session information.
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.5.1 (2025-06-13)
#> os macOS Sequoia 15.6
#> system aarch64, darwin24.4.0
#> ui unknown
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2025-08-08
#> pandoc 3.6.3 @ /Applications/Positron.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#> quarto 1.7.33 @ /usr/local/bin/quarto
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-8 2024-09-12 [2] CRAN (R 4.5.0)
#> backports 1.5.0 2024-05-23 [2] CRAN (R 4.5.1)
#> bibtex 0.5.1 2023-01-26 [2] CRAN (R 4.5.0)
#> bigassertr 0.1.7 2025-06-27 [2] CRAN (R 4.5.1)
#> bigparallelr 0.3.2 2021-10-02 [2] CRAN (R 4.5.0)
#> bigsnpr 1.12.18 2024-11-26 [2] CRAN (R 4.5.1)
#> bigsparser 0.7.3 2024-09-06 [2] CRAN (R 4.5.1)
#> bigstatsr 1.6.2 2025-07-29 [2] CRAN (R 4.5.1)
#> Biobase 2.68.0 2025-04-15 [2] Bioconduc~
#> BiocGenerics 0.54.0 2025-04-15 [2] Bioconduc~
#> BiocIO 1.18.0 2025-04-15 [2] Bioconduc~
#> BiocManager 1.30.26 2025-06-05 [2] CRAN (R 4.5.0)
#> BiocParallel 1.42.1 2025-06-01 [2] Bioconductor 3.21 (R 4.5.0)
#> BiocStyle 2.36.0 2025-04-15 [2] Bioconduc~
#> Biostrings 2.76.0 2025-04-15 [2] Bioconduc~
#> bit 4.6.0 2025-03-06 [2] CRAN (R 4.5.1)
#> bit64 4.6.0-1 2025-01-16 [2] CRAN (R 4.5.1)
#> bitops 1.0-9 2024-10-03 [2] CRAN (R 4.5.0)
#> brio 1.1.5 2024-04-24 [2] CRAN (R 4.5.1)
#> broom 1.0.9 2025-07-28 [2] CRAN (R 4.5.1)
#> BSgenome 1.76.0 2025-04-15 [2] Bioconduc~
#> cachem 1.1.0 2024-05-16 [2] CRAN (R 4.5.0)
#> calibrate 1.7.7 2020-06-19 [2] CRAN (R 4.5.0)
#> callr 3.7.6 2024-03-25 [2] CRAN (R 4.5.0)
#> car 3.1-3 2024-09-27 [2] CRAN (R 4.5.0)
#> carData 3.0-5 2022-01-06 [2] CRAN (R 4.5.0)
#> cli 3.6.5 2025-04-23 [2] CRAN (R 4.5.0)
#> codetools 0.2-20 2024-03-31 [4] CRAN (R 4.5.1)
#> cowplot 1.2.0 2025-07-07 [2] CRAN (R 4.5.1)
#> crayon 1.5.3 2024-06-20 [2] CRAN (R 4.5.0)
#> curl 6.4.0 2025-06-22 [2] CRAN (R 4.5.1)
#> data.table 1.17.8 2025-07-10 [2] CRAN (R 4.5.1)
#> DelayedArray 0.34.1 2025-04-17 [2] Bioconduc~
#> devtools * 2.4.5 2022-10-11 [3] CRAN (R 4.5.0)
#> digest 0.6.37 2024-08-19 [2] CRAN (R 4.5.0)
#> doParallel 1.0.17 2022-02-07 [2] CRAN (R 4.5.0)
#> doRNG 1.8.6.2 2025-04-02 [2] CRAN (R 4.5.0)
#> doSNOW 1.0.20 2022-02-04 [2] CRAN (R 4.5.0)
#> dplyr 1.1.4 2023-11-17 [2] CRAN (R 4.5.0)
#> ellipsis 0.3.2 2021-04-29 [3] CRAN (R 4.5.0)
#> evaluate 1.0.4 2025-06-18 [2] CRAN (R 4.5.1)
#> farver 2.1.2 2024-05-13 [2] CRAN (R 4.5.0)
#> fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.5.0)
#> flock 0.7 2016-11-12 [2] CRAN (R 4.5.1)
#> foreach 1.5.2 2022-02-02 [2] CRAN (R 4.5.0)
#> Formula 1.2-5 2023-02-24 [2] CRAN (R 4.5.0)
#> fs 1.6.6 2025-04-12 [2] CRAN (R 4.5.0)
#> gdsfmt 1.44.1 2025-07-09 [2] Bioconduc~
#> generics 0.1.4 2025-05-09 [2] CRAN (R 4.5.0)
#> GenomeInfoDb 1.44.1 2025-07-23 [2] Bioconduc~
#> GenomeInfoDbData 1.2.14 2025-04-21 [2] Bioconductor
#> GenomicAlignments 1.44.0 2025-04-15 [2] Bioconduc~
#> GenomicRanges 1.60.0 2025-04-15 [2] Bioconduc~
#> ggplot2 3.5.2 2025-04-09 [2] CRAN (R 4.5.0)
#> ggpubr 0.6.1 2025-06-27 [2] CRAN (R 4.5.1)
#> ggrepel 0.9.6 2024-09-07 [2] CRAN (R 4.5.1)
#> ggsignif 0.6.4 2022-10-13 [2] CRAN (R 4.5.0)
#> glue 1.8.0 2024-09-30 [2] CRAN (R 4.5.0)
#> gridExtra 2.3 2017-09-09 [2] CRAN (R 4.5.0)
#> gtable 0.3.6 2024-10-25 [2] CRAN (R 4.5.0)
#> GXwasR * 0.99.0 2025-08-08 [1] Bioconductor
#> hms 1.1.3 2023-03-21 [2] CRAN (R 4.5.0)
#> htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.5.0)
#> htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.5.0)
#> httpuv 1.6.16 2025-04-16 [2] CRAN (R 4.5.1)
#> httr 1.4.7 2023-08-15 [2] CRAN (R 4.5.0)
#> IRanges 2.42.0 2025-04-15 [2] Bioconduc~
#> iterators 1.0.14 2022-02-05 [2] CRAN (R 4.5.0)
#> jsonlite 2.0.0 2025-03-27 [2] CRAN (R 4.5.0)
#> knitr 1.50 2025-03-16 [2] CRAN (R 4.5.0)
#> labeling 0.4.3 2023-08-29 [2] CRAN (R 4.5.0)
#> later 1.4.2 2025-04-08 [2] CRAN (R 4.5.1)
#> lattice 0.22-7 2025-04-02 [4] CRAN (R 4.5.1)
#> lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.5.0)
#> lubridate 1.9.4 2024-12-08 [2] CRAN (R 4.5.1)
#> magrittr * 2.0.3 2022-03-30 [2] CRAN (R 4.5.0)
#> MASS 7.3-65 2025-02-28 [4] CRAN (R 4.5.1)
#> mathjaxr 1.8-0 2025-04-30 [2] CRAN (R 4.5.1)
#> Matrix 1.7-3 2025-03-11 [4] CRAN (R 4.5.1)
#> MatrixGenerics 1.20.0 2025-04-15 [2] Bioconduc~
#> matrixStats 1.5.0 2025-01-07 [2] CRAN (R 4.5.0)
#> memoise 2.0.1 2021-11-26 [2] CRAN (R 4.5.0)
#> mgcv 1.9-3 2025-04-04 [4] CRAN (R 4.5.1)
#> mime 0.13 2025-03-17 [2] CRAN (R 4.5.0)
#> miniUI 0.1.2 2025-04-17 [3] CRAN (R 4.5.0)
#> nlme 3.1-168 2025-03-31 [4] CRAN (R 4.5.1)
#> pillar 1.11.0 2025-07-04 [2] CRAN (R 4.5.1)
#> pkgbuild 1.4.8 2025-05-26 [2] CRAN (R 4.5.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.5.0)
#> pkgdev 0.1.0.9060 2025-08-04 [2] Github (dieghernan/pkgdev@e56f2a8)
#> pkgload 1.4.0 2024-06-28 [2] CRAN (R 4.5.0)
#> plyr 1.8.9 2023-10-02 [2] CRAN (R 4.5.1)
#> plyranges 1.28.0 2025-04-15 [2] Bioconduc~
#> polynom 1.4-1 2022-04-11 [2] CRAN (R 4.5.0)
#> poolr 1.2-0 2025-05-07 [2] CRAN (R 4.5.0)
#> prettyunits 1.2.0 2023-09-24 [2] CRAN (R 4.5.0)
#> printr * 0.3 2023-03-08 [2] CRAN (R 4.5.0)
#> processx 3.8.6 2025-02-21 [2] CRAN (R 4.5.1)
#> profvis 0.4.0 2024-09-20 [3] CRAN (R 4.5.0)
#> progress 1.2.3 2023-12-06 [2] CRAN (R 4.5.0)
#> promises 1.3.3 2025-05-29 [2] CRAN (R 4.5.0)
#> ps 1.9.1 2025-04-12 [2] CRAN (R 4.5.1)
#> purrr 1.1.0 2025-07-10 [2] CRAN (R 4.5.1)
#> qqman 0.1.9 2023-08-23 [2] CRAN (R 4.5.0)
#> R.methodsS3 1.8.2 2022-06-13 [2] CRAN (R 4.5.0)
#> R.oo 1.27.1 2025-05-02 [2] CRAN (R 4.5.0)
#> R.utils 2.13.0 2025-02-24 [2] CRAN (R 4.5.0)
#> R6 2.6.1 2025-02-15 [2] CRAN (R 4.5.0)
#> ragg 1.4.0 2025-04-10 [3] CRAN (R 4.5.0)
#> rbibutils 2.3 2024-10-04 [2] CRAN (R 4.5.1)
#> RColorBrewer 1.1-3 2022-04-03 [2] CRAN (R 4.5.0)
#> Rcpp 1.1.0 2025-07-02 [2] CRAN (R 4.5.1)
#> RCurl 1.98-1.17 2025-03-22 [2] CRAN (R 4.5.0)
#> Rdpack 2.6.4 2025-04-09 [2] CRAN (R 4.5.0)
#> RefManageR * 1.4.0 2022-09-30 [2] CRAN (R 4.5.1)
#> regioneR 1.40.1 2025-06-01 [2] Bioconductor 3.21 (R 4.5.0)
#> remotes 2.5.0 2024-03-17 [2] CRAN (R 4.5.0)
#> restfulr 0.0.16 2025-06-27 [2] CRAN (R 4.5.1)
#> rjson 0.2.23 2024-09-16 [2] CRAN (R 4.5.0)
#> rlang 1.1.6 2025-04-11 [2] CRAN (R 4.5.0)
#> rmarkdown * 2.29 2024-11-04 [2] CRAN (R 4.5.0)
#> rmio 0.4.0 2022-02-17 [2] CRAN (R 4.5.0)
#> rngtools 1.5.2 2021-09-20 [2] CRAN (R 4.5.0)
#> rprojroot 2.1.0 2025-07-12 [2] CRAN (R 4.5.1)
#> Rsamtools 2.24.0 2025-04-15 [2] Bioconduc~
#> rstatix 0.7.2 2023-02-01 [2] CRAN (R 4.5.0)
#> rtracklayer 1.68.0 2025-04-15 [2] Bioconduc~
#> S4Arrays 1.8.1 2025-06-01 [2] Bioconductor 3.21 (R 4.5.0)
#> S4Vectors 0.46.0 2025-04-15 [2] Bioconduc~
#> scales 1.4.0 2025-04-24 [2] CRAN (R 4.5.0)
#> seqminer 9.7 2024-10-02 [2] CRAN (R 4.5.1)
#> sessioninfo * 1.2.3 2025-02-05 [2] CRAN (R 4.5.1)
#> shiny 1.11.1 2025-07-03 [2] CRAN (R 4.5.1)
#> snow 0.4-4 2021-10-27 [2] CRAN (R 4.5.0)
#> SNPRelate 1.42.0 2025-04-15 [2] Bioconduc~
#> SparseArray 1.8.1 2025-07-23 [2] Bioconduc~
#> stringi 1.8.7 2025-03-27 [2] CRAN (R 4.5.0)
#> stringr 1.5.1 2023-11-14 [2] CRAN (R 4.5.0)
#> sumFREGAT 1.2.5 2022-06-07 [2] CRAN (R 4.5.1)
#> SummarizedExperiment 1.38.1 2025-04-30 [2] Bioconductor 3.21 (R 4.5.0)
#> sys 3.4.3 2024-10-04 [2] CRAN (R 4.5.0)
#> systemfonts 1.2.3 2025-04-30 [2] CRAN (R 4.5.0)
#> testthat * 3.2.3 2025-01-13 [2] CRAN (R 4.5.1)
#> textshaping 1.0.1 2025-05-01 [3] CRAN (R 4.5.0)
#> tibble 3.3.0 2025-06-08 [2] CRAN (R 4.5.0)
#> tidyr 1.3.1 2024-01-24 [2] CRAN (R 4.5.1)
#> tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.5.0)
#> timechange 0.3.0 2024-01-18 [2] CRAN (R 4.5.1)
#> tzdb 0.5.0 2025-03-15 [2] CRAN (R 4.5.1)
#> UCSC.utils 1.4.0 2025-04-15 [2] Bioconduc~
#> urlchecker 1.0.1 2021-11-30 [3] CRAN (R 4.5.0)
#> usethis * 3.1.0 2024-11-26 [2] CRAN (R 4.5.0)
#> vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.5.0)
#> vroom 1.6.5 2023-12-05 [2] CRAN (R 4.5.1)
#> withr 3.0.2 2024-10-28 [2] CRAN (R 4.5.0)
#> xfun 0.52 2025-04-02 [2] CRAN (R 4.5.0)
#> XML 3.99-0.18 2025-01-01 [2] CRAN (R 4.5.0)
#> xml2 1.3.8 2025-03-14 [2] CRAN (R 4.5.1)
#> xtable 1.8-4 2019-04-21 [2] CRAN (R 4.5.0)
#> XVector 0.48.0 2025-04-15 [2] Bioconduc~
#> yaml 2.3.10 2024-07-26 [2] CRAN (R 4.5.0)
#>
#> [1] /private/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T/RtmpcRh9ZZ/temp_libpath4e4d53c96c8f
#> [2] /Users/mayerdav/Library/R/arm64/4.5/library
#> [3] /opt/homebrew/lib/R/4.5/site-library
#> [4] /opt/homebrew/Cellar/r/4.5.1/lib/R/library
#> * ── Packages attached to the search path.
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Bibliography
This vignette was generated using BiocStyle (Oleś, 2025) with knitr (Xie, 2025) and rmarkdown (Allaire, Xie, Dervieux et al., 2024) running behind the scenes.
Citations made with RefManageR (McLean, 2017).
[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.29. 2024. URL: https://github.com/rstudio/rmarkdown.
[2] B. Bose, F. Blostein, J. Kim, et al. “GXwasR: A Toolkit for Investigating Sex-Differentiated Genetic Effects on Complex Traits”. In: medRxiv 2025.06.10.25329327 (2025). DOI: 10.1101/2025.06.10.25329327.
[3] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.
[4] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.36.0. 2025. DOI: 10.18129/B9.bioc.BiocStyle. URL: https://bioconductor.org/packages/BiocStyle.
[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2025. URL: https://www.R-project.org/.
[6] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[7] H. Wickham, W. Chang, R. Flight, et al. sessioninfo: R Session Information. R package version 1.2.3. 2025. DOI: 10.32614/CRAN.package.sessioninfo. URL: https://CRAN.R-project.org/package=sessioninfo.
[8] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.50. 2025. URL: https://yihui.org/knitr/.