Decoding Ancestry: A Guide to Using GXwasR for Genetic Ancestry Estimation
Banabithi Bose
Northwestern UniversityUniversity of Colorado Anschutz Medical Campusbanabithi.bose@gmail.com
8 August 2025
Source:vignettes/decoding_ancestry.Rmd
decoding_ancestry.Rmd
Introduction
The pursuit of ancestral origins using genetic data has become
increasingly accessible and sophisticated with the advent of genomics
technology. Our R package GXwasR enables the identification of
individuals of divergent ancestry through genotype datasets. This
vignette provides an in-depth look at how GXwasR facilitates ancestry
estimation, particularly focusing on its AncestryCheck
()
function.
Identifying Divergent Ancestry
The core of ancestry estimation lies in combining the genotypes of a study population with those from a reference dataset, such as the HapMap or 1000 Genomes studies. This integration allows for the detection of population structure down to the level of the reference dataset, revealing continental-scale ancestry differences.
Accessing GXwasR and Simulated Genotype Datasets
GXwasR is readily available on Bioconductor and can be installed using BiocManager::install(“GXwasR”). The package includes a simulated genotype dataset prefixed as GXwasR_example as plink bed files (i.e., .bed, .fam, and .bim), ideal for users to experiment with. This simulated dataset consists of genotypes from 276 individuals across 26,527 genetic markers, providing a realistic scenario for testing the functionality.
Let’s see more details of this genotype dataset that we will call ‘Study data’.
library(GXwasR)
#>
#> GXwasR: Genome-wide and x-chromosome wide association analyses applying best practices of quality control over genetic data
#> Version 0.99.0 () installed
#> Author: c(
#> person(given = "Banabithi",
#> family = "Bose",
#> role = c("cre", "aut"),
#> email = "banabithi.bose@gmail.com",
#> comment = c(ORCID = "0000-0003-0842-8768"))
#> )
#> Maintainer: Banabithi Bose <banabithi.bose@gmail.com>
#> Tutorial: https://github.com
#> Use citation("GXwasR") to know how to cite this work.
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
x <- PlinkSummary(DataDir, ResultDir, finput)
#> ℹ Dataset: GXwasR_example
#> ℹ Number of missing phenotypes: 0
#> ℹ Number of males: 125
#> ℹ Number of females: 151
#> ℹ This is case-control data
#> ℹ Number of cases: 108
#> ℹ Number of controls: 168
#> ℹ Number of cases in males: 53
#> ℹ Number of controls in males: 72
#> ℹ Number of cases in females: 55
#> ℹ Number of controls in females: 96
#> ℹ Number of chromosomes: 12
#> - Chr: 1
#> - Chr: 2
#> - Chr: 3
#> - Chr: 4
#> - Chr: 5
#> - Chr: 6
#> - Chr: 7
#> - Chr: 8
#> - Chr: 9
#> - Chr: 10
#> - Chr: 23
#> - Chr: 24
#> ℹ Total number of SNPs: 26527
#> ℹ Total number of samples: 276
AncestryCheck Function Overview
AncestryCheck
is a comprehensive R function designed for
comprehensive ancestry analysis in genetic studies. By leveraging plink
software via R platform,it seamlessly integrates several critical steps:
SNP filtering for data integrity, linkage disequilibrium (LD) pruning to
reduce data complexity, correction of chromosome mismatches, and
checking for allele flips to ensure data accuracy. Additionally, it
adeptly merges study and reference datasets, conducts Principal
Component Analysis (PCA) to unravel population structures, and
incorporates outlier detection and visualization. This function is
pivotal in comparing study samples’ self-reported ancestry against
reference populations, enabling a detailed analysis of population
structure and the identification of divergent ancestries.
Implemented workflow of ancestry estimation in AncestryCheck
The AncestryCheck
() function in the GXwasR package is a
sophisticated tool designed for ancestry analysis in genetic studies. It
provides a comprehensive approach, from data preparation to advanced
statistical analysis, for estimating ancestral backgrounds. Let’s delve
into the technical aspects and workflow of this function:
Data Preparation:
Directories and initial quality check: The ‘
AncestryCheck
’ function requires specific directories for inputting PLINK binary files and for saving output files. At this stage, the input genotype dataset, i.e., the study dataset, should have already undergone thorough quality control. GXwasR offers a range of functions to facilitate this preliminary quality check (QC) of genotype datasets, essential for accurate ancestry estimation. This QC process typically includes filtering out multi-allelic variants, selecting SNPs based on genotype call rate, filtering for specific chromosomes, SNP filtering grounded in minor allele frequency, adherence to Hardy-Weinberg equilibrium, and sample selection based on criteria like call rate, low missing rate, and heterozygosity thresholds. It’s important to apply these filters in the appropriate sequence for optimal dataset preparation. For a more detailed understanding of this QC pipeline, users can refer to the comprehensive guide available at https://boseb.github.io/GXwasR/articles/preimputationQC.html.Genomic Build and X-Chromosome Compatibility: Users of the
AncestryCheck
function must ensure that both their study and reference datasets are aligned on the same genomic build, with consistent chromosome coding across the entire genome. This alignment is crucial, particularly since sex chromosomes (X and Y) often have different encoding conventions in genomic datasets. TheAncestryCheck
function is equipped to implicitly handle variations in sex chromosome encoding. However, users are advised to employ tools like Liftover or CrossMap for verifying and aligning genomic builds and chromosome codes between their study and reference datasets, thus ensuring compatibility and accuracy in the ancestry estimation process.Reference Dataset Selection: The function offers users the flexibility to select from two established reference populations, such as, HapMap phase III[2] or the 1000 Genomes phase 3[3] Projects. It is specifically designed to work seamlessly with the preprocessed genotype datasets of these reference panels. For users seeking to download and process these datasets, a comprehensive guidance was provided through a dedicated vignette. This detailed resource, which can be accessed at https://boseb.github.io/Processing_1000Genome/tutorial.html, offers step-by-step instructions on how to effectively prepare these reference datasets for use in ancestry analysis, ensuring users can make the most out of their genetic studies with the selected reference populations.
SNP Filtering:
The function incorporates a crucial step of SNP filtering for both
study and reference datasets. This filtering process is particularly
focused on excluding SNPs that do not conform to A-T or G-C base
pairings. By implementing this filtering, AncestryCheck
ensures the overall quality and reliability of the genetic data from
both study and reference datasets. This step is vital in maintaining the
integrity of the analysis, as it helps in eliminating potentially
problematic SNPs that could skew the ancestry estimation results.
LD Pruning:
The function includes a critical step of LD-based filtering for both
the study and reference datasets, which is essential in simplifying the
genetic data by removing variants in high linkage disequilibrium. This
process is especially significant for conducting accurate PCA on the
study dataset SNPs. The default setting targets variants in LD with an
r2
value greater than 0.2 within a 50kb window. However,
AncestryCheck
offers users the flexibility to adjust these
parameters according to their specific research needs. Users can choose
whether or not to apply LD-based filtering and can modify parameters
like the r2
threshold and the window size
.
This adaptability ensures that the function can be tailored to a variety
of study designs and research objectives, making it a versatile tool for
ancestry analysis.
The ’AncestryCheck
then utilizes the list of pruned
variants from the study data to proportionately adjust the reference
dataset, aligning it to the scale of the study data. This harmonization
is crucial for accurate comparative analysis and ensures consistency in
the ancestral estimation process.
Filter for known high-LD structure.
The ‘AncestryCheck
’ provides an additional layer of
refinement by offering the option to filter out genomic ranges
characterized by high-LD [4]. This feature is particularly beneficial
for genomic builds hg19 and hg38, as it targets and removes regions
within these builds that are known for their high-LD structures. These
specific high-LD region files are already incorporated into the GXwasR
package, readily accessible for use in AncestryCheck
. This
targeted filtering is crucial in further enhancing the quality and
accuracy of the genetic data being analyzed, ensuring that the resulting
ancestry estimation is as precise and reliable as possible.
Genotype Data Processing for Merge:
The function includes a meticulous process to prepare genotype data for merging, addressing key issues that could otherwise compromise the accuracy of the analysis:
Correcting Mismatches: The function systematically
corrects mismatches in chromosome IDs, positions, and allele
annotations. This step is essential to ensure that the study and
reference data align accurately, providing a reliable basis for
subsequent analysis. By rectifying these discrepancies,
AncestryCheck
safeguards against potential errors that
could arise from inconsistencies in the datasets.
Handling Allele Flips: Allele flips are a frequent complication in genotype data, where alleles may be reported in reverse order. The function is adept at identifying and rectifying these flips, thereby preserving the integrity of the genetic information. Given the importance of accurately matching SNPs, the ability to correct allele flips is crucial for error-free dataset merging.
Dealing with Sex Chromosomes: The encoding of sex
chromosomes often differs from autosomal chromosomes, which can
introduce additional complexity in matching study and reference
datasets. AncestryCheck
is designed to handle these
variations effectively.
Removing Mismatches Post-Flipping: After the allele flipping process, the function performs a thorough check to identify any remaining mismatches. Any alleles that are still not aligned correctly after the flipping process are identified and removed from the reference dataset. This final step ensures that only perfectly matched SNPs are retained for the merged dataset, thereby enhancing the accuracy of the ancestral analysis.
Through these comprehensive steps, AncestryCheck
ensures
that the genotype datasets are optimally prepared for merging, laying a
solid foundation for accurate and reliable ancestry estimation.
Principal Component Analysis (PCA):
Merging Datasets: After preprocessing, the study and reference datasets are merged.
Executing PCA: PCA is then performed on the combined dataset, providing insights into the genetic variance and ancestral components.
By implementing PCA on the harmonized genetic data,
AncestryCheck
effectively translates the nuanced genetic
variations into interpretable patterns, facilitating a deeper
understanding of the ancestral backgrounds inherent in the study
population. This step is fundamental in transforming raw genetic data
into meaningful insights about human ancestry and genetic diversity.
Reference Populations:
HapMap: We combined and used HapMap data for the following populations: AFR = ASW (African Ancestry in SW USA), LWK (Luhya in Webuye, Kenya), MKK (Maasai in Kinyawa, Kenya), and YRI (Yoruba in Ibadan, Nigeria). EAS = CHB (Han Chinese in Beijing, China), CHD (Chinese in Metropolitan Denver, CO, USA), and JPT (Japanese in Tokyo, Japan). EUR = CEU (CEPH/Utah Collection) and TSI (Toscani in Italia). SAS = GIH (Gujarati Indians in Houston, Texas, USA).
1000Genomes: We used 1000Genomes data for the following super populations: AFR = ASW (African Ancestry in Southwest US), ACB (African Caribbean in Barbados), ESN (Esan in Nigeria), GWD (Gambian in Western Division, The Gambia), LWK (Luhya in Webuye, Kenya), MSL (Mende in Sierra Leone), and YRI (Yoruba in Ibadan, Nigeria). EAS = CDX (Chinese Dai in Xishuangbanna, China), CHB (Han Chinese in Bejing, China), JPT (Japanese in Tokyo, Japan), KHV (Kinh in Ho Chi Minh City, Vietnam), and CHS (Southern Han Chinese, China). EUR = GBR (British in England and Scotland), FIN (Finnish in Finland), IBS (Iberian populations in Spain), TSI (Toscani in Italy), and CEU (Utah residents with Northern and Western European ancestry). SAS = BEB (Bengali in Bangladesh), GIH (Gujarati Indian in Houston,TX), ITU (Indian Telugu in the UK), PJL (Punjabi in Lahore,Pakistan), and STU (Sri Lankan Tamil in the UK).
More information can be found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20131219.populations.tsv
Outlier Detection and Ancestry Assignment Analysis:
Flexibility in Reference Population Selection: A key
feature of the function is its versatility in handling various reference
populations. Users have the freedom to select a specific reference
population against which the study samples’ ancestry labels are
compared. For instance, setting the parameter outlierOf
to
EUR (i.e., Ref_EUR
) enables the identification of
non-European outliers in a European context. Other reference populations
like AFR (African), EAS (East Asian), and SAS (South Asian) can also be
used, allowing for a wide range of comparative ancestral analyses. This
flexibility ensures that AncestryCheck
can cater to diverse
research needs and objectives.
Setting Outlier/Prediction Thresholds: The function
allows users to define specific thresholds for outlier
detection/ancestry assignement, adding a layer of customization to the
analysis. It employs a methodical approach by calculating PC1 and PC2
Z-scores for each subject in relation to mean and standard deviation of
PC1 and PC2 for each reference population. Samples with PC1 and PC2
Z-scores within the user-defined threshold are predicted as having the
same ancestry as that reference population. For example, in the case of
European reference samples, the function computes the mean and standard
deviation of the first two principal components
(mean(PC1 of Ref_EUR)
, sd(PC1 of Ref_EUR
),
mean(PC2 of Ref_EUR)
, sd(PC2 of Ref_EUR)
). It
then calculates Z-scores for PC1 and PC2
(Z-score PC1 of Ref_EUR <- (PC1 - mean(PC1 of Ref_EUR))/sd(PC1 of Ref_EUR)
,
Z-score PC2 of Ref_EUR <- (PC2 - mean(PC2 of Ref_EUR))/sd(PC2 of Ref_EUR)
) for the samples in relation to the European Reference. Any study
sample whose PC1 and PC2 Z-score is considered an outlier of the
European population. This approach is particularly effective in
pinpointing samples with ancestral backgrounds that significantly
deviate from the selected reference population, providing a clear and
quantifiable method to identify and analyze outliers.
Visualization and Reporting:
PCA Plot Visualization: One of the most insightful
aspects of AncestryCheck
is its ability to visually present
the results of the PCA. This is done through a color-coded scatter plot
that displays the first two principal components. This graphical
representation is instrumental in highlighting the genetic similarities
and differences among the analyzed populations. By plotting individuals
on this scatter plot, researchers can easily discern the underlying
genetic structure and observe how study samples cluster with or diverge
from various reference populations.
Resultant Dataframe: Following the PCA and outlier
analysis, AncestryCheck
generates a dataframe that lists
the identified outlier samples. This output is particularly useful for
researchers, as it provides a clear and organized view of the samples
that significantly differ genetically from the chosen reference
population. This information is crucial for further detailed
investigations and interpretations of genetic ancestry.
Comprehensive Reporting: In addition to the visual
and tabular outputs, the function also produces detailed reports
covering various aspects of the analysis. These reports include
information on allele flips and a thorough breakdown of the outlier
analysis. Such comprehensive reporting enhances the overall
understanding of the genetic structure and ancestry estimations,
providing a multi-dimensional view of the results. This feature of
AncestryCheck
is particularly valuable for researchers who
require in-depth documentation and interpretation of their genetic
data.
AncestryCheck
parameters in brief
The AncestryCheck
function takes several arguments to
customize its operations:
-
DataDir
: Directory containing input PLINK files. -
ResultDir
: Directory for storing output files; defaults to a temporary directory. -
finput
: Prefix of PLINK binary files for study samples. -
reference
: Choice of reference population data, either ‘HapMapIII_NCBI36’ or ‘ThousandGenome’. -
filterSNP
: Flag to enable or disable SNP filtering. -
studyLD and referLD
: Flags for applying LD-based filtering on study and reference data, respectively. -
studyLD_window_size
,studyLD_step_size
,studyLD_r2_threshold
,referLD_window_size
,referLD_step_size
,referLD_r2_threshold
: Parameters for LD pruning. -
highLD_regions
: Dataframe of known high LD regions. -
study_pop
: Dataframe containing sample IDs and ancestry labels for study samples. -
outlier
: Flag to enable or disable outlier detection. -
outlierOf
: Reference ancestry name for detecting outlier samples. -
outlier_threshold
: Numeric threshold for outlier detection.
Detailed parameter description
Before exploring the practical application of the
AncestryCheck
function, it is essential to understand the
details of the above arguments and their implications:
-
DataDir
: This parameter requires a character string specifying the file path for the input PLINK binary files. It is essential for indicating where the study’s genetic data is stored. -
ResultDir
(default = tempdir()): This argument is a character string for the file path where all output files will be stored. If not specified, it defaults to R’s temporary directory. -
finput
: This parameter specifies the prefix of the input PLINK binary files for the study samples. -
reference
(default = c(“HapMapIII_NCBI36”, “ThousandGenome”)): A vector indicating the choice of reference population. The options include HapMap Phase 3 and 1000 Genomes Phase III. -
filterSNP
(default = TRUE): Indicates whether to filter out SNPs. It’s set to TRUE by default but can be changed to FALSE if direct joining of study and reference samples is possible. -
studyLD
(default = TRUE): This Boolean value determines whether LD-based filtering is applied to study genotype data. -
studyLD_window_size
(default = 50): An integer defining the window size in variant count or kilobase for LD-based filtering of the variants for the study data. -
studyLD_step_size
(default = 5): Specifies the variant count to shift the window at the end of each step for LD filtering of the study data. -
studyLD_r2_threshold
(default = 0.02): A numeric value between 0 and 1 setting the pairwise r2 threshold for LD-based filtering of the study data. -
referLD
(default = FALSE): Determines whether LD-based filtering is applied to reference genotype data. -
referLD_window_size
(default = 50), referLD_step_size (default = 5), referLD_r2_threshold (default = 0.02): These parameters define the window size, step size, and r2 threshold for LD filtering of the reference data, similar to the study data settings. -
highLD_regions
: This parameter requires a dataframe containing known high LD regions. A dataframe from Anderson et. al.[4], ‘highLD_hg19’ from hg19 genome built is provided in the package. -
study_pop
: A dataframe containing two columns for the study: sample ID (IID) and the ancestry label. An example dataframe prefixed as ‘example_data_study_sample_ancestry’ for ‘GXwasR_example’ data. -
outlier
(default = FALSE): A Boolean value specifying whether outlier detection will be performed. -
outlierOf
(default = “EUR”): This parameter specifies the reference ancestry name for detecting outlier samples. -
outlier_threshold
(default = 3): Specifies the numeric threshold for outlier detection. This threshold is multiplied with the Euclidean distance from the center of the PC1 and PC2 to the maximum Euclidean distance of the reference samples to determine outliers.
These parameters and their default values allow for a significant
degree of customization in the AncestryCheck
function,
enabling users to tailor the analysis to their specific research
needs.
AncestryCheck Usage using HapMapIII Reference
The function can be used in the following manner with our study data and reference data as HapMapIII:
# Define parameters
library(GXwasR)
data("highLD_hg19", package = "GXwasR")
data("example_data_study_sample_ancestry", package = "GXwasR")
# Call the AncestryCheck function
hapmap_results <- AncestryCheck(
DataDir = GXwasR:::GXwasR_data(),
ResultDir = tempdir(),
finput = "GXwasR_example",
reference = "HapMapIII_NCBI36",
filterSNP = TRUE,
studyLD = TRUE,
studyLD_window_size = 50,
studyLD_step_size = 5,
studyLD_r2_threshold = 0.02,
referLD = FALSE,
referLD_window_size = 50,
referLD_step_size = 5,
referLD_r2_threshold = 0.02,
highLD_regions = highLD_hg19,
study_pop = example_data_study_sample_ancestry,
outlier = TRUE,
outlierOf = "EUR",
outlier_threshold = 3
)
#> ℹ 'HapMapIII_NCBI36' reference data found at /Users/mayerdav/Downloads/HapMapIII_NCBI36.
#> Using PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024)
#> 4214 ambiguous SNPs removed from study data.
#> 111854 ambiguous SNPs removed from reference data.
#> ! LD pruning is recommended for reference dataset. Set referLD = TRUE.
#> ✔ LD pruning was performed for study dataset.
#>
#> ℹ Number of overlapping SNPs between study and reference data using rsID: 1096
#> ℹ No allele flips between study and reference data.
#> ℹ 9 samples are outliers of selected reference population.
#> ℹ 179 samples are NOT outliers of selected reference population.
#> This message is displayed once every 8 hours.
hapmap_results$pca_plot

Samples_with_predicted_ancestry <- hapmap_results[["Samples_with_predicted_ancestry"]]
outlier <- hapmap_results[["Outlier_samples"]]
outlier
#> FID IID
#> 1 EUR_FIN HG00189
#> 2 EUR_FIN HG00280
#> 3 EUR_FIN HG00315
#> 4 EUR_FIN HG00332
#> 5 EUR_FIN HG00349
#> 6 EUR_FIN HG00350
#> 7 EUR_FIN HG00377
#> 8 EUR_FIN HG00383
#> 9 EUR_TSI NA20537
Samples_with_predicted_ancestry[seq_len(5), ]
#> FID IID Predicted Ancestry
#> 1 EUR_GBR HG00096 EUR
#> 2 EUR_GBR HG00097 EUR
#> 3 EUR_GBR HG00099 EUR
#> 4 EUR_GBR HG00100 EUR
#> 5 EUR_GBR HG00101 EUR
The plot shows the distribution of individual genetic samples along
the first two principal components (PC1 and PC2). Each point represents
a genetic sample, and they are color-coded based on reference
populations (Ref_ASIAN
, Ref_CEU
, and
Ref_YRI
) and study populations (Study_EUR_FIN
,
Study_EUR_GBR
, and Study_EUR_TSI
).
The points are spread across the PCA plot, with each cluster likely
representing genetic similarity within a group and differences between
the groups. The reference populations, Ref_AFR
,
Ref_EAS
, Ref_EUR
, and Ref_SAS
clustered distinctively, suggesting clear genetic differentiation
between these populations. The study populations overlap with the
Ref_EUR
(Euoprean
) reference population,
indicating European ancestry, with each study subgroup
(FIN, GBR, TSI
) forming subclusters, suggesting finer-scale
differences in ancestry within Europe.
Since the study dataset was simulated from CEU 1000 Genome dataset, it is not surprising that we do see some outlier samples. Specifically, TSI population is not highly similar to CEU.
AncestryCheck Using ThousandGenome Reference
The function can be used in the following manner with our study data and reference data as ThousandGenome:
# Define parameters
data("highLD_hg19", package = "GXwasR")
data("example_data_study_sample_ancestry", package = "GXwasR")
thousandgenome_results <- AncestryCheck(
DataDir = GXwasR:::GXwasR_data(),
ResultDir = tempdir(),
finput = "GXwasR_example",
reference = "ThousandGenome",
highLD_regions = highLD_hg19,
study_pop = example_data_study_sample_ancestry,
studyLD = TRUE,
referLD = TRUE,
outlierOf = "EUR",
outlier = TRUE,
outlier_threshold = 5
)
#> 4214 ambiguous SNPs removed from study data.
#> No A-T or G-C SNPs found in reference data.
#> ✔ LD pruning was performed for reference dataset.
#> ✔ LD pruning was performed for study dataset.
#>
#> ℹ Number of overlapping SNPs between study and reference data using rsID: 1005
#> ℹ No allele flips between study and reference data.
#> ✖ There is no outlier sample.
#> ✖ There are no non-outlier samples.
thousandgenome_results$pca_plot

Samples_with_predicted_ancestry <- thousandgenome_results[["Samples_with_predicted_ancestry"]]
outlier <- thousandgenome_results[["Outlier_samples"]]
outlier
#> [1] FID IID
#> <0 rows> (or 0-length row.names)
Samples_with_predicted_ancestry[seq_len(5), ]
#> FID IID Predicted Ancestry
#> NA <NA> <NA> <NA>
#> NA.1 <NA> <NA> <NA>
#> NA.2 <NA> <NA> <NA>
#> NA.3 <NA> <NA> <NA>
#> NA.4 <NA> <NA> <NA>
The lack of outlier samples—those that would be significantly distanced from the reference group—suggests homogeneity in the ancestry of the study population. Given that our study data was simulated based on the EUR population from the 1000 Genomes project data, the absence of outliers aligns with expectations.
The precise clustering and absence of outliers not only reaffirm the simulation parameters but also shows the reliability of the ancestry estimation methods employed. Such accuracy is crucial in genetic studies, as it reinforces confidence in the PCA’s ability to discern and delineate ancestry with precision.
Conclusion
The combination of genotype datasets from study populations with comprehensive reference datasets, followed by PCA, provides a nuanced view of ancestry. The GXwasR package simplifies this process, offering researchers and enthusiasts a robust tool for ancestry estimation. This efficient approach not only enhances our understanding of genetic diversity but also strengthens our connection to our ancestral past.
References
Purcell et. al.,(2007). PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. The American Journal of Human Genetics, 81(3), 559-575. https://doi.org/10.1086/519795.
The International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010). https://doi.org/10.1038/nature09298
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015). https://doi.org/10.1038/nature15393
Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT. Data quality control in genetic case-control association studies. Nat Protoc. 2010 Sep;5(9):1564-73. doi: 10.1038/nprot.2010.116. Epub 2010 Aug 26. PMID: 21085122; PMCID: PMC3025522.
Citing GXwasR
We hope that GXwasR will be useful for your research. Please use the following information to cite the package and the overall approach. Thank you!
## Citation info
citation("GXwasR")
#> To cite package 'GXwasR' in publications use:
#>
#> Bose B, Blostein F, Kim J, Winters J, Actkins KV, Mayer D, Congivaram H, Niarchou M, Edwards DV, Davis LK, Stranger BE (2025). "GXwasR: A Toolkit
#> for Investigating Sex-Differentiated Genetic Effects on Complex Traits." _medRxiv 2025.06.10.25329327_. doi:10.1101/2025.06.10.25329327
#> <https://doi.org/10.1101/2025.06.10.25329327>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {GXwasR: A Toolkit for Investigating Sex-Differentiated Genetic Effects on Complex Traits},
#> author = {Banabithi Bose and Freida Blostein and Jeewoo Kim and Jessica Winters and Ky’Era V. Actkins and David Mayer and Harrsha Congivaram and Maria Niarchou and Digna Velez Edwards and Lea K. Davis and Barbara E. Stranger},
#> journal = {medRxiv 2025.06.10.25329327},
#> year = {2025},
#> doi = {10.1101/2025.06.10.25329327},
#> }
Reproducibility
The GXwasR package (Bose, Blostein, Kim, Winters, Actkins, Mayer, Congivaram, Niarchou, Edwards, Davis, and Stranger, 2025) was made possible thanks to:
- R (R Core Team, 2025)
- BiocStyle (Oleś, 2025)
- knitr (Xie, 2025)
- RefManageR (McLean, 2017)
- rmarkdown (Allaire, Xie, Dervieux, McPherson, Luraschi, Ushey, Atkins, Wickham, Cheng, Chang, and Iannone, 2024)
- sessioninfo (Wickham, Chang, Flight, Müller, and Hester, 2025)
- testthat (Wickham, 2011)
This package was developed using biocthis.
R
session information.
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.5.1 (2025-06-13)
#> os macOS Sequoia 15.6
#> system aarch64, darwin24.4.0
#> ui unknown
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2025-08-08
#> pandoc 3.6.3 @ /Applications/Positron.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#> quarto 1.7.33 @ /usr/local/bin/quarto
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> abind 1.4-8 2024-09-12 [2] CRAN (R 4.5.0)
#> backports 1.5.0 2024-05-23 [2] CRAN (R 4.5.1)
#> bibtex 0.5.1 2023-01-26 [2] CRAN (R 4.5.0)
#> bigassertr 0.1.7 2025-06-27 [2] CRAN (R 4.5.1)
#> bigparallelr 0.3.2 2021-10-02 [2] CRAN (R 4.5.0)
#> bigsnpr 1.12.18 2024-11-26 [2] CRAN (R 4.5.1)
#> bigsparser 0.7.3 2024-09-06 [2] CRAN (R 4.5.1)
#> bigstatsr 1.6.2 2025-07-29 [2] CRAN (R 4.5.1)
#> Biobase 2.68.0 2025-04-15 [2] Bioconduc~
#> BiocGenerics 0.54.0 2025-04-15 [2] Bioconduc~
#> BiocIO 1.18.0 2025-04-15 [2] Bioconduc~
#> BiocManager 1.30.26 2025-06-05 [2] CRAN (R 4.5.0)
#> BiocParallel 1.42.1 2025-06-01 [2] Bioconductor 3.21 (R 4.5.0)
#> BiocStyle 2.36.0 2025-04-15 [2] Bioconduc~
#> Biostrings 2.76.0 2025-04-15 [2] Bioconduc~
#> bit 4.6.0 2025-03-06 [2] CRAN (R 4.5.1)
#> bit64 4.6.0-1 2025-01-16 [2] CRAN (R 4.5.1)
#> bitops 1.0-9 2024-10-03 [2] CRAN (R 4.5.0)
#> brio 1.1.5 2024-04-24 [2] CRAN (R 4.5.1)
#> broom 1.0.9 2025-07-28 [2] CRAN (R 4.5.1)
#> BSgenome 1.76.0 2025-04-15 [2] Bioconduc~
#> cachem 1.1.0 2024-05-16 [2] CRAN (R 4.5.0)
#> calibrate 1.7.7 2020-06-19 [2] CRAN (R 4.5.0)
#> callr 3.7.6 2024-03-25 [2] CRAN (R 4.5.0)
#> car 3.1-3 2024-09-27 [2] CRAN (R 4.5.0)
#> carData 3.0-5 2022-01-06 [2] CRAN (R 4.5.0)
#> cli 3.6.5 2025-04-23 [2] CRAN (R 4.5.0)
#> codetools 0.2-20 2024-03-31 [4] CRAN (R 4.5.1)
#> cowplot 1.2.0 2025-07-07 [2] CRAN (R 4.5.1)
#> crayon 1.5.3 2024-06-20 [2] CRAN (R 4.5.0)
#> curl 6.4.0 2025-06-22 [2] CRAN (R 4.5.1)
#> data.table 1.17.8 2025-07-10 [2] CRAN (R 4.5.1)
#> DelayedArray 0.34.1 2025-04-17 [2] Bioconduc~
#> devtools * 2.4.5 2022-10-11 [3] CRAN (R 4.5.0)
#> digest 0.6.37 2024-08-19 [2] CRAN (R 4.5.0)
#> doParallel 1.0.17 2022-02-07 [2] CRAN (R 4.5.0)
#> doRNG 1.8.6.2 2025-04-02 [2] CRAN (R 4.5.0)
#> dplyr 1.1.4 2023-11-17 [2] CRAN (R 4.5.0)
#> ellipsis 0.3.2 2021-04-29 [3] CRAN (R 4.5.0)
#> evaluate 1.0.4 2025-06-18 [2] CRAN (R 4.5.1)
#> farver 2.1.2 2024-05-13 [2] CRAN (R 4.5.0)
#> fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.5.0)
#> flock 0.7 2016-11-12 [2] CRAN (R 4.5.1)
#> foreach 1.5.2 2022-02-02 [2] CRAN (R 4.5.0)
#> Formula 1.2-5 2023-02-24 [2] CRAN (R 4.5.0)
#> fs 1.6.6 2025-04-12 [2] CRAN (R 4.5.0)
#> gdsfmt 1.44.1 2025-07-09 [2] Bioconduc~
#> generics 0.1.4 2025-05-09 [2] CRAN (R 4.5.0)
#> GenomeInfoDb 1.44.1 2025-07-23 [2] Bioconduc~
#> GenomeInfoDbData 1.2.14 2025-04-21 [2] Bioconductor
#> GenomicAlignments 1.44.0 2025-04-15 [2] Bioconduc~
#> GenomicRanges 1.60.0 2025-04-15 [2] Bioconduc~
#> ggplot2 3.5.2 2025-04-09 [2] CRAN (R 4.5.0)
#> ggpubr 0.6.1 2025-06-27 [2] CRAN (R 4.5.1)
#> ggrepel 0.9.6 2024-09-07 [2] CRAN (R 4.5.1)
#> ggsignif 0.6.4 2022-10-13 [2] CRAN (R 4.5.0)
#> glue 1.8.0 2024-09-30 [2] CRAN (R 4.5.0)
#> gridExtra 2.3 2017-09-09 [2] CRAN (R 4.5.0)
#> gtable 0.3.6 2024-10-25 [2] CRAN (R 4.5.0)
#> GXwasR * 0.99.0 2025-08-08 [1] Bioconductor
#> hms 1.1.3 2023-03-21 [2] CRAN (R 4.5.0)
#> htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.5.0)
#> htmlwidgets 1.6.4 2023-12-06 [2] CRAN (R 4.5.0)
#> httpuv 1.6.16 2025-04-16 [2] CRAN (R 4.5.1)
#> httr 1.4.7 2023-08-15 [2] CRAN (R 4.5.0)
#> IRanges 2.42.0 2025-04-15 [2] Bioconduc~
#> iterators 1.0.14 2022-02-05 [2] CRAN (R 4.5.0)
#> jsonlite 2.0.0 2025-03-27 [2] CRAN (R 4.5.0)
#> knitr 1.50 2025-03-16 [2] CRAN (R 4.5.0)
#> labeling 0.4.3 2023-08-29 [2] CRAN (R 4.5.0)
#> later 1.4.2 2025-04-08 [2] CRAN (R 4.5.1)
#> lattice 0.22-7 2025-04-02 [4] CRAN (R 4.5.1)
#> lifecycle 1.0.4 2023-11-07 [2] CRAN (R 4.5.0)
#> lubridate 1.9.4 2024-12-08 [2] CRAN (R 4.5.1)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.5.0)
#> MASS 7.3-65 2025-02-28 [4] CRAN (R 4.5.1)
#> mathjaxr 1.8-0 2025-04-30 [2] CRAN (R 4.5.1)
#> Matrix 1.7-3 2025-03-11 [4] CRAN (R 4.5.1)
#> MatrixGenerics 1.20.0 2025-04-15 [2] Bioconduc~
#> matrixStats 1.5.0 2025-01-07 [2] CRAN (R 4.5.0)
#> memoise 2.0.1 2021-11-26 [2] CRAN (R 4.5.0)
#> mime 0.13 2025-03-17 [2] CRAN (R 4.5.0)
#> miniUI 0.1.2 2025-04-17 [3] CRAN (R 4.5.0)
#> pillar 1.11.0 2025-07-04 [2] CRAN (R 4.5.1)
#> pkgbuild 1.4.8 2025-05-26 [2] CRAN (R 4.5.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.5.0)
#> pkgdev 0.1.0.9060 2025-08-04 [2] Github (dieghernan/pkgdev@e56f2a8)
#> pkgload 1.4.0 2024-06-28 [2] CRAN (R 4.5.0)
#> plyr 1.8.9 2023-10-02 [2] CRAN (R 4.5.1)
#> plyranges 1.28.0 2025-04-15 [2] Bioconduc~
#> poolr 1.2-0 2025-05-07 [2] CRAN (R 4.5.0)
#> prettyunits 1.2.0 2023-09-24 [2] CRAN (R 4.5.0)
#> processx 3.8.6 2025-02-21 [2] CRAN (R 4.5.1)
#> profvis 0.4.0 2024-09-20 [3] CRAN (R 4.5.0)
#> progress 1.2.3 2023-12-06 [2] CRAN (R 4.5.0)
#> promises 1.3.3 2025-05-29 [2] CRAN (R 4.5.0)
#> ps 1.9.1 2025-04-12 [2] CRAN (R 4.5.1)
#> purrr 1.1.0 2025-07-10 [2] CRAN (R 4.5.1)
#> qqman 0.1.9 2023-08-23 [2] CRAN (R 4.5.0)
#> R.methodsS3 1.8.2 2022-06-13 [2] CRAN (R 4.5.0)
#> R.oo 1.27.1 2025-05-02 [2] CRAN (R 4.5.0)
#> R.utils 2.13.0 2025-02-24 [2] CRAN (R 4.5.0)
#> R6 2.6.1 2025-02-15 [2] CRAN (R 4.5.0)
#> rbibutils 2.3 2024-10-04 [2] CRAN (R 4.5.1)
#> RColorBrewer 1.1-3 2022-04-03 [2] CRAN (R 4.5.0)
#> Rcpp 1.1.0 2025-07-02 [2] CRAN (R 4.5.1)
#> RCurl 1.98-1.17 2025-03-22 [2] CRAN (R 4.5.0)
#> Rdpack 2.6.4 2025-04-09 [2] CRAN (R 4.5.0)
#> RefManageR * 1.4.0 2022-09-30 [2] CRAN (R 4.5.1)
#> regioneR 1.40.1 2025-06-01 [2] Bioconductor 3.21 (R 4.5.0)
#> remotes 2.5.0 2024-03-17 [2] CRAN (R 4.5.0)
#> restfulr 0.0.16 2025-06-27 [2] CRAN (R 4.5.1)
#> rjson 0.2.23 2024-09-16 [2] CRAN (R 4.5.0)
#> rlang 1.1.6 2025-04-11 [2] CRAN (R 4.5.0)
#> rmarkdown 2.29 2024-11-04 [2] CRAN (R 4.5.0)
#> rmio 0.4.0 2022-02-17 [2] CRAN (R 4.5.0)
#> rngtools 1.5.2 2021-09-20 [2] CRAN (R 4.5.0)
#> rprojroot 2.1.0 2025-07-12 [2] CRAN (R 4.5.1)
#> Rsamtools 2.24.0 2025-04-15 [2] Bioconduc~
#> rstatix 0.7.2 2023-02-01 [2] CRAN (R 4.5.0)
#> rtracklayer 1.68.0 2025-04-15 [2] Bioconduc~
#> S4Arrays 1.8.1 2025-06-01 [2] Bioconductor 3.21 (R 4.5.0)
#> S4Vectors 0.46.0 2025-04-15 [2] Bioconduc~
#> scales 1.4.0 2025-04-24 [2] CRAN (R 4.5.0)
#> sessioninfo * 1.2.3 2025-02-05 [2] CRAN (R 4.5.1)
#> shiny 1.11.1 2025-07-03 [2] CRAN (R 4.5.1)
#> SNPRelate 1.42.0 2025-04-15 [2] Bioconduc~
#> SparseArray 1.8.1 2025-07-23 [2] Bioconduc~
#> stringi 1.8.7 2025-03-27 [2] CRAN (R 4.5.0)
#> stringr 1.5.1 2023-11-14 [2] CRAN (R 4.5.0)
#> sumFREGAT 1.2.5 2022-06-07 [2] CRAN (R 4.5.1)
#> SummarizedExperiment 1.38.1 2025-04-30 [2] Bioconductor 3.21 (R 4.5.0)
#> sys 3.4.3 2024-10-04 [2] CRAN (R 4.5.0)
#> testthat * 3.2.3 2025-01-13 [2] CRAN (R 4.5.1)
#> tibble 3.3.0 2025-06-08 [2] CRAN (R 4.5.0)
#> tidyr 1.3.1 2024-01-24 [2] CRAN (R 4.5.1)
#> tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.5.0)
#> timechange 0.3.0 2024-01-18 [2] CRAN (R 4.5.1)
#> tzdb 0.5.0 2025-03-15 [2] CRAN (R 4.5.1)
#> UCSC.utils 1.4.0 2025-04-15 [2] Bioconduc~
#> urlchecker 1.0.1 2021-11-30 [3] CRAN (R 4.5.0)
#> usethis * 3.1.0 2024-11-26 [2] CRAN (R 4.5.0)
#> vctrs 0.6.5 2023-12-01 [2] CRAN (R 4.5.0)
#> vroom 1.6.5 2023-12-05 [2] CRAN (R 4.5.1)
#> withr 3.0.2 2024-10-28 [2] CRAN (R 4.5.0)
#> xfun 0.52 2025-04-02 [2] CRAN (R 4.5.0)
#> XML 3.99-0.18 2025-01-01 [2] CRAN (R 4.5.0)
#> xml2 1.3.8 2025-03-14 [2] CRAN (R 4.5.1)
#> xtable 1.8-4 2019-04-21 [2] CRAN (R 4.5.0)
#> XVector 0.48.0 2025-04-15 [2] Bioconduc~
#> yaml 2.3.10 2024-07-26 [2] CRAN (R 4.5.0)
#>
#> [1] /private/var/folders/d6/gtwl3_017sj4pp14fbfcbqjh0000gp/T/RtmpcRh9ZZ/temp_libpath4e4d53c96c8f
#> [2] /Users/mayerdav/Library/R/arm64/4.5/library
#> [3] /opt/homebrew/lib/R/4.5/site-library
#> [4] /opt/homebrew/Cellar/r/4.5.1/lib/R/library
#> * ── Packages attached to the search path.
#>
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Bibliography
This vignette was generated using BiocStyle (Oleś, 2025) with knitr (Xie, 2025) and rmarkdown (Allaire, Xie, Dervieux et al., 2024) running behind the scenes.
Citations made with RefManageR (McLean, 2017).
[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.29. 2024. URL: https://github.com/rstudio/rmarkdown.
[2] B. Bose, F. Blostein, J. Kim, et al. “GXwasR: A Toolkit for Investigating Sex-Differentiated Genetic Effects on Complex Traits”. In: medRxiv 2025.06.10.25329327 (2025). DOI: 10.1101/2025.06.10.25329327.
[3] M. W. McLean. “RefManageR: Import and Manage BibTeX and BibLaTeX References in R”. In: The Journal of Open Source Software (2017). DOI: 10.21105/joss.00338.
[4] A. Oleś. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.36.0. 2025. DOI: 10.18129/B9.bioc.BiocStyle. URL: https://bioconductor.org/packages/BiocStyle.
[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2025. URL: https://www.R-project.org/.
[6] H. Wickham. “testthat: Get Started with Testing”. In: The R Journal 3 (2011), pp. 5–10. URL: https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf.
[7] H. Wickham, W. Chang, R. Flight, et al. sessioninfo: R Session Information. R package version 1.2.3. 2025. DOI: 10.32614/CRAN.package.sessioninfo. URL: https://CRAN.R-project.org/package=sessioninfo.
[8] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.50. 2025. URL: https://yihui.org/knitr/.