AncestryCheck: Evaluation of the samples' ancestry label.

This function displays the result of the ancestry analysis in a color-coded scatter plot of the first two principal components for samples of the reference populations and the study population. Specifically, it compares the study samples' ancestry labels to a panel representing a reference population, and it also flags the outlier samples with respect to a chosen reference population.

Users are required to provide SNPs ids or rsids in the input PLINK files.

The function first filters the reference and study data for non-A-T or G-C SNPs. It next conducts LD pruning, fixes the chromosome mismatch between the reference and study datasets, checks for allele flips, updates the locations, and flips the alleles. The two datasets are then joined, and the resulting genotype dataset is subjected to Principal Component Analysis (PCA).

The detection of population structure down to the level of the reference dataset can then be accomplished using PCA on this combined genotyping panel. For instance, the center of the European reference samples is determined using the data from principal components 1 and 2 (median(PC1 europeanRef), median(PC2 europeanRef)). It determines the European reference samples' maximum Euclidean distance (maxDist) from this center.

All study samples that are non-European, or outliers, are those whose Euclidean distances from the center are more than or equal to the radius r= outlier threshold* maxDist. This function utilizes the HapMap phase 3 data in NCBI 36 and 1000GenomeIII in CGRCh37. Both study and reference datasets should be of the same genome build. If not, users need to lift over one of the datasets to the same build.

Usage

AncestryCheck(
  DataDir,
  ResultDir = tempdir(),
  finput,
  reference = c("HapMapIII_NCBI36", "ThousandGenome"),
  filterSNP = TRUE,
  studyLD = TRUE,
  studyLD_window_size = 50,
  studyLD_step_size = 5,
  studyLD_r2_threshold = 0.02,
  referLD = FALSE,
  referLD_window_size = 50,
  referLD_step_size = 5,
  referLD_r2_threshold = 0.02,
  highLD_regions,
  study_pop,
  outlier = FALSE,
  outlierOf = "EUR",
  outlier_threshold = 3
)

Arguments

DataDir: A character string for the file path of the input PLINK binary files.
ResultDir: A character string for the file path where all output files will be stored. The default is tempdir().
finput: Character string, specifying the prefix of the input PLINK binary files for the study samples.
reference: Boolean value,'HapMapIII_NCBI36' and 'ThousandGenome', specifying Hapmap Phase3 (3 Consortium 2010) and 1000 Genomes phase III (1000 Genomes Project Consortium 2015) reference population, respectively. The default is 'HapMapIII_NCBI36'.
filterSNP: Boolean value, TRUE or FALSE for filtering out the SNPs. The default is TRUE. We recommend setting it FALSE only when the users are sure that they could join the study and reference samples directly.
studyLD: Boolean value, TRUE or FALSE for applying linkage disequilibrium (LD)-based filtering on study genotype data.
studyLD_window_size: Integer value, specifying a window size in variant count or kilobase for LD-based filtering of the variants for the study data.
studyLD_step_size: Integer value, specifying a variant count to shift the window at the end of each step for LD filtering for the study data.
studyLD_r2_threshold: Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering for the study data.
referLD: Boolean value, 'TRUE' or 'FALSE' for applying linkage disequilibrium (LD)-based filtering on reference genotype data.
referLD_window_size: Integer value, specifying a window size in variant count or kilobase for LD-based filtering of the variants for the reference data.
referLD_step_size: Integer value, specifying a variant count to shift the window at the end of each step for LD filtering for the reference data.
referLD_r2_threshold: Numeric value between 0 to 1 of pairwise \(r^2\) threshold for LD-based filtering for the reference data.
highLD_regions: A dataframe with known high LD regions (Anderson et al. 2010) is provided with the package.
study_pop: A dataframe containing two columns for study in first column, sample ID (i.e., IID) and in second column, the ancestry label.
outlier: Boolean value, TRUE or FALSE, specifying outlier detection will be performed or not.
outlierOf: Chracter string, specifying the reference ancestry name for detecting outlier samples. The default is "outlierOf = "EUR".
outlier_threshold: Numeric value, specifying the threshold to be be used to detect outlier samples. This threshold will be multiplied with the Eucledean distance from the center of the PC1 and PC2 to the maximum Euclidean distance of the reference samples. Study samples outside this distance will be considered as outlier. The default is 3.

Value

A list containing three data frames: one with the IDs of outlier samples (Outlier_samples), another with samples annotated with predicted ancestry (Samples_with_predicted_ancestry), and one with the IDs of non-outlier samples (Non_outlier_samples). A PCA plot is also returned.

References

1000 Genomes Project Consortium T (2015). “A global reference for human genetic variation.” Nature, 526, 68–74. doi:10.1038/nature15393 .

3 Consortium TIH (2010). “Integrating common and rare genetic variation in diverse human populations.” Nature, 467, 52–58. doi:10.1038/nature09298 .

Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT (2010). “Data quality control in genetic case-control association studies.” Nature Protocols, 5(9), 1564–1573. doi:10.1038/nprot.2010.116 , http://www.ncbi.nlm.nih.gov/pubmed/21085122.

Author

Banabithi Bose

Examples

data("highLD_hg19", package = "GXwasR")
data("example_data_study_sample_ancestry", package = "GXwasR")
DataDir <- GXwasR:::GXwasR_data()
ResultDir <- tempdir()
finput <- "GXwasR_example"
reference <- "HapMapIII_NCBI36"
highLD_regions <- highLD_hg19
study_pop <- example_data_study_sample_ancestry # PreimputeEX
studyLD_window_size <- 50
studyLD_step_size <- 5
studyLD_r2_threshold <- 0.02
filterSNP <- TRUE
studyLD <- FALSE
referLD <- FALSE
referLD_window_size <- 50
referLD_step_size <- 5
referLD_r2_threshold <- 0.02
outlier <- TRUE
outlier_threshold <- 3
x <- AncestryCheck(
    DataDir = DataDir, ResultDir = ResultDir, finput = finput,
    reference = reference, highLD_regions = highLD_regions,
    study_pop = study_pop, studyLD = studyLD, referLD = referLD,
    outlierOf = "EUR", outlier = outlier, outlier_threshold = outlier_threshold
)
#> ℹ 'HapMapIII_NCBI36' reference data found at /Users/mayerdav/Downloads/HapMapIII_NCBI36.
#> This message is displayed once every 8 hours.
#> Using PLINK v1.9.0-b.7.7 64-bit (22 Oct 2024)
#> This message is displayed once every 8 hours.
#> 4214 ambiguous SNPs removed from study data.
#> 111854 ambiguous SNPs removed from reference data.
#> ! LD pruning is recommended for reference dataset. Set referLD = TRUE.
#> ! LD pruning is recommended for study dataset. Set studyLD = TRUE.
#> 
#> ℹ Number of overlapping SNPs between study and reference data using rsID: 3722
#> ℹ No allele flips between study and reference data.
#> ℹ 20 samples are outliers of selected reference population.
#> ℹ 168 samples are NOT outliers of selected reference population.