# Define the working directory
<- tempdir() # Use a temporary directory for this example
wdir dir.create(wdir, showWarnings = FALSE)
Processing 1000 Genomes Phase 3 Data Set
Introduction
Hey there, genomic researchers!
When diving into genomic research, one of the first things we often need to tackle is understanding and working with reference genome datasets. A key resource in this field is the 1000 Genomes Project. In this tutorial, I’ll show you how to download and process the 1000 Genomes Phase 3 dataset using R or Command Line interface by leveraging PLINK2 software. But before we get into the technical stuff, let’s take a moment to understand what the 1000 Genomes Project is and why it’s so important.
What is the 1000 Genomes Project?
The 1000 Genomes Project was a groundbreaking international effort aimed at creating the most detailed catalog of human genetic variation. Kicking off in 2008, the project’s goal was to sequence the genomes of at least 1,000 anonymous participants from diverse populations around the world. As the project progressed, it grew to include data from over 2,500 individuals across 26 different populations in Phase 3.
The Impact of Sequencing Costs
Back in 2008, sequencing a human genome was incredibly expensive. But thanks to rapid advancements in sequencing technology, the costs have plummeted. This dramatic decrease has made genomic analysis more accessible and paved the way for large-scale projects like the 1000 Genomes Project. This affordability is opening doors for researchers everywhere, allowing us to explore genetic variation on a much larger scale than ever before.
Why is the 1000 Genomes Project Important?
Ancestry Evaluation: The 1000 Genomes dataset is a goldmine for understanding genetic diversity among different populations. It’s crucial for studies on ancestry and evolution. By analyzing genetic variations, researchers can trace lineage and migration patterns of various ancestries, including African, European, East Asian, South Asian, and American populations. Curious to learn more? Check out the https://www.internationalgenome.org/.
Disease Association: This dataset provides a comprehensive map of genetic variation, which is essential for identifying genetic variants linked to diseases. It’s a key resource for genome-wide association studies (GWAS) that hunt for genetic markers tied to specific diseases.
Genetic Research: Researchers utilize this data to delve into population genetics, human evolution, and to create new methods for genetic analysis. The extensive dataset is invaluable for developing algorithms for genotype imputation, which predicts unobserved genotypes in study samples. This boosts the power of genetic studies and is fundamental for large-scale genomic research. For a deep dive into genotype imputation, check out this article https://www.nature.com/articles/nrg2796.
Tutorial Overview
In this tutorial, we’ll walk through the following steps to process the 1000 Genomes Phase 3 dataset:
Set up the working environment.
Download and prepare the data.
Convert the data to PLINK binary format.
Let’s dive in!
Main Steps
This document provides detailed steps to download and process the 1000 Genomes Phase 3 dataset using PLINK2 via R and Rstudio platforms. The workflow involves downloading necessary tools, preparing the data, and converting it to PLINK binary format. Each step is explained with the functions and parameters used.
This html file is designed to be comprehensive, providing detailed descriptions of each step along with the corresponding code chunks. You can copy these code chunks and run these using R/RStudio.
Setup: The setup section includes defining the working directory and downloading and setting up PLINK2. Each code chunk is followed by a detailed explanation of the functions and parameters used.
Download and Prepare 1000 Genomes Data: This section explains how to download the necessary data files and decompress the
.zst
files.Process the Data with PLINK2: Details the steps to convert the data to PLINK binary format, including explanations of each PLINK2 parameter.
Setup
Define Working Directory
First, we need to define the working directory where all files will be downloaded and processed. This directory will be used to store PLINK2 and the 1000 Genomes data files.
FOR R USERS:
FOR COMMAND LINE USERS:
# Define the working directory
=$(mktemp -d)
wdir-p $wdir mkdir
Download and Setup PLINK2
PLINK2 is a versatile tool for whole-genome association and population-based linkage analyses. We will download and unzip PLINK2, and set the appropriate permissions to make it executable.
FOR R USERS:
# Download PLINK2
<- "https://s3.amazonaws.com/plink2-assets/alpha3/plink2_linux_x86_64_20221024.zip"
plink2_url <- file.path(wdir, "plink2_linux_x86_64_20221024.zip")
plink2_dest ::download.file(plink2_url, destfile = plink2_dest, quiet = TRUE)
utils
# Unzip PLINK2
::unzip(plink2_dest, exdir = wdir)
utils
# Set executable permissions
<- file.path(wdir, "plink2")
plink2_path Sys.chmod(plink2_path, mode = "0777")
Explanation:
utils::download.file: Downloads the specified file from the URL to the destination path.
utils::unzip: Unzips the downloaded file into the specified directory.
Sys.chmod: Sets the file permissions to make PLINK2 executable.
FOR COMMAND LINE USERS:
#Set working directory:
=your/working/directory
wdir$wdir
cd
#Download PLINK2:
-q -O plink2_linux_x86_64_20221024.zip https://s3.amazonaws.com/plink2-assets/alpha3/plink2_linux_x86_64_20221024.zip
wget
#Unzip PLINK2:
-q plink2_linux_x86_64_20221024.zip
unzip
#Set executable permissions:
0777 plink2 chmod
Download and Prepare 1000 Genomes Data
Download Phase 3 Data
The 1000 Genomes Phase 3 dataset is a valuable resource for genetic research. We will download the necessary data files: .psam, .pvar.zst, and .pgen.zst from 2016-5-5 release, Genome build 37 with 2504 samples from Plink2 resources page (https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg) page.
Here’s a brief overview of each file type:
PSAM File:
Link: https://www.dropbox.com/s/6ppo144ikdzery5/phase3_corrected.psam?dl=1
Description: The .psam file is a sample information file. It contains metadata about the individuals (samples) in the dataset. This typically includes columns for sample IDs, sex, phenotype information, and potentially other covariates or identifiers.
PVAR.ZST File:
Link: https://www.dropbox.com/s/odlexvo8fummcvt/all_phase3.pvar.zst?dl=1
Description: The .pvar.zst file is a compressed variant information file. The .pvar extension stands for “PLINK variant,” and this file contains information about the genetic variants (e.g., SNPs) included in the dataset. The .zst extension indicates that the file is compressed using the Zstandard (zstd) compression algorithm, which is efficient and widely used for compressing large genomic data files.
PGEN.ZST File:
Link: https://www.dropbox.com/s/y6ytfoybz48dc0u/all_phase3.pgen.zst?dl=1
Description: The .pgen.zst file is a compressed genotype data file. The .pgen extension indicates that this file contains genotype data for the individuals in the dataset, specifying the alleles at each genetic variant for each individual. Like the .pvar.zst file, it is compressed with the Zstandard algorithm. These files are typically used together in PLINK 2 to perform various genetic analyses. The .psam file provides the sample metadata, the .pvar.zst file provides the variant information, and the .pgen.zst file provides the actual genotype data. To use these files with PLINK 2, you would need to decompress the .zst files and then use the appropriate PLINK 2 commands to load and analyze the data.
Remember to check these links in Plink2 resources page (https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg) as they seemed to be changed from time to time.
FOR R USERS:
# Define URLs for 1000 Genomes Phase 3 data
<- list(
urls psam = "https://www.dropbox.com/s/6ppo144ikdzery5/phase3_corrected.psam?dl=1",
pvar_zst = "https://www.dropbox.com/s/odlexvo8fummcvt/all_phase3.pvar.zst?dl=1",
pgen_zst = "https://www.dropbox.com/s/y6ytfoybz48dc0u/all_phase3.pgen.zst?dl=1"
)
# Define destination file paths
<- list(
files psam = file.path(wdir, "all_phase3.psam"),
pvar_zst = file.path(wdir, "all_phase3.pvar.zst"),
pgen_zst = file.path(wdir, "all_phase3.pgen.zst")
)
# Download the files
lapply(names(urls), function(x) {
::download.file(urls[[x]], destfile = files[[x]], quiet = TRUE, mode = "wb")
utils })
Explanation:
urls: A list of URLs where the 1000 Genomes data files can be downloaded.
files: A list of destination file paths where the downloaded files will be saved.
utils::download.file: Downloads each file from the specified URL to the corresponding destination path.
FOR COMMAND LINE USERS:
#Set working directory:
=your/working/directory
wdir$wdir
cd
#Define URLs for 1000 Genomes Phase 3 data:
-A urls=(
declare "psam"]="https://www.dropbox.com/s/6ppo144ikdzery5/phase3_corrected.psam?dl=1"
["pvar_zst"]="https://www.dropbox.com/s/odlexvo8fummcvt/all_phase3.pvar.zst?dl=1"
["pgen_zst"]="https://www.dropbox.com/s/y6ytfoybz48dc0u/all_phase3.pgen.zst?dl=1"
[
)
#Define destination file paths:
-A files=(
declare "psam"]="$wdir/phase3_corrected.psam"
["pvar_zst"]="$wdir/all_phase3.pvar.zst"
["pgen_zst"]="$wdir/all_phase3.pgen.zst"
[
)
#Download the files:
for key in "${!urls[@]}"; do
-q -O "${files[$key]}" "${urls[$key]}"
wget done
Decompress .zst Files
Decompress the .pgen.zst file to a .pgen file using PLINK2.
# Decompress .pgen.zst to .pgen
system(paste(plink2_path, "--zst-decompress", files$pgen_zst, ">", file.path(wdir, "all_phase3.pgen")))
Explanation:
system: Executes the PLINK2 command to decompress the .pgen.zst file into a .pgen file.
Process the Data with PLINK2
Convert to PLINK Binary Format
Use PLINK2 to process the data and convert it to PLINK binary format (.bed, .bim, .fam).
FOR R USERS:
# Define output file prefix
<- file.path(wdir, "ThousandGenome")
output_prefix
# Run PLINK2 to make bed file
::exec_wait(
sys
plink2_path,args = c(
"--pfile", file.path(wdir, "all_phase3"), "vzs",
"--output-chr", "26",
"--max-alleles", "2",
"--rm-dup", "exclude-mismatch",
"--make-bed",
"--out", output_prefix
) )
Explanation:**
sys::exec_wait: Executes the PLINK2 command to process the data and convert it to PLINK binary format.
–pfile: Specifies the prefix of the input files (.psam, .pvar, .pgen).
vzs: Indicates the files are compressed using Zstandard.
–output-chr 26: Converts chromosome codes to numeric values, with 26 representing the X chromosome.
–max-alleles 2: Ensures that no more than two alleles are considered per variant.
–rm-dup exclude-mismatch: Removes duplicate variants, excluding mismatches.
–make-bed: Creates PLINK binary files (.bed, .bim, .fam).
–out: Specifies the output file prefix.
FOR COMMAND LINE USERS:
#Decompress .pgen.zst to .pgen:
$wdir/plink2 --zst-decompress $wdir/all_phase3.pgen.zst > $wdir/all_phase3.pgen
Entire R script for downloading and processing 1000 genome data
# Define the working directory
<- tempdir() # Use a temporary directory for this example
wdir dir.create(wdir, showWarnings = FALSE)
# Define URLs for 1000 Genomes Phase 3 data
<- list(
urls psam = "https://www.dropbox.com/s/6ppo144ikdzery5/phase3_corrected.psam?dl=1",
pvar_zst = "https://www.dropbox.com/s/odlexvo8fummcvt/all_phase3.pvar.zst?dl=1",
pgen_zst = "https://www.dropbox.com/s/y6ytfoybz48dc0u/all_phase3.pgen.zst?dl=1"
)
# Define destination file paths
<- list(
files psam = file.path(wdir, "phase3_corrected.psam"),
pvar_zst = file.path(wdir, "all_phase3.pvar.zst"),
pgen_zst = file.path(wdir, "all_phase3.pgen.zst")
)
# Download the files
lapply(names(urls), function(x) {
::download.file(urls[[x]], destfile = files[[x]], quiet = TRUE, mode = "wb")
utils
})
# Download PLINK2
<- "https://s3.amazonaws.com/plink2-assets/alpha3/plink2_linux_x86_64_20221024.zip"
plink2_url <- file.path(wdir, "plink2_linux_x86_64_20221024.zip")
plink2_dest ::download.file(plink2_url, destfile = plink2_dest, quiet = TRUE)
utils
# Unzip PLINK2
::unzip(plink2_dest, exdir = wdir)
utils
# Set executable permissions
<- file.path(wdir, "plink2")
plink2_path Sys.chmod(plink2_path, mode = "0777")
# Decompress .pgen.zst to .pgen
system(paste(plink2_path, "--zst-decompress", files$pgen_zst, ">", file.path(wdir, "all_phase3.pgen")))
Entire Command line script for downloading and processing 1000 genome data
#Define the working directory:
=$(mktemp -d)
wdir-p $wdir
mkdir
#Define URLs for 1000 Genomes Phase 3 data:
-A urls=(
declare "psam"]="https://www.dropbox.com/s/6ppo144ikdzery5/phase3_corrected.psam?dl=1"
["pvar_zst"]="https://www.dropbox.com/s/odlexvo8fummcvt/all_phase3.pvar.zst?dl=1"
["pgen_zst"]="https://www.dropbox.com/s/y6ytfoybz48dc0u/all_phase3.pgen.zst?dl=1"
[
)
:
Define destination file paths
-A files=(
declare "psam"]="$wdir/phase3_corrected.psam"
["pvar_zst"]="$wdir/all_phase3.pvar.zst"
["pgen_zst"]="$wdir/all_phase3.pgen.zst"
[
)
#Download the files:
for key in "${!urls[@]}"; do
-q -O "${files[$key]}" "${urls[$key]}"
wget
done
#Download PLINK2:
="https://s3.amazonaws.com/plink2-assets/alpha3/plink2_linux_x86_64_20221024.zip"
plink2_url="$wdir/plink2_linux_x86_64_20221024.zip"
plink2_dest-q -O $plink2_dest $plink2_url
wget
#Unzip PLINK2:
-q $plink2_dest -d $wdir
unzip
#Set executable permissions:
0777 $wdir/plink2
chmod
#Decompress .pgen.zst to .pgen:
$wdir/plink2 --zst-decompress ${files["pgen_zst"]} > $wdir/all_phase3.pgen