https://github.com/broadinstitute/gnomad_local_ancestry
Hail batch pipeline and scripts for local ancestry inference
Science Score: 57.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org -
○Academic email domains
-
✓Institutional organization owner
Organization broadinstitute has institutional domain (www.broadinstitute.org) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.7%) to scientific vocabulary
Repository
Hail batch pipeline and scripts for local ancestry inference
Basic Info
- Host: GitHub
- Owner: broadinstitute
- License: mit
- Language: Python
- Default Branch: main
- Size: 159 KB
Statistics
- Stars: 6
- Watchers: 10
- Forks: 0
- Open Issues: 32
- Releases: 1
Metadata Files
README.md
gnomAD Local Ancestry Inference (LAI) Pipeline
This repository provides a streamlined pipeline for performing Local Ancestry Inference (LAI) using gnomAD samples. Due to the large scale of gnomAD, we implemented this pipeline using the Hail Batch Python module which is still in beta testing and exclusive to the Broad however the tools used within the pipeline are all publicly available.
The pipeline leverages Eagle for phasing, RFMix v2 for ancestry painting, Tractor for extracting ancestry-specific allele frequencies, and generate_output_vcf.py for generating a joint VCF with ancestry-specific allele count (AC), allele number (AN), and allele frequency (AF) annotations. Below, we outline the steps to run this pipeline on your dataset using Hail/Python scripts.
Getting Started
To run the LAI pipeline, set up your environment and install the required dependencies. Follow the step-by-step instructions below.
1. Pipeline Overview
This pipeline processes genomic data by:
- Phasing haplotypes using Eagle
- Inferring local ancestry using RFMix v2
- Extracting ancestry-specific allele frequencies from phased and painted data using Tractor
- Generating a joint VCF with ancestry-specific calls using generate_output_vcf.py
2. Installation & Setup
Step 1: Clone the Repository
bash
git clone https://github.com/broadinstitute/gnomad_local_ancestry.git
Step 2: Install Dependencies
The pipeline requires Python 3, Hail, and several additional tools. Install the necessary dependencies using:
bash
pip install hail
pip install numpy pandas
Make sure you have Eagle and RFMix v2 installed. You can find installation instructions and a toy dataset in the Tractor Tutorial.
3. Running the LAI Pipeline
Step 1: Phasing
To phase your genotype data using Eagle, run:
bash
eagle --vcf input_data.vcf.gz --out phased_data.vcf.gz
Refer to the Tractor wiki for a detailed guide on phasing.
Step 2: Local Ancestry Inference using RFMix
After phasing, run RFMix v2 to infer local ancestry:
bash
rfmix \
-f phased_data.vcf.gz \
-r reference_panel.vcf.gz \
-m samplemap.txt \
-g geneticmap.txt \
-o painted_lai \
--chromosome=22
See the Tractor wiki for additional instructions on reference panels and sample maps.
Step 3: Extracting Ancestry-Specific Allele Frequencies
Once local ancestry inference is complete, extract ancestry-specific allele frequencies using Tractor:
bash
python3 extract_tracts.py \
--vcf phased_data.vcf.gz \
--msp painted_lai.msp.tsv \
--num-ancs 2
Step 4: Generating a Joint VCF with Ancestry-Specific Annotations
The generate_lai_vcf function calls generate_output_vcf.py, a standalone Python/Hail script. This script outputs an annotated VCF containing ancestry-specific allele frequency data:
bash
python3 generate_output_vcf.py \
--msp-file painted_lai.msp.tsv \
--tractor-output tractor_output_path \
--output-path output_lai \
--is-zipped \
--mt-path-for-adj pipeline_input.mt \
--add-gnomad-af
4. Additional Resources
For detailed explanations of phasing, local ancestry painting, and extracting tracts, refer to:
For our Hail Batch Python pipeline, refer to:
5. Citation
If you use this pipeline in your research, please cite:
Kore, P., Wilson, M. et al., Improved Allele Frequencies in gnomAD through Local Ancestry Inference.
Please direct questions to pragati.kore@bcm.edu or mwilson@broadinstitute.org.
Owner
- Name: Broad Institute
- Login: broadinstitute
- Kind: organization
- Location: Cambridge, MA
- Website: http://www.broadinstitute.org/
- Twitter: broadinstitute
- Repositories: 1,083
- Profile: https://github.com/broadinstitute
Broad Institute of MIT and Harvard
GitHub Events
Total
- Release event: 1
- Watch event: 3
- Issue comment event: 1
- Push event: 9
- Pull request review comment event: 3
- Pull request event: 2
- Create event: 1
Last Year
- Release event: 1
- Watch event: 3
- Issue comment event: 1
- Push event: 9
- Pull request review comment event: 3
- Pull request event: 2
- Create event: 1
Issues and Pull Requests
Last synced: about 1 year ago
All Time
- Total issues: 178
- Total pull requests: 6
- Average time to close issues: 5 months
- Average time to close pull requests: 2 months
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 0.57
- Average comments per pull request: 1.0
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mike-w-wilson (73)
- gtiao (24)
Pull Request Authors
- mike-w-wilson (7)
- KoalaQin (1)
- pragatikore (1)