https://github.com/broadinstitute/gnomad_local_ancestry

Hail batch pipeline and scripts for local ancestry inference

https://github.com/broadinstitute/gnomad_local_ancestry

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Academic email domains
  • Institutional organization owner
    Organization broadinstitute has institutional domain (www.broadinstitute.org)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Hail batch pipeline and scripts for local ancestry inference

Basic Info
  • Host: GitHub
  • Owner: broadinstitute
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 159 KB
Statistics
  • Stars: 6
  • Watchers: 10
  • Forks: 0
  • Open Issues: 32
  • Releases: 1
Created over 5 years ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

gnomAD Local Ancestry Inference (LAI) Pipeline

This repository provides a streamlined pipeline for performing Local Ancestry Inference (LAI) using gnomAD samples. Due to the large scale of gnomAD, we implemented this pipeline using the Hail Batch Python module which is still in beta testing and exclusive to the Broad however the tools used within the pipeline are all publicly available.

The pipeline leverages Eagle for phasing, RFMix v2 for ancestry painting, Tractor for extracting ancestry-specific allele frequencies, and generate_output_vcf.py for generating a joint VCF with ancestry-specific allele count (AC), allele number (AN), and allele frequency (AF) annotations. Below, we outline the steps to run this pipeline on your dataset using Hail/Python scripts.


Getting Started

To run the LAI pipeline, set up your environment and install the required dependencies. Follow the step-by-step instructions below.


1. Pipeline Overview

This pipeline processes genomic data by:

  • Phasing haplotypes using Eagle
  • Inferring local ancestry using RFMix v2
  • Extracting ancestry-specific allele frequencies from phased and painted data using Tractor
  • Generating a joint VCF with ancestry-specific calls using generate_output_vcf.py

2. Installation & Setup

Step 1: Clone the Repository

bash git clone https://github.com/broadinstitute/gnomad_local_ancestry.git

Step 2: Install Dependencies

The pipeline requires Python 3, Hail, and several additional tools. Install the necessary dependencies using:

bash pip install hail pip install numpy pandas

Make sure you have Eagle and RFMix v2 installed. You can find installation instructions and a toy dataset in the Tractor Tutorial.


3. Running the LAI Pipeline

Step 1: Phasing

To phase your genotype data using Eagle, run:

bash eagle --vcf input_data.vcf.gz --out phased_data.vcf.gz

Refer to the Tractor wiki for a detailed guide on phasing.

Step 2: Local Ancestry Inference using RFMix

After phasing, run RFMix v2 to infer local ancestry:

bash rfmix \ -f phased_data.vcf.gz \ -r reference_panel.vcf.gz \ -m samplemap.txt \ -g geneticmap.txt \ -o painted_lai \ --chromosome=22

See the Tractor wiki for additional instructions on reference panels and sample maps.

Step 3: Extracting Ancestry-Specific Allele Frequencies

Once local ancestry inference is complete, extract ancestry-specific allele frequencies using Tractor:

bash python3 extract_tracts.py \ --vcf phased_data.vcf.gz \ --msp painted_lai.msp.tsv \ --num-ancs 2

Step 4: Generating a Joint VCF with Ancestry-Specific Annotations

The generate_lai_vcf function calls generate_output_vcf.py, a standalone Python/Hail script. This script outputs an annotated VCF containing ancestry-specific allele frequency data:

bash python3 generate_output_vcf.py \ --msp-file painted_lai.msp.tsv \ --tractor-output tractor_output_path \ --output-path output_lai \ --is-zipped \ --mt-path-for-adj pipeline_input.mt \ --add-gnomad-af


4. Additional Resources

For detailed explanations of phasing, local ancestry painting, and extracting tracts, refer to:

For our Hail Batch Python pipeline, refer to:


5. Citation

If you use this pipeline in your research, please cite:

Kore, P., Wilson, M. et al., Improved Allele Frequencies in gnomAD through Local Ancestry Inference.

Please direct questions to pragati.kore@bcm.edu or mwilson@broadinstitute.org.

Owner

  • Name: Broad Institute
  • Login: broadinstitute
  • Kind: organization
  • Location: Cambridge, MA

Broad Institute of MIT and Harvard

GitHub Events

Total
  • Release event: 1
  • Watch event: 3
  • Issue comment event: 1
  • Push event: 9
  • Pull request review comment event: 3
  • Pull request event: 2
  • Create event: 1
Last Year
  • Release event: 1
  • Watch event: 3
  • Issue comment event: 1
  • Push event: 9
  • Pull request review comment event: 3
  • Pull request event: 2
  • Create event: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time
  • Total issues: 178
  • Total pull requests: 6
  • Average time to close issues: 5 months
  • Average time to close pull requests: 2 months
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 0.57
  • Average comments per pull request: 1.0
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mike-w-wilson (73)
  • gtiao (24)
Pull Request Authors
  • mike-w-wilson (7)
  • KoalaQin (1)
  • pragatikore (1)
Top Labels
Issue Labels
Epic (7)
Pull Request Labels