the-real-genotools

GenoTools: Advanced Genotype Data Analysis A robust suite for processing genotype data, offering genotype calling (.idat to PLINK), comprehensive sample/variant QC, and ancestry estimation. Ideal for computational biology and genetics research.

https://github.com/dvitale199/genotools

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.7%) to scientific vocabulary

Keywords

ancestry-estimation genotype python quality-control
Last synced: 6 months ago · JSON representation ·

Repository

GenoTools: Advanced Genotype Data Analysis A robust suite for processing genotype data, offering genotype calling (.idat to PLINK), comprehensive sample/variant QC, and ancestry estimation. Ideal for computational biology and genetics research.

Basic Info
  • Host: GitHub
  • Owner: dvitale199
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 43.8 MB
Statistics
  • Stars: 28
  • Watchers: 4
  • Forks: 7
  • Open Issues: 10
  • Releases: 13
Topics
ancestry-estimation genotype python quality-control
Created about 5 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

GenoTools

Published in G3: https://www.biorxiv.org/content/10.1101/2024.03.26.586362v1.full.pdf

DOI PyPI version PRs Welcome GitHub License Python Python Python

Documentation

You can find the full documentation with the following links: - GenoTools Command Line Arguments - Default Pipeline Overview - Package Function Guide (for developers) - JSON output guide

Getting Started

GenoTools is a suite of automated genotype data processing steps written in Python. The core pipeline was built for Quality Control and Ancestry estimation of data in the Global Parkinson's Genetics Program (GP2)

To download the most current version from pip: pip install the-real-genotools Alternatively, if you'd like to download from github: git clone https://github.com/dvitale199/GenoTools.git cd GenoTools pip install . you can pull the most current references by running: genotools-download By default, the reference panel will be downloaded to ~/.genotools/ref. but can be download to a location of choice with --destination.

To download specific references/models, you can run the download with the following options: genotools-download --ref 1kg_30x_hgdp_ashk_ref_panel --model nba_v1 --destination /path/to/download_directory/

Currently, 1kg_30x_hgdp_ashk_ref_panel is the only available reference panel. Available models are nba_v1 for the NeuroBooster array and neurochip_v1 for the NeuroChip Array and both are in GRCh38. If using a different array, we would suggest training a new model by running the standard command below. Please ensure the reference panel and your genotypes are in the same build. If you're using our reference panel, your genotypes must be in GRCh38.

Modify the paths in the following command to run the standard GP2 pipeline: genotools \ --pfile /path/to/genotypes/for/qc \ --out /path/to/qc/output \ --ancestry \ --ref_panel /path/to/reference/panel \ --ref_labels /path/to/reference/ancestry/labels \ --all_sample \ --all_variant This will find common snps between your genotype data and the reference panel, run PCA, UMAP-transform PCs, and train a new XGBoost classifier specific to your data/ref panel.

if you'd like to run the pipeline using an existing model, you can do that like so (take note of the --model option): genotools \ --pfile /path/to/genotypes/for/qc \ --out /path/to/qc/output \ --ancestry \ --ref_panel /path/to/reference/panel \ --ref_labels /path/to/reference/ancestry/labels \ --all_sample \ --all_variant --model /path/to/nba_v1/model

if you'd like to run the pipeline using the default nbav1 model in a Docker container, you can do that like so: ``` genotools \ --pfile /path/to/genotypes/for/qc \ --out /path/to/qc/output \ --ancestry \ --refpanel /path/to/reference/panel \ --reflabels /path/to/reference/ancestry/labels \ --container \ --allsample \ --all_variant Note: add the--singularity``` flag to run containerized ancestry predictions on HPC

genotools accepts --pfile, --bfile, or --vcf. Any bfile or vcf will be converted to a pfile before running any steps.

Note: multiallelic pfiles will be converted to biallelic format by excluding multiallelic variants before running '--ancestry' steps. If you would prefer to not remove multiallelic snps, please pre-split the SNPs using bcftools prior to running genotools.

Please consult the docs links listed at the top of the README for the full argument guide, function guide, Default pipeline overview, and guide for navigating the output JSON.

Acknowledgements

GenoTools was developed as the core genotype and wgs processing pipeline for the Global Parkinson's Genetics Program (GP2) at the Center for Alzheimer's and Related Dementias (CARD) at the National Institutes of Health.

This tool relies on PLINK, a whole genome association analysis toolset, for various genetic data processing functionalities. We gratefully acknowledge the developers of PLINK for their foundational contributions to the field of genetics. More about PLINK can be found at their website.

Owner

  • Name: Dan Vitale
  • Login: dvitale199
  • Kind: user
  • Location: Boulder, CO
  • Company: Data Tecnica International

Data Scientist @ DTi ~Neurogenetics, Machine Learning, Computational Biology~

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use GenoTools, please cite it as below."
authors:
- family-names: "Vitale"
  given-names: "Dan"
  orcid: "https://orcid.org/0000-0002-0637-3671"
- family-names: "Koretsky"
  given-names: "Mathew"
  orcid: "https://orcid.org/0000-0003-4341-3991"
- family-names: "Kuznetsov"
  given-names: "Nicole"
  orcid: "https://orcid.org/0009-0005-9847-0282"
- family-names: "Hong"
  given-names: "Samantha"
  orcid: "https://orcid.org/0009-0001-8968-6461"
- family-names: "Leonard"
  given-names: "Hampton"
  orcid: "https://orcid.org/0000-0003-2390-8110"
- family-names: "Song"
  given-names: "Yeajin"
  orcid: "https://orcid.org/0009-0002-3127-2634"
- family-names: "Levine"
  given-names: "Kristin"
  orcid: "https://orcid.org/0000-0002-5702-0980"
title: "GenoTools"
version: 1.0.2
doi: 10.5281/zenodo.10443258
date-released: 2023-12-29
url: "https://github.com/dvitale199/GenoTools"

GitHub Events

Total
  • Create event: 27
  • Issues event: 2
  • Release event: 8
  • Watch event: 7
  • Delete event: 24
  • Push event: 57
  • Pull request event: 50
  • Pull request review event: 15
  • Fork event: 1
Last Year
  • Create event: 27
  • Issues event: 2
  • Release event: 8
  • Watch event: 7
  • Delete event: 24
  • Push event: 57
  • Pull request event: 50
  • Pull request review event: 15
  • Fork event: 1

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 74 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 13
  • Total maintainers: 1
pypi.org: the-real-genotools

A collection of tools for genotype quality control and analysis

  • Versions: 13
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 74 Last month
Rankings
Dependent packages count: 10.1%
Average: 38.5%
Dependent repos count: 66.9%
Maintainers (1)
Last synced: 6 months ago