genipe

Genome-wide imputation pipeline

https://github.com/pgxcentre/genipe

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.3%) to scientific vocabulary

Keywords

bioinformatics genetics genomics
Last synced: 6 months ago · JSON representation

Repository

Genome-wide imputation pipeline

Basic Info
  • Host: GitHub
  • Owner: pgxcentre
  • License: gpl-3.0
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 39.8 MB
Statistics
  • Stars: 32
  • Watchers: 7
  • Forks: 7
  • Open Issues: 7
  • Releases: 14
Topics
bioinformatics genetics genomics
Created about 11 years ago · Last pushed over 1 year ago
Metadata Files
Readme License

README.mkd

[![PyPI version](https://badgen.net/pypi/v/genipe)](https://pypi.org/project/genipe/)


# genipe - A Python module to perform genome-wide imputation analysis

The `genipe` module (standing for **GEN**ome-wide **I**mputation
**P**ipelin**E**) includes a script (named `genipe-launcher`) that
automatically runs a genome-wide imputation pipeline using *Plink*, *shapeit*
and *impute2*.

If you use `genipe` in any published work, please cite the paper describing the
tool:

> Lemieux Perreault LP, Legault MA, Asselin G, Dubé MP:
> **genipe: an automated genome-wide imputation pipeline with automatic reporting
> and statistical tools.** *Bioinformatics* 2016, **32** (23): 3661-3663
> (DOI:[10.1093/bioinformatics/btw487](http://dx.doi.org/10.1093/bioinformatics/btw487)).


## Documentation

Full documentation is available at
[http://pgxcentre.github.io/genipe/](http://pgxcentre.github.io/genipe/).


## Installation

Version 1.5.0 of `genipe` should work with the most recent versions of the
packages.

We recommend installing the package in a Python 3.7 (or latest) virtual
environment. There are two ways to install: `pip` or `conda`.

```bash
# Using pip
pip install genipe
```

```bash
# Using conda
conda install genipe -c http://statgen.org/wp-content/uploads/Softwares/genipe
```

The installation process should install all required dependencies to run the
main imputation pipeline. Optional dependencies can also be installed manually
in order to perform statistical analysis and data management (see below).

The complete installation procedure is available in the
[documentation](http://pgxcentre.github.io/genipe/installation.html).


### Dependencies

The tool requires a standard [Python](http://python.org/) 3.4 (or latest)
installation with the following modules:

* `numpy` version 1.11.3 and latest
* `Jinja2` version 2.9 and latest
* `pandas` version 0.19.2 and latest
* `setuptools` version 12.0.5 and latest

The tool requires the binaries for
[Plink](http://pngu.mgh.harvard.edu/~purcell/plink/download.shtml),
[shapeit](https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#download)
and [impute2](https://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download).


### Optional dependencies

In order to perform data management and statistical analysis (linear, logistic
and Cox's regressions), `genipe` requires the following Python modules:

* `Matplotlib`
* `scipy`
* `patsy`
* `statsmodels`
* `lifelines`
* `Biopython`
* `pyfaidx`
* `drmaa`
* `pyplink`

Note that `statsmodels` (specifically MixedLM analysis) version 0.6 **is not
compatible** with `numpy` version 1.12 and latest.

Finally, the tool requires a LaTeX installation to compile the automatically
generated report in PDF format.


### Testing

Basic testing was implemented for the script to merge *impute2* files resulting
from different segments of the same chromosome along with some utility
functions of the main package. To test the package (once installed), launch
Python and execute the following command:

```python
>>> import genipe
>>> genipe.test()
```

Some functionalities are difficult to test, since they mostly use external
tools to perform analysis (*i.e.* *Plink*, *shapeit* and *impute2*) and it
is assumed that they were properly tested by their author.

We will try to add further tests in the future.


## Basic usage

The following options are available when launching a genome-wide imputation
analysis.

```console
$ genipe-launcher --help
usage: genipe-launcher [-h] [-v] [--debug] [--thread THREAD] --bfile PREFIX
                       [--reference FILE] [--chrom CHROM [CHROM ...]]
                       [--output-dir DIR] [--bgzip] [--use-drmaa]
                       [--drmaa-config FILE] [--preamble FILE]
                       [--shapeit-bin BINARY] [--shapeit-thread INT]
                       [--shapeit-extra OPTIONS] [--plink-bin BINARY]
                       [--hap-template TEMPLATE] [--legend-template TEMPLATE]
                       [--map-template TEMPLATE] --sample-file FILE
                       [--hap-nonPAR FILE] [--hap-PAR1 FILE] [--hap-PAR2 FILE]
                       [--legend-nonPAR FILE] [--legend-PAR1 FILE]
                       [--legend-PAR2 FILE] [--map-nonPAR FILE]
                       [--map-PAR1 FILE] [--map-PAR2 FILE]
                       [--impute2-bin BINARY] [--segment-length BP]
                       [--filtering-rules RULE [RULE ...]]
                       [--impute2-extra OPTIONS] [--probability FLOAT]
                       [--completion FLOAT] [--info FLOAT]
                       [--report-number NB] [--report-title TITLE]
                       [--report-author AUTHOR]
                       [--report-background BACKGROUND]

Execute the genome-wide imputation pipeline. This script is part of the
'genipe' package, version 1.4.2.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  --debug               set the logging level to debug
  --thread THREAD       number of threads [1]

Input Options:
  --bfile PREFIX        The prefix of the binary pedfiles (input data).
  --reference FILE      The human reference to perform an initial strand check
                        (useful for genotyped markers not in the IMPUTE2
                        reference files) (optional).

Output Options:
  --chrom CHROM [CHROM ...]
                        The chromosomes to process. It is possible to write
                        'autosomes' to process all the autosomes (from
                        chromosome 1 to 22, inclusively).
  --output-dir DIR      The name of the output directory. [genipe]
  --bgzip               Use bgzip to compress the impute2 files.

HPC Options:
  --use-drmaa           Launch tasks using DRMAA.
  --drmaa-config FILE   The configuration file for tasks (use this option when
                        launching tasks using DRMAA). This file should
                        describe the walltime and the number of
                        nodes/processors to use for each task.
  --preamble FILE       This option should be used when using DRMAA on a HPC
                        to load required module and set environment variables.
                        The content of the file will be added between the
                        'shebang' line and the tool command.

SHAPEIT Options:
  --shapeit-bin BINARY  The SHAPEIT binary if it's not in the path.
  --shapeit-thread INT  The number of thread for phasing. [1]
  --shapeit-extra OPTIONS
                        SHAPEIT extra parameters. Put extra parameters between
                        single or normal quotes (e.g. --shapeit-extra '--
                        states 100 --window 2').

Plink Options:
  --plink-bin BINARY    The Plink binary if it's not in the path.

IMPUTE2 Autosomal Reference:
  --hap-template TEMPLATE
                        The template for IMPUTE2's haplotype files (replace
                        the chromosome number by '{chrom}', e.g.
                        '1000GP_Phase3_chr{chrom}.hap.gz').
  --legend-template TEMPLATE
                        The template for IMPUTE2's legend files (replace the
                        chromosome number by '{chrom}', e.g.
                        '1000GP_Phase3_chr{chrom}.legend.gz').
  --map-template TEMPLATE
                        The template for IMPUTE2's map files (replace the
                        chromosome number by '{chrom}', e.g.
                        'genetic_map_chr{chrom}_combined_b37.txt').
  --sample-file FILE    The name of IMPUTE2's sample file.

IMPUTE2 Chromosome X Reference:
  --hap-nonPAR FILE     The IMPUTE2's haplotype file for the non-
                        pseudoautosomal region of chromosome 23.
  --hap-PAR1 FILE       The IMPUTE2's haplotype file for the first
                        pseudoautosomal region of chromosome 23.
  --hap-PAR2 FILE       The IMPUTE2's haplotype file for the second
                        pseudoautosomal region of chromosome 23.
  --legend-nonPAR FILE  The IMPUTE2's legend file for the non-pseudoautosomal
                        region of chromosome 23.
  --legend-PAR1 FILE    The IMPUTE2's legend file for the first
                        pseudoautosomal region of chromosome 23.
  --legend-PAR2 FILE    The IMPUTE2's legend file for the second
                        pseudoautosomal region of chromosome 23.
  --map-nonPAR FILE     The IMPUTE2's map file for the non-pseudoautosomal
                        region of chromosome 23.
  --map-PAR1 FILE       The IMPUTE2's map file for the first pseudoautosomal
                        region of chromosome 23.
  --map-PAR2 FILE       The IMPUTE2's map file for the second pseudoautosomal
                        region of chromosome 23.

IMPUTE2 Options:
  --impute2-bin BINARY  The IMPUTE2 binary if it's not in the path.
  --segment-length BP   The length of a single segment for imputation. [5e+06]
  --filtering-rules RULE [RULE ...]
                        IMPUTE2 filtering rules (optional).
  --impute2-extra OPTIONS
                        IMPUTE2 extra parameters. Put the extra parameters
                        between single or normal quotes (e.g. --impute2-extra
                        '-buffer 250 -Ne 20000').

IMPUTE2 Merger Options:
  --probability FLOAT   The probability threshold for no calls. [<0.9]
  --completion FLOAT    The completion rate threshold for site exclusion.
                        [<0.98]
  --info FLOAT          The measure of the observed statistical information
                        associated with the allele frequency estimate
                        threshold for site exclusion. [<0.00]

Automatic Report Options:
  --report-number NB    The report number. [genipe automatic report]
  --report-title TITLE  The report title. [genipe: Automatic genome-wide
                        imputation]
  --report-author AUTHOR
                        The report author. [Automatically generated by genipe]
  --report-background BACKGROUND
                        The report background section (can either be a string
                        or a file containing the background. [General
                        background]
```


### Real life example

The documentation provides a quick and easy tutorial for the complete pipeline.
Refer to [this page](http://pgxcentre.github.io/genipe/tutorials.html)


### Automatic report

The pipeline provides a report containing relevant imputation statistics. This
report is located in the `report` directory. It can be compiled into a PDF file
using the following `make` command:

```console
$ make && make clean
...
```

This report uses LaTeX and the `*.tex` file can be modified to add project
specific information.


### Statistical analysis

Once the genome-wide imputation analysis is performed and the quality metrics
provided by the automatic report have been reviewed, it is possible to
perform different statistical analyses (e.g. linear or logistic regression,
or survival analysis using Cox's proportional hazard model) using the provided
script named `imputed-stats`.

```console
$ imputed-stats --help
usage: imputed-stats [-h] [-v] {cox,linear,logistic,mixedlm,skat} ...

Performs statistical analysis on imputed data (either SKAT analysis, or
linear, logistic or survival regression). This script is part of the 'genipe'
package, version 1.4.2.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Statistical Analysis Type:
  The type of statistical analysis to be performed on the imputed data.

  {cox,linear,logistic,mixedlm,skat}
    cox                 Cox's proportional hazard model (survival regression).
    linear              Linear regression (ordinary least squares).
    logistic            Logistic regression (GLM with binomial distribution).
    mixedlm             Linear mixed effect model (random intercept).
    skat                SKAT analysis.
```

See the tutorials for more information:
[http://pgxcentre.github.io/genipe/tutorials.html](http://pgxcentre.github.io/genipe/tutorials.html)


# About

This project was initiated at the
[Beaulieu-Saucier Pharmacogenomics Centre](http://www.pharmacogenomics.ca/) of
the [Montreal Heart Institute](https://www.icm-mhi.org/). The aim was to speed
up (and automatize) the imputation process for the whole genome.







Owner

  • Name: Beaulieu-Saucier Université de Montréal Pharmacogenomics Centre
  • Login: pgxcentre
  • Kind: organization
  • Location: Montreal, Canada

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: about 2 years ago

All Time
  • Total Commits: 650
  • Total Committers: 3
  • Avg Commits per committer: 216.667
  • Development Distribution Score (DDS): 0.048
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Louis-Philippe Lemieux Perreault l****t@s****g 619
Marc-André Legault l****c@g****m 30
Brian Sebastian Cole b****s@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 56
  • Total pull requests: 1
  • Average time to close issues: 2 months
  • Average time to close pull requests: 4 days
  • Total issue authors: 12
  • Total pull request authors: 1
  • Average comments per issue: 1.07
  • Average comments per pull request: 1.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • lemieuxl (28)
  • dlippold (10)
  • legaultmarc (8)
  • asseling (2)
  • AmmyDK (1)
  • gpcr (1)
  • arvkevi (1)
  • LeoBman (1)
  • klevdiamanti (1)
  • dgarrimar (1)
  • Sathish9 (1)
  • ashaked (1)
Pull Request Authors
  • bryketos (1)
Top Labels
Issue Labels
enhancement (25) bug (11) wontfix (7) new feature (1) question (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 152 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 3
  • Total versions: 13
  • Total maintainers: 1
pypi.org: genipe

An automatic genome-wide imputation pipeline.

  • Versions: 13
  • Dependent Packages: 0
  • Dependent Repositories: 3
  • Downloads: 152 Last month
Rankings
Dependent repos count: 8.9%
Dependent packages count: 10.1%
Stargazers count: 11.3%
Forks count: 12.6%
Average: 12.9%
Downloads: 21.6%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • numpy *
  • pandas *