nf-gwas-pipeline

nf-gwas-pipeline: A Nextflow Genome-Wide Association Study Pipeline - Published in JOSS (2021)

https://github.com/montilab/nf-gwas-pipeline

Science Score: 95.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in JOSS metadata
  • Academic publication links
  • Committers with academic emails
    1 of 3 committers (33.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
    Published in Journal of Open Source Software

Keywords

bioinformatics-pipeline nextflow

Scientific Fields

Earth and Environmental Sciences Physical Sciences - 40% confidence
Materials Science Physical Sciences - 40% confidence
Last synced: 4 months ago · JSON representation

Repository

A Nextflow Genome-Wide Association Study (GWAS) Pipeline

Basic Info
  • Host: GitHub
  • Owner: montilab
  • License: gpl-3.0
  • Language: R
  • Default Branch: master
  • Homepage:
  • Size: 5.25 MB
Statistics
  • Stars: 35
  • Watchers: 3
  • Forks: 21
  • Open Issues: 1
  • Releases: 1
Topics
bioinformatics-pipeline nextflow
Created about 5 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License

README.Rmd

---
output: rmarkdown::github_document
---



```{r echo=FALSE, message=FALSE}
knitr::opts_chunk$set(message=FALSE, comment="#>")
```

# nf-gwas-pipeline
A Nextflow Genome-Wide Association Study (GWAS) Pipeline

[![Built With](https://img.shields.io/badge/Built+With-Nextflow-brightgreen.svg)](https://www.nextflow.io/)
![Compatibility](https://img.shields.io/badge/Compatibility-Linux+%2F+OSX-blue.svg)
[![GitHub Issues](https://img.shields.io/github/issues/montilab/nf-gwas-pipeline.svg)](https://github.com/montilab/nf-gwas-pipeline/issues)

# Installation

### Clone Repository
```bash
$ git clone https://github.com/montilab/nf-gwas-pipeline
```

### Initalize Paths to Test Data

We have provided multiple toy datasets for testing the pipeline and ensuring all paths and dependencies are properly setup. To set the toy data paths to your local directory, run the following script.

```bash
$ cd nf-gwas-pipeline
$ python utils/paths.py  
```

### Download Nextflow Executable

Nextflow requires a POSIX compatible system (Linux, OS X, etc.) and Java 8 (or later, up to 11) to be installed. Once downloaded, optionally make the nextflow file accessible by your $PATH variable so you do not have to specify the full path to nextflow each time.

```
$ curl -s https://get.nextflow.io | bash
```

### Quick Start with Docker

We have created a pre-built Docker image with all of the dependencies installed. To get started, first make sure [Docker is installed](https://docs.docker.com/get-docker/). Then pull down the image onto your local machine.

```
$ docker pull montilab/gwas:latest
```

#### or

Optionally you could build this image  yourself from the Dockerfile which specifies all of the dependencies required. *Note: This might take a while!*

```
$ docker build --tag montilab/gwas:latest .
```

### Run with docker
```
$ ./nextflow gwas.nf -c gwas.config -with-docker montilab/gwas
```

### Expected Output
```bash
N E X T F L O W  ~  version 19.04.1
Launching `gwas.nf` [jolly_fermi] - revision: 46311ebd05
-

G W A S  ~  P I P E L I N E

================================
indir     : /data/
outdir    : /results

vcf       : /data/toy_vcf.csv
pheno     : /data/pheno_file_logistic.csv
snpset    : /data/snpset.txt

phenotype : outcome
covars    : age,sex,PC1,PC2,PC3,PC4
model     : logistic
test      : Score
ref       : hg19

-
[warm up] executor > local
executor >  local (141)
[60/b5b95e] process > qc_miss                   [100%] 22 of 22 ✔
[11/fa0fbd] process > annovar_ref               [100%] 1 of 1 ✔
[8f/25f8fa] process > qc_mono                   [100%] 22 of 22 ✔
[82/069a6d] process > vcf_to_gds                [100%] 22 of 22 ✔
[3e/819e86] process > merge_gds                 [100%] 1 of 1 ✔
[c3/f23390] process > nullmod_skip_pca_grm      [100%] 1 of 1 ✔
[ed/91344b] process > gwas_skip_pca_grm         [100%] 22 of 22 ✔
[b4/3aea3e] process > caf_by_group_skip_pca_grm [100%] 22 of 22 ✔
[e2/3c778d] process > merge_by_chr              [100%] 22 of 22 ✔
[fe/33ebd4] process > combine_results           [100%] 1 of 1 ✔
[8b/2020d3] process > annovar_input             [100%] 1 of 1 ✔
[61/3a373f] process > plot                      [100%] 1 of 1 ✔
[66/6f4246] process > annovar                   [100%] 1 of 1 ✔
[85/d4266b] process > add_annovar               [100%] 1 of 1 ✔
[9e/4fc2fe] process > report                    [100%] 1 of 1 ✔
Completed at: 15-Oct-2020 17:30:28
Duration    : 44.1s
CPU hours   : 0.1
Succeeded   : 141
```


### Alternative to Docker

If you are running the pipeline on a HPC that does not support docker (BU's Shared Computing Cluster), you can load the dependencies and run the pipeline as follows. (In addition, you need to install following R packages: SeqArray, GENESIS, Biobase, SeqVarTools, dplyr, SNPRelate, ggplot2, data.table, reshape2, latex2exp, knitr, EBImage, GenomicRanges, TxDb.Hsapiens.UCSC.hg19.knownGene, GMMAT, ezknitr)

```
$ module load R/4.1.1
$ module load vcftools/0.1.16
$ module load bcftools/1.10.2
$ module load plink/2.00a1LM
$ module load annovar/2018apr
$ module load pandoc/2.5

nextflow gwas.nf -c gwas.config
```

```{r include=FALSE}
library(knitr)
library(data.table)
library(EBImage)
```

# Underlying Structure and Output folder

```{r, echo=FALSE}
display(readImage("media/gwas_pipeline.png"))
```

# Inputs and Configuration

### Mandatory Input File Formats

#### 1. Phenotype file: csv file

1. The first column should be the unique ID for subjects
2. Names of the columns and numbers of columns are not fixed
3. The group variable is optional but should be a categorical variable if called
4. Longitudinal phenotype file shoud be in long-format
5. If the pca_grm process is turned-off,  PCs should present in the phenotype file to be called 

```
example: ./data/pheno_file_linear.csv
         ./data/pheno_file_logistic.csv
         ./data/1KG_pheno_linear.csv
         ./data/1KG_pheno_logistic.csv
         ./data/1KG_pheno_longitudinal.csv
```

```{r}
pheno.dat <- read.csv("data/pheno_file_linear.csv")
kable(head(pheno.dat))
```

#### 2. Genotype file: vcf.gz file

1. vcf.gz files at least contains the GT column
2. The ID column would end up being the snpID in the final output
3. vcf.file should contain DS column to use dosages in GWAS (imputed=T)

```
example: ./data/vcf/vcf_file1.vcf.gz
         ./data/1KG_vcf/1KG_phase3_subset_chr1.vcf.gz
```

#### 3. Mapping file: csv file

1. Two-column csv file mapping the prefix to the vcf.gz files
2. The results for each chromosome will be names be the corresponding prefix
3. NO header

```
example: ./data/toy_vcf.csv
         ./data/1KG_vcf.csv
```

```{r}
map.dat <- read.csv("./data/toy_vcf.csv", header=F)
kable(head(map.dat))
```

### Optional Input File Formats

#### 1. SNP set
1. Two column txt file seperated by ","
2. First column shoud be chromosome and second column be physical position with fixed header "chr,pos"

```
example: ./data/snpset.txt
```

```{r}
snp.dat <- fread("./data/snpset.txt")
kable(head(snp.dat))
```

#### 2.Genetic relationship matrix

1. A symmetric matrix saved in rds format with both columns being subjects
2. Can be replaced by 2*kinship matrix

```{r}
grm <- readRDS("./data/grm.rds")
kable(grm[1:5,1:5])
```

### GWAS example
#### Input file:

##### 1. Phenotype csv

```{r}
pheno.dat <- read.csv("./data/1KG_pheno_logistic.csv")
kable(head(pheno.dat))
```

##### 2. Mapping file
```{r}
map.dat <- read.csv("./data/1KG_vcf.csv", header=F)
kable(head(map.dat))
```

##### 3. Genotype file
See mapping file

#### Execution:
```bash
run with .config file:
nextflow run gwas.nf -c $PWD/configs/gwas_1KG_logistic.config

run with equivalent command:
nextflow run gwas.nf --vcf_list $PWD/data/1KG_vcf.csv --pheno $PWD/data/1KG_pheno_logistic.csv --phenotype outcome --covars sex,PC1,PC2,PC3,PC4 --pca_grm --model logistic --test Score --gwas --group Population --min_maf 0.1 --max_pval_manhattan 0.5 --max_pval 0.05 --ref_genome hg19
```

### Gene-based example
#### Input file:
##### 1. Phenotype csv

```{r}
pheno.dat <- read.csv("./data/1KG_pheno_linear.csv")
kable(head(pheno.dat))
```

##### 2. Mapping file

```{r}
map.dat <- read.csv("./data/1KG_vcf.csv", header=F)
kable(head(map.dat))
```

##### 3. Genotype file
See mapping file

#### Execution:
```bash
run with .config file:
nextflow run gwas.nf -c $PWD/configs/gene_1KG_linear.config

run with equivalent command:
nextflow run gwas.nf --vcf_list $PWD/data/1KG_vcf.csv --pheno $PWD/data/1KG_pheno_linear.csv --phenotype outcome --covars PC1,PC2,PC3,PC4 --pca_grm --model linear --test Score --gene_based --group Population --max_pval 0.01 --ref_genome hg19
```

### GWLA example
#### Input file:
##### 1. Phenotype csv

```{r}
pheno.dat <- read.csv("./data/1KG_pheno_longitudinal.csv")
kable(head(pheno.dat))
```

##### 2. Mapping file

```{r}
map.dat <- read.csv("./data/1KG_vcf.csv", header=F)
kable(head(map.dat))
```

##### 3. Genotype file
See mapping file

#### Execution:

```bash
run with .config file:
nextflow run gwas.nf -c $PWD/configs/gwla_1KG_linear_slope.config

run with equivalent command:
nextflow run gwas.nf --vcf_list $PWD/data/1KG_vcf.csv --pheno $PWD/data/1KG_pheno_longitudinal.csv --phenotype outcome --covars sex,age,PC1,PC2,PC3,PC4 --pca_grm --model linear --test Score --longitudinal --random_slope delta.age --group Population --min_maf 0.1 --max_pval_manhattan 0.5 --max_pval 0.01 --ref_genome hg19

```

# Help command
```bash
you can see explanations for all parameters with the help command:
nextflow gwas.nf --help

N E X T F L O W  ~  version 19.04.1
Launching `gwas.nf` [tiny_venter] - revision: c9ded642f7
USAGE:
Mandatory arguments:
--vcf_list                 String        Path to the two-column mapping csv file: id , file_path 
--pheno                    String        Path to the phenotype file
--phenotype                String        Name of the phenotype column
Optional arguments:
--gds_input                Logical       If true, ignore vcf input, start with GDS files and skip qc_miss, qc_mono, vcf_to_gds steps
--gds_list                 String        Path to the two-column mapping gds file: id , file_path 
--outdir                   String        Path to the master folder to store all results
--covars                   String        Name of the covariates to include in analysis model separated by comma (e.g. "age,sex,educ")
--qc                       Logical       If true, run qc_miss(filter genotypes called below max_missing) and qc_mono (drop monomorphic SNPs)
--max_missing              Numeric       Threshold for qc_miss (filter genotypes called below this value)
--pca_grm                  Logical       If true, run PCAiR (generate PCA in Related individuals) and PCRelate (generate genomic relationship matrix)
--snpset                   String        Path to the two column txt file separated by comma: chr,pos (can only be effective when pca_grm = true)
--grm                      String        Path to the genomic relationship matrix (can only be effective when pca_grm = false)
--model                    String        Name of regression model for gwas: "linear" or "logistic"
--test                     String        Name of statistical test for significance: "Score", "Score.SPA", "BinomiRare" and "CMP" (details see https://rdrr.io/bioc/GENESIS/man/assocTestSingle.html) 
--gwas                     Logical       If true, run gwas
--imputed                  Logical       If true, use dosages in regression model (DS columns needed in input vcf files)
--gene_based               Logical       If true, run aggregate test for genes based on hg19 reference genome
--max_maf                  Numeric       Threshold for maximun minor allele frequencies of SNPs to be aggregated
--method                   String        Name of aggregation test method: "Burden", "SKAT", "fastSKAT", "SMMAT" or "SKATO"
--longitudinal             Logical       If true, run genome-wide longitudianl analysis
--random_slope             String        if set to "null", random intercept only model is run; else run random slope and random intercept model
--group                    String        Name of the group variable based on which the allele frequencies in each subgroup is calculated (can be left empty)
--dosage                   Logical       If true, also calculate dosages in addition to allele frequencies (can be very slow with large single gds input)
--min_maf                  Numeric       Threshold for minimun minor allele frequencies of SNPs to include in QQ- and Manhattan-plot
--max_pval_manhattan       Numeric       Threshold for maximun p-value of SNPs to show in Manhattan-plot 
--max_pval                 Numeric       Threshold for maxumun p-value of SNPs to annotate
--ref_genome               String        Name of the reference genome for annotation: hg19 or hg38
```

Owner

  • Name: Monti Lab
  • Login: montilab
  • Kind: organization
  • Email: montilab@bu.edu

JOSS Publication

nf-gwas-pipeline: A Nextflow Genome-Wide Association Study Pipeline
Published
March 02, 2021
Volume 6, Issue 59, Page 2957
Authors
Zeyuan Song
Department of Biostatistics, Boston University School of Public Health, 801 Massachusetts Avenue 3rd Floor, Boston, MA 02218, USA
Anastasia Gurinovich
Department of Biostatistics, Boston University School of Public Health, 801 Massachusetts Avenue 3rd Floor, Boston, MA 02218, USA
Anthony Federico
Section of Computational Biomedicine, Boston University School of Medicine, 72 East Concord St., Boston, MA 02218, USA, Bioinformatics Program, Boston University, 24 Cummington Mall, Boston, MA 02215, USA
Stefano Monti
Section of Computational Biomedicine, Boston University School of Medicine, 72 East Concord St., Boston, MA 02218, USA, Bioinformatics Program, Boston University, 24 Cummington Mall, Boston, MA 02215, USA
Paola Sebastiani
Institute for Clinical Research and Health Policy Studies, Tufts Medical Center, 800 Washington Street, Boston, MA 02111, USA
Editor
Lorena Pantano ORCID
Tags
Nextflow GWAS gene-based analysis longitudianl GWAS

GitHub Events

Total
  • Watch event: 3
  • Push event: 2
Last Year
  • Watch event: 3
  • Push event: 2

Committers

Last synced: 5 months ago

All Time
  • Total Commits: 156
  • Total Committers: 3
  • Avg Commits per committer: 52.0
  • Development Distribution Score (DDS): 0.051
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
ZeyuanSong 5****g 148
anfederico a****o 7
Daniel S. Katz d****z@i****g 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 3
  • Total pull requests: 3
  • Average time to close issues: about 1 year
  • Average time to close pull requests: 10 minutes
  • Total issue authors: 3
  • Total pull request authors: 2
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • mgalland (1)
  • rspirgel (1)
  • PavitaKae (1)
Pull Request Authors
  • ZeyuanSong (2)
  • danielskatz (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

Dockerfile docker
  • bioconductor/bioconductor_docker latest build