mungesumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics

https://github.com/al-murphy/mungesumstats

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    2 of 12 committers (16.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary

Keywords

bioconductor-package bioinformatics database-api genomics gwas qtl r r-package standardisation summary-statistics vcf-files

Keywords from Contributors

rnaseq recount3 recount mouse junction illumina human gene exon derfinder
Last synced: 6 months ago · JSON representation

Repository

Rapid standardisation and quality control of GWAS or QTL summary statistics

Basic Info
Statistics
  • Stars: 86
  • Watchers: 3
  • Forks: 18
  • Open Issues: 10
  • Releases: 0
Topics
bioconductor-package bioinformatics database-api genomics gwas qtl r r-package standardisation summary-statistics vcf-files
Created almost 5 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog

README.Rmd

---
title: "`MungeSumstats`: Standardise the format of GWAS summary statistics"
author: "
Authors: Alan Murphy, Brian Schilder and Nathan Skene
" date: "
Updated: `r format(Sys.Date(), '%b-%d-%Y')`
" bibliography: vignettes/MungeSumstats.bib csl: vignettes/nature.csl output: rmarkdown::github_document vignette: > %\VignetteIndexEntry{MungeSumstats} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc} --- ```{r, echo=FALSE} knitr::opts_chunk$set(tidy = FALSE, warning = FALSE, message = FALSE) ``` `r badger::badge_bioc_release(color = "black")` `r badger::badge_github_version(color = "black")` `r badger::badge_last_commit(branch = "master")` `r badger::badge_bioc_download(by = "total", color = "blue")` `r badger::badge_license()` `r badger::badge_doi(doi = "https://doi.org/10.1093/bioinformatics/btab665", color="blue")` # Introduction The `MungeSumstats` package is designed to facilitate the standardisation of GWAS summary statistics. ## Overview The package is designed to handle the lack of standardisation of output files by the GWAS community. The [MRC IEU Open GWAS](https://gwas.mrcieu.ac.uk/) team have provided full summary statistics for >10k GWAS, which are API-accessible via the [`ieugwasr`](https://mrcieu.github.io/ieugwasr/) and [`gwasvcf`](https://github.com/MRCIEU/gwasvcf) packages. But these GWAS are only standardised in the sense that they are VCF format, and can be fully standardised with `MungeSumstats`. `MungeSumstats` provides a framework to standardise the format for any GWAS summary statistics, including those in VCF format, enabling downstream integration and analysis. It addresses the most common discrepancies across summary statistic files, and offers a range of adjustable Quality Control (QC) steps. ## Citation If you use `MungeSumstats`, please cite the original authors of the GWAS as well as: > Alan E Murphy, Brian M Schilder, Nathan G Skene (2021) MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics. *Bioinformatics*, btab665, https://doi.org/10.1093/bioinformatics/btab665 # Installing `MungeSumstats` `MungeSumstats` is available on [Bioconductor](https://bioconductor.org/packages/MungeSumstats). To install `MungeSumstats` on Bioconductor run: ```R if (!require("BiocManager")) install.packages("BiocManager") BiocManager::install("MungeSumstats") ``` You can then load the package and data package: ```R library(MungeSumstats) ``` Note that there is also a [docker image for MungeSumstats](https://hub.docker.com/r/neurogenomicslab/mungesumstats). Note that for a number of the checks implored by `MungeSumstats` a reference genome is used. If your GWAS summary statistics file of interest relates to *GRCh38*, you will need to install `SNPlocs.Hsapiens.dbSNP155.GRCh38` and `BSgenome.Hsapiens.NCBI.GRCh38` from Bioconductor as follows: ```R BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38") BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38") ``` If your GWAS summary statistics file of interest relates to *GRCh37*, you will need to install `SNPlocs.Hsapiens.dbSNP155.GRCh37` and `BSgenome.Hsapiens.1000genomes.hs37d5` from Bioconductor as follows: ```R BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh37") BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5") ``` These may take some time to install and are not included in the package as some users may only need one of *GRCh37*/*GRCh38*. If you are unsure of the genome build, MungeSumstats can also infer this information from your data. # Getting started See the [Getting started vignette website](https://al-murphy.github.io/MungeSumstats/articles/MungeSumstats.html) for up-to-date instructions on usage. See the [OpenGWAS vignette website](https://al-murphy.github.io/MungeSumstats/articles/OpenGWAS.html) for information on how to use MungeSumstats to access, standardise and perform quality control on GWAS Summary Statistics from the MRC IEU [Open GWAS Project](https://gwas.mrcieu.ac.uk/). **NOTE** to authenticate, you need to generate a token from the OpenGWAS website. The token behaves like a password, and it will be used to authorise the requests you make to the OpenGWAS API. Here are the steps to generate the token and then have `ieugwasr` automatically use it for your queries: 1. Login to https://api.opengwas.io/profile/ 2. Generate a new token 3. Add `OPENGWAS_JWT=` to your .Renviron file, thi can be edited in R by running `usethis::edit_r_environ()` 4. Restart your R session 5. To check that your token is being recognised, run `ieugwasr::get_opengwas_jwt()`. If it returns a long random string then you are authenticated. 6. To check that your token is working, run `ieugwasr::user()`. It will make a request to the API for your user information using your token. It should return a list with your user information. If it returns an error, then your token is not working. 7. Make sure you have submitted use Please read carefully through the [FAQ website](https://github.com/Al-Murphy/MungeSumstats/wiki/FAQ) for an queries about running MungeSumstats. If you have any outside of this problems please do file an [Issue](https://github.com/al-murphy/MungeSumstats/issues) here on GitHub. # Future Enhancements The `MungeSumstats` package aims to be able to handle the most common summary statistic file formats including VCF. If your file can not be formatted by `MungeSumstats` feel free to report the [Issue](https://github.com/al-murphy/MungeSumstats/issues) on GitHub along with your summary statistics file header. We also encourage people to edit the code to resolve their particular issues too and are happy to incorporate these through pull requests on github. If your summary statistic file headers are not recognised by `MungeSumstats` but correspond to one of ``` SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON, NSTUDY, INFO or FRQ, ``` Feel free to update the `data("sumstatsColHeaders")` following the approach in the *data.R* file and add your mapping. Then use a [Pull Request](https://github.com/al-murphy/MungeSumstats/pulls) on GitHub and we will incorporate this change into the package. # Contributors We would like to acknowledge all those who have contributed to `MungeSumstats` development: * [Alan Murphy](https://github.com/Al-Murphy) * [Nathan Skene](https://github.com/NathanSkene) * [Brian Schilder](https://github.com/bschilder) * [Shea Andrews](https://github.com/sjfandrews) * [Jonathan Griffiths](https://github.com/jonathangriffiths) * [Kitty Murphy](https://github.com/KittyMurphy) * [Mykhaylo Malakhov](https://github.com/MykMal) * [Alasdair Warwick](https://github.com/rmgpanw) * [Ao Lu](https://github.com/leoarrow1) * [Sufyan Sualeman](https://github.com/sufyansuleman)

Owner

  • Name: Alan Murphy
  • Login: Al-Murphy
  • Kind: user
  • Location: London
  • Company: UK DRI at Imperial College London

Computational Biologist, PhD Student, Neurogenomics

GitHub Events

Total
  • Issues event: 21
  • Watch event: 8
  • Issue comment event: 22
  • Push event: 16
  • Pull request event: 2
  • Gollum event: 9
  • Fork event: 3
  • Create event: 3
Last Year
  • Issues event: 21
  • Watch event: 8
  • Issue comment event: 22
  • Push event: 16
  • Pull request event: 2
  • Gollum event: 9
  • Fork event: 3
  • Create event: 3

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 399
  • Total Committers: 12
  • Avg Commits per committer: 33.25
  • Development Distribution Score (DDS): 0.411
Past Year
  • Commits: 24
  • Committers: 3
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.167
Top Committers
Name Email Commits
Al-Murphy a****4@h****m 235
Brian M. Schilder 3****r 120
sjfandrews s****s@g****m 13
J Wokaty j****y@s****u 10
Nitesh Turaga n****a@g****m 6
Jonathan Griffiths 7****s 5
Brian Fulton-Howard f****1@g****m 4
A Wokaty a****y@s****u 2
rmgpanw a****6@g****m 1
lshep l****d@r****g 1
cfbeuchel c****l@c****e 1
MykMal m****v@d****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 7
  • Total pull requests: 3
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 2 days
  • Total issue authors: 6
  • Total pull request authors: 1
  • Average comments per issue: 1.29
  • Average comments per pull request: 0.33
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 7
  • Pull requests: 3
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 2 days
  • Issue authors: 6
  • Pull request authors: 1
  • Average comments per issue: 1.29
  • Average comments per pull request: 0.33
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Al-Murphy (3)
  • shaman-yellow (2)
  • LouChen-med (2)
  • jksull (1)
  • songlyzz (1)
  • rmgpanw (1)
  • AhmedArslan (1)
  • htqqdd (1)
  • svenkatesh25 (1)
  • bschilder (1)
  • jaqueytw (1)
  • JacksonDU630 (1)
  • karpat90 (1)
  • AndyYang0924 (1)
  • DaXuanGarden (1)
Pull Request Authors
  • sufyansuleman (3)
  • Al-Murphy (2)
  • MC4R (1)
Top Labels
Issue Labels
bug (7) help wanted (2) enhancement (1)
Pull Request Labels
help wanted (1)

Dependencies

DESCRIPTION cran
  • R >= 4.1 depends
  • BSgenome * imports
  • Biostrings * imports
  • GenomeInfoDb * imports
  • GenomicRanges * imports
  • IRanges * imports
  • R.utils * imports
  • RCurl * imports
  • VariantAnnotation * imports
  • data.table * imports
  • dplyr * imports
  • googleAuthR * imports
  • httr * imports
  • jsonlite * imports
  • magrittr * imports
  • methods * imports
  • parallel * imports
  • rtracklayer * imports
  • stats * imports
  • stringr * imports
  • utils * imports
  • BSgenome.Hsapiens.1000genomes.hs37d5 * suggests
  • BSgenome.Hsapiens.NCBI.GRCh38 * suggests
  • BiocGenerics * suggests
  • BiocParallel * suggests
  • BiocStyle * suggests
  • GenomicFiles * suggests
  • MatrixGenerics * suggests
  • Rsamtools * suggests
  • S4Vectors * suggests
  • SNPlocs.Hsapiens.dbSNP144.GRCh37 * suggests
  • SNPlocs.Hsapiens.dbSNP144.GRCh38 * suggests
  • SNPlocs.Hsapiens.dbSNP155.GRCh37 * suggests
  • SNPlocs.Hsapiens.dbSNP155.GRCh38 * suggests
  • UpSetR * suggests
  • badger * suggests
  • covr * suggests
  • knitr * suggests
  • markdown * suggests
  • rmarkdown * suggests
  • seqminer * suggests
  • testthat >= 3.0.0 suggests
.github/workflows/rworkflows.yml actions
  • neurogenomics/rworkflows master composite