mungesumstats

Rapid standardisation and quality control of GWAS or QTL summary statistics

https://github.com/al-murphy/mungesumstats

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
○
Academic publication links
✓
Committers with academic emails
2 of 12 committers (16.7%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary

Keywords

bioconductor-package bioinformatics database-api genomics gwas qtl r r-package standardisation summary-statistics vcf-files

Keywords from Contributors

rnaseq recount3 recount mouse junction illumina human gene exon derfinder

Last synced: 6 months ago · JSON representation

Repository

Rapid standardisation and quality control of GWAS or QTL summary statistics

Basic Info

Host: GitHub
Owner: Al-Murphy
Language: R
Default Branch: master
Homepage: https://al-murphy.github.io/MungeSumstats/
Size: 9.41 MB

Statistics

Stars: 86
Watchers: 3
Forks: 18
Open Issues: 10
Releases: 0

Topics

bioconductor-package bioinformatics database-api genomics gwas qtl r r-package standardisation summary-statistics vcf-files

Created almost 5 years ago · Last pushed 6 months ago

Metadata Files

Readme Changelog

README.Rmd

---
title: "`MungeSumstats`: Standardise the format of GWAS summary statistics"
author: "Authors: Alan Murphy, Brian Schilder and Nathan Skene"
date: "Updated: `r format(Sys.Date(), '%b-%d-%Y')`"
bibliography: vignettes/MungeSumstats.bib
csl: vignettes/nature.csl
output: rmarkdown::github_document
vignette: >
  %\VignetteIndexEntry{MungeSumstats}
  %\VignetteEngine{knitr::rmarkdown}
  %\usepackage[utf8]{inputenc}    
---



```{r, echo=FALSE}
knitr::opts_chunk$set(tidy = FALSE,
                      warning = FALSE, 
                      message = FALSE)
```


`r badger::badge_bioc_release(color = "black")`
`r badger::badge_github_version(color = "black")`
`r badger::badge_last_commit(branch = "master")`
`r badger::badge_bioc_download(by = "total", color = "blue")`
`r badger::badge_license()`
`r badger::badge_doi(doi = "https://doi.org/10.1093/bioinformatics/btab665", color="blue")`




# Introduction

The `MungeSumstats` package is designed to facilitate the standardisation of GWAS summary statistics. 

## Overview

The package is designed to handle the lack of standardisation of output files by the GWAS community. The [MRC IEU Open GWAS](https://gwas.mrcieu.ac.uk/) team have 
provided full summary statistics for >10k GWAS, which are API-accessible via the  [`ieugwasr`](https://mrcieu.github.io/ieugwasr/) and [`gwasvcf`](https://github.com/MRCIEU/gwasvcf) packages. But these GWAS are only standardised in the sense that they are VCF format, and can be 
fully standardised with `MungeSumstats`.

`MungeSumstats` provides a framework to standardise the format for any GWAS summary statistics, including those in VCF format, enabling downstream integration and analysis. It addresses the most common discrepancies across summary statistic files, and offers a range of adjustable Quality Control (QC) steps.

## Citation

If you use `MungeSumstats`, please cite the original authors of the GWAS 
as well as:  

> Alan E Murphy, Brian M Schilder, Nathan G Skene (2021) 
MungeSumstats: A Bioconductor package for the
standardisation and quality control of many GWAS summary
statistics. 
*Bioinformatics*, btab665, https://doi.org/10.1093/bioinformatics/btab665


# Installing `MungeSumstats`

`MungeSumstats` is available on 
[Bioconductor](https://bioconductor.org/packages/MungeSumstats). 
To install `MungeSumstats` on Bioconductor run:

```R
if (!require("BiocManager")) install.packages("BiocManager")

BiocManager::install("MungeSumstats")
```

You can then load the package and data package:

```R
library(MungeSumstats)
```

Note that there is also a 
[docker image for MungeSumstats](https://hub.docker.com/r/neurogenomicslab/mungesumstats).

Note that for a number of the checks implored by `MungeSumstats` a reference 
genome is used. If your GWAS summary statistics file of interest relates to
*GRCh38*, you will need to install `SNPlocs.Hsapiens.dbSNP155.GRCh38` and 
`BSgenome.Hsapiens.NCBI.GRCh38` from Bioconductor as follows:

```R
BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh38")
BiocManager::install("BSgenome.Hsapiens.NCBI.GRCh38")
```

If your GWAS summary statistics file of interest relates to *GRCh37*, you will 
need to install `SNPlocs.Hsapiens.dbSNP155.GRCh37` and 
`BSgenome.Hsapiens.1000genomes.hs37d5` from Bioconductor as follows:

```R
BiocManager::install("SNPlocs.Hsapiens.dbSNP155.GRCh37")
BiocManager::install("BSgenome.Hsapiens.1000genomes.hs37d5")
```

These may take some time to install and are not included in the package as some 
users may only need one of *GRCh37*/*GRCh38*. If you are unsure of the genome 
build, MungeSumstats can also infer this information from your data.

# Getting started

See the [Getting started vignette website](https://al-murphy.github.io/MungeSumstats/articles/MungeSumstats.html)
for up-to-date instructions on usage.

See the [OpenGWAS vignette website](https://al-murphy.github.io/MungeSumstats/articles/OpenGWAS.html)
for information on how to use MungeSumstats to access, standardise and perform
quality control on GWAS Summary Statistics from the MRC IEU [Open GWAS Project](https://gwas.mrcieu.ac.uk/).

**NOTE** to authenticate, you need to generate a token from the OpenGWAS website. 
The token behaves like a password, and it will be used to authorise the requests 
you make to the OpenGWAS API. Here are the steps to generate the token and then 
have `ieugwasr` automatically use it for your queries:
  
1. Login to https://api.opengwas.io/profile/
2. Generate a new token
3. Add `OPENGWAS_JWT=` to your .Renviron file, thi can be edited in R by 
running `usethis::edit_r_environ()`
4. Restart your R session
5. To check that your token is being recognised, run 
`ieugwasr::get_opengwas_jwt()`. If it returns a long random string then you are 
authenticated.
6. To check that your token is working, run `ieugwasr::user()`. It will make a 
request to the API for your user information using your token. It should return 
a list with your user information. If it returns an error, then your token is 
not working.
7. Make sure you have submitted use

Please read carefully through the [FAQ website](https://github.com/Al-Murphy/MungeSumstats/wiki/FAQ) 
for an queries about running MungeSumstats. If you have any outside of this 
problems please do file an [Issue](https://github.com/al-murphy/MungeSumstats/issues) 
here on GitHub.

# Future Enhancements

The `MungeSumstats` package aims to be able to handle the most common
summary statistic file formats including VCF. If your file can not be
formatted by `MungeSumstats` feel free to report the [Issue](https://github.com/al-murphy/MungeSumstats/issues) 
on GitHub along with your summary statistics file header. 

We also encourage people to edit the code to resolve their particular issues 
too and are happy to incorporate these through pull requests on github. If your
summary statistic file headers are not recognised by `MungeSumstats` but 
correspond to one of 

```
SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON, 
NSTUDY, INFO or FRQ, 
```

Feel free to update the `data("sumstatsColHeaders")` following the 
approach in the *data.R* file and add your mapping. Then use a [Pull Request](https://github.com/al-murphy/MungeSumstats/pulls) on 
GitHub and we will incorporate this change into the package.

# Contributors

We would like to acknowledge all those who have contributed to `MungeSumstats` 
development:

 * [Alan Murphy](https://github.com/Al-Murphy)
 * [Nathan Skene](https://github.com/NathanSkene)
 * [Brian Schilder](https://github.com/bschilder)
 * [Shea Andrews](https://github.com/sjfandrews)
 * [Jonathan Griffiths](https://github.com/jonathangriffiths)
 * [Kitty Murphy](https://github.com/KittyMurphy)
 * [Mykhaylo Malakhov](https://github.com/MykMal)
 * [Alasdair Warwick](https://github.com/rmgpanw)
 * [Ao Lu](https://github.com/leoarrow1)
 * [Sufyan Sualeman](https://github.com/sufyansuleman)

Owner

Name: Alan Murphy
Login: Al-Murphy
Kind: user
Location: London
Company: UK DRI at Imperial College London

Website: https://al-murphy.github.io/
Twitter: Al_Murphy_
Repositories: 1
Profile: https://github.com/Al-Murphy

Computational Biologist, PhD Student, Neurogenomics

GitHub Events

Total

Issues event: 21
Watch event: 8
Issue comment event: 22
Push event: 16
Pull request event: 2
Gollum event: 9
Fork event: 3
Create event: 3

Last Year

Issues event: 21
Watch event: 8
Issue comment event: 22
Push event: 16
Pull request event: 2
Gollum event: 9
Fork event: 3
Create event: 3

Committers

Last synced: 9 months ago

All Time

Total Commits: 399
Total Committers: 12
Avg Commits per committer: 33.25
Development Distribution Score (DDS): 0.411

Past Year

Commits: 24
Committers: 3
Avg Commits per committer: 8.0
Development Distribution Score (DDS): 0.167

Top Committers

Name	Email	Commits
Al-Murphy	a**4@h**m	235
Brian M. Schilder	3****r	120
sjfandrews	s**s@g**m	13
J Wokaty	j**y@s**u	10
Nitesh Turaga	n**a@g**m	6
Jonathan Griffiths	7****s	5
Brian Fulton-Howard	f**1@g**m	4
A Wokaty	a**y@s**u	2
rmgpanw	a**6@g**m	1
lshep	l**d@r**g	1
cfbeuchel	c**l@c**e	1
MykMal	m**v@d**m	1

Committer Domains (Top 20 + Academic)

sph.cuny.edu: 2 dnli.com: 1 charite.de: 1 roswellpark.org: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 7
Total pull requests: 3
Average time to close issues: about 2 months
Average time to close pull requests: 2 days
Total issue authors: 6
Total pull request authors: 1
Average comments per issue: 1.29
Average comments per pull request: 0.33
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 7
Pull requests: 3
Average time to close issues: about 2 months
Average time to close pull requests: 2 days
Issue authors: 6
Pull request authors: 1
Average comments per issue: 1.29
Average comments per pull request: 0.33
Merged pull requests: 2
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Al-Murphy (3)
shaman-yellow (2)
LouChen-med (2)
jksull (1)
songlyzz (1)
rmgpanw (1)
AhmedArslan (1)
htqqdd (1)
svenkatesh25 (1)
bschilder (1)
jaqueytw (1)
JacksonDU630 (1)
karpat90 (1)
AndyYang0924 (1)
DaXuanGarden (1)

Pull Request Authors

sufyansuleman (3)
Al-Murphy (2)
MC4R (1)

Top Labels

Issue Labels

bug (7) help wanted (2) enhancement (1)

Pull Request Labels

help wanted (1)

Dependencies

DESCRIPTION cran

R >= 4.1 depends
BSgenome * imports
Biostrings * imports
GenomeInfoDb * imports
GenomicRanges * imports
IRanges * imports
R.utils * imports
RCurl * imports
VariantAnnotation * imports
data.table * imports
dplyr * imports
googleAuthR * imports
httr * imports
jsonlite * imports
magrittr * imports
methods * imports
parallel * imports
rtracklayer * imports
stats * imports
stringr * imports
utils * imports
BSgenome.Hsapiens.1000genomes.hs37d5 * suggests
BSgenome.Hsapiens.NCBI.GRCh38 * suggests
BiocGenerics * suggests
BiocParallel * suggests
BiocStyle * suggests
GenomicFiles * suggests
MatrixGenerics * suggests
Rsamtools * suggests
S4Vectors * suggests
SNPlocs.Hsapiens.dbSNP144.GRCh37 * suggests
SNPlocs.Hsapiens.dbSNP144.GRCh38 * suggests
SNPlocs.Hsapiens.dbSNP155.GRCh37 * suggests
SNPlocs.Hsapiens.dbSNP155.GRCh38 * suggests
UpSetR * suggests
badger * suggests
covr * suggests
knitr * suggests
markdown * suggests
rmarkdown * suggests
seqminer * suggests
testthat >= 3.0.0 suggests

.github/workflows/rworkflows.yml actions

neurogenomics/rworkflows master composite

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

mungesumstats

Science Score: 49.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.Rmd

Authors: Alan Murphy, Brian Schilder and Nathan Skene

Updated: `r format(Sys.Date(), '%b-%d-%Y')`

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies