cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.

https://github.com/biopsyk/cleansumstats

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: ncbi.nlm.nih.gov, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary

Keywords

data-cleaning gwas nextflow pipeline
Last synced: 6 months ago · JSON representation ·

Repository

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.

Basic Info
Statistics
  • Stars: 19
  • Watchers: 3
  • Forks: 3
  • Open Issues: 33
  • Releases: 2
Topics
data-cleaning gwas nextflow pipeline
Created about 6 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Citation

README.md

cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles

DOI

Introduction

The cleansumstats pipeline takes a typical genomic sumstat file as input(normally the output from a GWAS), together with specifiers for chr, pos and available stats.

Quick Start

To run a quick test using provided example and test data. Use either singularity or docker depending on what is available on your system. Note that Singularity has been renamed Apptainer.

```bash

Make sure git and either singularity or docker are installed

git --version singularity --version docker --version

clone and enter the cleansumstats github project

git clone https://github.com/BioPsyk/cleansumstats.git cd cleansumstats ```

Singularity

using singularity (use path to image) ```bash

pull singularity image for AMD64/x86_64 systems (most common)

mkdir -p sif singularity pull sif/ibp-cleansumstats-base_version-1.3.1.sif docker://biopsyk/ibp-cleansumstats:1.3.1-amd64

clean a sumstat using shrinked example data for dbsnp and 1kgp (-e flag)

./cleansumstats.sh \ -j sif/ibp-cleansumstats-baseversion-1.3.1.sif \ -i tests/exampledata/sumstat1/sumstat1rawmeta.txt \ -o out_example \ -e 1 ```

Docker

using docker image (use the tag: dockerhub_biopsyk) ```bash

pull docker image for AMD64/x86_64 systems (most common)

docker pull biopsyk/ibp-cleansumstats:1.3.1-amd64

using docker (using flag -j)

./cleansumstats.sh \ -j dockerhubbiopsyk \ -i tests/exampledata/sumstat1/sumstat1rawmeta.txt \ -o out_example \ -e 1 ```

Note: For ARM64 systems (e.g., Apple Silicon Macs), append -arm64 to the version tag instead of -amd64. For example: 1.3.0-arm64.

Add full size reference data

In the cleaning all positions are compared to a reference to confirm or add missing annotation.

dbsnp reference

The preparation of the dbsnp reference only has to be done once, and can be reused for all sumstats that needs cleaning.

```bash

i. Download the dbsnp reference and supplemental files: size 25GB

mkdir -p dbsnp wget -P dbsnp https://ftp.ncbi.nlm.nih.gov/snp/archive/b156/VCF/CHECKSUMS wget -P dbsnp https://ftp.ncbi.nlm.nih.gov/snp/archive/b156/VCF/GCF000001405.40.gz.md5 wget -P dbsnp https://ftp.ncbi.nlm.nih.gov/snp/archive/b156/VCF/GCF000001405.40.gz.tbi wget -P dbsnp https://ftp.ncbi.nlm.nih.gov/snp/archive/b156/VCF/GCF_000001405.40.gz

ii. If you are on a HPC Start your interactive session (below SLURM settings took about 5h to run)

srun --mem=400g --ntasks 1 --cpus-per-task 60 --time=10:00:00 --account ibppipelinecleansumstats --pty /bin/bash ./cleansumstats.sh \ prepare-dbsnp \ -i dbsnp/GCF000001405.40.gz \ -o outdbsnp ```

1000 genomes project reference

```bash

i. Download

mkdir -p 1kgp wget -P 1kgp https://ftp.ensembl.org/pub/release-112/variation/vcf/homosapiens/1000GENOMES-phase3.vcf.gz wget -P 1kgp https://ftp.ensembl.org/pub/release-112/variation/vcf/homosapiens/1000GENOMES-phase3.vcf.gz.csi

ii. If you are on a HPC Start your interactive session (below SLURM settings took about 5min to run)

srun --mem=80g --ntasks 1 --cpus-per-task 5 --time=1:00:00 --account ibppipelinecleansumstats --pty /bin/bash ./cleansumstats.sh \ prepare-1kgp \ -i 1kgp/1000GENOMES-phase3.vcf.gz \ -d outdbsnp \ -o out1kgptest ```

Prepare meta data files

After the reference data (dbsnp and 1000 genomes) has been created it is time to prepare the input for the actual cleaning. This file is called the meta file, and contains paths to other important files, such as the actual sumstats, README, article pdf, etc,. for which all need to be in the same folder as their corresponding metafile. This file has to be filled in manually, see tests/example_data/sumstat_1/sumstat_1_raw_meta.txt for an example of how it looks like.

You can also use this webinterface to generate a metadatafile. Again, remember that all files referred to by the metadatafile have to be in the same directory as the metafile when you run cleansumstats. Check tests/example_data and sumstats 1-5 for an example of how you can structure your input folders.

There is no support for relative links in the metadata file, which means all files have to be in the same folder. However, you can provide paths to associated files -p path/to/folder1,path/to/folder2

Run a fully operational cleaning pipeline

This will take longer time compared to the quick-start run as we now use the full >600 million rows dbsnp reference to map our variants to.

When you have prepared your meta data files, then replace -i example data with your own data.

```

i. If you are on a HPC Start your interactive session (below SLURM settings took about 10min to run)

srun --mem=40g --ntasks 1 --cpus-per-task 6 --time=1:00:00 --account ibppipelinecleansumstats --pty /bin/bash ./cleansumstats.sh \ -i tests/exampledata/sumstat1/sumstat1rawmeta.txt \ -d outdbsnp \ -k out1kgp \ -o outclean

For additional flags, see:

./cleansumstats.sh -h

```

More documentation

Credits

cleansumstats was originally written by Jesper R. Gådin

Owner

  • Name: Institute of Biological Psychiatry
  • Login: BioPsyk
  • Kind: organization
  • Location: Boserupvej 2, 4000 Roskilde, Denmark

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Gadin"
  given-names: "Jesper Robert"
  orcid: "https://orcid.org/0000-0002-9210-9534"
- family-names: "Zetterberg"
  given-names: "Richard"
  orcid: "https://orcid.org/0000-0002-4284-4063"
- family-names: "Meijsen"
  given-names: "Joeri"
  orcid: "https://orcid.org/0000-0002-4161-2199"
- family-names: "Schork"
  given-names: "Andrew Joseph"
  orcid: "https://orcid.org/0000-0003-4164-9335"
title: "Cleansumstats: Converting GWAS sumstats to a common format to facilitate downstream applications"
version: 1.5.4
date-released: 2022-12-15
url: "https://github.com/BioPsyk/cleansumstats"

GitHub Events

Total
  • Issues event: 11
  • Watch event: 6
  • Delete event: 9
  • Issue comment event: 6
  • Push event: 47
  • Pull request event: 24
  • Fork event: 1
  • Create event: 9
Last Year
  • Issues event: 11
  • Watch event: 6
  • Delete event: 9
  • Issue comment event: 6
  • Push event: 47
  • Pull request event: 24
  • Fork event: 1
  • Create event: 9

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 10
  • Average time to close issues: 12 days
  • Average time to close pull requests: less than a minute
  • Total issue authors: 3
  • Total pull request authors: 1
  • Average comments per issue: 0.4
  • Average comments per pull request: 0.0
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 10
  • Average time to close issues: 12 days
  • Average time to close pull requests: less than a minute
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 0.4
  • Average comments per pull request: 0.0
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • pappewaio (13)
  • ofrei (2)
  • Hugolyu (1)
  • gl-q1an (1)
Pull Request Authors
  • pappewaio (29)
Top Labels
Issue Labels
feature (1) bugfix (1)
Pull Request Labels

Dependencies

docker/Dockerfile docker
  • adoptopenjdk@sha256 477d0c53aca999692d2432e529af1f7abd715205fcfc36534ac9ff490f4da0e8 build
  • gradle@sha256 7e07e513b83e6a7790f0cb30820f4142b96ab7ceaac67865965b2127734c2c3d build
  • rust@sha256 f72949bcf1daf8954c0e0ed8b7e10ac4c641608f6aa5f0ef7c172c49f35bd9b5 build