ggfastman

Fast manhattenplots using ggplot2

https://github.com/roman-tremmel/ggfastman

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 3 DOI reference(s) in README
✓
Academic publication links
Links to: zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.2%) to scientific vocabulary

Keywords

fast ggplot2 gwas manhattan manhattan-plot plot plotting-pvalues pvalues qqplot r snp

Last synced: 9 months ago · JSON representation ·

Repository

Fast manhattenplots using ggplot2

Basic Info

Host: GitHub
Owner: roman-tremmel
License: gpl-3.0
Language: R
Default Branch: master
Homepage:
Size: 2.54 MB

Statistics

Stars: 9
Watchers: 2
Forks: 1
Open Issues: 0
Releases: 2

Topics

fast ggplot2 gwas manhattan manhattan-plot plot plotting-pvalues pvalues qqplot r snp

Created over 5 years ago · Last pushed over 2 years ago

Metadata Files

Readme License Citation

Introduction

This is a very fast and easy-to-individualize plotting function for GWAS results e.g. pvalues. Since I'm using ggplot2 a lot, I adopted the idea from a very nice project and combined it with a super fast plotting approach from the scattermore project.

A manhattan plot displays chromosomal positions against -mostly -log10 values- of genome-wide association studies between single nucleotide variants (SNV) or polymorphisms (SNP) and an endpoint e.g. expression, enzyme activity or case-control data.

One of the first R packages offering manhattan as well as qq plots was qqman from Stephen Turner, and nowadays there are a lot of different packages and approaches available for R and python. But a very fast one, which is still fast when plotting billions of data points, was missing.

This package ggfastman is trying to fill this gap.

Installation

So far the package is tested on Windows and MacOS, but is not on Cran, thus you need to:

devtools::install_github("roman-tremmel/ggfastman", upgrade = "never")

The package is depending on the additional packages ggplot2 and scattermore and some more. If there are problems try to install at least the latter one using:

devtools::install_github('exaexa/scattermore', dependencies = F, force = T, upgrade = "never")

Citation

You can cite the package using

Usage

The normal one

As an example you can load some data which is included in the package and run following code. More information of the data set is provided here.

{r} library(ggfastman) data("gwas_data") head(gwas_data)

Important is that the data has the three columns which are required:

chr
pos
pvalue

The chr should be in the format c("chr1", "chr2", "chr3", "chrX"...), the pos column must be a numeric vector reflecting base pair positions and the pvalue column contains the pvalues.

We can plot the manhattan figure with the speed option "slow" using only ggplot2 functions as follows.

{r} fast_manhattan(gwas_data, build='hg18', speed = "slow")

The fast way

Depending on your system this takes a while, particularly when plotting pvalues of more than 1,000,000 SNVs. Therefore, we replace the geom_point() function with the scattermore::geom_scattermore() function by calling the manhattan function using the "fast" option.

```{r} fastmanhattan(gwasdata, build='hg18', speed = "fast")

or

fastmanhattan(gwasdata, build='hg18', speed = "f") ``Zooooom, that was fast, right? How does it work? For the explanation I want to refer to thescattermore` package. Only so much, the speed is reached with some C code, rasterization and some magic.

Of course you can increase the point size and the resolution by loosing some of the speed.

{r} fast_manhattan(gwas_data, build='hg18', speed = "fast", pointsize = 3, pixels = c(1000, 1000))

The insane way

The fastest option is speed = "ultrafast". The fastest way costs that the data is plotted only in pure black. But I think it is it worth. Benchmarks are analysed below

```{r}

some big data file with >10^6 rows

biggwasdata <- do.call(rbind, replicate(15, gwasdata, simplify = FALSE)) fastmanhattan(biggwasdata, build='hg18', speed = "ultrafast")

compare with

fastmanhattan(biggwas_data, build='hg18', speed = "fast")

not compare with, unless you want to wait some minutes

fastmanhattan(biggwas_data, build='hg18', speed = "slow")

```

Individualization

Of course you can individualize the plot using standard ggplot2 functions.

xy-scales

```{r} fastmanhattan(gwasdata, build='hg18', speed = "fast", y_scale = F) + ylim(2, 10)

Of note, set `y_scale = F` to avoid the error of a second y-scale.

distinct chromosomes on x-axis

fastmanhattan(gwasdata[gwas_data$chr %in% c("chr1", "chr10", "chr22"),], build='hg18', speed = "fast")

```

color

Add color globally or highlight only individual SNPs. Of note, this is working for shape in the "slow"-mode as well.

{R} gwas_data2 <- gwas_data gwas_data2$color <- as.character(factor(gwas_data$chr, labels = 1:22)) fast_manhattan(gwas_data2, build = "hg18", speed = "fast") man 1

Highlight only some SNPs

{r} gwas_data2$color <- NA gwas_data2[gwas_data2$pvalue < 1e-5, ]$color <- "red" fast_manhattan(gwas_data2, build = "hg18", speed = "fast")

man 2

add significance line(s) and snp annotation(s)

{r} library(tidyverse) library(ggrepel) fast_manhattan(gwas_data, build='hg18', speed = "fast", color1 = "pink", color2 = "turquoise", pointsize = 3, pixels = c(1000, 500)) + geom_hline(yintercept = -log10(5e-08), linetype =2, color ="darkgrey") + # genomewide significance line geom_hline(yintercept = -log10(1e-5), linetype =2, color ="grey") + # suggestive significance line ggrepel::geom_text_repel(data = . %>% group_by(chr) %>% # ggrepel to avoid overplotting top_n(1, -pvalue) %>% # extract highest y values slice(1) %>% # if there are ties, choose the first one filter(pvalue <= 5e-08), # filter for significant ones aes(label=rsid), color =1) # add top rsid

Resulting manhattan plot

Facetting

{r} library(tidyverse) gwas_data %>% mutate( gr= "Study 1") %>% # rbind a second study bind_rows(., mutate(., gr= "Study 2", pvalue = runif(n()))) %>% fast_manhattan(., build = "hg18", speed = "fast", pointsize = 2.1, pixels = c(1000,500)) + geom_hline(yintercept = -log10(5e-08), linetype =2, color ="deeppink") + geom_hline(yintercept = -log10(1e-5), linetype =2, color ="grey") + facet_wrap(~gr, nrow = 2, scales = "free_y") + theme_bw(base_size = 16) + theme(panel.grid.minor.y = element_blank(), panel.grid.minor.x = element_blank())

Resulting manhattan plot

Zoom using ggforce

{r} fast_manhattan(gwas_data, build = "hg18", speed = "fast",pointsize = 3.2, pixels = c(1000,500)) + geom_hline(yintercept = -log10(5e-08), linetype =2, color ="deeppink") + geom_hline(yintercept = -log10(1e-5), linetype =2, color ="grey") + ggforce::facet_zoom(x = chr == "chr9",zoom.size = 1)

Resulting manhattan plot

and locus plots with linkage data. Of note, you have to register here and get your token

{r} fast_locusplot(gwas_data, token = "replace with your token", show_MAF = T, show_regulom = T)

locus plot

In addition the package includes also a fast way to create QQ-plots

{r} fast_qq(pvalue = runif(10^6), speed = "fast")

Resulting manhattan plot

Benchmarks

The benchmark analysis includes all operations of a plot generation including the code evaluation, the plotting as well as the saving of a .png file using png() for base R plots and ggsave() for ggplot figures. For a better comparison the same parameters for both approaches were chosen e.g. width = 270, height = 100 & units = "mm" as well as res=300 and dpi = 300, respectively. We compared the three speed option included in this package with fastman::fastman() and qqman::manhattan functions using bench::mark() with a minimum of 10 iterations. The complete code can be found here: benchmark_plot

The first comparision was performed using the example GWAS data of app. 80k pvalues/rows. As illustrated below, all three speed options were significantly faster than the other two base R functions, although the "slow" option performed rather similar compared to the base R functions regarding the user experience.

{r} gwas_data$chrom <- as.numeric(gsub("chr", "", gwas_data$chr)) res_small_manhattan <- bench_plot(gwas_data) plot_bench(res_small_manhattan)

speed1

In the next step we created manhattan plots on really big data of more than nine million datapoints by replicating the example data 120-times on a CPU i7-9700, 3GHz with 32GB RAM system.

{r} big_gwas_data <- do.call(rbind, replicate(120, gwas_data, simplify = FALSE)) nrow(big_gwas_data) 9495360 res_big_manhattan <- bench_plot(big_gwas_data)

There were again significant differences between the three analysed methods. Interestingly the fastman function performed very well. This fast behavior with this function is achieved with data cropping in the non-significant pvalue areas e.g. using only 20k pvalues>0.1, 0.01 > pvalues < 0.1, ... Nevertheless, the expierienced performance using the RStudio plotting window is even slower compared to the "fast" version. But if you are sticked to base R, the fastman package seems to be the choice for a fast plotting of >9x10^6 pvalues.

speed2

Questions and Bugs

Please report bugs by open github issue(s) here.

Owner

Name: Roman
Login: roman-tremmel
Kind: user

Website: www.romantremmel.de
Repositories: 3
Profile: https://github.com/roman-tremmel

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this R package, please cite it as below."
authors:
- family-names: "Tremmel"
  given-names: "Roman"
  orcid: "https://orcid.org/0000-0003-1564-0433"
title: "ggfastman"
version: 1.2.0
doi: 10.5281/zenodo.1234
date-released: 2021-06-18
url: "https://github.com/roman-tremmel/ggfastman"

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

DESCRIPTION cran

R >= 3.5.0 depends
GenomicRanges * imports
Homo.sapiens * imports
LDlinkR * imports
ggbio * imports
ggplot2 >= 2.2.1 imports
ggrepel * imports
scattermore * imports

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

ggfastman

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Introduction

Installation

Citation

Usage

The normal one

The fast way

or

The insane way

some big data file with >10^6 rows

compare with

not compare with, unless you want to wait some minutes

Individualization

Of note, set `y_scale = F` to avoid the error of a second y-scale.

distinct chromosomes on x-axis

Benchmarks

Questions and Bugs

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

ggfastman

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Introduction

Installation

Citation

Usage

The normal one

The fast way

or

The insane way

some big data file with >10^6 rows

compare with

not compare with, unless you want to wait some minutes

Individualization

Of note, set y_scale = F to avoid the error of a second y-scale.

distinct chromosomes on x-axis

Benchmarks

Questions and Bugs

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Dependencies

Of note, set `y_scale = F` to avoid the error of a second y-scale.