glmGamPoi

Fit Gamma-Poisson Generalized Linear Models Reliably

https://github.com/const-ae/glmgampoi

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: ncbi.nlm.nih.gov
  • Committers with academic emails
    1 of 8 committers (12.5%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (19.3%) to scientific vocabulary

Keywords

gamma-poisson glm negative-binomial-regression on-disk r-package regression

Keywords from Contributors

bioconductor transcriptomics scrna-seq gene immune-repertoire bioconductor-package bioinformatics clusters sequences proteomics
Last synced: 6 months ago · JSON representation

Repository

Fit Gamma-Poisson Generalized Linear Models Reliably

Basic Info
  • Host: GitHub
  • Owner: const-ae
  • Language: R
  • Default Branch: devel
  • Homepage:
  • Size: 2.48 MB
Statistics
  • Stars: 115
  • Watchers: 6
  • Forks: 17
  • Open Issues: 19
  • Releases: 0
Topics
gamma-poisson glm negative-binomial-regression on-disk r-package regression
Created about 6 years ago · Last pushed 11 months ago
Metadata Files
Readme Changelog

README.Rmd

---
output: github_document
---



```{r init_chunk, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
  # out.width = "100%"
)
set.seed(2)
```

# glmGamPoi 


[![codecov](https://codecov.io/gh/const-ae/glmGamPoi/branch/master/graph/badge.svg)](https://codecov.io/gh/const-ae/glmGamPoi)


> Fit Gamma-Poisson Generalized Linear Models Reliably.

Pronounciation: [`dʒi əl əm ɡam ˈpwɑ`](http://ipa-reader.xyz/?text=d%CA%92i%20%C9%99l%20%C9%99m%20%C9%A1am%20%CB%88pw%C9%91)

The core design aims of `glmGamPoi` are:

* Fit Gamma-Poisson models on arbitrarily large or small datasets
* Be faster than alternative methods, such as `DESeq2` or `edgeR`
* Calculate exact or approximate results based on user preference
* Support in-memory or on-disk data
* Follow established conventions around tools for RNA-seq analysis
* Present a simple user-interface
* Avoid unnecessary dependencies
* Make integration into other tools easy


# Installation

You can install the release version of *[glmGamPoi](https://bioconductor.org/packages/glmGamPoi)* from BioConductor:

``` r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("glmGamPoi")
```

For the latest developments, see the `r BiocStyle::Githubpkg("const-ae/glmGamPoi", "GitHub")` repo.

If you use this package in a scientific publication, please cite:

> glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data  
> Constantin Ahlmann-Eltze, Wolfgang Huber  
> Bioinformatics; 2020-12-09; doi: https://doi.org/10.1093/bioinformatics/btaa1009

# Example

Load the glmGamPoi package

```{r load_glmGamPoi}
library(glmGamPoi)
```


To fit a single Gamma-Poisson GLM do:

```{r single_gp_fit}
# overdispersion = 1/size
counts <- rnbinom(n = 10, mu = 5, size = 1/0.7)

# design = ~ 1 means that an intercept-only model is fit
fit <- glm_gp(counts, design = ~ 1)
fit

# Internally fit is just a list:
as.list(fit)[1:2]
```

The `glm_gp()` function returns a list with the results of the fit. Most importantly, it contains the estimates for the coefficients β and the overdispersion.

Fitting repeated Gamma-Poisson GLMs for each gene of a single cell dataset is just as easy:

I will first load an example dataset using the `TENxPBMCData` package. The dataset has 33,000 genes and 4340 cells. It takes roughly 1.5 minutes to fit the Gamma-Poisson model on the full dataset. For demonstration purposes, I will subset the dataset to 300 genes, but keep the 4340 cells:
```{r load_additional_packages, warning=FALSE, message = FALSE}
library(SummarizedExperiment)
library(DelayedMatrixStats)
```

```{r load_pbmc_data}
# The full dataset with 33,000 genes and 4340 cells
# The first time this is run, it will download the data
pbmcs <- TENxPBMCData::TENxPBMCData("pbmc4k")

# I want genes where at least some counts are non-zero
non_empty_rows <- which(rowSums2(assay(pbmcs)) > 0)
pbmcs_subset <- pbmcs[sample(non_empty_rows, 300), ]
pbmcs_subset
```

I call `glm_gp()` to fit one GLM model for each gene and force the calculation to happen in memory.

```{r simple_fit}
fit <- glm_gp(pbmcs_subset, on_disk = FALSE)
summary(fit)
```



# Benchmark

I compare my method (in-memory and on-disk) with `r BiocStyle::Biocpkg("DESeq2")` and `r BiocStyle::Biocpkg("edgeR")`. Both are classical methods for analyzing RNA-Seq datasets and have been around for almost 10 years. Note that both tools can do a lot more than just fitting the Gamma-Poisson model, so this benchmark only serves to give a general impression of the performance.



```{r run_bench_mark, warning=FALSE}
# Explicitly realize count matrix in memory so that it is a fair comparison
pbmcs_subset <- as.matrix(assay(pbmcs_subset))
model_matrix <- matrix(1, nrow = ncol(pbmcs_subset))


bench::mark(
  glmGamPoi_in_memory = {
    glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)
  }, glmGamPoi_on_disk = {
    glm_gp(pbmcs_subset, design = model_matrix, on_disk = TRUE)
  }, DESeq2 = suppressMessages({
    dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset,
                        colData = data.frame(name = seq_len(4340)),
                        design = ~ 1)
    dds <- DESeq2::estimateSizeFactors(dds, "poscounts")
    dds <- DESeq2::estimateDispersions(dds, quiet = TRUE)
    dds <- DESeq2::nbinomWaldTest(dds, minmu = 1e-6)
  }), edgeR = {
    edgeR_data <- edgeR::DGEList(pbmcs_subset)
    edgeR_data <- edgeR::calcNormFactors(edgeR_data)
    edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix)
    edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix)
  }, check = FALSE, min_iterations = 3
)
```

On this dataset, `glmGamPoi` is more than 5 times faster than `edgeR` and more than 18 times faster than `DESeq2`. `glmGamPoi` does **not** use approximations to achieve this performance increase. The performance comes from an optimized algorithm for inferring the overdispersion for each gene. It is tuned for datasets typically encountered in single RNA-seq with many samples and many small counts, by avoiding duplicate calculations.

To demonstrate that the method does not sacrifice accuracy, I compare the parameters that each method estimates. The means and β coefficients are identical, but that the overdispersion estimates from `glmGamPoi` are more reliable:

```{r compare_with_deseq2_and_edger, message=FALSE, warning=FALSE}
# Results with my method
fit <- glm_gp(pbmcs_subset, design = model_matrix, on_disk = FALSE)

# DESeq2
dds <- DESeq2::DESeqDataSetFromMatrix(pbmcs_subset, 
                        colData = data.frame(name = seq_len(4340)),
                        design = ~ 1)
sizeFactors(dds)  <- fit$size_factors
dds <- DESeq2::estimateDispersions(dds, quiet = TRUE)
dds <- DESeq2::nbinomWaldTest(dds, minmu = 1e-6)

#edgeR
edgeR_data <- edgeR::DGEList(pbmcs_subset, lib.size = fit$size_factors)
edgeR_data <- edgeR::estimateDisp(edgeR_data, model_matrix)
edgeR_fit <- edgeR::glmFit(edgeR_data, design = model_matrix)
```


```{r coefficientComparison, fig.height=5, fig.width=10, warning=FALSE, echo = FALSE}
par(mfrow = c(2, 4), cex.main = 2, cex.lab = 1.5)
plot(fit$Beta[,1], coef(dds)[,1] / log2(exp(1)), pch = 16, 
     main = "Beta Coefficients", xlab = "glmGamPoi", ylab = "DESeq2")
abline(0,1)
plot(fit$Beta[,1], edgeR_fit$unshrunk.coefficients[,1], pch = 16,
     main = "Beta Coefficients", xlab = "glmGamPoi", ylab = "edgeR")
abline(0,1)

plot(fit$Mu[,1], assay(dds, "mu")[,1], pch = 16, log="xy",
     main = "Gene Mean", xlab = "glmGamPoi", ylab = "DESeq2")
abline(0,1)
plot(fit$Mu[,1], edgeR_fit$fitted.values[,1], pch = 16, log="xy",
     main = "Gene Mean", xlab = "glmGamPoi", ylab = "edgeR")
abline(0,1)

plot(fit$overdispersions, rowData(dds)$dispGeneEst, pch = 16, log="xy",
     main = "Overdispersion", xlab = "glmGamPoi", ylab = "DESeq2")
abline(0,1)
plot(fit$overdispersions, edgeR_fit$dispersion, pch = 16, log="xy",
     main = "Overdispersion", xlab = "glmGamPoi", ylab = "edgeR")
abline(0,1)

```

I am comparing the gene-wise estimates of the coefficients from all three methods. Points on the diagonal line are identical. The inferred Beta coefficients and gene means agree well between the methods, however the overdispersion differs quite a bit. `DESeq2` has problems estimating most of the overdispersions and sets them to `1e-8`. `edgeR` only approximates the overdispersions which explains the variation around the overdispersions calculated with `glmGamPoi`. 


## Scalability

The method scales linearly, with the number of rows and columns in the dataset. For example: fitting the full `pbmc4k` dataset with subsampling on a modern MacBook Pro in-memory takes ~1 minute and on-disk a little over 4 minutes. Fitting the `pbmc68k` (17x the size) takes ~73 minutes (17x the time) on-disk.


## Differential expression analysis

`glmGamPoi` provides an interface to do quasi-likelihood ratio testing to identify differentially expressed genes. To demonstrate this feature, we will use the data from [Kang _et al._ (2018)](https://www.ncbi.nlm.nih.gov/pubmed/29227470) provided by the `MuscData` package. This is a single cell dataset of 8 Lupus patients for which 10x droplet-based scRNA-seq was performed before and after treatment with interferon beta. The `SingleCellExperiment` object conveniently provides the patient id (`ind`), treatment status (`stim`) and cell type (`cell`):

```{r load_kang_data}
sce <- muscData::Kang18_8vs8()
colData(sce)
```

For demonstration purpose, I will work on a subset of the genes and cells:

```{r subset_kang_data}
set.seed(1)
# Take highly expressed genes and proper cells:
sce_subset <- sce[rowSums(counts(sce)) > 100, 
                  sample(which(sce$multiplets == "singlet" & 
                              ! is.na(sce$cell) &
                              sce$cell %in% c("CD4 T cells", "B cells", "NK cells")), 
                         1000)]
# Convert counts to dense matrix
counts(sce_subset) <- as.matrix(counts(sce_subset))
# Remove empty levels because glm_gp() will complain otherwise
sce_subset$cell <- droplevels(sce_subset$cell)
```

In the first step we will aggregate the counts of each patient, condition and cell type and form pseudobulk samples. This ensures that I get reliable p-value by treating each patient as a replicate and not each cell.

```{r}
sce_reduced <- pseudobulk(sce_subset, group_by = vars(ind, stim, cell))
```


We will identify which genes in CD4 positive T-cells are changed most by the treatment. We will fit a full model including the interaction term `stim:cell`. The interaction term will help us identify cell type specific responses to the treatment:

```{r kang_fit}
fit <- glm_gp(sce_reduced, design = ~ cell + stim +  stim:cell - 1,
              reference_level = "NK cells")
summary(fit)
```


To see how the coefficient of our model are called, we look at the `colnames(fit$Beta)`:

```{r show_beta_colnames}
colnames(fit$Beta)
```


In our example, we want to find the genes that change specifically in T cells. Finding cell type specific responses to a treatment is a big advantage of single cell data over bulk data. 

```{r do_pseudobulk}
# The contrast argument specifies what we want to compare
# We test the expression difference of stimulated and control T-cells
de_res <- test_de(fit, contrast = cond(cell = "CD4 T cells", stim = "ctrl") - cond(cell = "CD4 T cells", stim = "stim")) 

# Most different genes
head(de_res[order(de_res$pval), ])
```

The test is successful and we identify interesting genes that are differentially expressed in interferon-stimulated T cells:  _IFI6_, _IFIT3_ and _ISG15_ literally stand for _Interferon Induced/Stimulated Protein_.

To get a more complete overview of the results, we can make a volcano plot that compares the log2-fold change (LFC) vs the logarithmized p-values.

```{r make_volcano_plot}
library(ggplot2)
ggplot(de_res, aes(x = lfc, y = -log10(pval))) +
  geom_point(size = 0.6, aes(color = adj_pval < 0.1)) +
  ggtitle("Volcano Plot", "Genes that change most through interferon-beta treatment in T cells")
```


Another important task in single cell data analysis is the identification of marker genes for cell clusters. For this we can also use our Gamma-Poisson fit. 

Let's assume we want to find genes that differ between T cells and the B cells. We can directly compare the corresponding coefficients and find genes that differ in the control condition (this time not accounting for the pseudo-replication structure):

```{r find_marker_genes}
fit_full <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
                   reference_level = "NK cells")
marker_genes <- test_de(fit_full, `cellCD4 T cells` - `cellB cells`, sort_by = pval)
head(marker_genes)
```

If we want find genes that differ in the stimulated condition, we just include the additional coefficients in the contrast:

```{r do_signif_genes}
marker_genes2 <- test_de(fit_full, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - 
                               (`cellB cells` + `cellB cells:stimstim`), 
                        sort_by = pval)

head(marker_genes2)
```

We identify many genes related to the human leukocyte antigen (HLA) system that is important for antigen presenting cells like B-cells, but are not expressed by T helper cells. The plot below shows the expression differences.

A note of caution: applying `test_de()` to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results.

```{r plot_marker_genes}
# Create a data.frame with the expression values, gene names, and cell types
tmp <- data.frame(gene = rep(marker_genes$name[1:6], times = ncol(sce_subset)),
                  expression = c(counts(sce_subset)[marker_genes$name[1:6], ]),
                  celltype = rep(sce_subset$cell, each = 6))

ggplot(tmp, aes(x = celltype, y = expression)) +
  geom_jitter(height = 0.1) +
  stat_summary(geom = "crossbar", fun = "mean", color = "red") +
  facet_wrap(~ gene, scales = "free_y") +
  ggtitle("Marker genes of B vs. T cells")
```



# Acknowlegments

This work was supported by the EMBL International PhD Programme and the European Research Council Synergy grant DECODE under grant agreement No. 810296.




# Session Info

```{r print_sessionInfo}
sessionInfo()
```



Owner

  • Name: Constantin
  • Login: const-ae
  • Kind: user
  • Location: Heidelberg, Germany
  • Company: EMBL

PhD Student, Biostats, R

GitHub Events

Total
  • Issues event: 7
  • Watch event: 10
  • Issue comment event: 22
  • Push event: 7
  • Pull request event: 2
  • Fork event: 4
Last Year
  • Issues event: 7
  • Watch event: 10
  • Issue comment event: 22
  • Push event: 7
  • Pull request event: 2
  • Fork event: 4

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 520
  • Total Committers: 8
  • Avg Commits per committer: 65.0
  • Development Distribution Score (DDS): 0.054
Past Year
  • Commits: 30
  • Committers: 5
  • Avg Commits per committer: 6.0
  • Development Distribution Score (DDS): 0.167
Top Committers
Name Email Commits
const-ae a****5@g****m 492
Nitesh Turaga n****a@g****m 10
J Wokaty j****y@s****u 10
Nathan Lubock n****e@o****o 4
LTLA i****s@g****m 1
Jeffrey Pullin j****n@g****m 1
Jack Kamm j****m@g****m 1
Hervé Pagès h****b@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 62
  • Total pull requests: 9
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 5 days
  • Total issue authors: 47
  • Total pull request authors: 6
  • Average comments per issue: 2.89
  • Average comments per pull request: 2.67
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 3
  • Average time to close issues: 14 days
  • Average time to close pull requests: 8 days
  • Issue authors: 4
  • Pull request authors: 2
  • Average comments per issue: 3.25
  • Average comments per pull request: 2.33
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Al-Murphy (4)
  • pchiang5 (4)
  • jan-glx (4)
  • const-ae (3)
  • malcook (2)
  • anglixue (2)
  • nicholaslimwy (2)
  • ceesu (2)
  • ATpoint (1)
  • akhst7 (1)
  • lichtobergo (1)
  • DarwinAwardWinner (1)
  • PedroMilanezAlmeida (1)
  • emdann (1)
  • dg520 (1)
Pull Request Authors
  • hpages (3)
  • jackkamm (2)
  • LTLA (2)
  • nlubock (2)
  • ycl6 (2)
  • jeffreypullin (1)
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • bioconductor 283,899 total
  • Total dependent packages: 5
  • Total dependent repositories: 0
  • Total versions: 8
  • Total maintainers: 1
bioconductor.org: glmGamPoi

Fit a Gamma-Poisson Generalized Linear Model

  • Versions: 8
  • Dependent Packages: 5
  • Dependent Repositories: 0
  • Downloads: 283,899 Total
Rankings
Dependent repos count: 0.0%
Dependent packages count: 0.0%
Average: 4.3%
Downloads: 12.8%
Maintainers (1)
Last synced: 6 months ago