QuantNorm

Mitigating the adverse impact of batch effects in sample pattern detection

https://github.com/tengfei-emory/quantnorm

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 7 DOI reference(s) in README
  • Academic publication links
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.7%) to scientific vocabulary

Keywords

batch-effects
Last synced: 6 months ago · JSON representation

Repository

Mitigating the adverse impact of batch effects in sample pattern detection

Basic Info
Statistics
  • Stars: 9
  • Watchers: 0
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Topics
batch-effects
Created over 8 years ago · Last pushed over 6 years ago
Metadata Files
Readme License

README.md

QuantNorm

Mitigating the adverse impact of batch effects in sample pattern detection

CRAN Status Badge Downloads badge Total downloads

QuantNorm modifies the distance matrix obtained from data with batch effects, so as to improve the performance of sample pattern detection, such as clustering, dimension reduction, and construction of networks between subjects. The method has been published in Bioinformatics (Fei et al, 2018, https://doi.org/10.1093/bioinformatics/bty117).

QuantNorm: Vectorization approach

Vectorization

QuantNorm: Row/column iterative approach

Row/Column

Study design requirement and running speed

Our method can be applied on data sets with roughly balanced study design, which means the cell populations should be relatively evenly distributed among batches. For data sets with sample size larger than 500, we recommend setting a maximum iteration number (for example max = 10) to save time. We expect up to an hour for the algorithm to complete even if a maximum iteration number is set.

Installation Guide

{r} library(devtools) install_github('tengfei-emory/QuantNorm') library(QuantNorm) Currently QuantNorm supports R version >= 3.3.0.

Preprocessing

Before putting the data in QuantNorm, please feel free to conduct preprocessing, such as cell filtering, log-transformation, or standardization (zero mean and one standard deviation). Currently, QuantNorm only features simple built-in preprocessing task, such as log- transformation and standardization.

The choice of preprocessing will probably affect the result of running QuantNorm. Since the behavior is pretty much dataset-specific, it is good to try different preprocessing methods.

Example - ENCODE Data for Human and Mouse Tissues

The dataset used in this example is reproduced according to Gilad et al (2015). The data contains the log-normalized counts for 13 different tissues for human and their counterparts in mouse. Our algorithm obtains clusters by tissues. The dataset can be loaded by data(ENCODE).

```{r} library(pheatmap) #drawing heatmap data("ENCODE")

Before correction, the subjects are clustered by species

pheatmap(cor(ENCODE))

Assigning the batches based on species

batches <- c(rep(1,13),rep(2,13))

QuantNorm correction

corrected.distance.matrix <- QuantNorm(ENCODE,batches,method='row/column', cor_method='pearson', logdat=F,standardize = T,tol=1e-4) pheatmap(1-corrected.distance.matrix) ``` QuantNorm (left) vs ComBat (right):

Heatmaps

Incorporating with the SC3 method

As mentioned in our paper, our method can improve the performance of a current powerful clustering method, Single-Cell Consensus Clustering (SC3, Kiselev VY et al, 2017). The following code shows how we can plug in the corrected distance matrix to the SC3 algorithm in R. For more detailed tutorial about SC3, please refer to this page by Vladimir Kiselev.

The following example is based on the single-cell RNA-seq data of mouse neuron (Usoskin et al, 2015). The SingleCellExperiment object 'usoskin.rds' is obtained from Hemberg-lab's RNA-seq dataset website.

```{r} library(SC3) library(scater) library(SingleCellExperiment) library(QuantNorm) library(Biobase) library(mclust)

Import the SCE object data

scenet<-readRDS(url('https://scrnaseq-public-datasets.s3.amazonaws.com/scater-objects/usoskin.rds'))

This is the true cell type

cell.type <- scenet@colData$cell_type1

This is the expression matrix

dat.sc3 <- exprs(scenet)

This is batch information

batch <- scenet@colData$Library ```

For the expressoin matrix dat.sc3, we can obtain the 2 corrected distance matrix from QuantNorm, one is based on spearman correlation and one is based on pearson correlation: (In order to save time, I set the maximum iteration as 5. This can be adjusted)

```{r} pearson.cor <- QuantNorm(dat.sc3[rowSums(dat.sc3 != 0) > 622/4,],batch=as.numeric(as.factor(batch)),cor_method='pearson',logdat=F,max=5)

spearman.cor <- QuantNorm(dat.sc3[rowSums(dat.sc3 != 0) > 622/4,],batch=as.numeric(as.factor(batch)),cor_method='spearman',logdat=F,max=5) ``` Then we could use SC3 in the following way (SC3 codes are borrowed from SC3 bioconductor manual). Since SC3 is not a deterministic method, the resulting ARI index can vary, as shown in the supplementary fig. S5 in Fei et al, 2018 paper.

```{r}

Run the SC3 algorithm step by step

sce <- sc3prepare(scenet) sce <- sc3estimatek(sce) sce <- sc3calc_dists(sce)

Replacing the original pearson and spearman distance matrices by the corrected ones.

sce@metadata$sc3$distances$spearman <- spearman.cor sce@metadata$sc3$distances$pearson <- pearson.cor

Continue finishing the standard steps

sce <- sc3calctransfs(sce) sce <- sc3kmeans(sce, ks = 4) sce <- sc3calc_consens(sce)

Check clustering results and ARI

library(mclust) svmlabels <- colData(sce)$sc34clusters adjustedRandIndex(cell.type, svmlabels)

Visualization

sc3_interactive(sce) ```

References

Fei, Teng, et al. "Mitigating the adverse impact of batch effects in sample pattern detection", Bioinformatics, epub ahead of printing (2018).

Gilad, Yoav, and Orna Mizrahi-Man. "A reanalysis of mouse ENCODE comparative gene expression data." F1000Research 4 (2015).

Hemberg lab's RNA-seq dataset page https://hemberg-lab.github.io/scRNA.seq.datasets/.

Johnson, W. Evan, Cheng Li, and Ariel Rabinovic. "Adjusting batch effects in microarray expression data using empirical Bayes methods." Biostatistics 8.1 (2007): 118-127.

Kiselev, Vladimir Yu, et al. "SC3: consensus clustering of single-cell RNA-seq data." Nature methods 14.5 (2017): 483.

Usoskin, D. et al. Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat. Neurosci. 18, 145–153 (2015)

Owner

  • Name: Teng Fei
  • Login: tengfei-emory
  • Kind: user
  • Company: MSKCC

Biostatistician at MSKCC. Latent class analysis; microbiome data; competing risks; longitudinal data; batch effect adjustment

GitHub Events

Total
Last Year

Committers

Last synced: over 2 years ago

All Time
  • Total Commits: 124
  • Total Committers: 1
  • Avg Commits per committer: 124.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
tengfei-emory t****i@e****u 124
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: 6 days
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 2.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • lmanchon (1)
  • yuisato (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • cran 223 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
cran.r-project.org: QuantNorm

Mitigating the Adverse Impact of Batch Effects in Sample Pattern Detection

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 223 Last month
Rankings
Stargazers count: 17.0%
Forks count: 17.8%
Dependent packages count: 29.8%
Dependent repos count: 35.5%
Average: 35.5%
Downloads: 77.4%
Maintainers (1)
Last synced: 6 months ago

Dependencies

DESCRIPTION cran
  • R >= 3.3.0 depends
  • stats * imports
  • GGally * suggests
  • ggplot2 * suggests
  • network * suggests
  • pheatmap * suggests
  • rgl * suggests
  • sna * suggests