scdblfinder

Methods for detecting doublets in single-cell sequencing data

https://github.com/plger/scdblfinder

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 4 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
2 of 8 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.0%) to scientific vocabulary

Keywords

doublets single-cell

Keywords from Contributors

bioconductor-package bioinformatics hdf5 rhdf5 bioconductor mass-spectrometry image-analysis bioc metabolomics gc-ms

Last synced: 6 months ago · JSON representation

Repository

Methods for detecting doublets in single-cell sequencing data

Basic Info

Host: GitHub
Owner: plger
License: gpl-3.0
Language: R
Default Branch: devel
Homepage: https://plger.github.io/scDblFinder/
Size: 17.3 MB

Statistics

Stars: 204
Watchers: 2
Forks: 18
Open Issues: 6
Releases: 0

Topics

doublets single-cell

Created almost 7 years ago · Last pushed 6 months ago

Metadata Files

Readme License

scDblFinder

The scDblFinder package gathers various methods for the detection and handling of doublets/multiplets in single-cell sequencing data (i.e. multiple cells captured within the same droplet or reaction volume), including the novel scDblFinder method. The methods included here are complementary to doublets detection via cell hashes and SNPs in multiplexed samples: while hashing/genotypes can identify doublets formed by cells of the same type (homotypic doublets) from two samples, which are often nearly undistinguishable from real cells transcriptionally (and hence generally unidentifiable through the present package), it cannot identify doublets made by cells of the same sample, even if they are heterotypic (formed by different cell types). Instead, the methods presented here are primarily geared towards the identification of heterotypic doublets, which for most purposes are also the most critical ones.

For a brief overview of the methods, see the introductory vignette (vignette("introduction", package="scDblFinder")). For the detailed study including comparison with alternative methods, see the paper. Here, we will showcase doublet detection using the fast and comprehensive scDblFinder method.

Important notes/updates

if you are using xgboost version 3 or higher, make sure that you are using scDblFinder version 1.23.2 or later (available either from github or bioconductor devel)
the scDblFinder version (1.20) initially shipped with Bioconductor 3.20 (current) had a wrong default doublet rate argument. This has been fixed in Bioconductor, but you should update your package.

Getting started

Installation

You may install the pakage using: r BiocManager::install("scDblFinder") Or, to get the very latest version, r BiocManager::install("plger/scDblFinder")

The latest version will not be compatible with older Bioconductor versions.

Note that, when not installing from git, Bioconductor does not install the latest version of packages, but (to ensure compatibility between packages) installs the version tied to your Bioconductor version. To ensure the best results, install the latest Bioconductor release. We recommend to avoid using scDblFinder from versions prior to Bioconductor 3.14, which give suboptimal results, and scATAC users will need scDblFinder version 1.13.2 or above.

Finally, the documentation here refers to the latest version. If you are using an earlier Bioconductor release, the more accurate documentation will be that of your version, available either from bioconductor or from vignette("introduction", package="scDblFinder").

Basic usage

Given an object sce of class SingleCellExperiment (which does not contain any empty drops, but hasn't been further filtered), you can launch the doublet detection with:

r library(scDblFinder) sce <- scDblFinder(sce)

This will add a number of columns to the colData of sce, the most important of which are:

sce$scDblFinder.score : the final doublet score (the higher the more likely that the cell is a doublet)
sce$scDblFinder.class : the classification (doublet or singlet)

There are several additional columns containing further information (e.g. the most likely origin of the putative doublet), an overview of which is available in the vignette (vignette("scDblFinder")).

Multiple samples

If you have multiple samples (understood as different cell captures, i.e. for multiplexed samples with cell hashes, rather use the batch), then it is preferable to provide scDblFinder with this information in order to take into consideration batch/sample-specific doublet rates. You can do this by simply providing a vector of the sample ids to the samples parameter of scDblFinder or, if these are stored in a column of colData, the name of the column. With default settings, the this will result in samples being processed separately, which appears to be faster, more robust to batch effects, and as accurate as training a single model (see the multiSampleMode argument for other options). In such cases, you might also consider multithreading it using the BPPARAM parameter. For example:

r library(BiocParallel) sce <- scDblFinder(sce, samples="sample_id", BPPARAM=MulticoreParam(3)) table(sce$scDblFinder.class)

Cluster-based detection

scDblFinder has two main modes for generating artificial doublets: a random one (clusters=FALSE, now default) and a cluster-based one (clusters=TRUE or providing your own clusters - the approach from previous versions). In practice, we observed that both approaches perform well (and better than alternatives). We suggest using the cluster-based approach when the datasets are segregated into clear clusters, and the random one for the rest (e.g. developmental trajectories).

Expected proportion of doublets

The expected proportion of doublets has little impact on the score, but a very strong impact on where the threshold will be placed (the thresholding procedure simultaneously minimizes classification error and departure from the expected doublet rate). It is specified through the dbr parameter and the dbr.sd parameter (the latter specifies the standard deviation of dbr, i.e. the uncertainty in the expected doublet rate). For 10x data, the more cells you capture the higher the chance of creating a doublet, and Chromium documentation indicates a doublet rate of roughly 1\% per 1000 cells captures (so with 5000 cells, (0.01*5)*5000 = 250 doublets), and the default expected doublet rate will be set to this value (with a default standard deviation of 0.015). Note however that different protocols may create considerably more doublets, and that this should be updated accordingly. If you are unsure about the doublet rate, set dbr.sd=1 and the thresholding will be entirely based on the misclassification rates.

Single-cell ATACseq

The scDblFinder method can be to single-cell ATACseq (on peak-level counts), however when doing so we recommend using the aggregateFeatures=TRUE parameter (see vignette).

In addition, the package includes a reimplementation of the Amulet method from Thibodeau et al. (2021). For more information, see the ATAC-related vignette.

Comparison with other tools

scDblFinder was independently evaluated by Nan Miles Xi and Jingyi Jessica Li in the addendum to their excellent benchmark, where they write that "scDblFinder achieves the highest mean AUPRC and AUROC values, and it is also the top method in terms of the precision, recall, and TNR under the 10% identification rate."

The figure below compares some of the methods implemented in this package (in bold) with alternative methods (including the top alternative, DoubletFinder): Benchmark of doublet detection methods Figure1: Accuracy (area under the precision and recall curve) of doublet identification using alternative methods across 16 benchmark datasets from Xi and Li (2020). The colour of the dots indicates the relative ranking for the dataset, while the size and numbers indicate the actual area under the (PR) curve. For each dataset, the top method is circled in black. Methods with names in black are provided in the scDblFinder package. Running times are indicated on the left. On top the number of cells in each dataset is shown, and colored by the proportion of variance explained by the first two components (relative to that explained by the first 100), as a rough guide to dataset simplicity.

Rather a python person? You can have a look at vaeda, another doublet finding method which appears to have performances close to those of scDblFinder. Alternatively, run scDblFinder from the command line.

Owner

Name: Pierre-Luc
Login: plger
Kind: user
Location: Zürich, Switzerland

Repositories: 20
Profile: https://github.com/plger

like sailors who on the open sea must reconstruct their ship but are never able to start afresh from the bottom...

GitHub Events

Total

Issues event: 30
Watch event: 38
Delete event: 1
Issue comment event: 47
Push event: 21
Fork event: 3
Create event: 2

Last Year

Issues event: 30
Watch event: 38
Delete event: 1
Issue comment event: 47
Push event: 21
Fork event: 3
Create event: 2

Committers

Last synced: 9 months ago

All Time

Total Commits: 515
Total Committers: 8
Avg Commits per committer: 64.375
Development Distribution Score (DDS): 0.091

Past Year

Commits: 30
Committers: 3
Avg Commits per committer: 10.0
Development Distribution Score (DDS): 0.1

Top Committers

Name	Email	Commits
plger	p**n@g**m	468
LTLA	i**s@g**m	21
Nitesh Turaga	n**a@g**m	12
J Wokaty	j**y@s**u	10
xnnba1984	6****4	1
Hervé Pagès	h**s@f**g	1
Federico Marini	m**f@u**e	1
Charlotte Soneson	c**n@g**m	1

Committer Domains (Top 20 + Academic)

uni-mainz.de: 1 fredhutch.org: 1 sph.cuny.edu: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 93
Total pull requests: 29
Average time to close issues: 20 days
Average time to close pull requests: 3 days
Total issue authors: 81
Total pull request authors: 6
Average comments per issue: 3.59
Average comments per pull request: 0.03
Merged pull requests: 26
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 22
Pull requests: 0
Average time to close issues: 17 days
Average time to close pull requests: N/A
Issue authors: 20
Pull request authors: 0
Average comments per issue: 1.95
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

bjstewart1 (3)
ATpoint (3)
LTLA (3)
mariabernard (2)
maryellenlynall (2)
dimitrisokolowskei (2)
JTpath (2)
TdzBAS (2)
drhochbaum (2)
openpaul (1)
kokitsuyuzaki (1)
cwoehle (1)
aghr (1)
ysu2015 (1)
myushen (1)

Pull Request Authors

plger (18)
LTLA (9)
xnnba1984 (1)
grst (1)
shaln (1)
federicomarini (1)
csoneson (1)
kew24 (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

Dockerfile docker

bioconductor/bioconductor_docker devel build

.github/workflows/check.yaml actions

actions/checkout v2 composite
actions/upload-artifact master composite

DESCRIPTION cran

R >= 4.0 depends
SingleCellExperiment * depends
BiocGenerics * imports
BiocNeighbors * imports
BiocParallel * imports
BiocSingular * imports
DelayedArray * imports
GenomeInfoDb * imports
GenomicRanges * imports
IRanges * imports
MASS * imports
Matrix * imports
Rsamtools * imports
S4Vectors * imports
SummarizedExperiment * imports
bluster * imports
igraph * imports
methods * imports
rtracklayer * imports
scater * imports
scran * imports
scuttle * imports
stats * imports
utils * imports
xgboost * imports
BiocStyle * suggests
ComplexHeatmap * suggests
circlize * suggests
dplyr * suggests
ggplot2 * suggests
knitr * suggests
mbkmeans * suggests
rmarkdown * suggests
scRNAseq * suggests
testthat * suggests
viridisLite * suggests

scdblfinder

Science Score: 59.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

scDblFinder

Important notes/updates

Getting started

Installation

Basic usage

Multiple samples

Cluster-based detection

Expected proportion of doublets

Single-cell ATACseq

Comparison with other tools

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies