https://github.com/arogozhnikov/demuxalot

Reliable, scalable, efficient demultiplexing for single-cell RNA sequencing

https://github.com/arogozhnikov/demuxalot

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 2 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary

Keywords

biotech demultiplexing scrnaseq single-cell-analysis
Last synced: 7 months ago · JSON representation

Repository

Reliable, scalable, efficient demultiplexing for single-cell RNA sequencing

Basic Info
  • Host: GitHub
  • Owner: arogozhnikov
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 2.51 MB
Statistics
  • Stars: 24
  • Watchers: 5
  • Forks: 3
  • Open Issues: 7
  • Releases: 3
Topics
biotech demultiplexing scrnaseq single-cell-analysis
Created about 6 years ago · Last pushed 9 months ago
Metadata Files
Readme License

README.md

demuxalot_logo_small

Run tests and deploy

Demuxalot

Reliable and efficient identification of genotypes for individual cells in RNA sequencing. Demuxalot refines its knowledge about genotypes directly from the data.

Demuxalot is fast and optimized to work with lots of genotypes, enabling efficient reutilization of inferred information from the data.

Preprint is available at biorxiv.

[!NOTE] My friend recently released a python package to perform single-cell analysis efficiently and without requiring a ton of RAM. Sounds interesting? Then give it a try: https://github.com/slaf-project/slaf

Background

During single-cell RNA-sequencing (scRnaSeq) we pool cells from different donors and process them together.

  • Pro: all cells come through the same pipeline, so preparation/biological variation effects are cancelled out from analysis automatically. Also experiments are much cheaper!
  • Con: we don't know cell origin, everything is mixed!

Demuxalot solves the con: it guesses genotype of each cell by matching reads coming from cell against genotypes. This is called demultiplexing.

Comparisons

Demuxalot shows high reliability, data efficiency and speed. Below is a benchmark on PMBC data with 32 donors from preprint

Screen Shot 2021-06-03 at 6 03 12 PM

[!NOTE] we used demuxalot internally for a number of challenging scenarios with a large biobank and low-depth sequencing, and it shines in these scenarios too. Actually, that's why algorithm was created in the first place.

Known genotypes and refined genotypes: the tale of two scenarios

Typical approach to get genotype-specific mutations are

  • whole-genome sequencing (expensive, very good)
    • you have information about all (ok, >90%) the genotype, and it is unlikely that you need to refine it
    • so you just go straight to demultiplexing
    • demuxlet solves this case
  • Bead arrays (aka SNP arrays aka DNA microarrays) are super cheap and practically more relevant
    • you get information about ~650k most common SNPs, and that's only a small fraction, but you also pay very little
    • this case is covered by demuxalot (this package)
    • Illumina's video about this technology

Why is it worth refining genotypes?

SNP array provides up to ~650k positions in the genome. Around 20-30% of them would be specific for a genotype (i.e. deviate from majority).

Each genotype has around 10 times more SNV (single nucleotide variations) that are not captured by array. Some of these missing SNPs are very valuable for demultiplexing.

What's special power of demuxalot?

  • much better handling of multiple reads coming from the same UMI (i.e. same transcript)
    • demuxalot efficiently combines information from multiple reads with same UMI and cross-checks it
  • default settings are CellRanger-specific (that is - optimized for 10X pipeline). Cellranger's and STAR's flags in BAM break some common conventions, but we can still efficiently use them (by using filtering callbacks)
  • ability to refine genotypes. without failing and diverging
    • Vireo is a tool that was created with similar purposes. But it either diverges or does not learn better genotypes
  • optimized variant calling. It's also faster than demuxlet due to multiprocessing
  • this is not a command-line tool, and not meant to be
    • write python code, this gives full control and flexibility of demultiplexing

Installation

Plain and simple: bash pip install demuxalot # Requires python >= 3.8

Here are some common scenarios and how they are implemented in demuxalot. Also visit examples/ folder

Running (simple scenario)

Only using provided genotypes

```python from demuxalot import Demultiplexer, BarcodeHandler, ProbabilisticGenotypes, count_snps

Loading genotypes

genotypes = ProbabilisticGenotypes(genotypenames=['Donor1', 'Donor2', 'Donor3']) genotypes.addvcf('path/to/genotypes.vcf')

Loading barcodes

barcodehandler = BarcodeHandler.fromfile('path/to/barcodes.csv')

snps = countsnps( bamfilelocation='path/to/sortedalignments.bam', chromosome2positions=genotypes.getchromosome2positions(), barcodehandler=barcodehandler, )

returns two dataframes with likelihoods and posterior probabilities

likelihoods, posteriorprobabilities = Demultiplexer.predictposteriors( snps, genotypes=genotypes, barcodehandler=barcodehandler, ) ```

Running (complex scenario)

Refinement of known genotypes is shown in a notebook, see examples/

Saving/loading genotypes

```python

You can always export learnt genotypes to be used later

refinedgenotypes.savebetas('learntgenotypes.parquet') refinedgenotypes = ProbabilisticGenotypes(genotypenames= ) refinedgenotypes.addpriorbetas('learnt_genotypes.parquet') ```

Re-saving VCF genotypes with betas (recommended)

Loading of internal parquet-based format is much faster than parsing/validating VCF. Makes sense to export VCF to internal format in two cases:

  1. when you plan to load it many times.
  2. when you want to 'accumulate' inferred information about genotypes from multiple scnraseq runs

```python genotypes = ProbabilisticGenotypes(genotypenames=['Donor1', 'Donor2', 'Donor3']) genotypes.addvcf('path/to/genotypes.vcf') genotypes.savebetas('learntgenotypes.parquet')

later you can use it.

genotypes = ProbabilisticGenotypes(genotypenames=['Donor1', 'Donor2', 'Donor3']) genotypes.addpriorbetas('learntgenotypes.parquet') ```

Owner

  • Name: Alex Rogozhnikov
  • Login: arogozhnikov
  • Kind: user
  • Location: San Francisco
  • Company: Aperture Science

ML + Science, einops, scientific tools

GitHub Events

Total
  • Create event: 4
  • Release event: 2
  • Issues event: 1
  • Watch event: 1
  • Delete event: 3
  • Issue comment event: 5
  • Push event: 4
  • Pull request event: 5
  • Fork event: 2
Last Year
  • Create event: 4
  • Release event: 2
  • Issues event: 1
  • Watch event: 1
  • Delete event: 3
  • Issue comment event: 5
  • Push event: 4
  • Pull request event: 5
  • Fork event: 2

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 181
  • Total Committers: 3
  • Avg Commits per committer: 60.333
  • Development Distribution Score (DDS): 0.033
Past Year
  • Commits: 6
  • Committers: 2
  • Avg Commits per committer: 3.0
  • Development Distribution Score (DDS): 0.333
Top Committers
Name Email Commits
Alex Rogozhnikov i****m@g****m 175
Marcel Schilling m****g@i****t 4
leqi0001 8****1 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 13
  • Total pull requests: 22
  • Average time to close issues: 12 months
  • Average time to close pull requests: about 17 hours
  • Total issue authors: 7
  • Total pull request authors: 4
  • Average comments per issue: 3.23
  • Average comments per pull request: 0.64
  • Merged pull requests: 20
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 2
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: about 2 hours
  • Issue authors: 2
  • Pull request authors: 2
  • Average comments per issue: 5.0
  • Average comments per pull request: 2.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • arogozhnikov (6)
  • mschilli87 (1)
  • himanshiarora7 (1)
  • leqi0001 (1)
  • Liane990 (1)
  • cartographerJ (1)
Pull Request Authors
  • arogozhnikov (21)
  • mschilli87 (7)
  • leqi0001 (2)
  • danielsarj (1)
Top Labels
Issue Labels
question (3)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 23 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 3
  • Total maintainers: 1
pypi.org: demuxalot

Scalable and reliable demulitplexing for single-cell RNA sequencing.

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 23 Last month
Rankings
Dependent packages count: 10.0%
Dependent repos count: 21.7%
Average: 23.6%
Downloads: 38.9%
Maintainers (1)
Last synced: 7 months ago

Dependencies

.github/workflows/publish.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
.github/workflows/run_test.yml actions
  • actions/checkout v4 composite
  • actions/setup-python v5 composite
pyproject.toml pypi
  • joblib *
  • numpy *
  • pandas *
  • pyarrow *
  • pysam *
  • scipy *