https://github.com/arogozhnikov/demuxalot

Reliable, scalable, efficient demultiplexing for single-cell RNA sequencing

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
✓
Academic publication links
Links to: biorxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary

Keywords

biotech demultiplexing scrnaseq single-cell-analysis

Last synced: 10 months ago · JSON representation

Repository

Reliable, scalable, efficient demultiplexing for single-cell RNA sequencing

Basic Info

Host: GitHub
Owner: arogozhnikov
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 2.51 MB

Statistics

Stars: 24
Watchers: 5
Forks: 3
Open Issues: 7
Releases: 3

Topics

biotech demultiplexing scrnaseq single-cell-analysis

Created over 6 years ago · Last pushed 12 months ago

Metadata Files

Readme License

README.md

demuxalot_logo_small

Demuxalot

Reliable and efficient identification of genotypes for individual cells in RNA sequencing. Demuxalot refines its knowledge about genotypes directly from the data.

Demuxalot is fast and optimized to work with lots of genotypes, enabling efficient reutilization of inferred information from the data.

Preprint is available at biorxiv.

[!NOTE] My friend recently released a python package to perform single-cell analysis efficiently and without requiring a ton of RAM. Sounds interesting? Then give it a try: https://github.com/slaf-project/slaf

Background

During single-cell RNA-sequencing (scRnaSeq) we pool cells from different donors and process them together.

Pro: all cells come through the same pipeline, so preparation/biological variation effects are cancelled out from analysis automatically. Also experiments are much cheaper!
Con: we don't know cell origin, everything is mixed!

Demuxalot solves the con: it guesses genotype of each cell by matching reads coming from cell against genotypes. This is called demultiplexing.

Comparisons

Demuxalot shows high reliability, data efficiency and speed. Below is a benchmark on PMBC data with 32 donors from preprint

Screen Shot 2021-06-03 at 6 03 12 PM

[!NOTE] we used demuxalot internally for a number of challenging scenarios with a large biobank and low-depth sequencing, and it shines in these scenarios too. Actually, that's why algorithm was created in the first place.

Known genotypes and refined genotypes: the tale of two scenarios

Typical approach to get genotype-specific mutations are

whole-genome sequencing (expensive, very good)
- you have information about all (ok, >90%) the genotype, and it is unlikely that you need to refine it
- so you just go straight to demultiplexing
- demuxlet solves this case
Bead arrays (aka SNP arrays aka DNA microarrays) are super cheap and practically more relevant
- you get information about ~650k most common SNPs, and that's only a small fraction, but you also pay very little
- this case is covered by demuxalot (this package)
- Illumina's video about this technology

Why is it worth refining genotypes?

SNP array provides up to ~650k positions in the genome. Around 20-30% of them would be specific for a genotype (i.e. deviate from majority).

Each genotype has around 10 times more SNV (single nucleotide variations) that are not captured by array. Some of these missing SNPs are very valuable for demultiplexing.

What's special power of demuxalot?

much better handling of multiple reads coming from the same UMI (i.e. same transcript)
- demuxalot efficiently combines information from multiple reads with same UMI and cross-checks it
default settings are CellRanger-specific (that is - optimized for 10X pipeline). Cellranger's and STAR's flags in BAM break some common conventions, but we can still efficiently use them (by using filtering callbacks)
ability to refine genotypes. without failing and diverging
- Vireo is a tool that was created with similar purposes. But it either diverges or does not learn better genotypes
optimized variant calling. It's also faster than demuxlet due to multiprocessing
this is not a command-line tool, and not meant to be
- write python code, this gives full control and flexibility of demultiplexing

Installation

Plain and simple: bash pip install demuxalot # Requires python >= 3.8

Here are some common scenarios and how they are implemented in demuxalot. Also visit examples/ folder

Running (simple scenario)

Only using provided genotypes

```python from demuxalot import Demultiplexer, BarcodeHandler, ProbabilisticGenotypes, count_snps

Loading genotypes

genotypes = ProbabilisticGenotypes(genotypenames=['Donor1', 'Donor2', 'Donor3']) genotypes.addvcf('path/to/genotypes.vcf')

Loading barcodes

barcodehandler = BarcodeHandler.fromfile('path/to/barcodes.csv')

snps = countsnps( bamfilelocation='path/to/sortedalignments.bam', chromosome2positions=genotypes.getchromosome2positions(), barcodehandler=barcodehandler, )

returns two dataframes with likelihoods and posterior probabilities

likelihoods, posteriorprobabilities = Demultiplexer.predictposteriors( snps, genotypes=genotypes, barcodehandler=barcodehandler, ) ```

Running (complex scenario)

Refinement of known genotypes is shown in a notebook, see examples/

Saving/loading genotypes

```python

You can always export learnt genotypes to be used later

refinedgenotypes.savebetas('learntgenotypes.parquet') refinedgenotypes = ProbabilisticGenotypes(genotypenames= ) refinedgenotypes.addpriorbetas('learnt_genotypes.parquet') ```

Re-saving VCF genotypes with betas (recommended)

Loading of internal parquet-based format is much faster than parsing/validating VCF. Makes sense to export VCF to internal format in two cases:

when you plan to load it many times.
when you want to 'accumulate' inferred information about genotypes from multiple scnraseq runs

```python genotypes = ProbabilisticGenotypes(genotypenames=['Donor1', 'Donor2', 'Donor3']) genotypes.addvcf('path/to/genotypes.vcf') genotypes.savebetas('learntgenotypes.parquet')

later you can use it.

genotypes = ProbabilisticGenotypes(genotypenames=['Donor1', 'Donor2', 'Donor3']) genotypes.addpriorbetas('learntgenotypes.parquet') ```

Owner

Name: Alex Rogozhnikov
Login: arogozhnikov
Kind: user
Location: San Francisco
Company: Aperture Science

Website: https://arogozhnikov.github.io
Repositories: 9
Profile: https://github.com/arogozhnikov

ML + Science, einops, scientific tools

GitHub Events

Total

Create event: 4
Release event: 2
Issues event: 1
Watch event: 1
Delete event: 3
Issue comment event: 5
Push event: 4
Pull request event: 5
Fork event: 2

Last Year

Create event: 4
Release event: 2
Issues event: 1
Watch event: 1
Delete event: 3
Issue comment event: 5
Push event: 4
Pull request event: 5
Fork event: 2

Committers

Last synced: 12 months ago

All Time

Total Commits: 181
Total Committers: 3
Avg Commits per committer: 60.333
Development Distribution Score (DDS): 0.033

Past Year

Commits: 6
Committers: 2
Avg Commits per committer: 3.0
Development Distribution Score (DDS): 0.333

Top Committers

Name	Email	Commits
Alex Rogozhnikov	i**m@g**m	175
Marcel Schilling	m**g@i**t	4
leqi0001	8****1	2

Committer Domains (Top 20 + Academic)

idibell.cat: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 13
Total pull requests: 22
Average time to close issues: 12 months
Average time to close pull requests: about 17 hours
Total issue authors: 7
Total pull request authors: 4
Average comments per issue: 3.23
Average comments per pull request: 0.64
Merged pull requests: 20
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: about 2 hours
Issue authors: 2
Pull request authors: 2
Average comments per issue: 5.0
Average comments per pull request: 2.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

arogozhnikov (6)
mschilli87 (1)
himanshiarora7 (1)
leqi0001 (1)
Liane990 (1)
cartographerJ (1)

Pull Request Authors

arogozhnikov (21)
mschilli87 (7)
leqi0001 (2)
danielsarj (1)

Top Labels

Issue Labels

question (3)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 23 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 3
Total maintainers: 1

pypi.org: demuxalot

Scalable and reliable demulitplexing for single-cell RNA sequencing.

Homepage: https://github.com/arogozhnikov/demuxalot
Documentation: https://demuxalot.readthedocs.io/
License: MIT
Latest release: 0.4.3
published about 1 year ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 23 Last month

Rankings

Dependent packages count: 10.0%

Dependent repos count: 21.7%

Average: 23.6%

Downloads: 38.9%

Maintainers (1)

alex.rogozhnikov

Last synced: 11 months ago

Dependencies

.github/workflows/publish.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

.github/workflows/run_test.yml actions

actions/checkout v4 composite
actions/setup-python v5 composite

pyproject.toml pypi

joblib *
numpy *
pandas *
pyarrow *
pysam *
scipy *