https://github.com/braingeneers/cellxgene-ingest

Download scRNASeq from CZ Cell x Gene Census upload to NRP S3 optimized for ML

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Download scRNASeq from CZ Cell x Gene Census upload to NRP S3 optimized for ML

Basic Info

Host: GitHub
Owner: braingeneers
License: mit
Language: Python
Default Branch: main
Size: 25.4 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme License

cellxgene-ingest

Ingest cellxgene data into s3 as chunked h5ad files

Run

Query cellxgene for a list of sample ids and save locally as a feather file: python index.py You can then further filter this index based on associated metadata: In [3]: df = pd.read_feather("data/index.feather") In [4]: df.head() Out[4]: soma_joinid assay cell_type tissue tissue_general suspension_type disease 0 7560043 10x 3' v2 T cell blood blood cell normal 1 7560044 10x 3' v2 T cell blood blood cell normal

Download from cellxgene 2 genes from 1 observation and upload to braingeneers/personal/foo python ingest.py -n 1 -c 1 -d 1 --gene-filter ENSG00000161798,ENSG00000139618 personal/foo

Install

pip install -r requirements.txt

Performance

$ python ingest-pool.py -n 10000 -c 100 -d 20 personal/rcurrie/cellxgene Downloading 10,000 observations in 100 files to s3://braingeneers/personal/rcurrie/cellxgene/ 2024-06-10 06:40:08,670 INFO worker.py:1740 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 Creating pool of 20 ray actors... Ingesting... 100%|█████████████████████████████████████████████████████████| 100/100 [02:02<00:00, 1.22s/it] Done. 100 files ingested in 2.04 minutes. 3.40 hours per 1M observations. 11.19 MB average file size.

$ python ingest-pool.py -n 10000 -c 100 -d 40 personal/rcurrie/cellxgene Downloading 10,000 observations in 100 files to s3://braingeneers/personal/rcurrie/cellxgene/ 2024-06-10 06:44:03,905 INFO worker.py:1740 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 Creating pool of 40 ray actors... Ingesting... 100%|█████████████████████████████████████████████████████████| 100/100 [01:24<00:00, 1.19it/s] Done. 100 files ingested in 1.41 minutes. 2.34 hours per 1M observations. 11.19 MB average file size.

TileDB-SOMA On Older CPUs

The cellxgene-census package depends on TileDB-SOMA which leverages AVX2 on modern CPUs. TileDB-SOMA python wheels assume AVX2 generating an illegal hardware instruction (core dumped) on CPUs without AVX2 (cat /proc/cpuinfo | grep avx2 } head -1). To run on non-AVX2 cpus build from source and install into your existing python environment or active virtualenv via: git clone https://github.com/single-cell-data/TileDB-SOMA.git pip install -v -e TileDB-SOMA/apis/python

References

Cell x Gene

TileDB 101: Single Cell

anndata - Annotated data

Python and boto3 Performance Adventures: Synchronous vs Asynchronous AWS API Interaction

Ray

For local development get a Ray Cluster Running

Interactive Ray Service development

For details on using Ray Actors vs. Data for processing see Model Batch Inference in Ray: Actors, ActorPool, and Datasets

Owner

Name: braingeneers
Login: braingeneers
Kind: organization

Repositories: 15
Profile: https://github.com/braingeneers

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

awscli *
boto3 *
cellxgene-census ==1.13.1
ipython *
jupyter *
jupyter-console *
ray ==2.23.0
tqdm *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/braingeneers/cellxgene-ingest

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

cellxgene-ingest

Run

Install

Performance

TileDB-SOMA On Older CPUs

References

Ray

Owner

GitHub Events

Total

Last Year

Dependencies