https://github.com/braingeneers/cellxgene-ingest
Download scRNASeq from CZ Cell x Gene Census upload to NRP S3 optimized for ML
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.1%) to scientific vocabulary
Repository
Download scRNASeq from CZ Cell x Gene Census upload to NRP S3 optimized for ML
Basic Info
- Host: GitHub
- Owner: braingeneers
- License: mit
- Language: Python
- Default Branch: main
- Size: 25.4 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
cellxgene-ingest
Ingest cellxgene data into s3 as chunked h5ad files
Run
Query cellxgene for a list of sample ids and save locally as a feather file:
python index.py
You can then further filter this index based on associated metadata:
In [3]: df = pd.read_feather("data/index.feather")
In [4]: df.head()
Out[4]:
soma_joinid assay cell_type tissue tissue_general suspension_type disease
0 7560043 10x 3' v2 T cell blood blood cell normal
1 7560044 10x 3' v2 T cell blood blood cell normal
Download from cellxgene 2 genes from 1 observation and upload to braingeneers/personal/foo
python ingest.py -n 1 -c 1 -d 1 --gene-filter ENSG00000161798,ENSG00000139618 personal/foo
Install
pip install -r requirements.txt
Performance
$ python ingest-pool.py -n 10000 -c 100 -d 20 personal/rcurrie/cellxgene
Downloading 10,000 observations in 100 files to s3://braingeneers/personal/rcurrie/cellxgene/
2024-06-10 06:40:08,670 INFO worker.py:1740 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
Creating pool of 20 ray actors...
Ingesting...
100%|█████████████████████████████████████████████████████████| 100/100 [02:02<00:00, 1.22s/it]
Done.
100 files ingested in 2.04 minutes.
3.40 hours per 1M observations.
11.19 MB average file size.
$ python ingest-pool.py -n 10000 -c 100 -d 40 personal/rcurrie/cellxgene
Downloading 10,000 observations in 100 files to s3://braingeneers/personal/rcurrie/cellxgene/
2024-06-10 06:44:03,905 INFO worker.py:1740 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
Creating pool of 40 ray actors...
Ingesting...
100%|█████████████████████████████████████████████████████████| 100/100 [01:24<00:00, 1.19it/s]
Done.
100 files ingested in 1.41 minutes.
2.34 hours per 1M observations.
11.19 MB average file size.
TileDB-SOMA On Older CPUs
The cellxgene-census package depends on TileDB-SOMA which leverages AVX2 on modern CPUs. TileDB-SOMA python wheels assume AVX2 generating an illegal hardware instruction (core dumped) on CPUs without AVX2 (cat /proc/cpuinfo | grep avx2 } head -1). To run on non-AVX2 cpus build from source and install into your existing python environment or active virtualenv via:
git clone https://github.com/single-cell-data/TileDB-SOMA.git
pip install -v -e TileDB-SOMA/apis/python
References
Python and boto3 Performance Adventures: Synchronous vs Asynchronous AWS API Interaction
Ray
For local development get a Ray Cluster Running
Interactive Ray Service development
For details on using Ray Actors vs. Data for processing see Model Batch Inference in Ray: Actors, ActorPool, and Datasets
Owner
- Name: braingeneers
- Login: braingeneers
- Kind: organization
- Repositories: 15
- Profile: https://github.com/braingeneers
GitHub Events
Total
Last Year
Dependencies
- awscli *
- boto3 *
- cellxgene-census ==1.13.1
- ipython *
- jupyter *
- jupyter-console *
- ray ==2.23.0
- tqdm *