https://github.com/bowang-lab/scgpt

https://github.com/bowang-lab/scgpt

Science Score: 59.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org
  • Committers with academic emails
    1 of 6 committers (16.7%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.8%) to scientific vocabulary

Keywords

foundation-model gpt single-cell
Last synced: 5 months ago · JSON representation

Repository

Basic Info
Statistics
  • Stars: 1,309
  • Watchers: 35
  • Forks: 276
  • Open Issues: 160
  • Releases: 9
Topics
foundation-model gpt single-cell
Created almost 3 years ago · Last pushed 6 months ago
Metadata Files
Readme License

README.md

scGPT

This is the official codebase for scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI.

Preprint   Documentation   PyPI version   Downloads   Webapp   License

!UPDATE: We have released several new pretrained scGPT checkpoints. Please see the Pretrained scGPT checkpoints section for more details.

[2024.02.26] We have provided a priliminary support for running the pretraining workflow with HuggingFace at the integrate-huggingface-model branch. We will conduct further testing and merge it to the main branch soon.

[2023.12.31] New tutorials about zero-shot applications are now available! Please see find them in the tutorials/zero-shot directory. We also provide a new continual pretrained model checkpoint for cell embedding related tasks. Please see the notebook for more details.

[2023.11.07] As requested by many, now we have made flash-attention an optional dependency. The pretrained weights can be loaded on pytorch CPU, GPU, and flash-attn backends using the same load_pretrained function, load_pretrained(target_model, torch.load("path_to_ckpt.pt")). An example usage is also here.

[2023.09.05] We have release a new feature for reference mapping samples to a custom reference dataset or to all the millions of cells collected from CellXGene! With the help of the faiss library, we achieved a great time and memory efficiency. The index of over 33 millions cells only takes less than 1GB of memory and the similarity search takes less than 1 second for 10,000 query cells on GPU. Please see the Reference mapping tutorial for more details.

Online apps

scGPT is now available at the following online apps as well, so you can get started simply with your browser!

Installation

scGPT works with Python >= 3.7.13 and R >=3.6.1. Please make sure you have the correct version of Python and R installed pre-installation.

scGPT is available on PyPI. To install scGPT, run the following command:

```bash pip install scgpt "flash-attn<1.0.5" # optional, recommended

As of 2023.09, pip install may not run with new versions of the google orbax package, if you encounter related issues, please use the following command instead:

pip install scgpt "flash-attn<1.0.5" "orbax<0.1.8"

```

[Optional] We recommend using wandb for logging and visualization.

bash pip install wandb

The poetry installation is out of sync. Please use pip install instead. ~~For developing, we are using the Poetry package manager. To install Poetry, follow the instructions here.~~

bash $ git clone this-repo-url $ cd scGPT $ poetry install

Note: The flash-attn dependency usually requires specific GPU and CUDA version. If you encounter any issues, please refer to the flash-attn repository for installation instructions. For now, May 2023, we recommend using CUDA 11.7 and flash-attn<1.0.5 due to various issues reported about installing new versions of flash-attn.

Pretrained scGPT Model Zoo

Here is the list of pretrained models. Please find the links for downloading the checkpoint folders. We recommend using the whole-human model for most applications by default. If your fine-tuning dataset shares similar cell type context with the training data of the organ-specific models, these models can usually demonstrate competitive performance as well. A paired vocabulary file mapping gene names to ids is provided in each checkpoint folder. If ENSEMBL ids are needed, please find the conversion at gene_info.csv.

| Model name | Description | Download | | :------------------------ | :------------------------------------------------------ | :------------------------------------------------------------------------------------------- | | whole-human (recommended) | Pretrained on 33 million normal human cells. | link | | continual pretrained | For zero-shot cell embedding related tasks. | link | | brain | Pretrained on 13.2 million brain cells. | link | | blood | Pretrained on 10.3 million blood and bone marrow cells. | link | | heart | Pretrained on 1.8 million heart cells | link | | lung | Pretrained on 2.1 million lung cells | link | | kidney | Pretrained on 814 thousand kidney cells | link | | pan-cancer | Pretrained on 5.7 million cells of various cancer types | link |

Fine-tune scGPT for scRNA-seq integration

Please see our example code in examples/finetune_integration.py. By default, the script assumes the scGPT checkpoint folder stored in the examples/save directory.

To-do-list

  • [x] Upload the pretrained model checkpoint
  • [x] Publish to pypi
  • [ ] Provide the pretraining code with generative attention masking
  • [ ] Finetuning examples for multi-omics integration, cell type annotation, perturbation prediction, cell generation
  • [x] Example code for Gene Regulatory Network analysis
  • [x] Documentation website with readthedocs
  • [x] Bump up to pytorch 2.0
  • [x] New pretraining on larger datasets
  • [x] Reference mapping example
  • [ ] Publish to huggingface model hub

Contributing

We greatly welcome contributions to scGPT. Please submit a pull request if you have any ideas or bug fixes. We also welcome any issues you encounter while using scGPT.

Acknowledgements

We sincerely thank the authors of following open-source projects:

Citing scGPT

bibtex @article{cui2023scGPT, title={scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI}, author={Cui, Haotian and Wang, Chloe and Maan, Hassaan and Pang, Kuan and Luo, Fengning and Wang, Bo}, journal={bioRxiv}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

Owner

  • Name: WangLab @ U of T
  • Login: bowang-lab
  • Kind: organization
  • Location: 190 Elizabeth St, Toronto, ON M5G 2C4 Canada

BoWang's Lab at University of Toronto

GitHub Events

Total
  • Create event: 4
  • Release event: 2
  • Issues event: 62
  • Watch event: 263
  • Delete event: 1
  • Issue comment event: 91
  • Push event: 9
  • Pull request event: 5
  • Fork event: 87
Last Year
  • Create event: 4
  • Release event: 2
  • Issues event: 62
  • Watch event: 263
  • Delete event: 1
  • Issue comment event: 91
  • Push event: 9
  • Pull request event: 5
  • Fork event: 87

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 160
  • Total Committers: 6
  • Avg Commits per committer: 26.667
  • Development Distribution Score (DDS): 0.088
Past Year
  • Commits: 10
  • Committers: 1
  • Avg Commits per committer: 10.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
haotian s****i@g****m 146
ChloeXWang c****q@v****l 5
Kuan-Pang k****g@m****a 4
Moritz m****r@c****t 2
ChloeXWang c****q@v****l 2
ChloeXWang c****q@v****l 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 277
  • Total pull requests: 30
  • Average time to close issues: 14 days
  • Average time to close pull requests: 20 days
  • Total issue authors: 197
  • Total pull request authors: 16
  • Average comments per issue: 2.27
  • Average comments per pull request: 0.6
  • Merged pull requests: 14
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 67
  • Pull requests: 7
  • Average time to close issues: 18 days
  • Average time to close pull requests: about 10 hours
  • Issue authors: 55
  • Pull request authors: 5
  • Average comments per issue: 0.63
  • Average comments per pull request: 0.14
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • HelloWorldLTY (6)
  • yueming-ding (6)
  • yuzhenmao (4)
  • taffy-miao (4)
  • SuperChanS (4)
  • rpeys (3)
  • xoelmb (3)
  • giuliaelgarcia (3)
  • subercui (3)
  • ellieujin (3)
  • nc1m (3)
  • krejciadam (3)
  • Ragagnin (3)
  • Xyihang (2)
  • igor-sadalski (2)
Pull Request Authors
  • subercui (10)
  • KristinTsui (4)
  • rpeys (2)
  • davidliwei (2)
  • Ding3LI (2)
  • ManuelSokolov (2)
  • LarsDu (2)
  • Liripo (2)
  • ceferisbarov (2)
  • Kuan-Pang (2)
  • avysogorets (2)
  • Mang30 (1)
  • xueerchen1990 (1)
  • nealgravindra (1)
  • alexaibio (1)
Top Labels
Issue Labels
enhancement (6) installation (2) gpu-hardware (2) zero-shot (1)
Pull Request Labels

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 2,839 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 1
    (may contain duplicates)
  • Total versions: 32
  • Total maintainers: 1
proxy.golang.org: github.com/bowang-lab/scgpt
  • Versions: 9
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.4%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 6 months ago
proxy.golang.org: github.com/bowang-lab/scGPT
  • Versions: 9
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.4%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 6 months ago
pypi.org: scgpt

Large-scale generative pretrain of single cell using transformer.

  • Versions: 14
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 2,839 Last month
Rankings
Stargazers count: 2.7%
Forks count: 4.8%
Downloads: 7.2%
Average: 9.3%
Dependent packages count: 10.1%
Dependent repos count: 21.5%
Maintainers (1)
Last synced: 6 months ago