https://github.com/lzh0525/scbert
Science Score: 23.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: biorxiv.org, ncbi.nlm.nih.gov -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.3%) to scientific vocabulary
Last synced: 5 months ago
·
JSON representation
Repository
Basic Info
- Host: GitHub
- Owner: lzh0525
- Default Branch: master
- Size: 50.8 KB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Fork of TencentAILabHealthcare/scBERT
Created over 3 years ago
· Last pushed over 3 years ago
https://github.com/lzh0525/scBERT/blob/master/
# scBERT
[](https://www.python.org/)
### scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data
Reliable cell type annotation is a prerequisite for downstream analysis of single-cell RNA sequencing data. Existing annotation algorithms typically suffer from improper handling of batch effect, lack of curated marker gene lists, or difficulty in leveraging the latent gene-gene interaction information. Inspired by large scale pretrained langurage models, we present a pretrained deep neural network-based model scBERT (single-cell Bidirectional Encoder Representations from Transformers) to overcome the above challenges. scBERT follows the state-of-the-art paradigm of pre-train and fine-tune in the deep learning field. In the first phase of scBERT, it obtains a general understanding of gene-gene interaction by being pre-trained on huge amounts of unlabeled scRNA-seq data. The pre-trained scBERT can then be used for the cell annotation task of unseen and user-specific scRNA-seq data through supervised fine-tuning. For more information, please refer to [https://www.biorxiv.org/content/10.1101/2021.12.05.471261v1](https://www.biorxiv.org/content/10.1101/2021.12.05.471261v1)
# Install
[](https://github.com/scipy/scipy) [](https://github.com/pytorch/pytorch) [](https://github.com/numpy/numpy) [](https://github.com/pandas-dev/pandas) [](https://github.com/theislab/scanpy) [](https://github.com/scikit-learn/scikit-learn) [](https://github.com/huggingface/transformers)
# Data
The data can be downloaded from these links. If you have any question, please contact fionafyang@tencent.com.
https://drive.weixin.qq.com/s?k=AJEAIQdfAAozQt5B8k
https://drive.google.com/file/d/1fNZbKx6LPeoS0hbVYJFI8jlDlNctZxlU/view?usp=sharing
# Checkpoint
The pre-trained model checkpoint can be downloaded from this link. If you have any question, please contact fionafyang@tencent.com.
https://drive.weixin.qq.com/s?k=AJEAIQdfAAoUxhXE7r
# Usage
The test single-cell transcriptomics data file should be pre-processed by first revising gene symbols according to [NCBI Gene database](https://www.ncbi.nlm.nih.gov/gene) updated on Jan. 10, 2020, wherein unmatched genes and duplicated genes will be removed. Then the data should be normalized with the `sc.pp.normalize_total` and `sc.pp.log1p` method in `scanpy` (Python package), detailed in `preprocess.py`.
You can download this repo and run the demo task on your computing machine within about 4 hours.
- Fine-tune using pre-trained models
```
python -m torch.distributed.launch --data_path "fine-tune_data_path" --model_path "pretrained_model_path" finetune.py
```
- Predict using fine-tuned models
```
python --data_path "test_data_path" --model_path "finetuned_model_path" predict.py
```
- Expected output
The expected output of model inference is the cell type of each individual cell.
- Guidance for hyperparameter selection
You can select the hyperparameters of the Performer encoder based on your data and task in:
```
model = PerformerLM(
num_tokens = 7,
dim = 200,
depth = 6,
heads = 10
)
```
Hyperparameter|Description | Default | Arbitrary range
--------------|---------------------------------------| ------- | ----------------
num_tokens |Number of bins in expression embedding | 7 | [5, 7, 9]
dim |Size of scBERT embedding vector | 200 | [100, 200]
heads |Number of attention heads of Performer | 10 | [8, 10, 20]
depth |Number of Performer encoder layers | 6 | [4, 6, 8]
# Time cost
Typical install time on a "normal" desktop computer is about 30 minutes.
Exptected run time for infering 10,000 cells on a "normal" desktop computer is about 25 minutes.
# Disclaimer
This tool is for research purpose and not approved for clinical use.
This is not an official Tencent product.
# Coypright
This tool is developed in Tencent AI Lab.
The copyright holder for this project is Tencent AI Lab.
All rights reserved.
# Citation
Yang, F., Wang, W., Wang, F. et al. scBERT as a large-scale pretrained deep language model for cell type annotation of single-cell RNA-seq data. Nat Mach Intell (2022). https://doi.org/10.1038/s42256-022-00534-z
Owner
- Login: lzh0525
- Kind: user
- Repositories: 1
- Profile: https://github.com/lzh0525