https://github.com/kuleshov-group/caduceus

Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Last synced: 6 months ago · JSON representation

Repository

Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Basic Info

Host: GitHub
Owner: kuleshov-group
License: apache-2.0
Language: Python
Default Branch: main
Size: 884 KB

Statistics

Stars: 191
Watchers: 7
Forks: 27
Open Issues: 14
Releases: 1

Created about 2 years ago · Last pushed 10 months ago

Metadata Files

Readme License

Caduceus ☤: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

[Blog] | [arXiv] | [HuggingFace 🤗]

This repository contains code for reproducing the results in the paper "Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling," Schiff et al. (2024).

Using Caduceus with 🤗

We have uploaded a pre-trained Caduceus model to the Huggingface hub. The available models are: - Caduceus-Ph: kuleshov-group/caduceus-phseqlen-131kdmodel-256n_layer-16 - Trained on sequences of length 131k, with a model size of 256 and 16 layers. - Trained for 50k steps and batch size of 8. - Trained with reverse-complement (RC) data augmentation. - Caduceus-PS: kuleshov-group/caduceus-psseqlen-131kdmodel-256n_layer-16 - Trained on sequences of length 131k, with a model size of 256 and 16 layers. - Trained for 50k steps and batch size of 8. - Model is RC equivariant, hence no RC data augmentation is required.

You can either use the pre-trained model directly within your trainer scripts or modify the config that initializes the model.

To use the pre-trained model for masked language modeling, use the following snippet: ```python from transformers import AutoModelForMaskedLM, AutoTokenizer

See the `Caduceus` collection page on the hub for list of available models.

modelname = "kuleshov-group/caduceus-phseqlen-131kdmodel-256nlayer-16" tokenizer = AutoTokenizer.frompretrained(modelname) model = AutoModelForMaskedLM.frompretrained(modelname) ```

Alternatively, you can instantiate a model from scratch to train on your own data as follows: ```python from transformers import AutoConfig, AutoModelForMaskedLM

Add any config overrides here, see the `config.json` file on the hub for details.

config_overrides = {}

See the `Caduceus` collection page on the hub for list of available models.

config = AutoConfig.frompretrained( "kuleshov-group/caduceus-phseqlen-131kdmodel-256nlayer-16", **configoverrides, ) model = AutoModelForMaskedLM.fromconfig(config) ```

Getting started in this repository

To get started, create a conda environment containing the required dependencies.

bash conda env create -f caduceus_env.yml

Activate the environment.

bash conda activate caduceus_env

Create the following directories to store saved models and slurm logs: bash mkdir outputs mkdir watch_folder

Reproducing Experiments

Below, we describe the steps required for reproducing the experiments in the paper. Throughout, the main entry point for running experiments is the train.py script. We also provide sample slurm scripts for launching pre-training and downstream fine-tuning experiments in the slurm_scripts/ directory.

Pretraining on Human Reference Genome

(Data downloading instructions are copied from HyenaDNA repo)

First, download the Human Reference Genome data. It's comprised of 2 files, 1 with all the sequences (the .fasta file), and with the intervals we use (.bed file).

The file structure should look like

data |-- hg38/ |-- hg38.ml.fa |-- human-sequences.bed

Download fasta (.fa format) file (of the entire human genome) into ./data/hg38. ~24 chromosomes in the whole genome (merged into 1 file), each chromosome is a continuous sequence, basically. Then download the .bed file with sequence intervals (contains chromosome name, start, end, split, which then allow you to retrieve from the fasta file). bash mkdir -p data/hg38/ curl https://storage.googleapis.com/basenji_barnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz gunzip data/hg38/hg38.ml.fa.gz # unzip the fasta file curl https://storage.googleapis.com/basenji_barnyard2/sequences_human.bed > data/hg38/human-sequences.bed

Launch pretraining run using the command line

bash python -m train \ experiment=hg38/hg38 \ callbacks.model_checkpoint_every_n_steps.every_n_train_steps=500 \ dataset.max_length=1024 \ dataset.batch_size=1024 \ dataset.mlm=true \ dataset.mlm_probability=0.15 \ dataset.rc_aug=false \ model=caduceus \ model.config.d_model=128 \ model.config.n_layer=4 \ model.config.bidirectional=true \ model.config.bidirectional_strategy=add \ model.config.bidirectional_weight_tie=true \ model.config.rcps=true \ optimizer.lr="8e-3" \ train.global_batch_size=1024 \ trainer.max_steps=10000 \ +trainer.val_check_interval=10000 \ wandb=null

or alternatively, if using a cluster that has slurm installed, adapt the scripts below: slurm_scripts |-- run_pretrain_caduceus.sh |-- run_pretrain_hyena.sh |-- run_pretrain_mamba.sh

and run the training as a batch job: bash cd slurm_scripts sbatch run_pretrain_caduceus.sh

GenomicBenchmarks

The GenomicBenchmarks presented in Grešová et al. (2023) is comprised of 8 classification tasks.

We can launch a downstream fine-tuning run on one of the tasks using the sample command below: bash python -m train \ experiment=hg38/genomic_benchmark \ callbacks.model_checkpoint_every_n_steps.every_n_train_steps=5000 \ dataset.dataset_name="dummy_mouse_enhancers_ensembl" \ dataset.train_val_split_seed=1 \ dataset.batch_size=256 \ dataset.rc_aug=false \ +dataset.conjoin_train=false \ +dataset.conjoin_test=false \ loader.num_workers=2 \ model=caduceus \ model._name_=dna_embedding_caduceus \ +model.config_path="<path to model_config.json>" \ +model.conjoin_test=false \ +decoder.conjoin_train=true \ +decoder.conjoin_test=false \ optimizer.lr="1e-3" \ trainer.max_epochs=10 \ train.pretrained_model_path="<path to .ckpt file>" \ wandb=null

This sample run will fine-tune a pre-trained Caduceus-PS model on the dummy_mouse_enhancers_ensembl task. Note some of the additional arguments present here, relative to the pre-training command from above: - model.config_path contains the path model config that was saved during pre-training. This will be saved to the run directory of the pre-training experiment. - train.pretrained_model_path contains the path to the pre-trained model checkpoint. - dataset.conjoin_train determines whether the dataset will return a single sequence (dataset.conjoin_train=false) or the concatenation of a sequence and its reverse complement along dim=-1, during downstream fine-tuning training. - dataset.conjoin_test is the same as above, but for inference (e.g., validation / test). - decoder.conjoin_train determines whether the prediction head (a mean pooling and linear projection in the case of the Genomics Benchmark) is expecting an input tensor of shape (batch_size, seq_len, d_model) or (batch_size, seq_len, d_model, 2) during downstream fine-tuning training. When set to true the decoder is run on input[..., 0] and input[..., 1] and the results are averaged to produce the final prediction. - decoder.conjoin_test is the same as above, but for inference (e.g., validation / test).

Note this benchmark only contains a training and test split for each task. Therefore, to have a more principled evaluation, we randomly split the training data into training and validation sets (90/10) using the dataset.train_val_split_seed argument. We perform early stopping on validation metric (accuracy) and repeat this for 5 random seeds.

As with pre-training, we can also launch the fine-tuning run as a batch job using the provided run_genomic_benchmark.sh script. We also provide a helper shell script wrapper_run_genomics.sh that can be used to launch multiple fine-tuning runs in parallel.

Finally, the run_genomics_benchmark_cnn.sh script can be used to train the CNN baseline for this experiment from scratch on the downstream tasks.

Nucleotide Transformer datasets

The Nucleotide Transformer suite of tasks was proposed in Dalla-Torre et al. (2023). The data is available on HuggingFace: InstaDeepAI/nucleotidetransformerdownstream_tasks.

We can launch a downstream fine-tuning run on one of the tasks using the sample command below: bash python -m train \ experiment=hg38/nucleotide_transformer \ callbacks.model_checkpoint_every_n_steps.every_n_train_steps=5000 \ dataset.dataset_name="${task}" \ dataset.train_val_split_seed=${seed} \ dataset.batch_size=${batch_size} \ dataset.rc_aug="${rc_aug}" \ +dataset.conjoin_test="${CONJOIN_TEST}" \ loader.num_workers=2 \ model._name_=dna_embedding_caduceus \ +model.config_path="<path to model_config.json>" \ +model.conjoin_test=false \ +decoder.conjoin_train=true \ +decoder.conjoin_test=false \ optimizer.lr="1e-3" \ trainer.max_epochs=10 \ train.pretrained_model_path="<path to .ckpt file>" \ trainer.max_epochs=20 \ wandb=null

We can also launch as batch jobs (see run_nucleotide_transformer.sh and wrapper_run_nucleotide_transformer.sh for details).

eQTL SNP Variant Effect Prediction

This task comes from the recently proposed Long Range Benchmark (LRB) in Kao et al., 2023. The data is available on HuggingFace: InstaDeepAI/genomics-long-range-benchmark. For this task we fit a model to the pre-trained and frozen embeddings of the DNA language models. Therefore, to perform the evaluation, we proceed in 2 steps: - Step 1: Extract the embeddings from the pre-trained model: Run the vep_embeddings.py script to extract the embeddings from the pre-trained model. See the example below: bash torchrun \ --standalone \ --nnodes=1 \ --nproc-per-node=8 \ vep_embeddings.py \ --num_workers=2 \ --seq_len=131072 \ --bp_per_token=1 \ --embed_dump_batch_size=1 \ --name="caduceus-ps_downstream-seqlen=131k" \ --model_name_or_path="kuleshov-group/caduceus-ps_seqlen-131k_d_model-256_n_layer-16" \ --rcps

The --rcps flag is used to indicate that the model is reverse-complement equivariant. When using other models, set this flag to false with --no-rcps. To speed this step up, this script utilizes torch distributed data parallelism.

Please refer to the slurm script provided in slurm_scripts/dump_vep_embeddings.sh to launch this step as a batch job.

Step 2: Fit an SVM model to the embeddings using this notebook: vep_svm.ipynb.

Citation

If you find our work useful, please cite our paper using the following: @article{schiff2024caduceus, title={Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling}, author={Schiff, Yair and Kao, Chia-Hsiang and Gokaslan, Aaron and Dao, Tri and Gu, Albert and Kuleshov, Volodymyr}, journal={arXiv preprint arXiv:2403.03234}, year={2024} }

Acknowledgements

This repository is adapted from the HyenaDNA repo and leverages much of the training, data loading, and logging infrastructure defined there. HyenaDNA was originally derived from the S4 and Safari repositories.

We would like to thank Evan Trop and the InstaDeep team for useful discussions about the Nucleotide Transformer leaderboard and the Long Range Benchmark task.

Finally, we would like to thank MosaicML for providing compute resources for some of the pre-training experiments.

Owner

Name: Kuleshov Group @ Cornell Tech
Login: kuleshov-group
Kind: organization

Website: https://www.cs.cornell.edu/~kuleshov/
Twitter: volokuleshov
Repositories: 7
Profile: https://github.com/kuleshov-group

Research group at Cornell focused on machine learning, generative models, AI for science

GitHub Events

Total

Issues event: 29
Watch event: 47
Member event: 1
Issue comment event: 63
Push event: 3
Pull request review event: 1
Pull request event: 6
Fork event: 9
Create event: 2

Last Year

Issues event: 27
Watch event: 47
Member event: 1
Issue comment event: 57
Push event: 3
Pull request review event: 1
Pull request event: 6
Fork event: 9
Create event: 2

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 10
Total pull requests: 3
Average time to close issues: 23 days
Average time to close pull requests: N/A
Total issue authors: 6
Total pull request authors: 2
Average comments per issue: 1.7
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 10
Pull requests: 3
Average time to close issues: 23 days
Average time to close pull requests: N/A
Issue authors: 6
Pull request authors: 2
Average comments per issue: 1.7
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

leannmlindsey (9)
zhan8855 (5)
yangzhao1230 (5)
eric-czech (4)
yongrenr (3)
GengGengJiuXi (2)
multydoffer (2)
HUST-Keep-Lin (2)
YHX021014 (2)
cbirchsy (2)
qiaoqiaoLF (2)
davidaknowles (1)
zyj1729 (1)
JiaLonghao1997 (1)
loieeyet (1)

Pull Request Authors

eric-czech (2)
eltociear (1)
s1787956 (1)
leannmlindsey (1)
AymenKallala (1)
jorenretel (1)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/kuleshov-group/caduceus

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Caduceus ☤: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Using Caduceus with 🤗

See the `Caduceus` collection page on the hub for list of available models.

Add any config overrides here, see the `config.json` file on the hub for details.

See the `Caduceus` collection page on the hub for list of available models.

Getting started in this repository

Reproducing Experiments

Pretraining on Human Reference Genome

GenomicBenchmarks

Nucleotide Transformer datasets

eQTL SNP Variant Effect Prediction

Citation

Acknowledgements

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

https://github.com/kuleshov-group/caduceus

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Caduceus ☤: Bi-Directional Equivariant Long-Range DNA Sequence Modeling

Using Caduceus with 🤗

See the Caduceus collection page on the hub for list of available models.

Add any config overrides here, see the config.json file on the hub for details.

See the Caduceus collection page on the hub for list of available models.

Getting started in this repository

Reproducing Experiments

Pretraining on Human Reference Genome

GenomicBenchmarks

Nucleotide Transformer datasets

eQTL SNP Variant Effect Prediction

Citation

Acknowledgements

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

See the `Caduceus` collection page on the hub for list of available models.

Add any config overrides here, see the `config.json` file on the hub for details.

See the `Caduceus` collection page on the hub for list of available models.