https://github.com/bowang-lab/orthrus

Orthrus is a mature RNA model for RNA property prediction. It uses a mamba encoder backbone, a variant of state-space models specifically designed for long-sequence data, such as RNA.

https://github.com/bowang-lab/orthrus

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: biorxiv.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.8%) to scientific vocabulary
Last synced: 7 months ago · JSON representation

Repository

Orthrus is a mature RNA model for RNA property prediction. It uses a mamba encoder backbone, a variant of state-space models specifically designed for long-sequence data, such as RNA.

Basic Info
  • Host: GitHub
  • Owner: bowang-lab
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 12.9 MB
Statistics
  • Stars: 77
  • Watchers: 4
  • Forks: 15
  • Open Issues: 1
  • Releases: 0
Created over 1 year ago · Last pushed 8 months ago
Metadata Files
Readme License

README.md

# Orthrus: Towards Evolutionary and Functional RNA Foundation Models [![Jekyll](https://img.shields.io/badge/Science_Explainer-C00?logo=jekyll&logoColor=fff)](https://philechka.com/science/orthrus) [![bioRxiv](https://img.shields.io/badge/bioRxiv-2024.10.10.617658-b31b1b.svg)](https://www.biorxiv.org/content/10.1101/2024.10.10.617658v1) [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-FFD21E?logo=huggingface&logoColor=000)](https://huggingface.co/antichronology/orthrus/blob/main/README.md) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.13910050.svg)](https://doi.org/10.5281/zenodo.13910050) [![license](https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray)](https://github.com/bowang-lab/Orthrus/blob/main/LICENSE.md) [![python](https://img.shields.io/badge/-Python_3.10-blue?logo=python&logoColor=white)](https://www.python.org/downloads/release/python-3100/) [![pytorch](https://img.shields.io/badge/PyTorch_2.2+-ee4c2c?logo=pytorch&logoColor=white)](https://pytorch.org/get-started/locally/) [![lightning](https://img.shields.io/badge/-Lightning_2.4+-792ee5?logo=pytorchlightning&logoColor=white)](https://pytorchlightning.ai/)

Orthrus Model Overview

## Model Overview
Orthrus is a mature RNA model for RNA property prediction. It uses a Mamba encoder backbone, a variant of state-space models specifically designed for long-sequence data, such as RNA. Two versions of Orthrus are available: - 4-track base version: Encodes the mRNA sequence with a simplified one-hot approach. - 6-track large version: Adds biological context by including splice site indicators and coding sequence markers, which is crucial for accurate mRNA property prediction such as RNA half-life, ribosome load, and exon junction detection. **Why the Mamba Backbone?** The Mamba architecture is an extension of the S4 (structured state-space) model family, which excels at handling long sequences like mRNAs that can reach over 12,000 nucleotides. This makes it an ideal fit for RNA property prediction models for several reasons: - _Efficient Memory Usage:_ Unlike transformers, which require quadratic memory scaling with sequence length, the Mamba backbone scales linearly, making it computationally efficient for long sequences. - _Variable Context Filtering:_ RNA sequences often contain functionally relevant motifs separated by variable spacing. The Mamba model is capable of selectively focusing on these important elements - _Selective Context Compression:_ Genomic sequences often have uneven information density, with critical regulatory elements scattered across regions of varying importance. The Mamba model selectively compresses less informative regions while preserving the context of key functional areas

Using Orthrus

Orthrus was trained on full RNA sequences, making its usage different from models like DNABERT or Enformer, which focus on arbitrary DNA segments. Orthrus was instead trained on full mature RNA sequences so if you pass an incomplete piece of a spliced RNA the input sample will be out of distribution.

To generate embeddings using Orthrus for spliced mature RNA sequences, follow the steps below:

Create and Set Up the Environment

We recommend using Mamba (a faster alternative to Conda) for environment creation, but Conda works as well. Follow these steps to set up your environment:

  1. Install Mamba or Conda if you haven't already.

  2. Create and activate the environment: ```bash

    Create the environment from the provided YAML file

    mamba env create -f env.yml

    Activate the environment

    conda activate orthrus ```

  3. Install additional dependencies: ```bash

    Install necessary dependencies for the Mamba model (not the environment manager)

    pip install causal_conv1d==1.2.0.post2 pip install mamba-ssm==1.2.0.post1 --no-cache-dir ```

  4. Set up GenomeKit: ```bash

    Download the GenomeKit setup script

    wget -O starter_build.sh https://raw.githubusercontent.com/deepgenomics/GenomeKit/main/starter/build.sh

    Make the script executable

    chmod +x starter_build.sh

    Run the script to download genomes and annotations

    ./starter_build.sh ```

  5. Install orthrus package so you can perform import bash pip install -e .

Now you're ready to use Orthrus for generating embeddings!

Generating Embeddings

4-Track Model

The 4-track model requires only a one-hot encoded sequence of your mRNA. This representation captures the basic nucleotide information of the mRNA sequence. We recommend using the bigger and more capable 6 track model when possible.

Here is example code ```

Sequence for short mRNA

seq=( 'TCATCTGGATTATACATATTTCGCAATGAAAGAGAGGAAGAAAAGGAAGCAGCAAAATATGTGGAGGCCCA' 'ACAAAAGAGACTAGAAGCCTTATTCACTAAAATTCAGGAGGAATTTGAAGAACATGAAGTTACTTCCTCC' 'ACTGAAGTCTTGAACCCCCCAAAGTCATCCATGAGGGTTGGAATCAACTTCTGAAAACACAACAAAACCA' 'TATTTACCATCACGTGCACTAACAAGACAGCAAGTTCGTGCTTTGCAAGATGGTGCAGAGCTTTATGAAG' 'CAGTGAAGAATGCAGCAGACCCAGCTTACCTTGAGGGTTATTTCAGTGAAGAGCAGTTAAGAGCCTTGAA' 'TAATCACAGGCAAATGTTGAATGATAAGAAACAAGCTCAGATCCAGTTGGAAATTAGGAAGGCCATGGAA' 'TCTGCTGAACAAAAGGAACAAGGTTTATCAAGGGATGTCACAACCGTGTGGAAGTTGCGTATTGTAAGCTATTC' )

One hot encode function

oh = seqtooh(seq) onehot = seqtooh(seq) onehot = onehot.T torchonehot = torch.tensor(onehot, dtype=torch.float32) torchonehot = torchonehot.unsqueeze(0) print(torchonehot.shape) torchonehot = torchonehot.to(device='cuda') lengths = torch.tensor([torchonehot.shape[2]]).to(device='cuda')

Load Orthrus

runname="orthrusbase4track" checkpoint="epoch=18-step=20000.ckpt" modelrepository="./models" model = loadmodel(f"{modelrepository}{runname}", checkpoint_name=checkpoint) model = model.to(torch.device('cuda')) print(model)

Generate embedding

reps = model.representation(torchonehot, lengths) print(reps.shape)

torch.Size([1, 256])

```

6-Track Model (Recommended)

The 6-track model offers a more detailed representation by incorporating additional biological context, including splice site and coding sequence information. To generate embeddings for this model:

We're going to be using an awesome library called GenomeKit to extract DNA sequences and build 4/6 track representations of mRNA transcripts, which will be used as input for Orthrus. GenomeKit makes it easy to work with genomic data, such as sequences and annotations, by providing tools to access and manipulate reference genomes and variants efficiently. It's built by the awesome folks at Deep Genomics

For more details, you can refer to the GenomeKit documentation.

To install it: ``` mamba install "genomekit>=6.0.0"

we now want to download the genome annotations and the 2bit genome files

wget -O starterbuild.sh https://raw.githubusercontent.com/deepgenomics/GenomeKit/main/starter/build.sh chmod +x starterbuild.sh ./starter_build.sh ```

We can now generate six track encodings for any transcript! ```

import Genome, Interval, instantiate Genome

genome = Genome("gencode.v29") interval = Interval("chr7", "+", 117120016, 117120201, genome) genome.dna(interval)

CTCTTATGCTCGGGTGATCC

Load Orthrus 6 track

runname="orthruslarge6track" checkpoint="epoch=22-step=20000.ckpt" modelrepository="./models" model = loadmodel(f"{modelrepository}{runname}", checkpoint_name=checkpoint) model = model.to(torch.device('cuda')) print(model)

Generate embedding

transcripts = findtranscriptbygenename(genome, 'BCL2L1') print(transcripts) t = transcripts[0] sixt = createsixtrack_encoding(t) sixt = torch.tensor(sixt, dtype=torch.float32) sixt = sixt.unsqueeze(0) sixt = sixt.to(device='cuda') lengths = torch.tensor([sixt.shape[2]]).to(device='cuda') embedding = model.representation(sixt, lengths) print(embedding.shape)

torch.Size([1, 512])

```

Alternatively, this information can be extracted from genePred files available for download from the UCSC Genome Browser here.

Fine-Tuning Orthrus

All the data for fine tuning, linear probing, and homology splitting is available at this zenodo link: https://zenodo.org/records/13910050

To fine tune orthrus you can use the pre-specified configurations lockated in ./orthrus/rna_task_config for data, model, optimizer, projector, and training parameters. Here is an example command that will fine tune Orthrus on an RNA half-life dataset:

cd orthrus

bash python rna_task_train.py \ --model_config mamba_pre_base \ --train_config bs_64_short_run \ --projector_config default_256 \ --data_config rna_hl \ --optimizer_config no_wd_1e-3 \ --seed_override 0

Before running the command please remember to update your data storage directory and your model weights directory in configs.

If you're interested in running data ablation experiments simply use one of the configured data configurations in ./orthrus/rna_task_config or create a new one. Here is an example of fine tuning GO classification with 10 percent of the data.

bash python rna_task_train.py \ --model_config mamba_pre_base \ --train_config bs_64_1000_steps \ --projector_config default_256_go \ --data_config go_mf_dataset_10pct \ --optimizer_config no_wd_1e-3_100_warmup \ --seed_override 0

Linear Probing

Similarly for linear probing:

bash python linear_probe_eval.py \ --run_name orthrus_large_6_track \ --model_name="epoch=22-step=20000.ckpt" \ --model_repository="/scratch/hdd001/home/phil/msk_backup/runs/" \ --npz_dir="/fs01/home/phil/Documents/01_projects/rna_rep/linear_probe_data2" \ --verbose 1 \ --n_seeds 1 \ --n_tracks 6 \ --load_state_dict=true \ --full_eval \ --homology_split=true

@article{orthrus_fradkin_shi_2024, title = {Orthrus: Towards Evolutionary and Functional RNA Foundation Models}, url = {http://dx.doi.org/10.1101/2024.10.10.617658}, DOI = {10.1101/2024.10.10.617658}, publisher = {Cold Spring Harbor Laboratory}, author = {Fradkin, Philip and Shi, Ruian and Isaev, Keren and Frey, Brendan J and Morris, Quaid and Lee, Leo J and Wang, Bo}, year = {2024}, month = oct }

Owner

  • Name: WangLab @ U of T
  • Login: bowang-lab
  • Kind: organization
  • Location: 190 Elizabeth St, Toronto, ON M5G 2C4 Canada

BoWang's Lab at University of Toronto

GitHub Events

Total
  • Issues event: 18
  • Watch event: 58
  • Delete event: 1
  • Issue comment event: 21
  • Push event: 7
  • Pull request review event: 1
  • Pull request event: 5
  • Fork event: 9
  • Create event: 2
Last Year
  • Issues event: 18
  • Watch event: 58
  • Delete event: 1
  • Issue comment event: 21
  • Push event: 7
  • Pull request review event: 1
  • Pull request event: 5
  • Fork event: 9
  • Create event: 2

Committers

Last synced: over 1 year ago

All Time
  • Total Commits: 9
  • Total Committers: 2
  • Avg Commits per committer: 4.5
  • Development Distribution Score (DDS): 0.444
Past Year
  • Commits: 9
  • Committers: 2
  • Avg Commits per committer: 4.5
  • Development Distribution Score (DDS): 0.444
Top Committers
Name Email Commits
phil p****l@v****l 5
phil-fradkin p****n@g****m 4

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 9
  • Total pull requests: 4
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 5 hours
  • Total issue authors: 9
  • Total pull request authors: 3
  • Average comments per issue: 1.22
  • Average comments per pull request: 0.25
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 9
  • Pull requests: 4
  • Average time to close issues: about 2 months
  • Average time to close pull requests: about 5 hours
  • Issue authors: 9
  • Pull request authors: 3
  • Average comments per issue: 1.22
  • Average comments per pull request: 0.25
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Luoyunfan-Jason (2)
  • KemoTherapy (1)
  • jpfry327 (1)
  • audreyeternal (1)
  • dancaron (1)
  • r-sayar (1)
  • HelloWorldLTY (1)
  • LittletreeZou (1)
  • yaqisu (1)
Pull Request Authors
  • HelloWorldLTY (2)
  • phil-fradkin (2)
  • IanShi1996 (1)
Top Labels
Issue Labels
Pull Request Labels