https://github.com/biomedsciai/biomed-multi-omic

Build foundation model on RNA or DNA data

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.5%) to scientific vocabulary

Keywords

foundation-models genomics transcriptomics transformers

Last synced: 10 months ago · JSON representation

Repository

Build foundation model on RNA or DNA data

Basic Info

Host: GitHub
Owner: BiomedSciAI
License: apache-2.0
Language: Jupyter Notebook
Default Branch: main
Homepage: https://arxiv.org/pdf/2506.14861
Size: 32.8 MB

Statistics

Stars: 24
Watchers: 6
Forks: 6
Open Issues: 3
Releases: 0

Topics

foundation-models genomics transcriptomics transformers

Created over 1 year ago · Last pushed 11 months ago

Metadata Files

Readme License

biomed-multi-omic

Biomedical foundational models for omics data. This package supports the development of foundation models for scRNA or for DNA data.

biomed-multi-omic enables development and testing of foundation models for DNA sequences and for RNA expression, with modular model and training methods for pretraining and fine-tuning, controllable via a declarative no-code interface. biomed-multi-omic leverages anndata, HuggingFace Transformers, PyTorchLighting and Hydra.

🧬 A single package for DNA and RNA Foundation models. scRNA pretraining on h5ad files or TileDB (eg CellXGene), DNA pretraining on reference human genome (GRCh38/hg38) and also variant imputed genome based on common SNPs available from GWAT catalog and ClinVar datasets.
🚀 Leverages latest open source tools: anndata, HuggingFace transformers and PyTorchLighting
📈 Zero-shot and finetuning support for diverse downstream tasks: (cell type annotation, perturbation prediction for scRNA, promoter prediction task and regulatory regions using Massively parallel reporter assays (MPRAs) for DNA sequences)
Novel pretraining strategies for scRNA and DNA implemented alongside existing methods to enable experimentation and comparison.

Installation

We recommend using uv to create your environment due to it's 10-100x speed up over pip, which also can be used for installation.

Install using cloned repo:

sh git clone git@github.com:BiomedSciAI/biomed-multi-omic.git cd biomed-multi-omic uv venv .venv -p3.12 source ./.venv/bin/activate uv pip install -e .

NB - biomed-multi-omic depends on hic-straw which requires curl. You may need to install curl or libcurl , for more information please refer to curl's install instructions for your OS.

Optional dependencies

In addition to the base package there are additional optional dependencies which extends biomed-multi-omic capabilities further. These include:

bulk_rna: Extends modules for extracting and preprocessing bulk RNA-seq data
benchmarking: Installs additional models used benchmark bmfm-mulit-omics against. These include scib, scib-metrics, pyliger, scanorama and harmony-pytorch.
test: Unittest suite which is recommended for development use

To install optional dependencies from this GitHub repository you can run from package root:

sh uv pip install ".[bulk_rna,benchmarking,test,notebook]"

`bmfm-rna` checkpoints

The model's weights can be aquired from IBM's HuggingFace collection. The following scRNA models are avaliable:

MLM+RDA: ibm-research/biomed.rna.bert.110m.mlm.rda.v1
MLM+Multitask: ibm-research/biomed.rna.bert.110m.mlm.multitask.v1
WCED+Multitask: ibm-research/biomed.rna.bert.110m.wced.multitask.v1
WCED 10 pct: ibm-research/biomed.rna.bert.110m.wced.v1

For details on how the models were trained, please refer to the BMFM-RNA preprint.

To get embeddings and predictions for scRNA data run:

bash export MY_DATA_FILE=... # path to h5ad file with raw counts and gene symbols bmfm-targets-run -cn predict input_file=$MY_DATA_FILE working_dir=/tmp checkpoint=ibm-research/biomed.rna.bert.110m.wced.multitask.v1

For more details tutorial see RNA tutorials. Note to use the notebook you will need to install the notebook optional dependencies (see Installation):

To run inference programmatically, you can see a zero-shot example in this scRNA zero-shot notebook.
To inspect the resulting embeddings and cell-type predictions use this scRNA inspect embeddings notebook.

`bmfm-dna` checkpoints

The model's weights can be aquired from IBM's HuggingFace collection. The following DNA models are avaliable:

MLM+REF_GENOME: ibm-research/biomed.dna.ref.modernbert.113m
MLM+REFSNP_GENOME: ibm-research/biomed.dna.snp.modernbert.113m

DNA Inference

For details on how the models were trained, please refer to the BMFM-DNA preprint.

To get embeddings for DNA sequences run:

bash export INPUT_DIRECTORY=... # path to your DNA sequences files bmfm-targets-run -cn dna_predict input_directory=$INPUT_DIRECTORY working_dir=/tmp checkpoint=ibm-research/biomed.dna.snp.modernbert.113m.v1

For more details tutorial see DNA tutorials.

Package Architecture

RNA Modules

bmfm-rna framework diagram schematic shows the modules available for building Transcriptomics Foundation Model (TFM). A novel contribution of our work is the Whole Cell Expression Decoder (WCED), an innovative pretraining method aimed at improving transcriptomic foundation models. In WCED, the model’s objective is to reconstruct a full cell expression profile from a partial input's [CLS] token representation generated by the transformer encoder. By training models to autocomplete the expression profiles, WCED improves the model’s understanding of underlying biological processes, resulting in better generalization and more accurate predictions for downstream tasks.

bmfm_omics_workflow

DNA Modules

The bmfm-dna framework addresses key limitations of existing DNA language models by incorporating natural genomic variations into the pre-training process, rather than relying solely on the reference genome. This allows the model to better capture critical biological properties, especially in regulatory regions where many disease-associated variants reside. As a result, bmfm-dna offers a more comprehensive and biologically meaningful representation, advancing the field beyond traditional DNALM strategies.

bmfm-dna framework diagram schematic shows the modules available for multiple strategies to encode natural genomic variations; multiple architectures such as BERT, Performer, ModernBERT to build genomic foundation models; fine-tuning and benchmarking of the foundation models on well-established biologically meaningful tasks. In particular, the package incorporates most of the benchmarking datasets from Genomic Understanding and Evaluation (GUE) package released in DNABERT-2. In addition, the package also supports promoter activity prediction on datasets created using Massive Parallel Reporting Assays (MPRA), and SNP-disease association prediction.

bmfm_dna

For more details, check out the the BMFM-DNA preprint.

Citation

To cite the tool for both RNA and DNA, please cite both the following articles:

```bibtex @misc{dandala2025bmfmrnaopenframeworkbuilding, title={BMFM-RNA: An Open Framework for Building and Evaluating Transcriptomic Foundation Models}, author={Bharath Dandala and Michael M. Danziger and Ella Barkan and Tanwi Biswas and Viatcheslav Gurev and Jianying Hu and Matthew Madgwick and Akira Koseki and Tal Kozlovski and Michal Rosen-Zvi and Yishai Shimoni and Ching-Huei Tsou}, year={2025}, eprint={2506.14861}, archivePrefix={arXiv}, primaryClass={q-bio.GN}, url={https://arxiv.org/abs/2506.14861}, }

@misc{li2025bmfmdnasnpawarednafoundation, title={BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects}, author={Hongyang Li and Sanjoy Dey and Bum Chul Kwon and Michael Danziger and Michal Rosen-Tzvi and Jianying Hu and James Kozloski and Ching-Huei Tsou and Bharath Dandala and Pablo Meyer}, year={2025}, eprint={2507.05265}, archivePrefix={arXiv}, primaryClass={q-bio.GN}, url={https://arxiv.org/abs/2507.05265}, } ```

Owner

Name: BiomedSciAI
Login: BiomedSciAI
Kind: organization

Repositories: 6
Profile: https://github.com/BiomedSciAI

GitHub Events

Total

Create event: 9
Issues event: 2
Watch event: 13
Delete event: 9
Issue comment event: 6
Member event: 3
Push event: 67
Pull request review comment event: 24
Pull request review event: 25
Pull request event: 27
Fork event: 3

Last Year

Create event: 9
Issues event: 2
Watch event: 13
Delete event: 9
Issue comment event: 6
Member event: 3
Push event: 67
Pull request review comment event: 24
Pull request review event: 25
Pull request event: 27
Fork event: 3

Committers

Last synced: about 1 year ago

All Time

Total Commits: 60
Total Committers: 3
Avg Commits per committer: 20.0
Development Distribution Score (DDS): 0.233

Past Year

Commits: 60
Committers: 3
Avg Commits per committer: 20.0
Development Distribution Score (DDS): 0.233

Top Committers

Name	Email	Commits
Liran	L**k@i**m	46
Michael-Danziger	m**r@i**m	13
Ching-Huei Tsou	c**i@g**m	1

Committer Domains (Top 20 + Academic)

ibm.com: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 2
Total pull requests: 22
Average time to close issues: N/A
Average time to close pull requests: 16 days
Total issue authors: 2
Total pull request authors: 5
Average comments per issue: 0.0
Average comments per pull request: 0.18
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 2
Pull requests: 22
Average time to close issues: N/A
Average time to close pull requests: 16 days
Issue authors: 2
Pull request authors: 5
Average comments per issue: 0.0
Average comments per pull request: 0.18
Merged pull requests: 11
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

NielsRogge (1)
yangzhao1230 (1)

Pull Request Authors

mmdanziger (11)
liranszlak (7)
deysanjoy33 (2)
mattmadgwick (1)
hayabenhayon (1)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

pyproject.toml pypi

anndata >=0.10.0
anndata >=0.10.9
captum *
cellxgene_census [experimental]>1.13.0
clearml >1.13,<2
einops *
focal-loss-torch @ git+https://github.com/mathiaszinnen/focal_loss_torch.git
hic-straw *
hydra-core *
litdata @ git+https://github.com/aksgibm/litdata.git@residual_sampling_30
numpy <2
omegaconf *
pandas >=2,<3
pydantic *
pysam *
pytorch-lightning >=2.0.0
rdata *
rich *
rnanorm *
scanpy [louvain]
scipy <1.15.0
scipy *
tensorboardX *
tiledbsoma *
torch *
torchmetrics ==1.1.0
transformers >=4.40.0

.github/workflows/pre-commit-check.yml actions

actions/checkout v4 composite
actions/setup-python v3 composite

.github/workflows/python-package.yml actions

actions/checkout v4 composite
astral-sh/setup-uv v5 composite

https://github.com/biomedsciai/biomed-multi-omic

Science Score: 36.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

biomed-multi-omic

Installation

Optional dependencies

bmfm-rna checkpoints

bmfm-dna checkpoints

DNA Inference

Package Architecture

RNA Modules

DNA Modules

Citation

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

`bmfm-rna` checkpoints

`bmfm-dna` checkpoints