dnadiffusion

🧬 Generative modeling of regulatory DNA sequences with diffusion probabilistic models 💨

https://github.com/pinellolab/dna-diffusion

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
2 of 22 committers (9.1%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary

Keywords

deep-learning diffusion-models diffusion-probabilistic-models generative-model genomics regulatory-genomics stable-diffusion

Keywords from Contributors

jax degoogle

Last synced: 10 months ago · JSON representation

Repository

🧬 Generative modeling of regulatory DNA sequences with diffusion probabilistic models 💨

Basic Info

Host: GitHub
Owner: pinellolab
License: other
Language: Python
Default Branch: main
Homepage: https://pinellolab.github.io/DNA-Diffusion/
Size: 115 MB

Statistics

Stars: 411
Watchers: 11
Forks: 53
Open Issues: 3
Releases: 4

Topics

deep-learning diffusion-models diffusion-probabilistic-models generative-model genomics regulatory-genomics stable-diffusion

Created almost 4 years ago · Last pushed 11 months ago

Metadata Files

Readme License Code of conduct Citation

DNA Diffusion

Generative modeling of regulatory DNA sequences with diffusion probabilistic models.

Documentation: https://pinellolab.github.io/DNA-Diffusion

Source Code: https://github.com/pinellolab/DNA-Diffusion

Introduction
Installation
Recreating data curation, training and sequence generation processes
Examples
- Training Notebook
- Sequence Generation Notebook
Using your own data

Introduction

DNA-Diffusion is diffusion-based model for generation of 200bp cell type-specific synthetic regulatory elements.

Installation

Our preferred package / project manager is uv. Please follow their recommended instructions for installation. For ease we provide their recommended Linux installation command below: bash curl -LsSf https://astral.sh/uv/install.sh | sh

To clone the repository and install the necessary packages, run:

bash git clone https://github.com/pinellolab/DNA-Diffusion.git cd DNA-Diffusion uv sync

This will create a virtual environment in .venv and install all dependencies listed in the uv.lock file. This is compatible with both CPU and GPU, but preferred operating system is Linux with a recent GPU (e.g. A100 GPU). For detailed versions of the dependencies, please refer to the uv.lock file.

Recreating data curation, training and sequence generation processes

Data

We provide a small subset of the DHS Index dataset that was used for training at data/K562_hESCT0_HepG2_GM12878_12k_sequences_per_group.txt.

If you would like to recreate the dataset, you can call:

bash uv run data/master_dataset_and_filter.py which will download all the necessary data and create a file data/master_dataset.ftr containing the full ~3.59 million dataset and a file data/filtered_dataset.txt containing the same subset of sequences as above. A rendered version of this code is provided at notebooks/marimo_master_dataset_and_filter.ipynb.

Training

To train the DNA-Diffusion model, we provide a basic config file for training the diffusion model on the same subset of chromatin accessible regions described in the data section above.

To train the model call:

bash uv run train.py

This runs the model with our predefined config file configs/train/default.yaml, which is set to train the model for a minimum of 2000 epochs. The training script will save model checkpoints for the lowest 2 validation loss values in the checkpoints/ directory. The path to this checkpoint will need to be updated in the sampling config file for sequence generation, as described in the Model Checkpoint section below.

We also provide a base config for debugging that will use a single sequence for training. You can override the default training script to use this debugging config by calling:

bash uv run train.py -cn train_debug

Model Checkpoint

We have uploaded the model checkpoint to HuggingFace. Below we provide an example script that handles downloading the model checkpoint and loading it for sequence generation.

If you would like to use a model checkpoint generated from the training script above, ensure you update the checkpoint_path within the config file configs/sampling/default.yaml to point to the location of the model checkpoint. By default, this is set to checkpoints/model.safetensors, so you will need to ensure that the model checkpoint is saved in this location. Both pt and safetensors formats are supported, so you can use either format for the model checkpoint. An example of overriding the checkpoint path from the command line is described in the sequence generation section below.

Sequence Generation

Generate using Hugging Face Checkpoint

We provide a basic config file for generating sequences using the diffusion model resulting in 1000 sequences made per cell type. To generate sequences using the trained model, you can run the following command:

bash uv run sample_hf.py

The default setup for sampling will generate 1000 sequences per cell type. You can override the default sampling script to generate one sequence per cell type with the following cli flags:

bash uv run sample_hf.py sampling.number_of_samples=1 sampling.sample_batch_size=1

Base generation utilizes a guidance scale 1.0, however this can be tuned within the sample.py with the guidance_scale parameter. This can be overridden in the command line as follows (generating using guidance scale 7.0):

bash uv run sample_hf.py sampling.guidance_scale=7.0 sampling.number_of_samples=1 sampling.sample_batch_size=1

Both above examples will generate sequences for all cell types in the dataset. If you would like to generate sequences for a specific cell type, you can do so by specifying the sampling.cell_type parameter in the command line. For example, to generate a sequence for the K562 cell type, you can run:

bash uv run sample_hf.py data.cell_types=K562 sampling.number_of_samples=1 sampling.sample_batch_size=1 or for both K562 and GM12878 cell types, you can run:

bash uv run sample_hf.py 'data.cell_types="K562,GM12878"' sampling.number_of_samples=1 sampling.sample_batch_size=1 Cell types can be specified as a comma separated string or as a list.

Generate using Local Checkpoint

If you would prefer to download the model checkpoint from Hugging Face and use it directly, you can run the following command to download the model and save it in the checkpoint directory: bash wget https://huggingface.co/ssenan/DNA-Diffusion/resolve/main/model.safetensors -O checkpoints/model.safetensors

Then you can run the sampling script with the following command: uv run sample.py

If you would like to override the checkpoint path from the command line, you can do so with the following command (replacing checkpoints/model.pt with the path to your model checkpoint): bash uv run sample.py sampling.checkpoint_path=checkpoints/model.pt

Examples

Training Notebook

We provide an example colab notebook for training and sampling with the diffusion model. This notebook runs the previous commands for training and sampling.

along with a copy of the notebook at notebooks/training_and_sequence_generation.ipynb

Sequence Generation Notebook

We also provide a colab notebook for generating sequences with the diffusion model using the trained model hosted on Hugging Face. This notebook runs the previous commands for sampling and shows some example outputs.

along with a copy of the notebook at notebooks/sequence_generation.ipynb

Both examples were run on Google Colab using a T4 GPU.

Using your own data

DNA-Diffusion is designed to be flexible and can be adapted to your own data. To use your own data, you will need to follow these steps:

Prepare your data in the same format as our DHS Index dataset. The data should be a tab separated text file contains at least the following columns:
- chr: the chromosome of the regulatory element (e.g. chr1, chr2, etc.)
- sequence: the DNA sequence of the regulatory element
- TAG: the cell type of the regulatory element (e.g. K562, hESCT0, HepG2, GM12878, etc.)

additional metadata columns like start, end, continuous accessibility are allowed but not required.

It's expected that your sequences are 200bp long, however the model can be adapted to work with different sequence lengths by the dataloading code at src/dnadiffusion/data/dataloader.py. You can change the sequence_length parameter in the function load_data to the desired length, but keep in mind that the original model is trained on 200bp sequences so the results may not be as good if you use a different length.
The model is designed to work with discrete class labels for the cell types, so you will need to ensure that your data is in the same format. If you have continuous labels, you can binarize them into discrete classes using a threshold or some other method. This value is contained within the TAG column of the dataset.

The data loading config can be found at configs/data/default.yaml, and you can override the default data loading config by passing the data parameter to the command line. For example, to use a custom data file, you can run:

bash uv run train.py data.data_path=path/to/your/data.txt data.load_saved_data=False

It is important to set data.load_saved_data=False to ensure that cached data is not used, and instead is regenerated from the provided data file. This will ensure that the model is trained on your own data. This will overwrite the default pkl file, so if you would like to keep the original data, you can set data.saved_data_path to a different path. For example:

bash uv run train.py data.data_path=path/to/your/data.txt data.load_saved_data=False data.saved_data_path=path/to/your/saved_data.pkl

A colab notebook demonstrating an example of training using your own data is provided. This example uses a dummy dataset of three 200bp sequences with a single cell type "CELL_A".

along with a copy of the notebook at notebooks/new_data_training_and_sequence_generation.ipynb. This example was run on Google Colab using a T4 GPU.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Lucas Ferreira da Silva}
🤔 💻

_{Luca Pinello}
🤔

_Simon
🤔 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

Owner

Name: Pinello Lab
Login: pinellolab
Kind: organization
Email: lpinello@mgh.harvard.edu
Location: Boston

Website: pinellolab.org
Repositories: 22
Profile: https://github.com/pinellolab

Massachusetts General Hospital/ Harvard Medical School

GitHub Events

Total

Create event: 28
Release event: 3
Issues event: 47
Watch event: 40
Delete event: 27
Issue comment event: 25
Push event: 74
Pull request event: 89
Fork event: 3

Last Year

Create event: 28
Release event: 3
Issues event: 47
Watch event: 40
Delete event: 27
Issue comment event: 25
Push event: 74
Pull request event: 89
Fork event: 3

Committers

Last synced: about 1 year ago

All Time

Total Commits: 228
Total Committers: 22
Avg Commits per committer: 10.364
Development Distribution Score (DDS): 0.763

Past Year

Commits: 24
Committers: 4
Avg Commits per committer: 6.0
Development Distribution Score (DDS): 0.25

Top Committers

Name	Email	Commits
Simon	s**n@g**m	54
renovate[bot]	2****]	52
Lucas Ferreira da Silva	L****a	46
Cameron Smith	c**h@g**m	22
Luca Pinello	l**o@g**m	14
Wouter Meuleman	m**n@g**m	7
IhabBendidi	i**i@g**m	6
Noah Weber	n**r@c**m	4
Saurav Maheshkar	s**r@g**m	3
Tin M. Tunjic	t**c@y**m	3
allcontributors[bot]	4****]	3
Martino	m**m@t**e	2
Mihir Neal	4****l	2
Niccolò Zanichelli	6****9	2
1edv	e**v@m**u	1
Ryams	s**r@g**m	1
Zach Nussbaum	z**m@g**m	1
aaronwtr	6****r	1
hssn-20	6****0	1
jxilt	8****t	1
Ihab BENDIDI	b**i@b**r	1
noahweber1	3****1	1

Committer Domains (Top 20 + Academic)

biocpc24.ens.fr: 1 mit.edu: 1 tcd.ie: 1 celeristx.com: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 125
Total pull requests: 240
Average time to close issues: about 1 month
Average time to close pull requests: about 1 month
Total issue authors: 19
Total pull request authors: 15
Average comments per issue: 1.61
Average comments per pull request: 0.71
Merged pull requests: 162
Bot issues: 2
Bot pull requests: 124

Past Year

Issues: 20
Pull requests: 83
Average time to close issues: 13 days
Average time to close pull requests: 21 days
Issue authors: 5
Pull request authors: 4
Average comments per issue: 0.4
Average comments per pull request: 0.18
Merged pull requests: 43
Bot issues: 1
Bot pull requests: 38

View more stats

Top Authors

Issue Authors

ssenan (58)
cameronraysmith (33)
lucapinello (7)
mateibejan1 (5)
noahweber1 (5)
IhabBendidi (2)
NZ99 (2)
Tylersuard (1)
buttercutter (1)
LucasSilvaFerreira (1)
CamelCaseCam (1)
mjleone (1)
HelloWorldLTY (1)
XuanrZhang (1)
GitHUB-ZYD (1)

Pull Request Authors

renovate[bot] (115)
ssenan (71)
cameronraysmith (27)
allcontributors[bot] (4)
mergify[bot] (3)
noahweber1 (3)
hssn-20 (2)
mansoldm (2)
pre-commit-ci[bot] (2)
lucapinello (2)
jamesthesnake (1)
mihirneal (1)
ttunja (1)
Johnny1188 (1)
mateibejan1 (1)

Top Labels

Issue Labels

stale (34) bug (28) enhancement (28) documentation (14) ci (14) codebase (12) refactoring (10) notebook-prototyping (7) build (7) metrics (6) github_actions (6) dependencies (5) testing (3) question (1) model (1) invalid (1) actions-debug (1) removal (1)

Pull Request Labels

dependencies (133) enhancement (32) stale (28) documentation (22) build (17) github_actions (14) codebase (12) ci (11) refactoring (11) bug (10) metrics (7) breaking (6) actions-debug (2) skip-ci (2) execute-workflow (2) work-in-progress (2) testing (1) notebook-prototyping (1) model (1)

Packages

Total packages: 1
Total downloads:
- pypi 14 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 6
Total maintainers: 1

pypi.org: dnadiffusion

Library for generation of synthetic regulatory elements using diffusion models

Documentation: https://pinellolab.github.io/DNA-Diffusion
License: other
Latest release: 0.1.1
published about 1 year ago

Versions: 6
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 14 Last month

Rankings

Dependent packages count: 7.0%

Average: 18.7%

Dependent repos count: 30.5%

Maintainers (1)

cameronraysmith

Last synced: 10 months ago

Dependencies

.github/workflows/build.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite
codecov/codecov-action v3 composite

.github/workflows/docker.yml actions

actions/checkout v4 composite
docker/build-push-action v5 composite
docker/login-action v3 composite
docker/metadata-action v5 composite
docker/setup-buildx-action v3 composite
docker/setup-qemu-action v3 composite

.github/workflows/documentation.yml actions

actions/checkout v4 composite
actions/deploy-pages v2 composite
actions/setup-python v4 composite
actions/upload-pages-artifact v2 composite

.github/workflows/inactive-issues-prs.yml actions

actions/stale v8 composite

.github/workflows/labeler.yml actions

actions/checkout v4 composite
crazy-max/ghaction-github-labeler v5.0.0 composite

.github/workflows/release.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite
pypa/gh-action-pypi-publish v1.8.10 composite
release-drafter/release-drafter v5.24.0 composite

.github/workflows/test-release.yml actions

actions/checkout v4 composite
actions/setup-python v4 composite
pypa/gh-action-pypi-publish v1.8.10 composite
release-drafter/release-drafter v5.24.0 composite
salsify/action-detect-and-tag-new-version v2.0.3 composite

dockerfiles/Dockerfile docker

condaforge/mambaforge 23.1.0-4 build

environments/conda/environment.yml pypi

memory-efficient-attention-pytorch *
tensorflow *

dnadiffusion

Science Score: 36.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

DNA Diffusion

Contents

Introduction

Installation

Recreating data curation, training and sequence generation processes

Data

Training

Model Checkpoint

Sequence Generation

Generate using Hugging Face Checkpoint

Generate using Local Checkpoint

Examples

Training Notebook

Sequence Generation Notebook

Using your own data

Contributors ✨

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: dnadiffusion

Rankings

Maintainers (1)

Dependencies