https://github.com/aim-uofa/convnova

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org, biorxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: aim-uofa
Language: Python
Default Branch: main
Size: 1.65 MB

Statistics

Stars: 8
Watchers: 4
Forks: 0
Open Issues: 4
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme

[ICLR2025] ConvNova 🧬 Revisiting Convolution Architecture in the Realm of DNA Foundation Models

OpenReview | arXiv | GitHub | HuggingFace 🤗(coming soon)

ConvNova demonstrates that, if carefully designed, a pure CNN can serve as a DNA foundation model that surpasses Transformer and SSM-inspired architectures, while retaining the classic convolutional advantages of stronger locality bias, lower memory footprint, and markedly faster training and inference.

🚩 Plan

[x] Scripts for Pretraining, NT & Genomic Benchmarks.
[x] Paper Released.
[ ] Pretrained Weights of ConvNova.
[ ] Source Code and Pretrained Weights on transformers.
[ ] Scripts for DeepSEA & Bend-gene-finding.

1 Quick start

Clone the repo.

  git clone git@github.com:aim-uofa/ConvNova.git
  cd ConvNova/convnova

Prepare conda env.

  conda create -n convnova python==3.10
  conda activate convnova
  pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 
  pip install -r requirements.txt --no-deps
  pip install pytorch-lightning==1.8.6 --no-deps
  pip install packaging --no-deps
<!--   pip install flashattn --no-build-isolation --no-deps -->
  pip install lightningutilities --no-deps
  pip install torchmetrics
  pip install tensorboardX

Download the data.(Pretrain)

  mkdir data
  mkdir -p data/hg38/
  curl https://storage.googleapis.com/basenjibarnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
  gunzip data/hg38/hg38.ml.fa.gz  # unzip the fasta file
  curl https://storage.googleapis.com/basenjibarnyard2/sequences_human.bed > data/hg38/human-sequences.bed

You can check out the Nucleotide Transformer ang Genomic Benchmarks paper for how to download and process NT benchmark & Genomic Benchmark datasets.

The final file structure (data directory) should look like

  |____bert_hg38
| |____hg38.ml.fa
| |____hg38.ml.fa.fai
| |____human-sequences.bed
|____nucleotide_transformer
| |____H3K36me3
| |____......
|____genomic_benchmark
| |____dummy_mouse_enhancers_ensembl
| |____....

2 Using ConvNova with 🤗 Transformers

Coming Soon

3 Reproducing the paper

3.1 Pre-training on the Human Reference Genome

  python train.py experiment='hg38-pretrain/convnova'

you can adjust the hyperparameters by using cmd like following, detailed hyperparameters setting can be seen in configs/experiment/xxx/xxx.yaml

  python train.py experiment='hg38-pretrain/convnova' wandb=null trainer.devices=4

3.2 Genomic Benchmarks (short-range)

GenomicBenchmarks provides 8 binary- and multi-class tasks packaged as a Python library.

Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).

  python train.py experiment='genomic-benchmark/convnova' with-some-argments

3.3 Nucleotide Transformer Benchmark

Datasets are hosted on the Hub as InstaDeepAI/nucleotide_transformer_downstream_tasks.

Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).

  python train.py experiment='nt-benchmark/convnova' with-some-argments

4 Citation

@inproceedings{bo2025convnova,
  title     = {Revisiting Convolution Architecture in the Realm of DNA Foundation Models},
  author    = {Yu Bo and Weian Mao and Yanjun Shao and Weiqiang Bai and Peng Ye
               and Xinzhu Ma and Junbo Zhao and Hao Chen and Chunhua Shen},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2025}
}

5 Acknowledgements

ConvNova builds on the training, logging and data-loading scaffolds of HyenaDNA and Caduceus, and evaluates on Genomic Benchmarks, Nucleotide Transformer tasks, and the Long-Range Benchmark. We thank the maintainers of these open resources for making rigorous comparison possible.

Owner

Name: Advanced Intelligent Machines (AIM)
Login: aim-uofa
Kind: organization
Location: China

Repositories: 23
Profile: https://github.com/aim-uofa

A research team at Zhejiang University, focusing on Computer Vision and broad AI research ...

GitHub Events

Total

Issues event: 2
Watch event: 8
Push event: 11
Public event: 1
Pull request event: 1
Create event: 2

Last Year

Issues event: 2
Watch event: 8
Push event: 11
Public event: 1
Pull request event: 1
Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 4
Total pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 2
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 4
Pull requests: 2
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 2
Pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ychuest (3)
yangzhao1230 (1)

Pull Request Authors

multydoffer (2)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

convnova/requirements.txt pypi

GitPython ==3.1.43
Jinja2 ==3.1.3
Markdown ==3.6
MarkupSafe ==2.1.5
PySocks ==1.7.1
PyYAML ==6.0.1
Werkzeug ==3.0.3
absl-py ==2.1.0
accelerate ==0.32.1
aiohttp ==3.9.5
aiosignal ==1.3.1
annotated-types ==0.7.0
antlr4-python3-runtime ==4.9.3
async-timeout ==4.0.3
attrs ==23.2.0
beautifulsoup4 ==4.12.3
biopython ==1.83
bleach ==6.1.0
cachetools ==5.3.3
charset-normalizer ==3.3.2
click ==8.1.7
cmake ==3.29.6
datasets ==2.16.0
deepspeed ==0.14.4
defusedxml ==0.7.1
dill ==0.3.7
docker-pycreds ==0.4.0
docopt ==0.6.2
einops ==0.8.0
fastjsonschema ==2.20.0
filelock ==3.13.1
fire ==0.6.0
frozenlist ==1.4.1
fsspec ==2023.10.0
gdown ==5.2.0
genomic_benchmarks ==0.0.9
gitdb ==4.0.11
google-auth ==2.30.0
google-auth-oauthlib ==1.0.0
grpcio ==1.64.1
hjson ==3.1.0
huggingface-hub ==0.23.4
hydra-core ==1.3.2
idna ==3.7
importlib_resources ==6.4.0
joblib ==1.4.2
jsonschema ==4.23.0
jsonschema-specifications ==2023.12.1
jupyterlab_pygments ==0.3.0
liftover ==1.1.18
lit ==18.1.8
markdown-it-py ==3.0.0
mdurl ==0.1.2
mistune ==3.0.2
mpmath ==1.3.0
multidict ==6.0.5
multiprocess ==0.70.15
nbclient ==0.10.0
nbformat ==5.10.4
networkx ==3.0
ninja ==1.11.1.1
numpy ==1.24.1
oauthlib ==3.2.2
omegaconf ==2.3.0
opt-einsum ==3.3.0
pandas ==2.0.3
pandocfilters ==1.5.1
peft ==0.11.1
pillow ==10.2.0
pipreqs ==0.5.0
pkgutil_resolve_name ==1.3.10
polars ==0.20.13
protobuf ==5.27.2
psutil ==6.0.0
py-cpuinfo ==9.0.0
pyarrow ==16.1.0
pyarrow-hotfix ==0.6
pyasn1 ==0.6.0
pyasn1_modules ==0.4.0
pydantic ==2.8.2
pydantic_core ==2.20.1
pyfaidx ==0.8.1.1
pygments ==2.17.1
pynvml ==11.5.0
python-dateutil ==2.8.2
pytz ==2024.1
referencing ==0.35.1
regex ==2024.5.15
requests ==2.32.3
requests-oauthlib ==2.0.0
rich ==13.7.1
rpds-py ==0.20.0
rsa ==4.9
safetensors ==0.4.3
scikit-learn ==1.3.2
scipy ==1.10.1
sentry-sdk ==2.7.1
setproctitle ==1.3.3
six ==1.16.0
smmap ==5.0.1
soupsieve ==2.5
sympy ==1.12
tensorboard ==2.14.0
tensorboard-data-server ==0.7.2
termcolor ==2.4.0
threadpoolctl ==3.5.0
timm ==0.9.16
tinycss2 ==1.3.0
tokenizers ==0.13.3
tqdm ==4.66.4
transformers ==4.28.0
triton ==2.0.0
tzdata ==2024.1
urllib3 ==2.2.2
wandb ==0.17.3
webencodings ==0.5.1
xxhash ==3.4.1
yarg ==0.1.9
yarl ==1.9.4

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science