https://github.com/aim-uofa/convnova

https://github.com/aim-uofa/convnova

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, biorxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: aim-uofa
  • Language: Python
  • Default Branch: main
  • Size: 1.65 MB
Statistics
  • Stars: 8
  • Watchers: 4
  • Forks: 0
  • Open Issues: 4
  • Releases: 0
Created over 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme

README.md

ConvNova

[ICLR2025] ConvNova 🧬 Revisiting Convolution Architecture in the Realm of DNA Foundation Models

OpenReview |  arXiv |  GitHub |  HuggingFace 🤗(coming soon)

ConvNova demonstrates that, if carefully designed, a pure CNN can serve as a DNA foundation model that surpasses Transformer and SSM-inspired architectures, while retaining the classic convolutional advantages of stronger locality bias, lower memory footprint, and markedly faster training and inference.


🚩 Plan

  • [x] Scripts for Pretraining, NT & Genomic Benchmarks.
  • [x] Paper Released.
  • [ ] Pretrained Weights of ConvNova.
  • [ ] Source Code and Pretrained Weights on transformers.
  • [ ] Scripts for DeepSEA & Bend-gene-finding.

1 Quick start

Clone the repo.

  git clone git@github.com:aim-uofa/ConvNova.git
  cd ConvNova/convnova

Prepare conda env.

  conda create -n convnova python==3.10
  conda activate convnova
  pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 
  pip install -r requirements.txt --no-deps
  pip install pytorch-lightning==1.8.6 --no-deps
  pip install packaging --no-deps
<!--   pip install flashattn --no-build-isolation --no-deps -->
  pip install lightningutilities --no-deps
  pip install torchmetrics
  pip install tensorboardX

Download the data.(Pretrain)

  mkdir data
  mkdir -p data/hg38/
  curl https://storage.googleapis.com/basenjibarnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
  gunzip data/hg38/hg38.ml.fa.gz  # unzip the fasta file
  curl https://storage.googleapis.com/basenjibarnyard2/sequences_human.bed > data/hg38/human-sequences.bed

You can check out the Nucleotide Transformer ang Genomic Benchmarks paper for how to download and process NT benchmark & Genomic Benchmark datasets.

The final file structure (data directory) should look like

  |____bert_hg38
| |____hg38.ml.fa
| |____hg38.ml.fa.fai
| |____human-sequences.bed
|____nucleotide_transformer
| |____H3K36me3
| |____......
|____genomic_benchmark
| |____dummy_mouse_enhancers_ensembl
| |____....

2 Using ConvNova with 🤗 Transformers

Coming Soon


3 Reproducing the paper

3.1 Pre-training on the Human Reference Genome

  python train.py experiment='hg38-pretrain/convnova'

you can adjust the hyperparameters by using cmd like following, detailed hyperparameters setting can be seen in configs/experiment/xxx/xxx.yaml

  python train.py experiment='hg38-pretrain/convnova' wandb=null trainer.devices=4

3.2 Genomic Benchmarks (short-range)

GenomicBenchmarks provides 8 binary- and multi-class tasks packaged as a Python library.

Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).

  python train.py experiment='genomic-benchmark/convnova' with-some-argments

3.3 Nucleotide Transformer Benchmark

Datasets are hosted on the Hub as InstaDeepAI/nucleotide_transformer_downstream_tasks.

Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).

  python train.py experiment='nt-benchmark/convnova' with-some-argments


4 Citation

@inproceedings{bo2025convnova,
  title     = {Revisiting Convolution Architecture in the Realm of DNA Foundation Models},
  author    = {Yu Bo and Weian Mao and Yanjun Shao and Weiqiang Bai and Peng Ye
               and Xinzhu Ma and Junbo Zhao and Hao Chen and Chunhua Shen},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2025}
}

5 Acknowledgements

ConvNova builds on the training, logging and data-loading scaffolds of HyenaDNA and Caduceus, and evaluates on Genomic Benchmarks, Nucleotide Transformer tasks, and the Long-Range Benchmark. We thank the maintainers of these open resources for making rigorous comparison possible.

Owner

  • Name: Advanced Intelligent Machines (AIM)
  • Login: aim-uofa
  • Kind: organization
  • Location: China

A research team at Zhejiang University, focusing on Computer Vision and broad AI research ...

GitHub Events

Total
  • Issues event: 2
  • Watch event: 8
  • Push event: 11
  • Public event: 1
  • Pull request event: 1
  • Create event: 2
Last Year
  • Issues event: 2
  • Watch event: 8
  • Push event: 11
  • Public event: 1
  • Pull request event: 1
  • Create event: 2

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 4
  • Total pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 4
  • Pull requests: 2
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • ychuest (3)
  • yangzhao1230 (1)
Pull Request Authors
  • multydoffer (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

convnova/requirements.txt pypi
  • GitPython ==3.1.43
  • Jinja2 ==3.1.3
  • Markdown ==3.6
  • MarkupSafe ==2.1.5
  • PySocks ==1.7.1
  • PyYAML ==6.0.1
  • Werkzeug ==3.0.3
  • absl-py ==2.1.0
  • accelerate ==0.32.1
  • aiohttp ==3.9.5
  • aiosignal ==1.3.1
  • annotated-types ==0.7.0
  • antlr4-python3-runtime ==4.9.3
  • async-timeout ==4.0.3
  • attrs ==23.2.0
  • beautifulsoup4 ==4.12.3
  • biopython ==1.83
  • bleach ==6.1.0
  • cachetools ==5.3.3
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • cmake ==3.29.6
  • datasets ==2.16.0
  • deepspeed ==0.14.4
  • defusedxml ==0.7.1
  • dill ==0.3.7
  • docker-pycreds ==0.4.0
  • docopt ==0.6.2
  • einops ==0.8.0
  • fastjsonschema ==2.20.0
  • filelock ==3.13.1
  • fire ==0.6.0
  • frozenlist ==1.4.1
  • fsspec ==2023.10.0
  • gdown ==5.2.0
  • genomic_benchmarks ==0.0.9
  • gitdb ==4.0.11
  • google-auth ==2.30.0
  • google-auth-oauthlib ==1.0.0
  • grpcio ==1.64.1
  • hjson ==3.1.0
  • huggingface-hub ==0.23.4
  • hydra-core ==1.3.2
  • idna ==3.7
  • importlib_resources ==6.4.0
  • joblib ==1.4.2
  • jsonschema ==4.23.0
  • jsonschema-specifications ==2023.12.1
  • jupyterlab_pygments ==0.3.0
  • liftover ==1.1.18
  • lit ==18.1.8
  • markdown-it-py ==3.0.0
  • mdurl ==0.1.2
  • mistune ==3.0.2
  • mpmath ==1.3.0
  • multidict ==6.0.5
  • multiprocess ==0.70.15
  • nbclient ==0.10.0
  • nbformat ==5.10.4
  • networkx ==3.0
  • ninja ==1.11.1.1
  • numpy ==1.24.1
  • oauthlib ==3.2.2
  • omegaconf ==2.3.0
  • opt-einsum ==3.3.0
  • pandas ==2.0.3
  • pandocfilters ==1.5.1
  • peft ==0.11.1
  • pillow ==10.2.0
  • pipreqs ==0.5.0
  • pkgutil_resolve_name ==1.3.10
  • polars ==0.20.13
  • protobuf ==5.27.2
  • psutil ==6.0.0
  • py-cpuinfo ==9.0.0
  • pyarrow ==16.1.0
  • pyarrow-hotfix ==0.6
  • pyasn1 ==0.6.0
  • pyasn1_modules ==0.4.0
  • pydantic ==2.8.2
  • pydantic_core ==2.20.1
  • pyfaidx ==0.8.1.1
  • pygments ==2.17.1
  • pynvml ==11.5.0
  • python-dateutil ==2.8.2
  • pytz ==2024.1
  • referencing ==0.35.1
  • regex ==2024.5.15
  • requests ==2.32.3
  • requests-oauthlib ==2.0.0
  • rich ==13.7.1
  • rpds-py ==0.20.0
  • rsa ==4.9
  • safetensors ==0.4.3
  • scikit-learn ==1.3.2
  • scipy ==1.10.1
  • sentry-sdk ==2.7.1
  • setproctitle ==1.3.3
  • six ==1.16.0
  • smmap ==5.0.1
  • soupsieve ==2.5
  • sympy ==1.12
  • tensorboard ==2.14.0
  • tensorboard-data-server ==0.7.2
  • termcolor ==2.4.0
  • threadpoolctl ==3.5.0
  • timm ==0.9.16
  • tinycss2 ==1.3.0
  • tokenizers ==0.13.3
  • tqdm ==4.66.4
  • transformers ==4.28.0
  • triton ==2.0.0
  • tzdata ==2024.1
  • urllib3 ==2.2.2
  • wandb ==0.17.3
  • webencodings ==0.5.1
  • xxhash ==3.4.1
  • yarg ==0.1.9
  • yarl ==1.9.4