https://github.com/aim-uofa/convnova
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, biorxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.1%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: aim-uofa
- Language: Python
- Default Branch: main
- Size: 1.65 MB
Statistics
- Stars: 8
- Watchers: 4
- Forks: 0
- Open Issues: 4
- Releases: 0
Metadata Files
README.md
[ICLR2025] ConvNova 🧬 Revisiting Convolution Architecture in the Realm of DNA Foundation Models
OpenReview | arXiv | GitHub | HuggingFace 🤗(coming soon)
ConvNova demonstrates that, if carefully designed, a pure CNN can serve as a DNA foundation model that surpasses Transformer and SSM-inspired architectures, while retaining the classic convolutional advantages of stronger locality bias, lower memory footprint, and markedly faster training and inference.
🚩 Plan
- [x] Scripts for Pretraining, NT & Genomic Benchmarks.
- [x] Paper Released.
- [ ] Pretrained Weights of ConvNova.
- [ ] Source Code and Pretrained Weights on transformers.
- [ ] Scripts for DeepSEA & Bend-gene-finding.
1 Quick start
Clone the repo.
git clone git@github.com:aim-uofa/ConvNova.git cd ConvNova/convnova
Prepare conda env.
conda create -n convnova python==3.10 conda activate convnova pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pip install -r requirements.txt --no-deps pip install pytorch-lightning==1.8.6 --no-deps pip install packaging --no-deps <!-- pip install flashattn --no-build-isolation --no-deps --> pip install lightningutilities --no-deps pip install torchmetrics pip install tensorboardX
Download the data.(Pretrain)
mkdir data mkdir -p data/hg38/ curl https://storage.googleapis.com/basenjibarnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz gunzip data/hg38/hg38.ml.fa.gz # unzip the fasta file curl https://storage.googleapis.com/basenjibarnyard2/sequences_human.bed > data/hg38/human-sequences.bed
You can check out the Nucleotide Transformer ang Genomic Benchmarks paper for how to download and process NT benchmark & Genomic Benchmark datasets.
The final file structure (data directory) should look like
|____bert_hg38 | |____hg38.ml.fa | |____hg38.ml.fa.fai | |____human-sequences.bed |____nucleotide_transformer | |____H3K36me3 | |____...... |____genomic_benchmark | |____dummy_mouse_enhancers_ensembl | |____....
2 Using ConvNova with 🤗 Transformers
Coming Soon
3 Reproducing the paper
3.1 Pre-training on the Human Reference Genome
python train.py experiment='hg38-pretrain/convnova'
you can adjust the hyperparameters by using cmd like following, detailed hyperparameters setting can be seen in configs/experiment/xxx/xxx.yaml
python train.py experiment='hg38-pretrain/convnova' wandb=null trainer.devices=4
3.2 Genomic Benchmarks (short-range)
GenomicBenchmarks provides 8 binary- and multi-class tasks packaged as a Python library.
Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).
python train.py experiment='genomic-benchmark/convnova' with-some-argments
3.3 Nucleotide Transformer Benchmark
Datasets are hosted on the Hub as InstaDeepAI/nucleotide_transformer_downstream_tasks.
Remeber to adjust the setting for different dataset like max seq length and the pretrained checkpoint(comming soon).
python train.py experiment='nt-benchmark/convnova' with-some-argments
4 Citation
@inproceedings{bo2025convnova,
title = {Revisiting Convolution Architecture in the Realm of DNA Foundation Models},
author = {Yu Bo and Weian Mao and Yanjun Shao and Weiqiang Bai and Peng Ye
and Xinzhu Ma and Junbo Zhao and Hao Chen and Chunhua Shen},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2025}
}
5 Acknowledgements
ConvNova builds on the training, logging and data-loading scaffolds of HyenaDNA and Caduceus, and evaluates on Genomic Benchmarks, Nucleotide Transformer tasks, and the Long-Range Benchmark. We thank the maintainers of these open resources for making rigorous comparison possible.
Owner
- Name: Advanced Intelligent Machines (AIM)
- Login: aim-uofa
- Kind: organization
- Location: China
- Repositories: 23
- Profile: https://github.com/aim-uofa
A research team at Zhejiang University, focusing on Computer Vision and broad AI research ...
GitHub Events
Total
- Issues event: 2
- Watch event: 8
- Push event: 11
- Public event: 1
- Pull request event: 1
- Create event: 2
Last Year
- Issues event: 2
- Watch event: 8
- Push event: 11
- Public event: 1
- Pull request event: 1
- Create event: 2
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 4
- Total pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 2
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 2
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ychuest (3)
- yangzhao1230 (1)
Pull Request Authors
- multydoffer (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- GitPython ==3.1.43
- Jinja2 ==3.1.3
- Markdown ==3.6
- MarkupSafe ==2.1.5
- PySocks ==1.7.1
- PyYAML ==6.0.1
- Werkzeug ==3.0.3
- absl-py ==2.1.0
- accelerate ==0.32.1
- aiohttp ==3.9.5
- aiosignal ==1.3.1
- annotated-types ==0.7.0
- antlr4-python3-runtime ==4.9.3
- async-timeout ==4.0.3
- attrs ==23.2.0
- beautifulsoup4 ==4.12.3
- biopython ==1.83
- bleach ==6.1.0
- cachetools ==5.3.3
- charset-normalizer ==3.3.2
- click ==8.1.7
- cmake ==3.29.6
- datasets ==2.16.0
- deepspeed ==0.14.4
- defusedxml ==0.7.1
- dill ==0.3.7
- docker-pycreds ==0.4.0
- docopt ==0.6.2
- einops ==0.8.0
- fastjsonschema ==2.20.0
- filelock ==3.13.1
- fire ==0.6.0
- frozenlist ==1.4.1
- fsspec ==2023.10.0
- gdown ==5.2.0
- genomic_benchmarks ==0.0.9
- gitdb ==4.0.11
- google-auth ==2.30.0
- google-auth-oauthlib ==1.0.0
- grpcio ==1.64.1
- hjson ==3.1.0
- huggingface-hub ==0.23.4
- hydra-core ==1.3.2
- idna ==3.7
- importlib_resources ==6.4.0
- joblib ==1.4.2
- jsonschema ==4.23.0
- jsonschema-specifications ==2023.12.1
- jupyterlab_pygments ==0.3.0
- liftover ==1.1.18
- lit ==18.1.8
- markdown-it-py ==3.0.0
- mdurl ==0.1.2
- mistune ==3.0.2
- mpmath ==1.3.0
- multidict ==6.0.5
- multiprocess ==0.70.15
- nbclient ==0.10.0
- nbformat ==5.10.4
- networkx ==3.0
- ninja ==1.11.1.1
- numpy ==1.24.1
- oauthlib ==3.2.2
- omegaconf ==2.3.0
- opt-einsum ==3.3.0
- pandas ==2.0.3
- pandocfilters ==1.5.1
- peft ==0.11.1
- pillow ==10.2.0
- pipreqs ==0.5.0
- pkgutil_resolve_name ==1.3.10
- polars ==0.20.13
- protobuf ==5.27.2
- psutil ==6.0.0
- py-cpuinfo ==9.0.0
- pyarrow ==16.1.0
- pyarrow-hotfix ==0.6
- pyasn1 ==0.6.0
- pyasn1_modules ==0.4.0
- pydantic ==2.8.2
- pydantic_core ==2.20.1
- pyfaidx ==0.8.1.1
- pygments ==2.17.1
- pynvml ==11.5.0
- python-dateutil ==2.8.2
- pytz ==2024.1
- referencing ==0.35.1
- regex ==2024.5.15
- requests ==2.32.3
- requests-oauthlib ==2.0.0
- rich ==13.7.1
- rpds-py ==0.20.0
- rsa ==4.9
- safetensors ==0.4.3
- scikit-learn ==1.3.2
- scipy ==1.10.1
- sentry-sdk ==2.7.1
- setproctitle ==1.3.3
- six ==1.16.0
- smmap ==5.0.1
- soupsieve ==2.5
- sympy ==1.12
- tensorboard ==2.14.0
- tensorboard-data-server ==0.7.2
- termcolor ==2.4.0
- threadpoolctl ==3.5.0
- timm ==0.9.16
- tinycss2 ==1.3.0
- tokenizers ==0.13.3
- tqdm ==4.66.4
- transformers ==4.28.0
- triton ==2.0.0
- tzdata ==2024.1
- urllib3 ==2.2.2
- wandb ==0.17.3
- webencodings ==0.5.1
- xxhash ==3.4.1
- yarg ==0.1.9
- yarl ==1.9.4