https://github.com/bioscan-ml/barcodemae
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: bioscan-ml
- License: mit
- Language: Python
- Default Branch: main
- Size: 2.49 MB
Statistics
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
BarcodeMAE
A PyTorch implementation of BarcodeMAE, a model for enhancing DNA foundation models to address masking inefficiencies.
Check out our paper
Model checkpoint is available here: BarcodeMAE
Quick start
Use this jupyter notebook for quick start: Quick start
Setup
- Clone this repository
- Install the required libraries
shell
pip install -r requirements.txt
pip install -e .
Preparing the data
- Download the metadata file and copy it into the data folder
- Split the metadata file into smaller files according to the different partitions as presented in the BIOSCAN-5M paper
shell
cd data/
python data_split.py BIOSCAN-5M_Dataset_metadata.tsv
Reproducing the results
- Download the checkpoint and copy it to the model_checkpoints directory
- Run KNN evaluation
shell
python barcodebert/knn_probing.py \
--run-name knn_evaluation \
--data-dir ./data/ \
--pretrained-checkpoint "./model_checkpoints/best_pretraining.pt"\
--log-wandb \
--dataset BIOSCAN-5M
Pretraining from scratch
- Run pretraining
shell
python barcodebert/pretraining.py \
--dataset=BIOSCAN-5M \
--k_mer=6 \
--n_layers=6 \
--n_heads=6 \
--decoder-n-layers=6 \
--decoder-n-heads=6 \
--data_dir=data/ \
--checkpoint=model_checkpoints/BIOSCAN-5M/6-6-6/model_checkpoint.pt
Citation
If you find BarcodeMAE useful in your research please consider citing:
bibtex
@article{safari2025barcodemae,
title={Enhancing DNA Foundation Models to Address Masking Inefficiencies},
author={Monireh Safari
and Pablo Millan Arias
and Scott C. Lowe
and Lila Kari
and Angel X. Chang
and Graham W. Taylor
},
journal={arXiv preprint arXiv:2502.18405},
year={2025},
eprint={2502.18405},
archivePrefix={arXiv},
primaryClass={cs.LG},
doi={10.48550/arXiv.2502.18405},
}
Owner
- Name: BIOSCAN
- Login: bioscan-ml
- Kind: organization
- Email: contact@bioscancanada.org
- Website: https://biodiversitygenomics.net/research/bioscan/
- Repositories: 1
- Profile: https://github.com/bioscan-ml
Illuminating biodiversity with DNA-based identification systems
GitHub Events
Total
- Issues event: 2
- Watch event: 4
- Issue comment event: 1
- Push event: 19
- Public event: 1
- Create event: 1
Last Year
- Issues event: 2
- Watch event: 4
- Issue comment event: 1
- Push event: 19
- Public event: 1
- Create event: 1
Dependencies
- actions/checkout v3 composite
- actions/setup-python v4 composite
- pre-commit/action v3.0.0 composite
- accelerate ==0.25.0
- black ==22.8.0
- boto3 ==1.28.57
- botocore ==1.31.57
- cmake *
- einops ==0.6
- matplotlib *
- numpy ==1.25.2
- omegaconf ==2.3
- opt-einsum ==3.3
- pandas ==2.1
- peft ==0.5
- scikit-learn ==1.3
- scipy ==1.12
- seaborn *
- torch ==2.1.1
- torchtext ==0.16.1
- transformers ==4.29.2
- umap-learn ==0.5.6
- wandb *
- accelerate ==0.23.0
- boto3 ==1.28.51
- botocore ==1.31.51
- cmake ==3.27.5
- einops ==0.6.1
- evaluate ==0.4.0
- matplotlib *
- numpy ==1.26.0
- omegaconf ==2.3.0
- opt-einsum ==3.3.0
- pandas ==2.1.1
- peft ==0.5.0
- scikit-learn ==1.3.0
- scipy ==1.11.2
- seaborn *
- torch ==2.0.1
- torchtext ==0.15.2
- torchvision ==0.15.2
- transformers ==4.29.2
- umap-learn ==0.5.3
- accelerate >=0.23.0
- boto3 >=1.28.51
- botocore >=1.31.51
- cmake >=3.27.5
- einops >=0.6.1
- matplotlib *
- numpy >=1.25.0
- omegaconf >=2.3.0
- opt-einsum >=3.3.0
- pandas >=2.1.1
- peft >=0.5.0
- scikit-learn >=1.3.0
- scipy >=1.11.2
- seaborn *
- torch >=2.0.1
- torchtext >=0.15.2
- torchvision >=0.15.2
- transformers >=4.29.2
- umap-learn >=0.5.3
- wandb *
- accelerate ==0.25.0
- black ==22.8.0
- boto3 ==1.28.57
- botocore ==1.31.57
- cmake *
- einops ==0.6
- matplotlib *
- numpy ==1.25.2
- omegaconf ==2.3
- opt-einsum ==3.3
- pandas ==2.1
- peft ==0.5
- scikit-learn ==1.3
- scipy ==1.12
- seaborn *
- torch ==2.1.1
- torchdata ==0.7.1
- torchtext ==0.16.1
- transformers ==4.29.2
- umap-learn ==0.5.6
- wandb *