https://github.com/bioscan-ml/barcodemae

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.7%) to scientific vocabulary

Last synced: 11 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: bioscan-ml
License: mit
Language: Python
Default Branch: main
Size: 2.49 MB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 1
Releases: 0

Created over 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License

BarcodeMAE

A PyTorch implementation of BarcodeMAE, a model for enhancing DNA foundation models to address masking inefficiencies.

drawing

Check out our paper

Model checkpoint is available here: BarcodeMAE

Quick start

Use this jupyter notebook for quick start: Quick start

Setup

Clone this repository
Install the required libraries

shell pip install -r requirements.txt pip install -e .

Preparing the data

Download the metadata file and copy it into the data folder
Split the metadata file into smaller files according to the different partitions as presented in the BIOSCAN-5M paper

shell cd data/ python data_split.py BIOSCAN-5M_Dataset_metadata.tsv

Reproducing the results

Download the checkpoint and copy it to the model_checkpoints directory
Run KNN evaluation

shell python barcodebert/knn_probing.py \ --run-name knn_evaluation \ --data-dir ./data/ \ --pretrained-checkpoint "./model_checkpoints/best_pretraining.pt"\ --log-wandb \ --dataset BIOSCAN-5M

Pretraining from scratch

Run pretraining

shell python barcodebert/pretraining.py \ --dataset=BIOSCAN-5M \ --k_mer=6 \ --n_layers=6 \ --n_heads=6 \ --decoder-n-layers=6 \ --decoder-n-heads=6 \ --data_dir=data/ \ --checkpoint=model_checkpoints/BIOSCAN-5M/6-6-6/model_checkpoint.pt

Citation

If you find BarcodeMAE useful in your research please consider citing:

bibtex @article{safari2025barcodemae, title={Enhancing DNA Foundation Models to Address Masking Inefficiencies}, author={Monireh Safari and Pablo Millan Arias and Scott C. Lowe and Lila Kari and Angel X. Chang and Graham W. Taylor }, journal={arXiv preprint arXiv:2502.18405}, year={2025}, eprint={2502.18405}, archivePrefix={arXiv}, primaryClass={cs.LG}, doi={10.48550/arXiv.2502.18405}, }

Owner

Name: BIOSCAN
Login: bioscan-ml
Kind: organization
Email: contact@bioscancanada.org

Website: https://biodiversitygenomics.net/research/bioscan/
Repositories: 1
Profile: https://github.com/bioscan-ml

Illuminating biodiversity with DNA-based identification systems

GitHub Events

Total

Issues event: 2
Watch event: 4
Issue comment event: 1
Push event: 19
Public event: 1
Create event: 1

Last Year

Issues event: 2
Watch event: 4
Issue comment event: 1
Push event: 19
Public event: 1
Create event: 1

Dependencies

.github/workflows/pre-commit.yaml actions

actions/checkout v3 composite
actions/setup-python v4 composite
pre-commit/action v3.0.0 composite

CC_requirements.txt pypi

accelerate ==0.25.0
black ==22.8.0
boto3 ==1.28.57
botocore ==1.31.57
cmake *
einops ==0.6
matplotlib *
numpy ==1.25.2
omegaconf ==2.3
opt-einsum ==3.3
pandas ==2.1
peft ==0.5
scikit-learn ==1.3
scipy ==1.12
seaborn *
torch ==2.1.1
torchtext ==0.16.1
transformers ==4.29.2
umap-learn ==0.5.6
wandb *

frozen_requirements.txt pypi

accelerate ==0.23.0
boto3 ==1.28.51
botocore ==1.31.51
cmake ==3.27.5
einops ==0.6.1
evaluate ==0.4.0
matplotlib *
numpy ==1.26.0
omegaconf ==2.3.0
opt-einsum ==3.3.0
pandas ==2.1.1
peft ==0.5.0
scikit-learn ==1.3.0
scipy ==1.11.2
seaborn *
torch ==2.0.1
torchtext ==0.15.2
torchvision ==0.15.2
transformers ==4.29.2
umap-learn ==0.5.3

pyproject.toml pypi

accelerate >=0.23.0
boto3 >=1.28.51
botocore >=1.31.51
cmake >=3.27.5
einops >=0.6.1
matplotlib *
numpy >=1.25.0
omegaconf >=2.3.0
opt-einsum >=3.3.0
pandas >=2.1.1
peft >=0.5.0
scikit-learn >=1.3.0
scipy >=1.11.2
seaborn *
torch >=2.0.1
torchtext >=0.15.2
torchvision >=0.15.2
transformers >=4.29.2
umap-learn >=0.5.3
wandb *

requirements.txt pypi

accelerate ==0.25.0
black ==22.8.0
boto3 ==1.28.57
botocore ==1.31.57
cmake *
einops ==0.6
matplotlib *
numpy ==1.25.2
omegaconf ==2.3
opt-einsum ==3.3
pandas ==2.1
peft ==0.5
scikit-learn ==1.3
scipy ==1.12
seaborn *
torch ==2.1.1
torchdata ==0.7.1
torchtext ==0.16.1
transformers ==4.29.2
umap-learn ==0.5.6
wandb *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science