https://github.com/bioscan-ml/barcodemae

https://github.com/bioscan-ml/barcodemae

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: bioscan-ml
  • License: mit
  • Language: Python
  • Default Branch: main
  • Size: 2.49 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 1
  • Releases: 0
Created 11 months ago · Last pushed 10 months ago
Metadata Files
Readme License

README.md

BarcodeMAE

A PyTorch implementation of BarcodeMAE, a model for enhancing DNA foundation models to address masking inefficiencies.

drawing

Check out our paper

Model checkpoint is available here: BarcodeMAE

Quick start

Use this jupyter notebook for quick start: Quick start

Setup

  1. Clone this repository
  2. Install the required libraries

shell pip install -r requirements.txt pip install -e .

Preparing the data

  1. Download the metadata file and copy it into the data folder
  2. Split the metadata file into smaller files according to the different partitions as presented in the BIOSCAN-5M paper

shell cd data/ python data_split.py BIOSCAN-5M_Dataset_metadata.tsv

Reproducing the results

  1. Download the checkpoint and copy it to the model_checkpoints directory
  2. Run KNN evaluation

shell python barcodebert/knn_probing.py \ --run-name knn_evaluation \ --data-dir ./data/ \ --pretrained-checkpoint "./model_checkpoints/best_pretraining.pt"\ --log-wandb \ --dataset BIOSCAN-5M

Pretraining from scratch

  1. Run pretraining

shell python barcodebert/pretraining.py \ --dataset=BIOSCAN-5M \ --k_mer=6 \ --n_layers=6 \ --n_heads=6 \ --decoder-n-layers=6 \ --decoder-n-heads=6 \ --data_dir=data/ \ --checkpoint=model_checkpoints/BIOSCAN-5M/6-6-6/model_checkpoint.pt

Citation

If you find BarcodeMAE useful in your research please consider citing:

bibtex @article{safari2025barcodemae, title={Enhancing DNA Foundation Models to Address Masking Inefficiencies}, author={Monireh Safari and Pablo Millan Arias and Scott C. Lowe and Lila Kari and Angel X. Chang and Graham W. Taylor }, journal={arXiv preprint arXiv:2502.18405}, year={2025}, eprint={2502.18405}, archivePrefix={arXiv}, primaryClass={cs.LG}, doi={10.48550/arXiv.2502.18405}, }

Owner

  • Name: BIOSCAN
  • Login: bioscan-ml
  • Kind: organization
  • Email: contact@bioscancanada.org

Illuminating biodiversity with DNA-based identification systems

GitHub Events

Total
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 1
  • Push event: 19
  • Public event: 1
  • Create event: 1
Last Year
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 1
  • Push event: 19
  • Public event: 1
  • Create event: 1

Dependencies

.github/workflows/pre-commit.yaml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • pre-commit/action v3.0.0 composite
CC_requirements.txt pypi
  • accelerate ==0.25.0
  • black ==22.8.0
  • boto3 ==1.28.57
  • botocore ==1.31.57
  • cmake *
  • einops ==0.6
  • matplotlib *
  • numpy ==1.25.2
  • omegaconf ==2.3
  • opt-einsum ==3.3
  • pandas ==2.1
  • peft ==0.5
  • scikit-learn ==1.3
  • scipy ==1.12
  • seaborn *
  • torch ==2.1.1
  • torchtext ==0.16.1
  • transformers ==4.29.2
  • umap-learn ==0.5.6
  • wandb *
frozen_requirements.txt pypi
  • accelerate ==0.23.0
  • boto3 ==1.28.51
  • botocore ==1.31.51
  • cmake ==3.27.5
  • einops ==0.6.1
  • evaluate ==0.4.0
  • matplotlib *
  • numpy ==1.26.0
  • omegaconf ==2.3.0
  • opt-einsum ==3.3.0
  • pandas ==2.1.1
  • peft ==0.5.0
  • scikit-learn ==1.3.0
  • scipy ==1.11.2
  • seaborn *
  • torch ==2.0.1
  • torchtext ==0.15.2
  • torchvision ==0.15.2
  • transformers ==4.29.2
  • umap-learn ==0.5.3
pyproject.toml pypi
  • accelerate >=0.23.0
  • boto3 >=1.28.51
  • botocore >=1.31.51
  • cmake >=3.27.5
  • einops >=0.6.1
  • matplotlib *
  • numpy >=1.25.0
  • omegaconf >=2.3.0
  • opt-einsum >=3.3.0
  • pandas >=2.1.1
  • peft >=0.5.0
  • scikit-learn >=1.3.0
  • scipy >=1.11.2
  • seaborn *
  • torch >=2.0.1
  • torchtext >=0.15.2
  • torchvision >=0.15.2
  • transformers >=4.29.2
  • umap-learn >=0.5.3
  • wandb *
requirements.txt pypi
  • accelerate ==0.25.0
  • black ==22.8.0
  • boto3 ==1.28.57
  • botocore ==1.31.57
  • cmake *
  • einops ==0.6
  • matplotlib *
  • numpy ==1.25.2
  • omegaconf ==2.3
  • opt-einsum ==3.3
  • pandas ==2.1
  • peft ==0.5
  • scikit-learn ==1.3
  • scipy ==1.12
  • seaborn *
  • torch ==2.1.1
  • torchdata ==0.7.1
  • torchtext ==0.16.1
  • transformers ==4.29.2
  • umap-learn ==0.5.6
  • wandb *