https://github.com/bioscan-ml/barcodebert
A pre-trained representation from a transformers model for inference on insect DNA barcoding data.
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org, zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.3%) to scientific vocabulary
Repository
A pre-trained representation from a transformers model for inference on insect DNA barcoding data.
Basic Info
- Host: GitHub
- Owner: bioscan-ml
- License: mit
- Language: HTML
- Default Branch: main
- Size: 25.4 MB
Statistics
- Stars: 13
- Watchers: 4
- Forks: 5
- Open Issues: 3
- Releases: 1
Metadata Files
README.md
BarcodeBERT
A pre-trained transformer model for inference on insect DNA barcoding data.
Check out our paper
Using the model
```python from transformers import AutoTokenizer, AutoModel
Load the tokenizer
tokenizer = AutoTokenizer.frompretrained( "bioscan-ml/BarcodeBERT", trustremote_code=True )
Load the model
model = AutoModel.frompretrained("bioscan-ml/BarcodeBERT", trustremote_code=True)
Sample sequence
dna_seq = "ACGCGCTGACGCATCAGCATACGA"
Tokenize
inputseq = tokenizer(dnaseq, returntensors="pt")["inputids"]
Pass through the model
output = model(input_seq.unsqueeze(0))["hidden_states"][-1]
Compute Global Average Pooling
features = output.mean(1) ```
Reproducing the results from the paper
Clone this repository and install the required libraries by running
shell pip install -e .Download the data from our Hugging Face Dataset repository
shell cd data/ python download_HF_CanInv.py
Optional: You can also download the first version of the data
shell
wget https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm/download -O data.zip
unzip data.zip
mv new_data/* data/
rm -r new_data
rm data.zip
DNA foundation model baselines: The desired backbone can be selected using one of the following keywords:
BarcodeBERT, NT, Hyena_DNA, DNABERT, DNABERT-2, DNABERT-Sbash python baselines/knn_probing.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ python baselines/linear_probing.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ python baselines/finetuning.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ --batch_size=32 python baselines/zsc.py --backbone=<DESIRED-BACKBONE> --data-dir=data/Note: The DNABERT model has to be downloaded manually following the instructions in the paper's repo and placed in thepretrained-modelsfolder.Supervised CNN
```bash python baselines/cnn/1DCNNsupervised.py python baselines/cnn/1DCNNKNN.py python baselines/cnn/1DCNNLinearprobing.py python baselines/cnn/1DCNN_ZSC.py
``
**Note**: Train the CNN backbone with1DCNNsupervised.py` before evaluating it on any downtream task.
- BLAST ```shell cd data/ python tofasta.py --inputfile=supervisedtrain.csv && python tofasta.py --inputfile=supervisedtest.csv && python tofasta.py --inputfile=unseen.csv
makeblastdb -in supervisedtrain.fas -title train -dbtype nucl -out train.fas blastn -query supervisedtest.fas -db train.fas -out resultssupervisedtest.tsv -outfmt 6 -numthreads 16 blastn -query unseen.fas -db train.fas -out resultsunseen.tsv -outfmt 6 -num_threads 16 ```
Pretrain BarcodeBERT
To pretrain the model you can run the following command:
bash
python barcodebert/pretraining.py
--dataset=CANADA-1.5M \
--k_mer=4 \
--n_layers=4 \
--n_heads=4 \
--data_dir=data/ \
--checkpoint=model_checkpoints/CANADA-1.5M/4_4_4/checkpoint_pretraining.pt
Citation
If you find BarcodeBERT useful in your research please consider citing:
bibtex
@article{arias2023barcodebert,
title={{BarcodeBERT}: Transformers for Biodiversity Analysis},
author={Pablo Millan Arias
and Niousha Sadjadi
and Monireh Safari
and ZeMing Gong
and Austin T. Wang
and Joakim Bruslund Haurum
and Iuliia Zarubiieva
and Dirk Steinke
and Lila Kari
and Angel X. Chang
and Scott C. Lowe
and Graham W. Taylor
},
journal={arXiv preprint arXiv:2311.02401},
year={2023},
eprint={2311.02401},
archivePrefix={arXiv},
primaryClass={cs.LG},
doi={10.48550/arxiv.2311.02401},
}
Owner
- Name: BIOSCAN
- Login: bioscan-ml
- Kind: organization
- Email: contact@bioscancanada.org
- Website: https://biodiversitygenomics.net/research/bioscan/
- Repositories: 1
- Profile: https://github.com/bioscan-ml
Illuminating biodiversity with DNA-based identification systems
GitHub Events
Total
- Create event: 7
- Release event: 1
- Issues event: 1
- Watch event: 4
- Delete event: 2
- Issue comment event: 6
- Member event: 1
- Push event: 27
- Pull request review comment event: 5
- Pull request review event: 9
- Pull request event: 14
- Fork event: 2
Last Year
- Create event: 7
- Release event: 1
- Issues event: 1
- Watch event: 4
- Delete event: 2
- Issue comment event: 6
- Member event: 1
- Push event: 27
- Pull request review comment event: 5
- Pull request review event: 9
- Pull request event: 14
- Fork event: 2
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1
- Total pull requests: 8
- Average time to close issues: N/A
- Average time to close pull requests: 2 months
- Total issue authors: 1
- Total pull request authors: 4
- Average comments per issue: 0.0
- Average comments per pull request: 0.63
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 3
Past Year
- Issues: 1
- Pull requests: 7
- Average time to close issues: N/A
- Average time to close pull requests: 7 days
- Issue authors: 1
- Pull request authors: 4
- Average comments per issue: 0.0
- Average comments per pull request: 0.71
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 3
Top Authors
Issue Authors
- ychuest (1)
Pull Request Authors
- pre-commit-ci[bot] (3)
- scottclowe (2)
- NotMyLyfe (1)