https://github.com/bioscan-ml/barcodebert

A pre-trained representation from a transformers model for inference on insect DNA barcoding data.

https://github.com/bioscan-ml/barcodebert

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.3%) to scientific vocabulary
Last synced: 6 months ago · JSON representation

Repository

A pre-trained representation from a transformers model for inference on insect DNA barcoding data.

Basic Info
  • Host: GitHub
  • Owner: bioscan-ml
  • License: mit
  • Language: HTML
  • Default Branch: main
  • Size: 25.4 MB
Statistics
  • Stars: 13
  • Watchers: 4
  • Forks: 5
  • Open Issues: 3
  • Releases: 1
Created about 3 years ago · Last pushed 7 months ago
Metadata Files
Readme License

README.md

DOI

BarcodeBERT

A pre-trained transformer model for inference on insect DNA barcoding data.

drawing

Check out our paper

Using the model

```python from transformers import AutoTokenizer, AutoModel

Load the tokenizer

tokenizer = AutoTokenizer.frompretrained( "bioscan-ml/BarcodeBERT", trustremote_code=True )

Load the model

model = AutoModel.frompretrained("bioscan-ml/BarcodeBERT", trustremote_code=True)

Sample sequence

dna_seq = "ACGCGCTGACGCATCAGCATACGA"

Tokenize

inputseq = tokenizer(dnaseq, returntensors="pt")["inputids"]

Pass through the model

output = model(input_seq.unsqueeze(0))["hidden_states"][-1]

Compute Global Average Pooling

features = output.mean(1) ```

Reproducing the results from the paper

  1. Clone this repository and install the required libraries by running shell pip install -e .

  2. Download the data from our Hugging Face Dataset repository shell cd data/ python download_HF_CanInv.py

Optional: You can also download the first version of the data shell wget https://vault.cs.uwaterloo.ca/s/x7gXQKnmRX3GAZm/download -O data.zip unzip data.zip mv new_data/* data/ rm -r new_data rm data.zip

  1. DNA foundation model baselines: The desired backbone can be selected using one of the following keywords:
    BarcodeBERT, NT, Hyena_DNA, DNABERT, DNABERT-2, DNABERT-S bash python baselines/knn_probing.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ python baselines/linear_probing.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ python baselines/finetuning.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ --batch_size=32 python baselines/zsc.py --backbone=<DESIRED-BACKBONE> --data-dir=data/ Note: The DNABERT model has to be downloaded manually following the instructions in the paper's repo and placed in the pretrained-models folder.

  2. Supervised CNN

```bash python baselines/cnn/1DCNNsupervised.py python baselines/cnn/1DCNNKNN.py python baselines/cnn/1DCNNLinearprobing.py python baselines/cnn/1DCNN_ZSC.py

`` **Note**: Train the CNN backbone with1DCNNsupervised.py` before evaluating it on any downtream task.

  1. BLAST ```shell cd data/ python tofasta.py --inputfile=supervisedtrain.csv && python tofasta.py --inputfile=supervisedtest.csv && python tofasta.py --inputfile=unseen.csv

makeblastdb -in supervisedtrain.fas -title train -dbtype nucl -out train.fas blastn -query supervisedtest.fas -db train.fas -out resultssupervisedtest.tsv -outfmt 6 -numthreads 16 blastn -query unseen.fas -db train.fas -out resultsunseen.tsv -outfmt 6 -num_threads 16 ```

Pretrain BarcodeBERT

To pretrain the model you can run the following command: bash python barcodebert/pretraining.py --dataset=CANADA-1.5M \ --k_mer=4 \ --n_layers=4 \ --n_heads=4 \ --data_dir=data/ \ --checkpoint=model_checkpoints/CANADA-1.5M/4_4_4/checkpoint_pretraining.pt

Citation

If you find BarcodeBERT useful in your research please consider citing:

bibtex @article{arias2023barcodebert, title={{BarcodeBERT}: Transformers for Biodiversity Analysis}, author={Pablo Millan Arias and Niousha Sadjadi and Monireh Safari and ZeMing Gong and Austin T. Wang and Joakim Bruslund Haurum and Iuliia Zarubiieva and Dirk Steinke and Lila Kari and Angel X. Chang and Scott C. Lowe and Graham W. Taylor }, journal={arXiv preprint arXiv:2311.02401}, year={2023}, eprint={2311.02401}, archivePrefix={arXiv}, primaryClass={cs.LG}, doi={10.48550/arxiv.2311.02401}, }

Owner

  • Name: BIOSCAN
  • Login: bioscan-ml
  • Kind: organization
  • Email: contact@bioscancanada.org

Illuminating biodiversity with DNA-based identification systems

GitHub Events

Total
  • Create event: 7
  • Release event: 1
  • Issues event: 1
  • Watch event: 4
  • Delete event: 2
  • Issue comment event: 6
  • Member event: 1
  • Push event: 27
  • Pull request review comment event: 5
  • Pull request review event: 9
  • Pull request event: 14
  • Fork event: 2
Last Year
  • Create event: 7
  • Release event: 1
  • Issues event: 1
  • Watch event: 4
  • Delete event: 2
  • Issue comment event: 6
  • Member event: 1
  • Push event: 27
  • Pull request review comment event: 5
  • Pull request review event: 9
  • Pull request event: 14
  • Fork event: 2

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 1
  • Total pull requests: 8
  • Average time to close issues: N/A
  • Average time to close pull requests: 2 months
  • Total issue authors: 1
  • Total pull request authors: 4
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.63
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 1
  • Pull requests: 7
  • Average time to close issues: N/A
  • Average time to close pull requests: 7 days
  • Issue authors: 1
  • Pull request authors: 4
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.71
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 3
Top Authors
Issue Authors
  • ychuest (1)
Pull Request Authors
  • pre-commit-ci[bot] (3)
  • scottclowe (2)
  • NotMyLyfe (1)
Top Labels
Issue Labels
Pull Request Labels
documentation (1)