biologicaltokenizers

Effect of tokenization on transformers for biological sequence

https://github.com/technion-cs-nlp/biologicaltokenizers

Keywords

bioinformatics dna-sequences nlp protein-sequences tokenizer transformers

Last synced: 6 months ago · JSON representation ·

Repository

Effect of tokenization on transformers for biological sequence

Basic Info

Host: GitHub
Owner: technion-cs-nlp
Language: Python
Default Branch: main
Homepage: https://academic.oup.com/bioinformatics/article/40/4/btae196/7645044
Size: 34.9 MB

Statistics

Stars: 16
Watchers: 0
Forks: 0
Open Issues: 0
Releases: 0

Topics

bioinformatics dna-sequences nlp protein-sequences tokenizer transformers

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme Citation

Effect of Tokenization on Transformers for Biological Sequences

Abstract:

Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a three-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data.

Different tokenization algorithms can be applied to biological sequences, as exemplified for the sequence “AAGTCAAGGATC”. (a) The baseline “words” tokenizer assumes a dictionary consisting of the nucleotides: “A”, “C”, “G” and “T”. The length of the encoded sequence is 12, i.e., the number of nucleotides; (b) The “pairs” tokenizer assumes a dictionary consisting of all possible nucleotide pairs. The length of the encoded sequences is typically halved; (c) A sophisticated dictionary consisting of only three tokens: “AAG”, “TC” and “GA”. The encoded sequence for this dictionary contains only five tokens.

Data:

The "data" folder contains the train, valid and test data of seven of the eight datasets used in the paper.

BFD Tokenizers:

We trained BPE, WordPiece and Unigram tokenizers on samples of proteins from the 2.2 billion protein sequences of the BFD dataset (Steinegger and Söding 2018). We evaluate the average sequences length as a function of the vocabulary size and number of sequences in the training data.

BFD_BPE_table BFD_WPC_table BFD_UNI_table

Effect of vocabulary size and number of training samples on the three tokenizers: BPE, WordPiece and Unigram. The darker the color the higher the average number of tokens per protein. Increasing the vocabulary and the training size reduces the number of tokens per protein for all of the tested tokenizers.

The "BFDTokenizers" contains the trained tokenizers on the BFD datasset. The path to the tokenizers is as follows: "/BFDTokenizers/<NUMBER OF TRAINING SAMPLES>/<TOKENIZER TYPE>/<VOCABULARY SIZE>"

Training Script:

You can use the provided script train_tokenizer_bert.py to perform the training and evaluation.

Usage Flags

The script supports various flags for customization:

--tokenizer-type (-t): Choose the type of tokenizer to use. Options include:
- "BPE" (Byte Pair Encoding, (Sennrich, Haddow, and Birch 2016))
- "WPC" (WordPiece, (Schuster and Nakajima 2012))
- "UNI" (Unigram, (Kudo 2018))
- "WORDS" (each token is a single character)
- "PAIRS" (each token is a pair of two characters)
--vocab-size (-s): Set the vocabulary size for the tokenizer. (Used only when tokenizer type is "BPE", "WPC", or "UNI").
--results-path (-r): Specify the path to save the tokenizer, transformer, and results.
--layers-num (-l): Define the number of BERT layers.
--attention-heads-num (-a): Set the number of BERT attention heads.
--hidden-size (-z): Specify the hidden size of BERT layer.
--data-path (-d): Provide the path to the folder containing three files: train.csv, valid.csv, and test.csv. For the datasets used in our paper, you may download them from the "data" folder.
--epochs (-e): Define the number of training epochs.
--print-training-loss (-p): Specify the number of steps to print the loss.
--task-type (-y): Choose the task type:
- "REGRESSION" (for regression datasets, i.e., predicting a score)
- "CLASSIFICATION" (for classification datasets, i.e., predicting a class).
--max-length (-m): Set the maximum tokens per sequence.
--learning-rate (-lr): Set the learning rate for the model training.

Example Usage:

```

running the SuperFamily classification training with a "BPE" tokenizer of 3,000 tokens

python traintokenizerbert.py --tokenizer-type BPE --vocab-size 3000 --results-path ./results_SuperFamily --layers-num 6 --attention-heads-num 8 --hidden-size 256 --data-path ./data/SuperFamily/ --epochs 10 --print-training-loss 1000 --task-type CLASSIFICATION --max-length 128

running the fluorescence prediction training with a "PAIRS" tokenizer

python traintokenizerbert.py --tokenizer-type PAIRS --results-path ./results_fluorescence --layers-num 2 --attention-heads-num 4 --hidden-size 128 --data-path ./data/fluorescence/ --epochs 30 --print-training-loss 100 --task-type REGRESSION --max-length 256 --learning-rate 0.001

running the stability prediction training with a "WPC" tokenizer of 200 tokens

python traintokenizerbert.py --tokenizer-type WPC --vocab-size 200 --results-path ./results_stability --layers-num 6 --attention-heads-num 4 --hidden-size 128 --data-path ./data/stability/ --epochs 15 --print-training-loss 1000 --task-type REGRESSION --max-length 512 --learning-rate 0.000001 ```

APA

Edo Dotan, Gal Jaschek, Tal Pupko, Yonatan Belinkov, Effect of tokenization on transformers for biological sequences, Bioinformatics, 2024;, btae196, https://doi.org/10.1093/bioinformatics/btae196

BibTeX

@article{Dotan_Effect_of_Tokenization_2024, author = {Dotan, Edo and Jaschek, Gal and Pupko, Tal and Belinkov, Yonatan}, title = "{Effect of tokenization on transformers for biological sequences}", journal = {Bioinformatics}, pages = {btae196}, year = {2024}, month = {04}, abstract = "{Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families.We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a three-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data.Code, data and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.}", issn = {1367-4811}, doi = {10.1093/bioinformatics/btae196}, url = {https://doi.org/10.1093/bioinformatics/btae196}, eprint = {https://academic.oup.com/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btae196/57226869/btae196.pdf}, }

Owner

Name: technion-cs-nlp
Login: technion-cs-nlp
Kind: organization

Repositories: 9
Profile: https://github.com/technion-cs-nlp

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this repo, please cite it as below."
authors:
- family-names: "Dotan"
  given-names: "Edo"
- family-names: "Jaschek"
  given-names: "Gal"
  orcid: "https://orcid.org/0000-0001-7555-4575"
- family-names: "Pupko"
  given-names: "Tal"
  orcid: "https://orcid.org/0000-0001-9463-2575"
- family-names: "Belinkov"
  given-names: "Yonatan"
title: "Effect of Tokenization on Transformers for Biological Sequences"
version: 1.0.0
doi: 10.1093/bioinformatics/btae196
date-released: 2023-04-12
url: "https://github.com/technion-cs-nlp/BiologicalTokenizers"
preferred-citation: 
  type: article
  authors:
  - family-names: "Dotan"
    given-names: "Edo"
  - family-names: "Jaschek"
    given-names: "Gal"
    orcid: "https://orcid.org/0000-0001-7555-4575"
  - family-names: "Pupko"
    given-names: "Tal"
    orcid: "https://orcid.org/0000-0001-9463-2575"
  - family-names: "Belinkov"
    given-names: "Yonatan"
  doi: "10.1093/bioinformatics/btae196"
  journal: "Bioinformatics"
  month: 04
  title: "Effect of Tokenization on Transformers for Biological Sequences"
  year: 2024

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science