bpeasy

Fast bare-bones BPE for modern tokenizer training

https://github.com/gautierdag/bpeasy

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 1 committers (100.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.0%) to scientific vocabulary

Keywords

bpe tokenization tokenizer
Last synced: 6 months ago · JSON representation ·

Repository

Fast bare-bones BPE for modern tokenizer training

Basic Info
  • Host: GitHub
  • Owner: gautierdag
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 1.41 MB
Statistics
  • Stars: 164
  • Watchers: 2
  • Forks: 5
  • Open Issues: 0
  • Releases: 7
Topics
bpe tokenization tokenizer
Created over 2 years ago · Last pushed 8 months ago
Metadata Files
Readme License Citation

README.md

bpeasy

codecov tests image image PyPI version

Overview

bpeasy is a Python package that provides a tokenizer trainer, implementing in 400 lines of rust an efficient version of Byte Pair Encoding (BPE). The implementation largely follows the huggingface tokenizers library, but makes opinionated decisions to simplify the tokenizer training specifically to:

  1. Treat text data at the byte-level first --- all text is converted to bytes before training rather than using a character-level approach (like in Huggingface).
  2. Always use a regex-based split pre-tokenizer. This is a customisable regex that is applied to the text before training. This regex decides where to split the text and limits what kind of tokens are possible. This is technically possible in Huggingface but is not well documented. We also use the fancy-regex crate which supports a richer set of regex features than the regex crate used in Huggingface.
  3. Use int64 types for counting to allow for training on much larger datasets without the risk of overflow.

You can think of bpeasy as the tiktoken training code that never was.

See the benchmarks section for a comparison with the Huggingface library.

Installation

Simply install the package using pip:

bash pip install bpeasy

Training

The training function is designed to be bare-bones and returns the trained tokenizer vocab as a dictionary of bytes to integers. This is to allow for maximum flexibility in how you want to use the tokenizer. For example, you can use then port these to tiktoken or Huggingface tokenizers (see below).

```python

should be an iterator over str

iterator = jsonlcontentiterator(args)

example regex from GPT-4

regex_pattern = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+"""

returns the vocab (dict[bytes, int])

vocab = bpeasy.trainbpe( iterator, regexpattern, args.maxsentencepiecelength, # max length of tokens args.vocabsize, # max size of vocab args.batchsize, # number of items to pretokenize/count at once (default: 1000) ) ```

Alternatively, you can also train using the basic tokenizer class provided:

```python from bpeasy.tokenizer import BPEasyTokenizer

tokenizer = BPEasyTokenizer.train( iterator, # iterator over str vocabsize=vocabsize, maxtokenlength=maxtokenlength, regexpattern=regexpattern, specialtokens=["", "", ""], filltonearestmultipleofeight=True, name="bpeasy", ) ```

Note on batch_size: The batch_size parameter controls how many items are pretokenized and counted at once. A larger batch size will be faster but will use more memory. The default is 1000. If all your texts/strings can fit in memory, you can set this to the size of your dataset to speed up the process. If you are working with a large dataset, you can set this to a smaller value to limit memory issues.

Encoding/Decoding

To test your tokenizer you can use the BPEasyTokenizer class, which is a wrapper around the tiktoken.Encoding module, simplifying the handling of vocabularies, special tokens, and regex patterns for tokenization.

```python from bpeasy.tokenizer import BPEasyTokenizer

yourspecialtokens = ["", "", ""]

tokenizer = BPEasyTokenizer( vocab=vocab, regexpattern=regexpattern, specialtokens=yourspecialtokens, filltonearestmultipleofeight=True, # pad vocab to multiple of 8 name="bpeasy" # optional name for the tokenizer, batch_size=1000, # optional batch size for the number of items to pretokenize/count at once )

test = "hello_world"

encode and decode uses the tiktoken functions

encoded = tokenizer.encode(test) decoded = tokenizer.decode(encoded)

"hello_world" ```

You can also use tiktoken directly, but you would need to handle the special tokens and regex pattern yourself:

```python import tiktoken

vocab = bpeasy.trainbpe(...) specialtokens = ["", "", ""]

Sort the vocab by rank

sorted_vocab = sorted(list(vocab.items()), key=lambda x: x[1])

add special tokens

specialtokenranks = {} for specialtoken in specialtokens: specialtokenranks[specialtoken] = len(sortedvocab) sortedvocab.append((specialtoken.encode("utf-8"), len(sorted_vocab)))

fullvocab = dict(sortedvocab)

encoder = tiktoken.Encoding( name=name, patstr=regexpattern, mergeableranks=fullvocab, specialtokens=specialtoken_ranks, ) ```

Save/Load tokenizer from file

We provide basic utility functions to save and load the tokenizer from a json file.

```python tokenizer.save("pathtofile.json")

tokenizer = BPEasyTokenizer.fromfile("pathto_file.json") ```

Export to HuggingFace format

We also support exporting the tokenizer to the HuggingFace format, which can then be used directly with the HuggingFace transformers library.

```python from bpeasy.tokenizer import BPEasyTokenizer from trans tokenizer = BPEasyTokenizer( ... )

tokenizer.exporttohuggingfaceformat("hftokenizer.json")

from transformers import PreTrainedTokenizerFast

hftokenizer = PreTrainedTokenizerFast(tokenizerfile="hf_tokenizer.json") ```

Export vocab to tiktoken txt format

```python from bpeasy import vocab = bpeasy.train_bpe(...)

saves the vocab to a tiktoken txt file format

savevocabtotiktoken(vocab, "vocab.txt", specialtokens=["", "", ""])

```

If you want to use the tiktoken txt format, you will still need to handle the regex and special tokens yourself, as shown above,

Contributing

Contributions are welcome! Please open an issue if you have any suggestions or improvements.

License

This project is licensed under the MIT License.

Citation

If you use bpeasy in your research, please cite the following paper:

bash @software{bpeasy, author = {Gautier Dagan}, title = {bpeasy}, year = {2024}, url = {https://github.com/gautierdag/bpeasy}, repository = {https://github.com/gautierdag/bpeasy}, author-email = {gautier.dagan@ed.ac.uk}, affiliation = {University of Edinburgh}, orcid = {https://orcid.org/0000-0002-1867-4201} }

Citation (CITATION.cff)

cff-version: 1.2.0
title: bpeasy
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Gautier
    family-names: Dagan
    email: gautier.dagan@ed.ac.uk
    affiliation: University of Edinburgh
    orcid: 'https://orcid.org/0000-0002-1867-4201'
repository-code: 'https://github.com/gautierdag/bpeasy'
url: 'https://github.com/gautierdag/bpeasy'
license: MIT

GitHub Events

Total
  • Create event: 1
  • Release event: 1
  • Issues event: 3
  • Watch event: 24
  • Issue comment event: 1
  • Push event: 3
  • Fork event: 4
Last Year
  • Create event: 1
  • Release event: 1
  • Issues event: 3
  • Watch event: 24
  • Issue comment event: 1
  • Push event: 3
  • Fork event: 4

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 53
  • Total Committers: 1
  • Avg Commits per committer: 53.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 11
  • Committers: 1
  • Avg Commits per committer: 11.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Gautier Dagan s****1@e****k 53
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 0
  • Average time to close issues: 9 days
  • Average time to close pull requests: N/A
  • Total issue authors: 5
  • Total pull request authors: 0
  • Average comments per issue: 3.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 5
  • Pull requests: 0
  • Average time to close issues: 9 days
  • Average time to close pull requests: N/A
  • Issue authors: 5
  • Pull request authors: 0
  • Average comments per issue: 3.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • JonasGeiping (1)
  • filemon11 (1)
  • umarbutler (1)
  • bauwenst (1)
  • depshad (1)
Pull Request Authors
Top Labels
Issue Labels
enhancement (1) question (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,536 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 7
  • Total maintainers: 1
pypi.org: bpeasy

Fast bare-bones BPE for modern tokenizer training

  • Versions: 7
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 1,536 Last month
Rankings
Dependent packages count: 10.2%
Average: 38.6%
Dependent repos count: 67.0%
Maintainers (1)
Last synced: 6 months ago