bpeasy

Fast bare-bones BPE for modern tokenizer training

https://github.com/gautierdag/bpeasy

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 1 committers (100.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary

Keywords

bpe tokenization tokenizer

Last synced: 6 months ago · JSON representation ·

Repository

Fast bare-bones BPE for modern tokenizer training

Basic Info

Host: GitHub
Owner: gautierdag
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 1.41 MB

Statistics

Stars: 164
Watchers: 2
Forks: 5
Open Issues: 0
Releases: 7

Topics

bpe tokenization tokenizer

Created over 2 years ago · Last pushed 8 months ago

Metadata Files

Readme License Citation

bpeasy

Overview

bpeasy is a Python package that provides a tokenizer trainer, implementing in 400 lines of rust an efficient version of Byte Pair Encoding (BPE). The implementation largely follows the huggingface tokenizers library, but makes opinionated decisions to simplify the tokenizer training specifically to:

Treat text data at the byte-level first --- all text is converted to bytes before training rather than using a character-level approach (like in Huggingface).
Always use a regex-based split pre-tokenizer. This is a customisable regex that is applied to the text before training. This regex decides where to split the text and limits what kind of tokens are possible. This is technically possible in Huggingface but is not well documented. We also use the fancy-regex crate which supports a richer set of regex features than the regex crate used in Huggingface.
Use int64 types for counting to allow for training on much larger datasets without the risk of overflow.

You can think of bpeasy as the tiktoken training code that never was.

See the benchmarks section for a comparison with the Huggingface library.

Installation

Simply install the package using pip:

bash pip install bpeasy

Training

The training function is designed to be bare-bones and returns the trained tokenizer vocab as a dictionary of bytes to integers. This is to allow for maximum flexibility in how you want to use the tokenizer. For example, you can use then port these to tiktoken or Huggingface tokenizers (see below).

```python

should be an iterator over str

iterator = jsonlcontentiterator(args)

example regex from GPT-4

regex_pattern = r"""(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]|\s[\r\n]+|\s+(?!\S)|\s+"""

returns the vocab (dict[bytes, int])

vocab = bpeasy.trainbpe( iterator, regexpattern, args.maxsentencepiecelength, # max length of tokens args.vocabsize, # max size of vocab args.batchsize, # number of items to pretokenize/count at once (default: 1000) ) ```

Alternatively, you can also train using the basic tokenizer class provided:

```python from bpeasy.tokenizer import BPEasyTokenizer

tokenizer = BPEasyTokenizer.train( iterator, # iterator over str vocabsize=vocabsize, maxtokenlength=maxtokenlength, regexpattern=regexpattern, specialtokens=["~~", "", "~~"], filltonearestmultipleofeight=True, name="bpeasy", ) ```

Note on batch_size: The batch_size parameter controls how many items are pretokenized and counted at once. A larger batch size will be faster but will use more memory. The default is 1000. If all your texts/strings can fit in memory, you can set this to the size of your dataset to speed up the process. If you are working with a large dataset, you can set this to a smaller value to limit memory issues.

Encoding/Decoding

To test your tokenizer you can use the BPEasyTokenizer class, which is a wrapper around the tiktoken.Encoding module, simplifying the handling of vocabularies, special tokens, and regex patterns for tokenization.

```python from bpeasy.tokenizer import BPEasyTokenizer

yourspecialtokens = ["~~", "", "~~"]

tokenizer = BPEasyTokenizer( vocab=vocab, regexpattern=regexpattern, specialtokens=yourspecialtokens, filltonearestmultipleofeight=True, # pad vocab to multiple of 8 name="bpeasy" # optional name for the tokenizer, batch_size=1000, # optional batch size for the number of items to pretokenize/count at once )

test = "hello_world"

encode and decode uses the tiktoken functions

encoded = tokenizer.encode(test) decoded = tokenizer.decode(encoded)

"hello_world" ```

You can also use tiktoken directly, but you would need to handle the special tokens and regex pattern yourself:

```python import tiktoken

vocab = bpeasy.trainbpe(...) specialtokens = ["~~", "", "~~"]

Sort the vocab by rank

sorted_vocab = sorted(list(vocab.items()), key=lambda x: x[1])

add special tokens

specialtokenranks = {} for specialtoken in specialtokens: specialtokenranks[specialtoken] = len(sortedvocab) sortedvocab.append((specialtoken.encode("utf-8"), len(sorted_vocab)))

fullvocab = dict(sortedvocab)

encoder = tiktoken.Encoding( name=name, patstr=regexpattern, mergeableranks=fullvocab, specialtokens=specialtoken_ranks, ) ```

Save/Load tokenizer from file

We provide basic utility functions to save and load the tokenizer from a json file.

```python tokenizer.save("pathtofile.json")

tokenizer = BPEasyTokenizer.fromfile("pathto_file.json") ```

Export to HuggingFace format

We also support exporting the tokenizer to the HuggingFace format, which can then be used directly with the HuggingFace transformers library.

```python from bpeasy.tokenizer import BPEasyTokenizer from trans tokenizer = BPEasyTokenizer( ... )

tokenizer.exporttohuggingfaceformat("hftokenizer.json")

from transformers import PreTrainedTokenizerFast

hftokenizer = PreTrainedTokenizerFast(tokenizerfile="hf_tokenizer.json") ```

Export vocab to `tiktoken` txt format

```python from bpeasy import vocab = bpeasy.train_bpe(...)

saves the vocab to a tiktoken txt file format

savevocabtotiktoken(vocab, "vocab.txt", specialtokens=["~~", "", "~~"])

```

If you want to use the tiktoken txt format, you will still need to handle the regex and special tokens yourself, as shown above,

Contributing

Contributions are welcome! Please open an issue if you have any suggestions or improvements.

License

This project is licensed under the MIT License.

Citation

If you use bpeasy in your research, please cite the following paper:

bash @software{bpeasy, author = {Gautier Dagan}, title = {bpeasy}, year = {2024}, url = {https://github.com/gautierdag/bpeasy}, repository = {https://github.com/gautierdag/bpeasy}, author-email = {gautier.dagan@ed.ac.uk}, affiliation = {University of Edinburgh}, orcid = {https://orcid.org/0000-0002-1867-4201} }

Citation (CITATION.cff)

cff-version: 1.2.0
title: bpeasy
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Gautier
    family-names: Dagan
    email: gautier.dagan@ed.ac.uk
    affiliation: University of Edinburgh
    orcid: 'https://orcid.org/0000-0002-1867-4201'
repository-code: 'https://github.com/gautierdag/bpeasy'
url: 'https://github.com/gautierdag/bpeasy'
license: MIT

GitHub Events

Total

Create event: 1
Release event: 1
Issues event: 3
Watch event: 24
Issue comment event: 1
Push event: 3
Fork event: 4

Last Year

Create event: 1
Release event: 1
Issues event: 3
Watch event: 24
Issue comment event: 1
Push event: 3
Fork event: 4

Committers

Last synced: 9 months ago

All Time

Total Commits: 53
Total Committers: 1
Avg Commits per committer: 53.0
Development Distribution Score (DDS): 0.0

Past Year

Commits: 11
Committers: 1
Avg Commits per committer: 11.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Gautier Dagan	s**1@e**k	53

Committer Domains (Top 20 + Academic)

ed.ac.uk: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 5
Total pull requests: 0
Average time to close issues: 9 days
Average time to close pull requests: N/A
Total issue authors: 5
Total pull request authors: 0
Average comments per issue: 3.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 5
Pull requests: 0
Average time to close issues: 9 days
Average time to close pull requests: N/A
Issue authors: 5
Pull request authors: 0
Average comments per issue: 3.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

JonasGeiping (1)
filemon11 (1)
umarbutler (1)
bauwenst (1)
depshad (1)

Pull Request Authors

Top Labels

Issue Labels

enhancement (1) question (1)

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 1,536 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 7
Total maintainers: 1

pypi.org: bpeasy

Fast bare-bones BPE for modern tokenizer training

Documentation: https://bpeasy.readthedocs.io/
License: MIT
Latest release: 0.1.6
published 8 months ago

Versions: 7
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 1,536 Last month

Rankings

Dependent packages count: 10.2%

Average: 38.6%

Dependent repos count: 67.0%

Maintainers (1)

gautierdag

Last synced: 6 months ago

bpeasy

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

bpeasy

Overview

Installation

Training

should be an iterator over str

example regex from GPT-4

returns the vocab (dict[bytes, int])

Encoding/Decoding

encode and decode uses the tiktoken functions

Sort the vocab by rank

add special tokens

Save/Load tokenizer from file

Export to HuggingFace format

Export vocab to tiktoken txt format

saves the vocab to a tiktoken txt file format

Contributing

License

Citation

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: bpeasy

Rankings

Maintainers (1)

Export vocab to `tiktoken` txt format