tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://github.com/huggingface/tokenizers

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    8 of 115 committers (7.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.7%) to scientific vocabulary

Keywords

bert gpt language-model natural-language-processing natural-language-understanding nlp transformers

Keywords from Contributors

jax cryptocurrency transformer cryptography argument-parser evaluation-framework agents keras spacy graph-computing
Last synced: 6 months ago · JSON representation ·

Repository

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Basic Info
Statistics
  • Stars: 10,025
  • Watchers: 125
  • Forks: 958
  • Open Issues: 117
  • Releases: 98
Topics
bert gpt language-model natural-language-processing natural-language-understanding nlp transformers
Created over 6 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md



Build GitHub

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Performances

Performances can vary depending on hardware, but running the ~/bindings/python/benches/test_tiktoken.py should give the following on a g6 aws instance: image

Bindings

We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)

Installation

You can install from source using: bash pip install git+https://github.com/huggingface/tokenizers.git#subdirectory=bindings/python

or install the released versions with

bash pip install tokenizers

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

```python from tokenizers import Tokenizer from tokenizers.models import BPE

tokenizer = Tokenizer(BPE()) ```

You can customize how pre-tokenization (e.g., splitting into words) is done:

```python from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace() ```

Then training your tokenizer on a set of files just takes two lines of codes:

```python from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```

Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)

["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

```

Check the documentation or the quicktour to learn more!

Owner

  • Name: Hugging Face
  • Login: huggingface
  • Kind: organization
  • Location: NYC + Paris

The AI community building the future.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
type: software
authors:
  - given-names: Anthony
    family-names: Moi
    email: m.anthony.moi@gmail.com
    affiliation: HuggingFace
  - given-names: Nicolas
    family-names: Patry
    affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
keywords:
  - Rust
  - Tokenizer
  - NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 1,767
  • Total Committers: 115
  • Avg Commits per committer: 15.365
  • Development Distribution Score (DDS): 0.471
Past Year
  • Commits: 101
  • Committers: 27
  • Avg Commits per committer: 3.741
  • Development Distribution Score (DDS): 0.644
Top Committers
Name Email Commits
Anthony MOI m****i@g****m 934
Nicolas Patry p****s@p****m 269
Pierric Cistac p****c@h****o 138
Arthur Zucker a****r@g****m 79
epwalsh e****0@g****m 48
dependabot[bot] 4****] 36
Morgan Funtowicz m****n@h****o 35
Sebastian Pütz s****z@u****e 28
Mishig Davaadorj d****g@g****m 22
Bjarte Johansen b****n@g****m 11
thomwolf t****f@g****m 8
Sylvain Gugger s****r@g****m 7
Chris Ha h****9@g****m 6
sftse c@f****t 5
Roy Hvaara h****a@g****m 5
Luc Georges M****e 4
Connor Boyle c****o@g****m 4
Clement c****e@g****m 4
Lysandre l****t@r****r 4
Julien Chaumond c****d@g****m 4
François Garillot f****s@g****t 3
dctelus 9****s 3
tinyboxvk t****k 3
Michael Lui m****i 2
SeongBeomLEE 2****r@n****m 2
Thomas Wang 2****1 2
mert-kurttutan k****t@g****m 2
Mario Šaško m****7@g****m 2
MarcusGrass 3****s 2
Lucain l****p@g****m 2
and 85 more...

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 609
  • Total pull requests: 361
  • Average time to close issues: about 1 year
  • Average time to close pull requests: 2 months
  • Total issue authors: 515
  • Total pull request authors: 113
  • Average comments per issue: 4.32
  • Average comments per pull request: 1.99
  • Merged pull requests: 209
  • Bot issues: 0
  • Bot pull requests: 25
Past Year
  • Issues: 118
  • Pull requests: 156
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 24 days
  • Issue authors: 109
  • Pull request authors: 51
  • Average comments per issue: 1.93
  • Average comments per pull request: 1.31
  • Merged pull requests: 81
  • Bot issues: 0
  • Bot pull requests: 13
Top Authors
Issue Authors
  • david-waterworth (15)
  • n1t0 (10)
  • Narsil (5)
  • SaulLu (4)
  • pietrolesci (4)
  • chris-ha458 (4)
  • davidgilbertson (4)
  • DOGEwbx (3)
  • EricLBuehler (3)
  • jafioti (3)
  • xenova (3)
  • 8ria (3)
  • talolard (3)
  • shivanraptor (3)
  • DamonsJ (3)
Pull Request Authors
  • Narsil (110)
  • ArthurZucker (92)
  • dependabot[bot] (39)
  • chris-ha458 (10)
  • sftse (10)
  • hvaara (6)
  • eaplatanios (5)
  • tinyboxvk (5)
  • 414owen (4)
  • boyleconnor (4)
  • sondalex (4)
  • MeetThePatel (4)
  • mjbommar (4)
  • hamirmahal (4)
  • bryantbiggs (4)
Top Labels
Issue Labels
Stale (316) bug (11) Feature Request (11) enhancement (10) planned (4) good first issue (2) good second issue (1) bytefallback (1) documentation (1) python (1) training (1)
Pull Request Labels
Stale (60) dependencies (39) javascript (37) github_actions (2)

Dependencies

.github/workflows/docs-check.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v1 composite
  • actions/setup-python v1 composite
  • actions/upload-artifact v2 composite
.github/workflows/node-release.yml actions
  • actions-rs/toolchain v1 composite
  • actions/cache v1 composite
  • actions/checkout v1 composite
  • actions/setup-node v1 composite
  • actions/setup-python v1 composite
.github/workflows/node.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/cache v1 composite
  • actions/checkout v1 composite
  • actions/setup-node v1 composite
.github/workflows/python-release-conda.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v2 composite
  • conda-incubator/setup-miniconda v2 composite
.github/workflows/python-release-extra.yml actions
  • actions/checkout v1 composite
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/python-release.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v2 composite
  • actions/checkout v1 composite
  • actions/setup-python v1 composite
  • actions/setup-python v4 composite
.github/workflows/python.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/cache v1 composite
  • actions/checkout v1 composite
  • actions/setup-python v2 composite
.github/workflows/rust-release.yml actions
  • actions-rs/toolchain v1 composite
  • actions/cache v1 composite
  • actions/checkout v1 composite
.github/workflows/rust.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout v1 composite
bindings/python/Cargo.toml cargo
  • pyo3 0.17.2 development
  • tempfile 3.1 development
  • env_logger 0.7.1
  • itertools 0.9
  • libc 0.2
  • ndarray 0.13
  • numpy 0.17.2
  • onig 6.0
  • pyo3 0.17.2
  • rayon 1.3
  • serde 1.0
  • serde_json 1.0
  • tokenizers *
tokenizers/Cargo.toml cargo
  • assert_approx_eq 1.1 development
  • criterion 0.4 development
  • tempfile 3.1 development
  • aho-corasick 0.7
  • cached-path 0.6
  • clap 4.0
  • derive_builder 0.12
  • dirs 3.0
  • esaxx-rs 0.1
  • fancy-regex 0.10
  • getrandom 0.2.6
  • indicatif 0.15
  • itertools 0.9
  • lazy_static 1.4
  • log 0.4
  • macro_rules_attribute 0.1.2
  • onig 6.0
  • paste 1.0.6
  • rand 0.8
  • rayon 1.3
  • rayon-cond 0.1
  • regex 1.3
  • regex-syntax 0.6
  • reqwest 0.11
  • serde 1.0
  • serde_json 1.0
  • spm_precompiled 0.1
  • thiserror 1.0.30
  • unicode-normalization-alignments 0.1
  • unicode-segmentation 1.6
  • unicode_categories 0.1
tokenizers/examples/unstable_wasm/Cargo.toml cargo
  • wasm-bindgen-test 0.3.13 development
  • console_error_panic_hook 0.1.6
  • wasm-bindgen 0.2.63
  • wee_alloc 0.4.5
bindings/node/package-lock.json npm
  • 627 dependencies
bindings/node/package.json npm
  • @types/jest ^26.0.24 development
  • @typescript-eslint/eslint-plugin ^3.10.1 development
  • @typescript-eslint/parser ^3.10.1 development
  • eslint ^7.32.0 development
  • eslint-config-prettier ^6.15.0 development
  • eslint-plugin-jest ^23.20.0 development
  • eslint-plugin-jsdoc ^30.7.13 development
  • eslint-plugin-prettier ^3.4.1 development
  • eslint-plugin-simple-import-sort ^5.0.3 development
  • jest ^26.6.3 development
  • neon-cli ^0.9.1 development
  • prettier ^2.5.1 development
  • shelljs ^0.8.3 development
  • ts-jest ^26.5.6 development
  • typescript ^3.9.10 development
  • @types/node ^13.13.52
  • node-pre-gyp ^0.14.0
tokenizers/examples/unstable_wasm/www/package-lock.json npm
  • 312 dependencies
tokenizers/examples/unstable_wasm/www/package.json npm
  • copy-webpack-plugin ^11.0.0 development
  • webpack ^5.75.0 development
  • webpack-cli ^5.0.1 development
  • webpack-dev-server ^4.10.0 development
  • unstable_wasm file:../pkg