tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://github.com/huggingface/tokenizers

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
8 of 115 committers (7.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary

Keywords

bert gpt language-model natural-language-processing natural-language-understanding nlp transformers

Keywords from Contributors

jax cryptocurrency transformer cryptography argument-parser evaluation-framework agents keras spacy graph-computing

Last synced: 10 months ago · JSON representation ·

Repository

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Basic Info

Host: GitHub
Owner: huggingface
License: apache-2.0
Language: Rust
Default Branch: main
Homepage: https://huggingface.co/docs/tokenizers
Size: 13.5 MB

Statistics

Stars: 10,025
Watchers: 125
Forks: 958
Open Issues: 117
Releases: 98

Topics

bert gpt language-model natural-language-processing natural-language-understanding nlp transformers

Created over 6 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

README.md

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Performances

Performances can vary depending on hardware, but running the ~/bindings/python/benches/test_tiktoken.py should give the following on a g6 aws instance:

Bindings

We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)

Installation

You can install from source using: bash pip install git+https://github.com/huggingface/tokenizers.git#subdirectory=bindings/python

or install the released versions with

bash pip install tokenizers

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

```python from tokenizers import Tokenizer from tokenizers.models import BPE

tokenizer = Tokenizer(BPE()) ```

You can customize how pre-tokenization (e.g., splitting into words) is done:

```python from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace() ```

Then training your tokenizer on a set of files just takes two lines of codes:

```python from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```

Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)

["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

```

Check the documentation or the quicktour to learn more!

Owner

Name: Hugging Face
Login: huggingface
Kind: organization
Location: NYC + Paris

Website: https://huggingface.co/
Twitter: huggingface
Repositories: 344
Profile: https://github.com/huggingface

The AI community building the future.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
type: software
authors:
  - given-names: Anthony
    family-names: Moi
    email: m.anthony.moi@gmail.com
    affiliation: HuggingFace
  - given-names: Nicolas
    family-names: Patry
    affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
keywords:
  - Rust
  - Tokenizer
  - NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'

Committers

Last synced: about 1 year ago

All Time

Total Commits: 1,767
Total Committers: 115
Avg Commits per committer: 15.365
Development Distribution Score (DDS): 0.471

Past Year

Commits: 101
Committers: 27
Avg Commits per committer: 3.741
Development Distribution Score (DDS): 0.644

Top Committers

Name	Email	Commits
Anthony MOI	m**i@g**m	934
Nicolas Patry	p**s@p**m	269
Pierric Cistac	p**c@h**o	138
Arthur Zucker	a**r@g**m	79
epwalsh	e**0@g**m	48
dependabot[bot]	4****]	36
Morgan Funtowicz	m**n@h**o	35
Sebastian Pütz	s**z@u**e	28
Mishig Davaadorj	d**g@g**m	22
Bjarte Johansen	b**n@g**m	11
thomwolf	t**f@g**m	8
Sylvain Gugger	s**r@g**m	7
Chris Ha	h**9@g**m	6
sftse	c@f****t	5
Roy Hvaara	h**a@g**m	5
Luc Georges	M****e	4
Connor Boyle	c**o@g**m	4
Clement	c**e@g**m	4
Lysandre	l**t@r**r	4
Julien Chaumond	c**d@g**m	4
François Garillot	f**s@g**t	3
dctelus	9****s	3
tinyboxvk	t****k	3
Michael Lui	m****i	2
SeongBeomLEE	2**r@n**m	2
Thomas Wang	2****1	2
mert-kurttutan	k**t@g**m	2
Mario Šaško	m**7@g**m	2
MarcusGrass	3****s	2
Lucain	l**p@g**m	2
and 85 more...

Committer Domains (Top 20 + Academic)

huggingface.co: 3 qq.com: 1 nii.ac.jp: 1 coloradocollege.edu: 1 163.com: 1 aleph-alpha.de: 1 echevarria.io: 1 atomiclogic.com: 1 northsouth.edu: 1 ankane.org: 1 lifen.fr: 1 naver.com: 1 garillot.net: 1 reseau.eseo.fr: 1 google.com: 1 farsight.net: 1 uni-tuebingen.de: 1 qkou.info: 1 modelcloud.ai: 1 pm.me: 1 kanji.zinbun.kyoto-u.ac.jp: 1 mail.utoronto.ca: 1 sms.ed.ac.uk: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 609
Total pull requests: 361
Average time to close issues: about 1 year
Average time to close pull requests: 2 months
Total issue authors: 515
Total pull request authors: 113
Average comments per issue: 4.32
Average comments per pull request: 1.99
Merged pull requests: 209
Bot issues: 0
Bot pull requests: 25

Past Year

Issues: 118
Pull requests: 156
Average time to close issues: about 1 month
Average time to close pull requests: 24 days
Issue authors: 109
Pull request authors: 51
Average comments per issue: 1.93
Average comments per pull request: 1.31
Merged pull requests: 81
Bot issues: 0
Bot pull requests: 13

View more stats

Top Authors

Issue Authors

david-waterworth (15)
n1t0 (10)
Narsil (5)
SaulLu (4)
pietrolesci (4)
chris-ha458 (4)
davidgilbertson (4)
DOGEwbx (3)
EricLBuehler (3)
jafioti (3)
xenova (3)
8ria (3)
talolard (3)
shivanraptor (3)
DamonsJ (3)

Pull Request Authors

Narsil (110)
ArthurZucker (92)
dependabot[bot] (39)
chris-ha458 (10)
sftse (10)
hvaara (6)
eaplatanios (5)
tinyboxvk (5)
414owen (4)
boyleconnor (4)
sondalex (4)
MeetThePatel (4)
mjbommar (4)
hamirmahal (4)
bryantbiggs (4)

Top Labels

Issue Labels

Stale (316) bug (11) Feature Request (11) enhancement (10) planned (4) good first issue (2) good second issue (1) bytefallback (1) documentation (1) python (1) training (1)

Pull Request Labels

Stale (60) dependencies (39) javascript (37) github_actions (2)

Dependencies

.github/workflows/docs-check.yml actions

actions-rs/toolchain v1 composite
actions/checkout v1 composite
actions/setup-python v1 composite
actions/upload-artifact v2 composite

.github/workflows/node-release.yml actions

actions-rs/toolchain v1 composite
actions/cache v1 composite
actions/checkout v1 composite
actions/setup-node v1 composite
actions/setup-python v1 composite

.github/workflows/node.yml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/cache v1 composite
actions/checkout v1 composite
actions/setup-node v1 composite

.github/workflows/python-release-conda.yml actions

actions-rs/toolchain v1 composite
actions/checkout v2 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/python-release-extra.yml actions

actions/checkout v1 composite
actions/checkout v2 composite
actions/setup-python v1 composite

.github/workflows/python-release.yml actions

actions-rs/toolchain v1 composite
actions/checkout v2 composite
actions/checkout v1 composite
actions/setup-python v1 composite
actions/setup-python v4 composite

.github/workflows/python.yml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/cache v1 composite
actions/checkout v1 composite
actions/setup-python v2 composite

.github/workflows/rust-release.yml actions

actions-rs/toolchain v1 composite
actions/cache v1 composite
actions/checkout v1 composite

.github/workflows/rust.yml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/checkout v1 composite

bindings/python/Cargo.toml cargo

pyo3 0.17.2 development
tempfile 3.1 development
env_logger 0.7.1
itertools 0.9
libc 0.2
ndarray 0.13
numpy 0.17.2
onig 6.0
pyo3 0.17.2
rayon 1.3
serde 1.0
serde_json 1.0
tokenizers *

tokenizers/Cargo.toml cargo

assert_approx_eq 1.1 development
criterion 0.4 development
tempfile 3.1 development
aho-corasick 0.7
cached-path 0.6
clap 4.0
derive_builder 0.12
dirs 3.0
esaxx-rs 0.1
fancy-regex 0.10
getrandom 0.2.6
indicatif 0.15
itertools 0.9
lazy_static 1.4
log 0.4
macro_rules_attribute 0.1.2
onig 6.0
paste 1.0.6
rand 0.8
rayon 1.3
rayon-cond 0.1
regex 1.3
regex-syntax 0.6
reqwest 0.11
serde 1.0
serde_json 1.0
spm_precompiled 0.1
thiserror 1.0.30
unicode-normalization-alignments 0.1
unicode-segmentation 1.6
unicode_categories 0.1

tokenizers/examples/unstable_wasm/Cargo.toml cargo

wasm-bindgen-test 0.3.13 development
console_error_panic_hook 0.1.6
wasm-bindgen 0.2.63
wee_alloc 0.4.5

bindings/node/package-lock.json npm

627 dependencies

bindings/node/package.json npm

@types/jest ^26.0.24 development
@typescript-eslint/eslint-plugin ^3.10.1 development
@typescript-eslint/parser ^3.10.1 development
eslint ^7.32.0 development
eslint-config-prettier ^6.15.0 development
eslint-plugin-jest ^23.20.0 development
eslint-plugin-jsdoc ^30.7.13 development
eslint-plugin-prettier ^3.4.1 development
eslint-plugin-simple-import-sort ^5.0.3 development
jest ^26.6.3 development
neon-cli ^0.9.1 development
prettier ^2.5.1 development
shelljs ^0.8.3 development
ts-jest ^26.5.6 development
typescript ^3.9.10 development
@types/node ^13.13.52
node-pre-gyp ^0.14.0

tokenizers/examples/unstable_wasm/www/package-lock.json npm

312 dependencies

tokenizers/examples/unstable_wasm/www/package.json npm

copy-webpack-plugin ^11.0.0 development
webpack ^5.75.0 development
webpack-cli ^5.0.1 development
webpack-dev-server ^4.10.0 development
unstable_wasm file:../pkg

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

tokenizers

Science Score: 54.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Main features:

Performances

Bindings

Installation

Quick example using Python:

["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

Owner

Citation (CITATION.cff)

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies