tokenizers
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
8 of 115 committers (7.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.7%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Basic Info
- Host: GitHub
- Owner: huggingface
- License: apache-2.0
- Language: Rust
- Default Branch: main
- Homepage: https://huggingface.co/docs/tokenizers
- Size: 13.5 MB
Statistics
- Stars: 10,025
- Watchers: 125
- Forks: 958
- Open Issues: 117
- Releases: 98
Topics
Metadata Files
README.md
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
Main features:
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Performances
Performances can vary depending on hardware, but running the ~/bindings/python/benches/test_tiktoken.py should give the following on a g6 aws instance:
Bindings
We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)
Installation
You can install from source using:
bash
pip install git+https://github.com/huggingface/tokenizers.git#subdirectory=bindings/python
or install the released versions with
bash
pip install tokenizers
Quick example using Python:
Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:
```python from tokenizers import Tokenizer from tokenizers.models import BPE
tokenizer = Tokenizer(BPE()) ```
You can customize how pre-tokenization (e.g., splitting into words) is done:
```python from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace() ```
Then training your tokenizer on a set of files just takes two lines of codes:
```python from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```
Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)
["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
```
Check the documentation or the quicktour to learn more!
Owner
- Name: Hugging Face
- Login: huggingface
- Kind: organization
- Location: NYC + Paris
- Website: https://huggingface.co/
- Twitter: huggingface
- Repositories: 344
- Profile: https://github.com/huggingface
The AI community building the future.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
Fast State-of-the-Art Tokenizers optimized for Research
and Production.
type: software
authors:
- given-names: Anthony
family-names: Moi
email: m.anthony.moi@gmail.com
affiliation: HuggingFace
- given-names: Nicolas
family-names: Patry
affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
Fast State-of-the-Art Tokenizers optimized for Research
and Production.
keywords:
- Rust
- Tokenizer
- NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Anthony MOI | m****i@g****m | 934 |
| Nicolas Patry | p****s@p****m | 269 |
| Pierric Cistac | p****c@h****o | 138 |
| Arthur Zucker | a****r@g****m | 79 |
| epwalsh | e****0@g****m | 48 |
| dependabot[bot] | 4****] | 36 |
| Morgan Funtowicz | m****n@h****o | 35 |
| Sebastian Pütz | s****z@u****e | 28 |
| Mishig Davaadorj | d****g@g****m | 22 |
| Bjarte Johansen | b****n@g****m | 11 |
| thomwolf | t****f@g****m | 8 |
| Sylvain Gugger | s****r@g****m | 7 |
| Chris Ha | h****9@g****m | 6 |
| sftse | c@f****t | 5 |
| Roy Hvaara | h****a@g****m | 5 |
| Luc Georges | M****e | 4 |
| Connor Boyle | c****o@g****m | 4 |
| Clement | c****e@g****m | 4 |
| Lysandre | l****t@r****r | 4 |
| Julien Chaumond | c****d@g****m | 4 |
| François Garillot | f****s@g****t | 3 |
| dctelus | 9****s | 3 |
| tinyboxvk | t****k | 3 |
| Michael Lui | m****i | 2 |
| SeongBeomLEE | 2****r@n****m | 2 |
| Thomas Wang | 2****1 | 2 |
| mert-kurttutan | k****t@g****m | 2 |
| Mario Šaško | m****7@g****m | 2 |
| MarcusGrass | 3****s | 2 |
| Lucain | l****p@g****m | 2 |
| and 85 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 609
- Total pull requests: 361
- Average time to close issues: about 1 year
- Average time to close pull requests: 2 months
- Total issue authors: 515
- Total pull request authors: 113
- Average comments per issue: 4.32
- Average comments per pull request: 1.99
- Merged pull requests: 209
- Bot issues: 0
- Bot pull requests: 25
Past Year
- Issues: 118
- Pull requests: 156
- Average time to close issues: about 1 month
- Average time to close pull requests: 24 days
- Issue authors: 109
- Pull request authors: 51
- Average comments per issue: 1.93
- Average comments per pull request: 1.31
- Merged pull requests: 81
- Bot issues: 0
- Bot pull requests: 13
Top Authors
Issue Authors
- david-waterworth (15)
- n1t0 (10)
- Narsil (5)
- SaulLu (4)
- pietrolesci (4)
- chris-ha458 (4)
- davidgilbertson (4)
- DOGEwbx (3)
- EricLBuehler (3)
- jafioti (3)
- xenova (3)
- 8ria (3)
- talolard (3)
- shivanraptor (3)
- DamonsJ (3)
Pull Request Authors
- Narsil (110)
- ArthurZucker (92)
- dependabot[bot] (39)
- chris-ha458 (10)
- sftse (10)
- hvaara (6)
- eaplatanios (5)
- tinyboxvk (5)
- 414owen (4)
- boyleconnor (4)
- sondalex (4)
- MeetThePatel (4)
- mjbommar (4)
- hamirmahal (4)
- bryantbiggs (4)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- actions-rs/toolchain v1 composite
- actions/checkout v1 composite
- actions/setup-python v1 composite
- actions/upload-artifact v2 composite
- actions-rs/toolchain v1 composite
- actions/cache v1 composite
- actions/checkout v1 composite
- actions/setup-node v1 composite
- actions/setup-python v1 composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/cache v1 composite
- actions/checkout v1 composite
- actions/setup-node v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v2 composite
- conda-incubator/setup-miniconda v2 composite
- actions/checkout v1 composite
- actions/checkout v2 composite
- actions/setup-python v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v2 composite
- actions/checkout v1 composite
- actions/setup-python v1 composite
- actions/setup-python v4 composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/cache v1 composite
- actions/checkout v1 composite
- actions/setup-python v2 composite
- actions-rs/toolchain v1 composite
- actions/cache v1 composite
- actions/checkout v1 composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v1 composite
- pyo3 0.17.2 development
- tempfile 3.1 development
- env_logger 0.7.1
- itertools 0.9
- libc 0.2
- ndarray 0.13
- numpy 0.17.2
- onig 6.0
- pyo3 0.17.2
- rayon 1.3
- serde 1.0
- serde_json 1.0
- tokenizers *
- assert_approx_eq 1.1 development
- criterion 0.4 development
- tempfile 3.1 development
- aho-corasick 0.7
- cached-path 0.6
- clap 4.0
- derive_builder 0.12
- dirs 3.0
- esaxx-rs 0.1
- fancy-regex 0.10
- getrandom 0.2.6
- indicatif 0.15
- itertools 0.9
- lazy_static 1.4
- log 0.4
- macro_rules_attribute 0.1.2
- onig 6.0
- paste 1.0.6
- rand 0.8
- rayon 1.3
- rayon-cond 0.1
- regex 1.3
- regex-syntax 0.6
- reqwest 0.11
- serde 1.0
- serde_json 1.0
- spm_precompiled 0.1
- thiserror 1.0.30
- unicode-normalization-alignments 0.1
- unicode-segmentation 1.6
- unicode_categories 0.1
- wasm-bindgen-test 0.3.13 development
- console_error_panic_hook 0.1.6
- wasm-bindgen 0.2.63
- wee_alloc 0.4.5
- 627 dependencies
- @types/jest ^26.0.24 development
- @typescript-eslint/eslint-plugin ^3.10.1 development
- @typescript-eslint/parser ^3.10.1 development
- eslint ^7.32.0 development
- eslint-config-prettier ^6.15.0 development
- eslint-plugin-jest ^23.20.0 development
- eslint-plugin-jsdoc ^30.7.13 development
- eslint-plugin-prettier ^3.4.1 development
- eslint-plugin-simple-import-sort ^5.0.3 development
- jest ^26.6.3 development
- neon-cli ^0.9.1 development
- prettier ^2.5.1 development
- shelljs ^0.8.3 development
- ts-jest ^26.5.6 development
- typescript ^3.9.10 development
- @types/node ^13.13.52
- node-pre-gyp ^0.14.0
- 312 dependencies
- copy-webpack-plugin ^11.0.0 development
- webpack ^5.75.0 development
- webpack-cli ^5.0.1 development
- webpack-dev-server ^4.10.0 development
- unstable_wasm file:../pkg