https://github.com/bminixhofer/tokenizers
Science Score: 18.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
○codemeta.json file
-
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: bminixhofer
- License: apache-2.0
- Language: Rust
- Default Branch: main
- Homepage: https://huggingface.co/docs/tokenizers
- Size: 5.83 MB
Statistics
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
Main features:
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Bindings
We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)
Quick example using Python:
Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:
```python from tokenizers import Tokenizer from tokenizers.models import BPE
tokenizer = Tokenizer(BPE()) ```
You can customize how pre-tokenization (e.g., splitting into words) is done:
```python from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace() ```
Then training your tokenizer on a set of files just takes two lines of codes:
```python from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```
Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you ?") print(output.tokens)
["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
```
Check the python documentation or the
python quicktour to learn more!
Owner
- Name: Benjamin Minixhofer
- Login: bminixhofer
- Kind: user
- Location: Linz, Austria
- Website: bmin.ai
- Twitter: bminixhofer
- Repositories: 31
- Profile: https://github.com/bminixhofer
PhD Student @cambridgeltl
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
Fast State-of-the-Art Tokenizers optimized for Research
and Production.
type: software
authors:
- given-names: Anthony
family-names: Moi
email: m.anthony.moi@gmail.com
affiliation: HuggingFace
- given-names: Nicolas
family-names: Patry
affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
Fast State-of-the-Art Tokenizers optimized for Research
and Production.
keywords:
- Rust
- Tokenizer
- NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'