Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: shreyakumaran
  • License: apache-2.0
  • Language: Rust
  • Default Branch: main
  • Size: 6.17 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md



Build GitHub

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

  • Train new vocabularies and tokenize, using today's most used tokenizers.
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

```python from tokenizers import Tokenizer from tokenizers.models import BPE

tokenizer = Tokenizer(BPE()) ```

You can customize how pre-tokenization (e.g., splitting into words) is done:

```python from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace() ```

Then training your tokenizer on a set of files just takes two lines of codes:

```python from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```

Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)

["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

```

Check the documentation or the quicktour to learn more!

Owner

  • Login: shreyakumaran
  • Kind: user

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
type: software
authors:
  - given-names: Anthony
    family-names: Moi
    email: m.anthony.moi@gmail.com
    affiliation: HuggingFace
  - given-names: Nicolas
    family-names: Patry
    affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
keywords:
  - Rust
  - Tokenizer
  - NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'

GitHub Events

Total
Last Year

Dependencies

.github/workflows/build_documentation.yml actions
.github/workflows/build_pr_documentation.yml actions
.github/workflows/delete_doc_comment.yml actions
.github/workflows/delete_doc_comment_trigger.yml actions
.github/workflows/docs-check.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v3 composite
  • actions/setup-python v1 composite
  • actions/upload-artifact v2 composite
.github/workflows/node-release.yml actions
  • actions-rs/toolchain v1 composite
  • actions/cache v1 composite
  • actions/checkout v1 composite
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/setup-node v3 composite
  • actions/setup-python v1 composite
  • actions/upload-artifact v3 composite
.github/workflows/node.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/cache v1 composite
  • actions/checkout v1 composite
  • actions/setup-node v3 composite
.github/workflows/python-release-conda.yml actions
  • actions-rs/toolchain v1 composite
  • actions/checkout v3 composite
  • actions/checkout v2 composite
  • conda-incubator/setup-miniconda v2 composite
.github/workflows/python-release.yml actions
  • PyO3/maturin-action v1 composite
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/python.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/cache v1 composite
  • actions/checkout v3 composite
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • actions/setup-python v4 composite
.github/workflows/rust-release.yml actions
  • actions-rs/toolchain v1 composite
  • actions/cache v1 composite
  • actions/checkout v3 composite
.github/workflows/rust.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout v3 composite
.github/workflows/stale.yml actions
  • actions/stale v8 composite
.github/workflows/trufflehog.yml actions
  • actions/checkout v4 composite
  • trufflesecurity/trufflehog main composite
.github/workflows/upload_pr_documentation.yml actions
bindings/node/Cargo.toml cargo
bindings/python/Cargo.toml cargo
  • pyo3 0.21 development
  • tempfile 3.10 development
  • env_logger 0.11
  • itertools 0.12
  • libc 0.2
  • ndarray 0.15
  • numpy 0.21
  • onig 6.4
  • pyo3 0.21
  • rayon 1.10
  • serde 1.0
  • serde_json 1.0
tokenizers/Cargo.toml cargo
  • assert_approx_eq 1.1 development
  • criterion 0.5 development
  • tempfile 3.10 development
  • tracing 0.1 development
  • tracing-subscriber 0.3.18 development
  • aho-corasick 1.1
  • derive_builder 0.20
  • esaxx-rs 0.1.10
  • fancy-regex 0.13
  • getrandom 0.2.10
  • hf-hub 0.3.2
  • indicatif 0.17
  • itertools 0.12
  • lazy_static 1.4
  • log 0.4
  • macro_rules_attribute 0.2.0
  • monostate 0.1.12
  • onig 6.4
  • paste 1.0.14
  • rand 0.8
  • rayon 1.10
  • rayon-cond 0.3
  • regex 1.10
  • regex-syntax 0.8
  • serde 1.0
  • serde_json 1.0
  • spm_precompiled 0.1
  • thiserror 1.0.49
  • unicode-normalization-alignments 0.1
  • unicode-segmentation 1.11
  • unicode_categories 0.1
tokenizers/examples/unstable_wasm/Cargo.toml cargo
  • wasm-bindgen-test 0.3.13 development
  • console_error_panic_hook 0.1.6
  • wasm-bindgen 0.2.63
  • wee_alloc 0.4.5
bindings/node/npm/android-arm-eabi/package.json npm
bindings/node/npm/android-arm64/package.json npm
bindings/node/npm/darwin-arm64/package.json npm
bindings/node/npm/darwin-x64/package.json npm
bindings/node/npm/freebsd-x64/package.json npm
bindings/node/npm/linux-arm-gnueabihf/package.json npm
bindings/node/npm/linux-arm64-gnu/package.json npm
bindings/node/npm/linux-arm64-musl/package.json npm
bindings/node/npm/linux-x64-gnu/package.json npm
bindings/node/npm/linux-x64-musl/package.json npm
bindings/node/npm/win32-arm64-msvc/package.json npm
bindings/node/npm/win32-ia32-msvc/package.json npm
bindings/node/npm/win32-x64-msvc/package.json npm
bindings/node/package.json npm
  • @napi-rs/cli ^2.14.6 development
  • @swc-node/register ^1.5.5 development
  • @swc/core ^1.3.32 development
  • @taplo/cli ^0.5.2 development
  • @types/jest ^29.5.1 development
  • @typescript-eslint/eslint-plugin ^5.50.0 development
  • @typescript-eslint/parser ^5.50.0 development
  • ava ^5.1.1 development
  • benny ^3.7.1 development
  • chalk ^5.2.0 development
  • eslint ^8.33.0 development
  • eslint-config-prettier ^8.6.0 development
  • eslint-plugin-import ^2.27.5 development
  • eslint-plugin-prettier ^4.2.1 development
  • husky ^8.0.3 development
  • jest ^29.5.0 development
  • lint-staged ^13.1.0 development
  • npm-run-all ^4.1.5 development
  • prettier ^2.8.3 development
  • ts-jest ^29.1.0 development
  • typescript ^5.0.0 development
bindings/node/yarn.lock npm
  • 686 dependencies
tokenizers/examples/unstable_wasm/www/package-lock.json npm
  • 311 dependencies
tokenizers/examples/unstable_wasm/www/package.json npm
  • copy-webpack-plugin ^11.0.0 development
  • webpack ^5.75.0 development
  • webpack-cli ^5.0.1 development
  • webpack-dev-server ^4.10.0 development
  • unstable_wasm file:../pkg
bindings/python/pyproject.toml pypi
  • huggingface_hub >=0.16.4,<1.0
requirements.txt pypi
  • flask *
  • flask-cors *
  • openpyxl *
  • pandas *
  • spacy *
  • torch *
  • transformers ==4.9.2