maintenance_chatbot
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: shreyakumaran
- License: apache-2.0
- Language: Rust
- Default Branch: main
- Size: 6.17 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.
Main features:
- Train new vocabularies and tokenize, using today's most used tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
- Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.
Bindings
We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)
Quick example using Python:
Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:
```python from tokenizers import Tokenizer from tokenizers.models import BPE
tokenizer = Tokenizer(BPE()) ```
You can customize how pre-tokenization (e.g., splitting into words) is done:
```python from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace() ```
Then training your tokenizer on a set of files just takes two lines of codes:
```python from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```
Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)
["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]
```
Check the documentation or the quicktour to learn more!
Owner
- Login: shreyakumaran
- Kind: user
- Repositories: 1
- Profile: https://github.com/shreyakumaran
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
Fast State-of-the-Art Tokenizers optimized for Research
and Production.
type: software
authors:
- given-names: Anthony
family-names: Moi
email: m.anthony.moi@gmail.com
affiliation: HuggingFace
- given-names: Nicolas
family-names: Patry
affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
Fast State-of-the-Art Tokenizers optimized for Research
and Production.
keywords:
- Rust
- Tokenizer
- NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'
GitHub Events
Total
Last Year
Dependencies
- actions-rs/toolchain v1 composite
- actions/checkout v3 composite
- actions/setup-python v1 composite
- actions/upload-artifact v2 composite
- actions-rs/toolchain v1 composite
- actions/cache v1 composite
- actions/checkout v1 composite
- actions/checkout v3 composite
- actions/download-artifact v3 composite
- actions/setup-node v3 composite
- actions/setup-python v1 composite
- actions/upload-artifact v3 composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/cache v1 composite
- actions/checkout v1 composite
- actions/setup-node v3 composite
- actions-rs/toolchain v1 composite
- actions/checkout v3 composite
- actions/checkout v2 composite
- conda-incubator/setup-miniconda v2 composite
- PyO3/maturin-action v1 composite
- actions/checkout v3 composite
- actions/setup-python v4 composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/cache v1 composite
- actions/checkout v3 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite
- actions/setup-python v4 composite
- actions-rs/toolchain v1 composite
- actions/cache v1 composite
- actions/checkout v3 composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout v3 composite
- actions/stale v8 composite
- actions/checkout v4 composite
- trufflesecurity/trufflehog main composite
- pyo3 0.21 development
- tempfile 3.10 development
- env_logger 0.11
- itertools 0.12
- libc 0.2
- ndarray 0.15
- numpy 0.21
- onig 6.4
- pyo3 0.21
- rayon 1.10
- serde 1.0
- serde_json 1.0
- assert_approx_eq 1.1 development
- criterion 0.5 development
- tempfile 3.10 development
- tracing 0.1 development
- tracing-subscriber 0.3.18 development
- aho-corasick 1.1
- derive_builder 0.20
- esaxx-rs 0.1.10
- fancy-regex 0.13
- getrandom 0.2.10
- hf-hub 0.3.2
- indicatif 0.17
- itertools 0.12
- lazy_static 1.4
- log 0.4
- macro_rules_attribute 0.2.0
- monostate 0.1.12
- onig 6.4
- paste 1.0.14
- rand 0.8
- rayon 1.10
- rayon-cond 0.3
- regex 1.10
- regex-syntax 0.8
- serde 1.0
- serde_json 1.0
- spm_precompiled 0.1
- thiserror 1.0.49
- unicode-normalization-alignments 0.1
- unicode-segmentation 1.11
- unicode_categories 0.1
- wasm-bindgen-test 0.3.13 development
- console_error_panic_hook 0.1.6
- wasm-bindgen 0.2.63
- wee_alloc 0.4.5
- @napi-rs/cli ^2.14.6 development
- @swc-node/register ^1.5.5 development
- @swc/core ^1.3.32 development
- @taplo/cli ^0.5.2 development
- @types/jest ^29.5.1 development
- @typescript-eslint/eslint-plugin ^5.50.0 development
- @typescript-eslint/parser ^5.50.0 development
- ava ^5.1.1 development
- benny ^3.7.1 development
- chalk ^5.2.0 development
- eslint ^8.33.0 development
- eslint-config-prettier ^8.6.0 development
- eslint-plugin-import ^2.27.5 development
- eslint-plugin-prettier ^4.2.1 development
- husky ^8.0.3 development
- jest ^29.5.0 development
- lint-staged ^13.1.0 development
- npm-run-all ^4.1.5 development
- prettier ^2.8.3 development
- ts-jest ^29.1.0 development
- typescript ^5.0.0 development
- 686 dependencies
- 311 dependencies
- copy-webpack-plugin ^11.0.0 development
- webpack ^5.75.0 development
- webpack-cli ^5.0.1 development
- webpack-dev-server ^4.10.0 development
- unstable_wasm file:../pkg
- huggingface_hub >=0.16.4,<1.0
- flask *
- flask-cors *
- openpyxl *
- pandas *
- spacy *
- torch *
- transformers ==4.9.2