maintenance_chatbot

https://github.com/shreyakumaran/maintenance_chatbot

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: shreyakumaran
License: apache-2.0
Language: Rust
Default Branch: main
Size: 6.17 MB

Statistics

Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Created almost 2 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

README.md

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Main features:

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Bindings

We provide bindings to the following languages (more to come!): - Rust (Original implementation) - Python - Node.js - Ruby (Contributed by @ankane, external repo)

Quick example using Python:

Choose your model between Byte-Pair Encoding, WordPiece or Unigram and instantiate a tokenizer:

```python from tokenizers import Tokenizer from tokenizers.models import BPE

tokenizer = Tokenizer(BPE()) ```

You can customize how pre-tokenization (e.g., splitting into words) is done:

```python from tokenizers.pre_tokenizers import Whitespace

tokenizer.pre_tokenizer = Whitespace() ```

Then training your tokenizer on a set of files just takes two lines of codes:

```python from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer) ```

Once your tokenizer is trained, encode any text with just one line: ```python output = tokenizer.encode("Hello, y'all! How are you 😁 ?") print(output.tokens)

["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

```

Check the documentation or the quicktour to learn more!

Owner

Login: shreyakumaran
Kind: user

Repositories: 1
Profile: https://github.com/shreyakumaran

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: HuggingFace's Tokenizers
message: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
type: software
authors:
  - given-names: Anthony
    family-names: Moi
    email: m.anthony.moi@gmail.com
    affiliation: HuggingFace
  - given-names: Nicolas
    family-names: Patry
    affiliation: HuggingFace
repository-code: 'https://github.com/huggingface/tokenizers'
url: 'https://github.com/huggingface/tokenizers'
repository: 'https://huggingface.co'
abstract: >-
  Fast State-of-the-Art Tokenizers optimized for Research
  and Production.
keywords:
  - Rust
  - Tokenizer
  - NLP
license: Apache-2.0
commit: 37372b6
version: 0.13.4
date-released: '2023-04-05'

GitHub Events

Total

Last Year

Dependencies

.github/workflows/build_documentation.yml actions

.github/workflows/build_pr_documentation.yml actions

.github/workflows/delete_doc_comment.yml actions

.github/workflows/delete_doc_comment_trigger.yml actions

.github/workflows/docs-check.yml actions

actions-rs/toolchain v1 composite
actions/checkout v3 composite
actions/setup-python v1 composite
actions/upload-artifact v2 composite

.github/workflows/node-release.yml actions

actions-rs/toolchain v1 composite
actions/cache v1 composite
actions/checkout v1 composite
actions/checkout v3 composite
actions/download-artifact v3 composite
actions/setup-node v3 composite
actions/setup-python v1 composite
actions/upload-artifact v3 composite

.github/workflows/node.yml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/cache v1 composite
actions/checkout v1 composite
actions/setup-node v3 composite

.github/workflows/python-release-conda.yml actions

actions-rs/toolchain v1 composite
actions/checkout v3 composite
actions/checkout v2 composite
conda-incubator/setup-miniconda v2 composite

.github/workflows/python-release.yml actions

PyO3/maturin-action v1 composite
actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/python.yml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/cache v1 composite
actions/checkout v3 composite
actions/checkout v2 composite
actions/setup-python v2 composite
actions/setup-python v4 composite

.github/workflows/rust-release.yml actions

actions-rs/toolchain v1 composite
actions/cache v1 composite
actions/checkout v3 composite

.github/workflows/rust.yml actions

actions-rs/cargo v1 composite
actions-rs/toolchain v1 composite
actions/checkout v3 composite

.github/workflows/stale.yml actions

actions/stale v8 composite

.github/workflows/trufflehog.yml actions

actions/checkout v4 composite
trufflesecurity/trufflehog main composite

.github/workflows/upload_pr_documentation.yml actions

bindings/node/Cargo.toml cargo

bindings/python/Cargo.toml cargo

pyo3 0.21 development
tempfile 3.10 development
env_logger 0.11
itertools 0.12
libc 0.2
ndarray 0.15
numpy 0.21
onig 6.4
pyo3 0.21
rayon 1.10
serde 1.0
serde_json 1.0

tokenizers/Cargo.toml cargo

assert_approx_eq 1.1 development
criterion 0.5 development
tempfile 3.10 development
tracing 0.1 development
tracing-subscriber 0.3.18 development
aho-corasick 1.1
derive_builder 0.20
esaxx-rs 0.1.10
fancy-regex 0.13
getrandom 0.2.10
hf-hub 0.3.2
indicatif 0.17
itertools 0.12
lazy_static 1.4
log 0.4
macro_rules_attribute 0.2.0
monostate 0.1.12
onig 6.4
paste 1.0.14
rand 0.8
rayon 1.10
rayon-cond 0.3
regex 1.10
regex-syntax 0.8
serde 1.0
serde_json 1.0
spm_precompiled 0.1
thiserror 1.0.49
unicode-normalization-alignments 0.1
unicode-segmentation 1.11
unicode_categories 0.1

tokenizers/examples/unstable_wasm/Cargo.toml cargo

wasm-bindgen-test 0.3.13 development
console_error_panic_hook 0.1.6
wasm-bindgen 0.2.63
wee_alloc 0.4.5

bindings/node/npm/android-arm-eabi/package.json npm

bindings/node/npm/android-arm64/package.json npm

bindings/node/npm/darwin-arm64/package.json npm

bindings/node/npm/darwin-x64/package.json npm

bindings/node/npm/freebsd-x64/package.json npm

bindings/node/npm/linux-arm-gnueabihf/package.json npm

bindings/node/npm/linux-arm64-gnu/package.json npm

bindings/node/npm/linux-arm64-musl/package.json npm

bindings/node/npm/linux-x64-gnu/package.json npm

bindings/node/npm/linux-x64-musl/package.json npm

bindings/node/npm/win32-arm64-msvc/package.json npm

bindings/node/npm/win32-ia32-msvc/package.json npm

bindings/node/npm/win32-x64-msvc/package.json npm

bindings/node/package.json npm

@napi-rs/cli ^2.14.6 development
@swc-node/register ^1.5.5 development
@swc/core ^1.3.32 development
@taplo/cli ^0.5.2 development
@types/jest ^29.5.1 development
@typescript-eslint/eslint-plugin ^5.50.0 development
@typescript-eslint/parser ^5.50.0 development
ava ^5.1.1 development
benny ^3.7.1 development
chalk ^5.2.0 development
eslint ^8.33.0 development
eslint-config-prettier ^8.6.0 development
eslint-plugin-import ^2.27.5 development
eslint-plugin-prettier ^4.2.1 development
husky ^8.0.3 development
jest ^29.5.0 development
lint-staged ^13.1.0 development
npm-run-all ^4.1.5 development
prettier ^2.8.3 development
ts-jest ^29.1.0 development
typescript ^5.0.0 development

bindings/node/yarn.lock npm

686 dependencies

tokenizers/examples/unstable_wasm/www/package-lock.json npm

311 dependencies

tokenizers/examples/unstable_wasm/www/package.json npm

copy-webpack-plugin ^11.0.0 development
webpack ^5.75.0 development
webpack-cli ^5.0.1 development
webpack-dev-server ^4.10.0 development
unstable_wasm file:../pkg

bindings/python/pyproject.toml pypi

huggingface_hub >=0.16.4,<1.0

requirements.txt pypi

flask *
flask-cors *
openpyxl *
pandas *
spacy *
torch *
transformers ==4.9.2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science