pyonmttok

Fast and customizable text tokenization library with BPE and SentencePiece support

https://github.com/opennmt/tokenizer

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.4%) to scientific vocabulary

Keywords

bpe cpp icu machine-translation natural-language-processing python sentencepiece tokenization tokenizer unicode

Keywords from Contributors

language-model llms neural-machine-translation
Last synced: 6 months ago · JSON representation

Repository

Fast and customizable text tokenization library with BPE and SentencePiece support

Basic Info
  • Host: GitHub
  • Owner: OpenNMT
  • License: mit
  • Language: C++
  • Default Branch: master
  • Homepage: https://opennmt.net/
  • Size: 1.69 MB
Statistics
  • Stars: 314
  • Watchers: 19
  • Forks: 74
  • Open Issues: 10
  • Releases: 30
Topics
bpe cpp icu machine-translation natural-language-processing python sentencepiece tokenization tokenizer unicode
Created about 9 years ago · Last pushed 10 months ago
Metadata Files
Readme Changelog License

README.md

CI PyPI version Forum

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters ⦅ and ⦆.

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

bash pip install pyonmttok

```python

import pyonmttok tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True) tokens = tokenizer("Hello World!") tokens ['Hello', 'World', '■!'] tokenizer.detokenize(tokens) 'Hello World!' ```

See the Python API description for more details.

C++ API

```cpp

include

using namespace onmt;

int main() { Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate); std::vectorstd::string tokens; tokenizer.tokenize("Hello World!", tokens); } ```

See the Tokenizer class for more details.

Command line clients

bash $ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate Hello World ■! $ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

git submodule update --init mkdir build cd build cmake .. make

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

mkdir build cd build cmake -DBUILD_TESTS=ON .. make test/onmt_tokenizer_test ../test/data

Owner

  • Name: OpenNMT
  • Login: OpenNMT
  • Kind: organization

Open source ecosystem for neural machine translation and neural sequence learning

GitHub Events

Total
  • Watch event: 34
  • Push event: 2
  • Pull request event: 4
  • Fork event: 4
  • Create event: 1
Last Year
  • Watch event: 34
  • Push event: 2
  • Pull request event: 4
  • Fork event: 4
  • Create event: 1

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 590
  • Total Committers: 14
  • Avg Commits per committer: 42.143
  • Development Distribution Score (DDS): 0.085
Past Year
  • Commits: 2
  • Committers: 1
  • Avg Commits per committer: 2.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Guillaume Klein g****n@s****m 540
Jean A. Senellart j****t@s****m 16
jhnwnd 4****d 8
Jean Senellart j****n@s****m 7
Dakun ZHANG z****n@g****m 5
Panos Kanavos p****s@g****m 4
inull 1****L 2
Minh-Thuc 4****2 2
odidev o****v@p****m 1
kovalevfm k****m@g****m 1
RnRoger r****n@u****l 1
NM 3****0 1
Keichi Takahashi k****t@m****m 1
DYCSystran y****g@s****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 31
  • Total pull requests: 82
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 1 day
  • Total issue authors: 21
  • Total pull request authors: 7
  • Average comments per issue: 5.16
  • Average comments per pull request: 0.22
  • Merged pull requests: 79
  • Bot issues: 0
  • Bot pull requests: 1
Past Year
  • Issues: 1
  • Pull requests: 5
  • Average time to close issues: N/A
  • Average time to close pull requests: 1 minute
  • Issue authors: 1
  • Pull request authors: 4
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 2
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • vince62s (6)
  • Zenglinxiao (2)
  • rudyyin (2)
  • panosk (2)
  • NM-20 (2)
  • anderleich (2)
  • A2va (1)
  • Zapotecatl (1)
  • mediabuff (1)
  • BrightXiaoHan (1)
  • l-k-11235 (1)
  • guillaumekln (1)
  • filips123 (1)
  • emabiz (1)
  • areaChun (1)
Pull Request Authors
  • guillaumekln (72)
  • minhthuc2502 (4)
  • panosk (4)
  • hatboyzero (2)
  • dependabot[bot] (2)
  • NM-20 (1)
  • odidev (1)
Top Labels
Issue Labels
enhancement (5) help wanted (2) question (1)
Pull Request Labels
dependencies (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 22,812 last-month
  • Total docker downloads: 29
  • Total dependent packages: 3
  • Total dependent repositories: 103
  • Total versions: 66
  • Total maintainers: 4
pypi.org: pyonmttok

Fast and customizable text tokenization library with BPE and SentencePiece support

  • Versions: 66
  • Dependent Packages: 3
  • Dependent Repositories: 103
  • Downloads: 22,812 Last month
  • Docker Downloads: 29
Rankings
Dependent repos count: 1.5%
Downloads: 3.1%
Dependent packages count: 3.2%
Average: 3.6%
Docker downloads count: 4.0%
Stargazers count: 4.3%
Forks count: 5.4%
Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions
  • actions/checkout v3 composite
  • actions/download-artifact v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
  • docker/setup-qemu-action v2 composite
  • pypa/cibuildwheel v2.11.2 composite
  • pypa/gh-action-pypi-publish release/v1 composite
bindings/python/setup.py pypi