pyonmttok
Fast and customizable text tokenization library with BPE and SentencePiece support
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Fast and customizable text tokenization library with BPE and SentencePiece support
Basic Info
- Host: GitHub
- Owner: OpenNMT
- License: mit
- Language: C++
- Default Branch: master
- Homepage: https://opennmt.net/
- Size: 1.69 MB
Statistics
- Stars: 314
- Watchers: 19
- Forks: 74
- Open Issues: 10
- Releases: 30
Topics
Metadata Files
README.md
Tokenizer
Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.
Overview
By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:
- Reversible tokenization
Marking joints or spaces by annotating tokens or injecting modifier characters. - Subword tokenization
Support for training and using BPE and SentencePiece models. - Advanced text segmentation
Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc. - Case management
Lowercase text and return case information as a separate feature or inject case modifier tokens. - Protected sequences
Sequences can be protected against tokenization with the special characters ⦅ and ⦆.
See the available options for an overview of supported features.
Using
The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.
Python API
bash
pip install pyonmttok
```python
import pyonmttok tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True) tokens = tokenizer("Hello World!") tokens ['Hello', 'World', '■!'] tokenizer.detokenize(tokens) 'Hello World!' ```
See the Python API description for more details.
C++ API
```cpp
include
using namespace onmt;
int main() { Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate); std::vectorstd::string tokens; tokenizer.tokenize("Hello World!", tokens); } ```
See the Tokenizer class for more details.
Command line clients
bash
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!
See the -h flag to list the available options.
Development
Dependencies
Compiling
CMake and a compiler that supports the C++11 standard are required to compile the project.
git submodule update --init
mkdir build
cd build
cmake ..
make
It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.
- To compile only the library, use the
-DLIB_ONLY=ONflag.
Testing
The tests are using Google Test which is included as a Git submodule. Run the tests with:
mkdir build
cd build
cmake -DBUILD_TESTS=ON ..
make
test/onmt_tokenizer_test ../test/data
Owner
- Name: OpenNMT
- Login: OpenNMT
- Kind: organization
- Website: https://opennmt.net/
- Repositories: 13
- Profile: https://github.com/OpenNMT
Open source ecosystem for neural machine translation and neural sequence learning
GitHub Events
Total
- Watch event: 34
- Push event: 2
- Pull request event: 4
- Fork event: 4
- Create event: 1
Last Year
- Watch event: 34
- Push event: 2
- Pull request event: 4
- Fork event: 4
- Create event: 1
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Guillaume Klein | g****n@s****m | 540 |
| Jean A. Senellart | j****t@s****m | 16 |
| jhnwnd | 4****d | 8 |
| Jean Senellart | j****n@s****m | 7 |
| Dakun ZHANG | z****n@g****m | 5 |
| Panos Kanavos | p****s@g****m | 4 |
| inull | 1****L | 2 |
| Minh-Thuc | 4****2 | 2 |
| odidev | o****v@p****m | 1 |
| kovalevfm | k****m@g****m | 1 |
| RnRoger | r****n@u****l | 1 |
| NM | 3****0 | 1 |
| Keichi Takahashi | k****t@m****m | 1 |
| DYCSystran | y****g@s****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 31
- Total pull requests: 82
- Average time to close issues: about 1 month
- Average time to close pull requests: 1 day
- Total issue authors: 21
- Total pull request authors: 7
- Average comments per issue: 5.16
- Average comments per pull request: 0.22
- Merged pull requests: 79
- Bot issues: 0
- Bot pull requests: 1
Past Year
- Issues: 1
- Pull requests: 5
- Average time to close issues: N/A
- Average time to close pull requests: 1 minute
- Issue authors: 1
- Pull request authors: 4
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 1
Top Authors
Issue Authors
- vince62s (6)
- Zenglinxiao (2)
- rudyyin (2)
- panosk (2)
- NM-20 (2)
- anderleich (2)
- A2va (1)
- Zapotecatl (1)
- mediabuff (1)
- BrightXiaoHan (1)
- l-k-11235 (1)
- guillaumekln (1)
- filips123 (1)
- emabiz (1)
- areaChun (1)
Pull Request Authors
- guillaumekln (72)
- minhthuc2502 (4)
- panosk (4)
- hatboyzero (2)
- dependabot[bot] (2)
- NM-20 (1)
- odidev (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 22,812 last-month
- Total docker downloads: 29
- Total dependent packages: 3
- Total dependent repositories: 103
- Total versions: 66
- Total maintainers: 4
pypi.org: pyonmttok
Fast and customizable text tokenization library with BPE and SentencePiece support
- Homepage: https://opennmt.net
- Documentation: https://pyonmttok.readthedocs.io/
- License: MIT
-
Latest release: 1.37.1
published almost 3 years ago
Rankings
Maintainers (4)
Dependencies
- actions/checkout v3 composite
- actions/download-artifact v3 composite
- actions/setup-python v4 composite
- actions/upload-artifact v3 composite
- docker/setup-qemu-action v2 composite
- pypa/cibuildwheel v2.11.2 composite
- pypa/gh-action-pypi-publish release/v1 composite