nlpo3
Thai natural language processing library in Rust, with Python and Node bindings.
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 4 DOI reference(s) in README -
✓Academic publication links
Links to: acm.org, zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Keywords
Repository
Thai natural language processing library in Rust, with Python and Node bindings.
Basic Info
Statistics
- Stars: 35
- Watchers: 4
- Forks: 9
- Open Issues: 7
- Releases: 14
Topics
Metadata Files
README.md
SPDX-FileCopyrightText: 2024 PyThaiNLP Project
SPDX-License-Identifier: Apache-2.0
nlpO3
Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.
To use as a library in a Rust project:
shell
cargo add nlpo3
To use as a library in a Python project:
shell
pip install nlpo3
Table of contents
Features
- Thai word tokenizer
- Use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
- 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
- Load a dictionary from a plain text file (one word per line)
or from
Vec<String>
Use
Node.js binding
See nlpo3-nodejs.
Python binding
Example:
```python from nlpo3 import load_dict, segment
loaddict("path/to/dict.file", "dictname") segment("สวัสดีครับ", "dict_name") ```
See more at nlpo3-python.
Rust library
Add to dependency
To use as a library in a Rust project:
shell
cargo add nlpo3
It will add "nlpo3" to Cargo.toml:
```toml [dependencies]
...
nlpo3 = "1.4.0" ```
Example
Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):
```rust use nlpo3::tokenizer::newmm::NewmmTokenizer; use nlpo3::tokenizer::tokenizer_trait::Tokenizer;
let tokenizer = NewmmTokenizer::new("path/to/dict.file"); let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap(); ```
Create a tokenizer using a dictionary from a vector of Strings:
rust
let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()];
let tokenizer = NewmmTokenizer::from_word_list(words);
Add words to an existing tokenizer:
rust
tokenizer.add_word(&["มิวเซียม"]);
Remove words from an existing tokenizer:
rust
tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);
Command-line interface
Example:
bash
echo "ฉันกินข้าว" | nlpo3 segment
See more at nlpo3-cli.
Dictionary
- For the interest of library size, nlpO3 does not assume what dictionary the user would like to use, and it does not come with a dictionary.
- A dictionary is needed for the dictionary-based word tokenizer.
- For tokenization dictionary, try
- words_th.tx from PyThaiNLP
- ~62,000 words
- CC0-1.0
- word break dictionary from libthai
- consists of dictionaries in different categories, with a make script
- LGPL-2.1
Build
Requirements
Steps
Generic test:
bash
cargo test
Build API document and open it to check:
bash
cargo doc --open
Build (remove --release to keep debug information):
bash
cargo build --release
Check target/ for build artifacts.
Develop
Development document
Issues
- Please report issues at https://github.com/PyThaiNLP/nlpo3/issues
License
nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.
Owner
- Name: PyThaiNLP
- Login: PyThaiNLP
- Kind: organization
- Location: Thailand
- Website: https://pythainlp.github.io
- Repositories: 50
- Profile: https://github.com/PyThaiNLP
We build Thai NLP.
Citation (CITATION.cff)
cff-version: "1.2.0"
title: "nlpO3"
message: >-
If you use this software, please cite it using these
metadata.
type: software
authors:
- family-names: Suntorntip
given-names: Thanathip
- family-names: Suriyawongkul
given-names: Arthit
orcid: "https://orcid.org/0000-0002-9698-1899"
- family-names: Phatthiyaphaibun
given-names: Wannaphong
orcid: "https://orcid.org/0000-0002-4153-4354"
repository-code: "https://github.com/PyThaiNLP/nlpo3/"
repository: "https://github.com/PyThaiNLP/nlpo3/"
url: "https://github.com/PyThaiNLP/nlpo3/"
abstract: "Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp."
keywords:
- "tokenizer"
- "tokenization"
- "Thai"
- "natural language processing"
- "NLP"
- "Rust"
- "Node.js"
- "Node"
- "Python"
- "text processing"
- "word segmentation"
- "Thai language"
- "Thai NLP"
license: Apache-2.0
version: v1.4.0
date-released: "2024-11-09"
GitHub Events
Total
- Create event: 2
- Release event: 2
- Issues event: 6
- Watch event: 5
- Issue comment event: 5
- Push event: 28
- Pull request event: 45
- Fork event: 3
Last Year
- Create event: 2
- Release event: 2
- Issues event: 6
- Watch event: 5
- Issue comment event: 5
- Push event: 28
- Pull request event: 45
- Fork event: 3
Committers
Last synced: 7 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Arthit Suriyawongkul | a****t@g****m | 251 |
| Thanathip Gorlph | g****h@h****m | 82 |
| Wannaphong Phatthiyaphaibun | w****g@y****m | 74 |
| Thanabodee Charoenpiriyakij | w****s@g****m | 7 |
| Vee Satayamas | 5****v@r****m | 5 |
| cstorm125 | c****l@g****m | 2 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 26
- Total pull requests: 119
- Average time to close issues: 26 days
- Average time to close pull requests: 5 days
- Total issue authors: 5
- Total pull request authors: 6
- Average comments per issue: 1.96
- Average comments per pull request: 1.24
- Merged pull requests: 105
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 3
- Pull requests: 24
- Average time to close issues: 4 days
- Average time to close pull requests: about 2 hours
- Issue authors: 2
- Pull request authors: 1
- Average comments per issue: 0.67
- Average comments per pull request: 0.13
- Merged pull requests: 21
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- bact (6)
- wannaphong (4)
- wingyplus (3)
- Gorlph (1)
- thipokKub (1)
Pull Request Authors
- bact (66)
- Gorlph (15)
- wingyplus (7)
- wannaphong (5)
- veer66 (3)
- cstorm125 (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 5
-
Total downloads:
- pypi 2,962 last-month
- npm 2 last-month
- cargo 23,411 total
-
Total dependent packages: 3
(may contain duplicates) -
Total dependent repositories: 5
(may contain duplicates) - Total versions: 25
- Total maintainers: 6
pypi.org: nlpo3
Python binding for nlpO3 Thai language processing library in Rust
- Homepage: https://github.com/PyThaiNLP/nlpo3/tree/main/nlpo3-python
- Documentation: https://nlpo3.readthedocs.io/
- License: Apache-2.0
-
Latest release: 1.3.1
published about 1 year ago
Rankings
Maintainers (1)
pypi.org: pythainlp-rust-modules
pythainlp-rust-modules is now nlpo3
- Homepage: https://github.com/PyThaiNLP/nlpo3/
- Documentation: https://pythainlp-rust-modules.readthedocs.io/
- License: Apache-2.0
-
Latest release: 0.2.2
published over 4 years ago
Rankings
Maintainers (2)
crates.io: nlpo3
Thai natural language processing library, with Python and Node bindings
- Homepage: https://github.com/PyThaiNLP/nlpo3/
- Documentation: https://docs.rs/nlpo3/
- License: Apache-2.0
-
Latest release: 1.4.0
published about 1 year ago
Rankings
Maintainers (2)
npmjs.org: nlpo3
Node.js binding for nlpO3 Thai language processing library
- Homepage: https://github.com/PyThaiNLP/nlpo3/
- License: Apache-2.0
-
Latest release: 0.2.1
published over 4 years ago
Rankings
crates.io: nlpo3-cli
Command line interface for nlpO3, a Thai natural language processing library
- Homepage: https://github.com/PyThaiNLP/nlpo3/tree/main/nlpo3-cli/
- Documentation: https://docs.rs/nlpo3-cli/
- License: Apache-2.0
-
Latest release: 0.2.0
published over 4 years ago
Rankings
Dependencies
- ahash 0.7.4
- aho-corasick 0.7.18
- atty 0.2.14
- autocfg 1.0.1
- binary-heap-plus 0.4.1
- bitflags 1.2.1
- bytecount 0.6.2
- cfg-if 1.0.0
- clap 3.0.0-beta.2
- clap_derive 3.0.0-beta.2
- compare 0.1.0
- crossbeam-channel 0.5.1
- crossbeam-deque 0.8.1
- crossbeam-epoch 0.9.5
- crossbeam-utils 0.8.5
- either 1.6.1
- getrandom 0.2.3
- hashbrown 0.11.2
- heck 0.3.3
- hermit-abi 0.1.19
- indexmap 1.7.0
- lazy_static 1.4.0
- libc 0.2.98
- memchr 2.4.0
- memoffset 0.6.4
- nlpo3 1.2.0
- num_cpus 1.13.0
- once_cell 1.8.0
- os_str_bytes 2.4.0
- proc-macro-error 1.0.4
- proc-macro-error-attr 1.0.4
- proc-macro2 1.0.28
- quote 1.0.9
- rayon 1.5.1
- rayon-core 1.9.1
- regex 1.5.4
- regex-syntax 0.6.25
- scopeguard 1.1.0
- smol_str 0.1.18
- strsim 0.10.0
- syn 1.0.74
- termcolor 1.1.2
- textwrap 0.12.1
- unicode-segmentation 1.8.0
- unicode-width 0.1.8
- unicode-xid 0.2.2
- vec_map 0.8.2
- version_check 0.9.3
- wasi 0.10.2+wasi-snapshot-preview1
- winapi 0.3.9
- winapi-i686-pc-windows-gnu 0.4.0
- winapi-util 0.1.5
- winapi-x86_64-pc-windows-gnu 0.4.0
- ahash 0.7.6
- aho-corasick 0.7.18
- anyhow 1.0.45
- autocfg 1.0.1
- binary-heap-plus 0.4.1
- bytecount 0.6.2
- cfg-if 1.0.0
- compare 0.1.0
- crossbeam-channel 0.5.1
- crossbeam-deque 0.8.1
- crossbeam-epoch 0.9.5
- crossbeam-utils 0.8.5
- cslice 0.2.0
- either 1.6.1
- getrandom 0.2.3
- hermit-abi 0.1.19
- lazy_static 1.4.0
- libc 0.2.107
- libloading 0.6.7
- memchr 2.4.1
- memoffset 0.6.4
- neon 0.8.3
- neon-build 0.8.3
- neon-macros 0.8.3
- neon-runtime 0.8.3
- nlpo3 1.3.1
- num_cpus 1.13.0
- once_cell 1.8.0
- proc-macro2 1.0.32
- quote 1.0.10
- rayon 1.5.1
- rayon-core 1.9.1
- regex 1.5.4
- regex-syntax 0.6.25
- rustc-hash 1.1.0
- scopeguard 1.1.0
- semver 0.9.0
- semver-parser 0.7.0
- smallvec 1.7.0
- syn 1.0.81
- unicode-xid 0.2.2
- version_check 0.9.3
- wasi 0.10.2+wasi-snapshot-preview1
- winapi 0.3.9
- winapi-i686-pc-windows-gnu 0.4.0
- winapi-x86_64-pc-windows-gnu 0.4.0
- ahash 0.7.6
- aho-corasick 0.7.18
- anyhow 1.0.45
- autocfg 1.0.1
- binary-heap-plus 0.4.1
- bitflags 1.3.2
- bytecount 0.6.2
- cfg-if 1.0.0
- compare 0.1.0
- crossbeam-channel 0.5.1
- crossbeam-deque 0.8.1
- crossbeam-epoch 0.9.5
- crossbeam-utils 0.8.5
- either 1.6.1
- getrandom 0.2.3
- hermit-abi 0.1.19
- indoc 0.3.6
- indoc-impl 0.3.6
- instant 0.1.12
- lazy_static 1.4.0
- libc 0.2.107
- lock_api 0.4.5
- memchr 2.4.1
- memoffset 0.6.4
- nlpo3 1.3.2
- num_cpus 1.13.0
- once_cell 1.8.0
- parking_lot 0.11.2
- parking_lot_core 0.8.5
- paste 0.1.18
- paste-impl 0.1.18
- proc-macro-hack 0.5.19
- proc-macro2 1.0.32
- pyo3 0.15.0
- pyo3-build-config 0.15.0
- pyo3-macros 0.15.0
- pyo3-macros-backend 0.15.0
- quote 1.0.10
- rayon 1.5.1
- rayon-core 1.9.1
- redox_syscall 0.2.10
- regex 1.5.4
- regex-syntax 0.6.25
- rustc-hash 1.1.0
- scopeguard 1.1.0
- smallvec 1.7.0
- syn 1.0.81
- unicode-xid 0.2.2
- unindent 0.1.7
- version_check 0.9.3
- wasi 0.10.2+wasi-snapshot-preview1
- winapi 0.3.9
- winapi-i686-pc-windows-gnu 0.4.0
- winapi-x86_64-pc-windows-gnu 0.4.0
- cargo-cp-artifact 0.1.4 development
- typescript 4.3.5 development
- cargo-cp-artifact ^0.1 development
- typescript ^4.3.5 development
- pytest * develop
- pytest-runner * develop
- wheel * develop
- python ^3.6
- actions-rs/toolchain v1 composite
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/checkout v3 composite
- actions/download-artifact v2 composite
- actions/setup-python v3 composite
- actions/setup-python v2 composite
- actions/upload-artifact v2 composite
- pypa/cibuildwheel v2.11.2 composite
- pypa/gh-action-pypi-publish release/v1 composite
- actions/checkout v2 composite
- github/codeql-action/analyze v1 composite
- github/codeql-action/autobuild v1 composite
- github/codeql-action/init v1 composite
- ATiltedTree/setup-rust v1 composite
- actions-rs/toolchain v1 composite
- actions/cache v2 composite
- actions/checkout master composite
- actions-rs/cargo v1 composite
- actions-rs/toolchain v1 composite
- actions/checkout master composite