nlpo3

Thai natural language processing library in Rust, with Python and Node bindings.

https://github.com/pythainlp/nlpo3

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 4 DOI reference(s) in README
  • Academic publication links
    Links to: acm.org, zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary

Keywords

hacktoberfest natural-language-processing nodejs python rust text-processing thai-language tokenizer
Last synced: 4 months ago · JSON representation ·

Repository

Thai natural language processing library in Rust, with Python and Node bindings.

Basic Info
  • Host: GitHub
  • Owner: PyThaiNLP
  • License: apache-2.0
  • Language: Rust
  • Default Branch: main
  • Homepage:
  • Size: 1.09 MB
Statistics
  • Stars: 35
  • Watchers: 4
  • Forks: 9
  • Open Issues: 7
  • Releases: 14
Topics
hacktoberfest natural-language-processing nodejs python rust text-processing thai-language tokenizer
Created over 4 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md


SPDX-FileCopyrightText: 2024 PyThaiNLP Project

SPDX-License-Identifier: Apache-2.0

nlpO3

crates.io Apache-2.0 DOI

Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp.

To use as a library in a Rust project:

shell cargo add nlpo3

To use as a library in a Python project:

shell pip install nlpo3

Table of contents

Features

  • Thai word tokenizer
    • Use maximal-matching dictionary-based tokenization algorithm and honor Thai Character Cluster boundaries
    • 2.5x faster than similar pure Python implementation (PyThaiNLP's newmm)
    • Load a dictionary from a plain text file (one word per line) or from Vec<String>

Use

Node.js binding

See nlpo3-nodejs.

Python binding

PyPI

Example:

```python from nlpo3 import load_dict, segment

loaddict("path/to/dict.file", "dictname") segment("สวัสดีครับ", "dict_name") ```

See more at nlpo3-python.

Rust library

crates.io

Add to dependency

To use as a library in a Rust project:

shell cargo add nlpo3

It will add "nlpo3" to Cargo.toml:

```toml [dependencies]

...

nlpo3 = "1.4.0" ```

Example

Create a tokenizer using a dictionary from file, then use it to tokenize a string (safe mode = true, and parallel mode = false):

```rust use nlpo3::tokenizer::newmm::NewmmTokenizer; use nlpo3::tokenizer::tokenizer_trait::Tokenizer;

let tokenizer = NewmmTokenizer::new("path/to/dict.file"); let tokens = tokenizer.segment("ห้องสมุดประชาชน", true, false).unwrap(); ```

Create a tokenizer using a dictionary from a vector of Strings:

rust let words = vec!["ปาลิเมนต์".to_string(), "คอนสติติวชั่น".to_string()]; let tokenizer = NewmmTokenizer::from_word_list(words);

Add words to an existing tokenizer:

rust tokenizer.add_word(&["มิวเซียม"]);

Remove words from an existing tokenizer:

rust tokenizer.remove_word(&["กระเพรา", "ชานชลา"]);

Command-line interface

crates.io

Example:

bash echo "ฉันกินข้าว" | nlpo3 segment

See more at nlpo3-cli.

Dictionary

  • For the interest of library size, nlpO3 does not assume what dictionary the user would like to use, and it does not come with a dictionary.
  • A dictionary is needed for the dictionary-based word tokenizer.
  • For tokenization dictionary, try

Build

Requirements

Steps

Generic test:

bash cargo test

Build API document and open it to check:

bash cargo doc --open

Build (remove --release to keep debug information):

bash cargo build --release

Check target/ for build artifacts.

Develop

Development document

Issues

License

nlpO3 is copyrighted by its authors and licensed under terms of the Apache Software License 2.0 (Apache-2.0). See file LICENSE for details.

Owner

  • Name: PyThaiNLP
  • Login: PyThaiNLP
  • Kind: organization
  • Location: Thailand

We build Thai NLP.

Citation (CITATION.cff)

cff-version: "1.2.0"
title: "nlpO3"
message: >-
  If you use this software, please cite it using these
  metadata.
type: software
authors:
  - family-names: Suntorntip
    given-names: Thanathip
  - family-names: Suriyawongkul
    given-names: Arthit
    orcid: "https://orcid.org/0000-0002-9698-1899"
  - family-names: Phatthiyaphaibun
    given-names: Wannaphong
    orcid: "https://orcid.org/0000-0002-4153-4354"
repository-code: "https://github.com/PyThaiNLP/nlpo3/"
repository: "https://github.com/PyThaiNLP/nlpo3/"
url: "https://github.com/PyThaiNLP/nlpo3/"
abstract: "Thai natural language processing library in Rust, with Python and Node bindings. Formerly oxidized-thainlp."
keywords:
  - "tokenizer"
  - "tokenization"
  - "Thai"
  - "natural language processing"
  - "NLP"
  - "Rust"
  - "Node.js"
  - "Node"
  - "Python"
  - "text processing"
  - "word segmentation"
  - "Thai language"
  - "Thai NLP"
license: Apache-2.0
version: v1.4.0
date-released: "2024-11-09"

GitHub Events

Total
  • Create event: 2
  • Release event: 2
  • Issues event: 6
  • Watch event: 5
  • Issue comment event: 5
  • Push event: 28
  • Pull request event: 45
  • Fork event: 3
Last Year
  • Create event: 2
  • Release event: 2
  • Issues event: 6
  • Watch event: 5
  • Issue comment event: 5
  • Push event: 28
  • Pull request event: 45
  • Fork event: 3

Committers

Last synced: 7 months ago

All Time
  • Total Commits: 421
  • Total Committers: 6
  • Avg Commits per committer: 70.167
  • Development Distribution Score (DDS): 0.404
Past Year
  • Commits: 78
  • Committers: 2
  • Avg Commits per committer: 39.0
  • Development Distribution Score (DDS): 0.013
Top Committers
Name Email Commits
Arthit Suriyawongkul a****t@g****m 251
Thanathip Gorlph g****h@h****m 82
Wannaphong Phatthiyaphaibun w****g@y****m 74
Thanabodee Charoenpiriyakij w****s@g****m 7
Vee Satayamas 5****v@r****m 5
cstorm125 c****l@g****m 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 26
  • Total pull requests: 119
  • Average time to close issues: 26 days
  • Average time to close pull requests: 5 days
  • Total issue authors: 5
  • Total pull request authors: 6
  • Average comments per issue: 1.96
  • Average comments per pull request: 1.24
  • Merged pull requests: 105
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 24
  • Average time to close issues: 4 days
  • Average time to close pull requests: about 2 hours
  • Issue authors: 2
  • Pull request authors: 1
  • Average comments per issue: 0.67
  • Average comments per pull request: 0.13
  • Merged pull requests: 21
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • bact (6)
  • wannaphong (4)
  • wingyplus (3)
  • Gorlph (1)
  • thipokKub (1)
Pull Request Authors
  • bact (66)
  • Gorlph (15)
  • wingyplus (7)
  • wannaphong (5)
  • veer66 (3)
  • cstorm125 (1)
Top Labels
Issue Labels
bug (4) Hacktoberfest (3) enhancement (2) documentation (1) question (1) infrastructure (1) help wanted (1)
Pull Request Labels
documentation (22) enhancement (16) infrastructure (7) bug (5) refactoring (5) hacktoberfest-accepted (3) Hacktoberfest (1) help wanted (1)

Packages

  • Total packages: 5
  • Total downloads:
    • pypi 2,962 last-month
    • npm 2 last-month
    • cargo 23,411 total
  • Total dependent packages: 3
    (may contain duplicates)
  • Total dependent repositories: 5
    (may contain duplicates)
  • Total versions: 25
  • Total maintainers: 6
pypi.org: nlpo3

Python binding for nlpO3 Thai language processing library in Rust

  • Versions: 10
  • Dependent Packages: 1
  • Dependent Repositories: 3
  • Downloads: 2,898 Last month
Rankings
Dependent packages count: 4.8%
Downloads: 7.8%
Dependent repos count: 8.9%
Average: 9.3%
Stargazers count: 11.6%
Forks count: 13.3%
Maintainers (1)
Last synced: 4 months ago
pypi.org: pythainlp-rust-modules

pythainlp-rust-modules is now nlpo3

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 64 Last month
Rankings
Dependent packages count: 10.1%
Stargazers count: 11.6%
Forks count: 13.3%
Average: 18.0%
Dependent repos count: 21.6%
Downloads: 33.3%
Maintainers (2)
Last synced: 4 months ago
crates.io: nlpo3

Thai natural language processing library, with Python and Node bindings

  • Versions: 8
  • Dependent Packages: 1
  • Dependent Repositories: 1
  • Downloads: 19,664 Total
Rankings
Dependent repos count: 16.5%
Dependent packages count: 18.2%
Forks count: 18.6%
Average: 19.6%
Stargazers count: 20.1%
Downloads: 24.4%
Maintainers (2)
Last synced: 4 months ago
npmjs.org: nlpo3

Node.js binding for nlpO3 Thai language processing library

  • Versions: 1
  • Dependent Packages: 1
  • Dependent Repositories: 0
  • Downloads: 2 Last month
Rankings
Forks count: 9.7%
Stargazers count: 10.1%
Dependent packages count: 16.2%
Average: 21.9%
Dependent repos count: 25.3%
Downloads: 48.3%
Maintainers (2)
Last synced: 4 months ago
crates.io: nlpo3-cli

Command line interface for nlpO3, a Thai natural language processing library

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 3,747 Total
Rankings
Forks count: 17.7%
Stargazers count: 21.1%
Dependent repos count: 29.3%
Average: 32.7%
Dependent packages count: 33.8%
Downloads: 61.4%
Maintainers (2)
Last synced: 4 months ago

Dependencies

nlpo3-cli/Cargo.lock cargo
  • ahash 0.7.4
  • aho-corasick 0.7.18
  • atty 0.2.14
  • autocfg 1.0.1
  • binary-heap-plus 0.4.1
  • bitflags 1.2.1
  • bytecount 0.6.2
  • cfg-if 1.0.0
  • clap 3.0.0-beta.2
  • clap_derive 3.0.0-beta.2
  • compare 0.1.0
  • crossbeam-channel 0.5.1
  • crossbeam-deque 0.8.1
  • crossbeam-epoch 0.9.5
  • crossbeam-utils 0.8.5
  • either 1.6.1
  • getrandom 0.2.3
  • hashbrown 0.11.2
  • heck 0.3.3
  • hermit-abi 0.1.19
  • indexmap 1.7.0
  • lazy_static 1.4.0
  • libc 0.2.98
  • memchr 2.4.0
  • memoffset 0.6.4
  • nlpo3 1.2.0
  • num_cpus 1.13.0
  • once_cell 1.8.0
  • os_str_bytes 2.4.0
  • proc-macro-error 1.0.4
  • proc-macro-error-attr 1.0.4
  • proc-macro2 1.0.28
  • quote 1.0.9
  • rayon 1.5.1
  • rayon-core 1.9.1
  • regex 1.5.4
  • regex-syntax 0.6.25
  • scopeguard 1.1.0
  • smol_str 0.1.18
  • strsim 0.10.0
  • syn 1.0.74
  • termcolor 1.1.2
  • textwrap 0.12.1
  • unicode-segmentation 1.8.0
  • unicode-width 0.1.8
  • unicode-xid 0.2.2
  • vec_map 0.8.2
  • version_check 0.9.3
  • wasi 0.10.2+wasi-snapshot-preview1
  • winapi 0.3.9
  • winapi-i686-pc-windows-gnu 0.4.0
  • winapi-util 0.1.5
  • winapi-x86_64-pc-windows-gnu 0.4.0
nlpo3-nodejs/Cargo.lock cargo
  • ahash 0.7.6
  • aho-corasick 0.7.18
  • anyhow 1.0.45
  • autocfg 1.0.1
  • binary-heap-plus 0.4.1
  • bytecount 0.6.2
  • cfg-if 1.0.0
  • compare 0.1.0
  • crossbeam-channel 0.5.1
  • crossbeam-deque 0.8.1
  • crossbeam-epoch 0.9.5
  • crossbeam-utils 0.8.5
  • cslice 0.2.0
  • either 1.6.1
  • getrandom 0.2.3
  • hermit-abi 0.1.19
  • lazy_static 1.4.0
  • libc 0.2.107
  • libloading 0.6.7
  • memchr 2.4.1
  • memoffset 0.6.4
  • neon 0.8.3
  • neon-build 0.8.3
  • neon-macros 0.8.3
  • neon-runtime 0.8.3
  • nlpo3 1.3.1
  • num_cpus 1.13.0
  • once_cell 1.8.0
  • proc-macro2 1.0.32
  • quote 1.0.10
  • rayon 1.5.1
  • rayon-core 1.9.1
  • regex 1.5.4
  • regex-syntax 0.6.25
  • rustc-hash 1.1.0
  • scopeguard 1.1.0
  • semver 0.9.0
  • semver-parser 0.7.0
  • smallvec 1.7.0
  • syn 1.0.81
  • unicode-xid 0.2.2
  • version_check 0.9.3
  • wasi 0.10.2+wasi-snapshot-preview1
  • winapi 0.3.9
  • winapi-i686-pc-windows-gnu 0.4.0
  • winapi-x86_64-pc-windows-gnu 0.4.0
nlpo3-python/Cargo.lock cargo
  • ahash 0.7.6
  • aho-corasick 0.7.18
  • anyhow 1.0.45
  • autocfg 1.0.1
  • binary-heap-plus 0.4.1
  • bitflags 1.3.2
  • bytecount 0.6.2
  • cfg-if 1.0.0
  • compare 0.1.0
  • crossbeam-channel 0.5.1
  • crossbeam-deque 0.8.1
  • crossbeam-epoch 0.9.5
  • crossbeam-utils 0.8.5
  • either 1.6.1
  • getrandom 0.2.3
  • hermit-abi 0.1.19
  • indoc 0.3.6
  • indoc-impl 0.3.6
  • instant 0.1.12
  • lazy_static 1.4.0
  • libc 0.2.107
  • lock_api 0.4.5
  • memchr 2.4.1
  • memoffset 0.6.4
  • nlpo3 1.3.2
  • num_cpus 1.13.0
  • once_cell 1.8.0
  • parking_lot 0.11.2
  • parking_lot_core 0.8.5
  • paste 0.1.18
  • paste-impl 0.1.18
  • proc-macro-hack 0.5.19
  • proc-macro2 1.0.32
  • pyo3 0.15.0
  • pyo3-build-config 0.15.0
  • pyo3-macros 0.15.0
  • pyo3-macros-backend 0.15.0
  • quote 1.0.10
  • rayon 1.5.1
  • rayon-core 1.9.1
  • redox_syscall 0.2.10
  • regex 1.5.4
  • regex-syntax 0.6.25
  • rustc-hash 1.1.0
  • scopeguard 1.1.0
  • smallvec 1.7.0
  • syn 1.0.81
  • unicode-xid 0.2.2
  • unindent 0.1.7
  • version_check 0.9.3
  • wasi 0.10.2+wasi-snapshot-preview1
  • winapi 0.3.9
  • winapi-i686-pc-windows-gnu 0.4.0
  • winapi-x86_64-pc-windows-gnu 0.4.0
nlpo3-nodejs/package-lock.json npm
  • cargo-cp-artifact 0.1.4 development
  • typescript 4.3.5 development
nlpo3-nodejs/package.json npm
  • cargo-cp-artifact ^0.1 development
  • typescript ^4.3.5 development
nlpo3-python/pyproject.toml pypi
  • pytest * develop
  • pytest-runner * develop
  • wheel * develop
  • python ^3.6
.github/workflows/build-python-wheels.yml actions
  • actions-rs/toolchain v1 composite
  • actions/cache v2 composite
  • actions/checkout v2 composite
  • actions/checkout v3 composite
  • actions/download-artifact v2 composite
  • actions/setup-python v3 composite
  • actions/setup-python v2 composite
  • actions/upload-artifact v2 composite
  • pypa/cibuildwheel v2.11.2 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/codeql-analysis.yml actions
  • actions/checkout v2 composite
  • github/codeql-action/analyze v1 composite
  • github/codeql-action/autobuild v1 composite
  • github/codeql-action/init v1 composite
.github/workflows/test-main-lib.yml actions
  • ATiltedTree/setup-rust v1 composite
  • actions-rs/toolchain v1 composite
  • actions/cache v2 composite
  • actions/checkout master composite
.github/workflows/test-nlpo3-cli.yml actions
  • actions-rs/cargo v1 composite
  • actions-rs/toolchain v1 composite
  • actions/checkout master composite
Cargo.toml cargo
nlpo3-cli/Cargo.toml cargo
nlpo3-nodejs/Cargo.toml cargo
nlpo3-python/Cargo.toml cargo
nlpo3-python/setup.py pypi