pythainlp

Thai natural language processing in Python

https://github.com/pythainlp/pythainlp

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.3%) to scientific vocabulary

Keywords

computational-linguistics hacktoberfest natural-language-processing nlp-library python soundex text-processing thai thai-language thai-nlp thai-nlp-library thai-soundex word-segmentation

Keywords from Contributors

cryptocurrencies fake-data fake faker faker-generator test-data test-data-generator vocabulary examples graph-generation
Last synced: 4 months ago · JSON representation ·

Repository

Thai natural language processing in Python

Basic Info
  • Host: GitHub
  • Owner: PyThaiNLP
  • License: apache-2.0
  • Language: Python
  • Default Branch: dev
  • Homepage: https://pythainlp.org/
  • Size: 67.7 MB
Statistics
  • Stars: 1,065
  • Watchers: 47
  • Forks: 282
  • Open Issues: 39
  • Releases: 123
Topics
computational-linguistics hacktoberfest natural-language-processing nlp-library python soundex text-processing thai thai-language thai-nlp thai-nlp-library thai-soundex word-segmentation
Created over 9 years ago · Last pushed 4 months ago
Metadata Files
Readme Changelog Contributing License Code of conduct Citation Security Codemeta

README.md

PyThaiNLP: Thai Natural Language Processing in Python

Project Logo

pypi Python 3.9 License DOI

Project Status: Active Codacy Grade Coverage Status

Google Colab Badge Chat on Matrix

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with a focus on Thai language.

PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ คล้ายกับ NLTK โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยได้ที่ README_TH.MD

Quick install

sh pip install pythainlp

| Version | Description | Status | |:------:|:--:|:------:| | 5.1.2 | Stable | Change Log | | dev | Release Candidate for 5.2 | Change Log |

Getting Started

Capabilities

PyThaiNLP provides standard linguistic analysis for Thai language and standard Thai locale utility functions. Some of these functions are also available via the command-line interface (run thainlp in your shell).

Partial list of features:

  • Convenient character and word classes, like Thai consonants (pythainlp.thai_consonants), vowels (pythainlp.thai_vowels), digits (pythainlp.thai_digits), and stop words (pythainlp.corpus.thai_stopwords) -- comparable to constants like string.letters, string.digits, and string.punctuation
  • Linguistic unit segmentation at different levels: sentence (sent_tokenize), word (word_tokenize), and subword (subword_tokenize)
  • Part-of-speech tagging (pos_tag)
  • Spelling suggestion and correction (spell and correct)
  • Phonetic algorithm and transliteration (soundex and transliterate)
  • Collation (sorted by dictionary order) (collate)
  • Number read out (num_to_thaiword and bahttext)
  • Datetime formatting (thai_strftime)
  • Thai-English keyboard misswitched fix (eng_to_thai, thai_to_eng)

Installation

sh pip install --upgrade pythainlp

This will install the latest stable release of PyThaiNLP.

Install different releases:

  • Stable release: pip install --upgrade pythainlp
  • Pre-release (nearly ready): pip install --upgrade --pre pythainlp
  • Development (likely to break things): pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

Installation Options

Some functionalities, like Thai WordNet, may require extra packages. To install those requirements, specify a set of [name] immediately after pythainlp:

sh pip install "pythainlp[extra1,extra2,...]"

Possible extras:

  • full (install everything)
  • compact (install a stable and small subset of dependencies)
  • attacut (to support attacut, a fast and accurate tokenizer)
  • benchmarks (for word tokenization benchmarking)
  • icu (for ICU, International Components for Unicode, support in transliteration and tokenization)
  • ipa (for IPA, International Phonetic Alphabet, support in transliteration)
  • ml (to support ULMFiT models for classification)
  • thai2fit (for Thai word vector)
  • thai2rom (for machine-learnt romanization)
  • wordnet (for Thai WordNet API)

For dependency details, look at the extras variable in setup.py.

Data Directory

  • Some additional data, like word lists and language models, may be automatically downloaded during runtime.
  • PyThaiNLP caches these data under the directory ~/pythainlp-data by default.
  • The data directory can be changed by specifying the environment variable PYTHAINLP_DATA_DIR.
  • See the data catalog (db.json) at https://github.com/PyThaiNLP/pythainlp-corpus

Command-Line Interface

Some of PyThaiNLP functionalities can be used via command line with the thainlp command.

For example, to display a catalog of datasets:

sh thainlp data catalog

To show how to use:

sh thainlp help

Testing and test suites

We test core functionalities on all officially supported Python versions.

Some functionality requiring extra dependencies may be tested less frequently due to potential version conflicts or incompatibilities between packages.

Test cases are categorized into three groups: core, compact, and extra. You can find these tests in the tests/ directory.

For more detailed information on testing, please refer to the tests README: tests/README.md

Licenses

| | License | |:---|:----| | PyThaiNLP source codes and notebooks | Apache Software License 2.0 | | Corpora, datasets, and documentations created by PyThaiNLP | Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0)| | Language models created by PyThaiNLP | Creative Commons Attribution 4.0 International Public License (CC-by) | | Other corpora and models that may be included in PyThaiNLP | See Corpus License |

Contribute to PyThaiNLP

  • Please fork and create a pull request :)
  • For style guides and other information, including references to algorithms we use, please refer to our contributing page.

Who uses PyThaiNLP?

You can read INTHEWILD.md.

Citations

If you use PyThaiNLP in your project or publication, please cite the library as follows:

Phatthiyaphaibun, Wannaphong, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, and Pattarawat Chormai. “Pythainlp: Thai Natural Language Processing in Python”. Zenodo, 2 June 2024. http://doi.org/10.5281/zenodo.3519354.

or by BibTeX entry:

bibtex @software{pythainlp, title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython", author = "Phatthiyaphaibun, Wannaphong and Chaovavanich, Korakot and Polpanumas, Charin and Suriyawongkul, Arthit and Lowphansirikul, Lalita and Chormai, Pattarawat", doi = {10.5281/zenodo.3519354}, license = {Apache-2.0}, month = jun, url = {https://github.com/PyThaiNLP/pythainlp/}, version = {v5.0.4}, year = {2024}, }

Our NLP-OSS 2023 paper:

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore, Singapore. Empirical Methods in Natural Language Processing.

and its BibTeX entry:

bibtex @inproceedings{phatthiyaphaibun-etal-2023-pythainlp, title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython", author = "Phatthiyaphaibun, Wannaphong and Chaovavanich, Korakot and Polpanumas, Charin and Suriyawongkul, Arthit and Lowphansirikul, Lalita and Chormai, Pattarawat and Limkonchotiwat, Peerat and Suntorntip, Thanathip and Udomcharoenchaikit, Can", editor = "Tan, Liling and Milajevs, Dmitrijs and Chauhan, Geeticka and Gwinnup, Jeremy and Rippeth, Elijah", booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)", month = dec, year = "2023", address = "Singapore, Singapore", publisher = "Empirical Methods in Natural Language Processing", url = "https://aclanthology.org/2023.nlposs-1.4", pages = "25--36", abstract = "We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.", }

Sponsors

| Logo | Description | | --- | ----------- | | VISTEC-depa Thailand Artificial Intelligence Research Institute | Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute. | | MacStadium | We get support of free Mac Mini M1 from MacStadium for running CI builds. |


Made with ❤️ | PyThaiNLP Team 💻 | "We build Thai NLP" 🇹🇭

We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp
Beware of malware if you use codes from mirrors other than the official two on GitHub and GitLab.

Owner

  • Name: PyThaiNLP
  • Login: PyThaiNLP
  • Kind: organization
  • Location: Thailand

We build Thai NLP.

Citation (CITATION.cff)

cff-version: "1.2.0"
title: "PyThaiNLP: Thai Natural Language Processing in Python"
message: >-
  If you use this software, please cite it using these
  metadata.
type: software
authors:
  - family-names: Phatthiyaphaibun
    given-names: Wannaphong
    orcid: "https://orcid.org/0000-0002-4153-4354"
  - family-names: Chaovavanich
    given-names: Korakot
    orcid: "https://orcid.org/0009-0002-7350-9855"
  - family-names: Polpanumas
    given-names: Charin
    orcid: "https://orcid.org/0000-0001-7822-4600"
  - family-names: Suriyawongkul
    given-names: Arthit
    orcid: "https://orcid.org/0000-0002-9698-1899"
  - family-names: Lowphansirikul
    given-names: Lalita
    orcid: "https://orcid.org/0000-0002-5305-2088"
  - family-names: Chormai
    given-names: Pattarawat
    orcid: "https://orcid.org/0000-0002-7582-4667"
identifiers:
  - type: doi
    value: 10.5281/zenodo.3519354
    description: >-
      This is the collection of archived snapshots of all
      versions of PyThaiNLP.
repository-code: "https://github.com/PyThaiNLP/pythainlp/"
repository: "https://github.com/PyThaiNLP/pythainlp/"
url: "https://pythainlp.org/"
abstract: "Thai natural language processing in Python"
keywords:
  - "natural language processing"
  - "Thai"
  - "Python"
  - "text processing"
  - "computational linguistics"
  - "tokenization"
  - "word segmentation"
  - "NLP"
  - "Thai language"
  - "Thai NLP"
license: Apache-2.0
version: 5.1.0
date-released: "2025-02-25"

CodeMeta (codemeta.json)

{
  "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
  "@type": "SoftwareSourceCode",
  "name": "PyThaiNLP",
  "description": "Thai Natural Language Processing in Python",
  "version": "5.1.0",
  "author": [
    {
      "@type": "Person",
      "givenName": "Wannaphong",
      "familyName": "Phatthiyaphaibun",
      "@id": "https://orcid.org/0000-0002-4153-4354"
    },
    {
      "@type": "Person",
      "givenName": "Korakot",
      "familyName": "Chaovavanich",
      "@id": "https://orcid.org/0009-0002-7350-9855"
    },
    {
      "@type": "Person",
      "givenName": "Charin",
      "familyName": "Polpanumas",
      "@id": "https://orcid.org/0000-0001-7822-4600"
    },
    {
      "@type": "Person",
      "givenName": "Arthit",
      "familyName": "Suriyawongkul",
      "@id": "https://orcid.org/0000-0002-9698-1899"
    },
    {
      "@type": "Person",
      "givenName": "Lalita",
      "familyName": "Lowphansirikul",
      "@id": "https://orcid.org/0000-0002-5305-2088"
    },
    {
      "@type": "Person",
      "givenName": "Pattarawat",
      "familyName": "Chormai",
      "@id": "https://orcid.org/0000-0002-7582-4667"
    }
  ],
  "maintainer": [
    {
      "@type": "Person",
      "givenName": "Wannaphong",
      "familyName": "Phatthiyaphaibun",
      "@id": "https://orcid.org/0000-0002-4153-4354"
    },
    {
      "@type": "Person",
      "givenName": "Arthit",
      "familyName": "Suriyawongkul",
      "@id": "https://orcid.org/0000-0002-9698-1899"
    }
  ],
  "license": "https://www.apache.org/licenses/LICENSE-2.0",
  "codeRepository": "https://github.com/PyThaiNLP/pythainlp",
  "issueTracker": "https://github.com/PyThaiNLP/pythainlp/issues",
  "url": "https://pythainlp.org/",
  "keywords": [
    "natural language processing",
    "Thai",
    "Python",
    "text processing",
    "computational linguistics",
    "tokenization",
    "word segmentation",
    "NLP",
    "Thai language",
    "Thai NLP"
  ]
}

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 4,205
  • Total Committers: 69
  • Avg Commits per committer: 60.942
  • Development Distribution Score (DDS): 0.488
Past Year
  • Commits: 405
  • Committers: 10
  • Avg Commits per committer: 40.5
  • Development Distribution Score (DDS): 0.407
Top Committers
Name Email Commits
Wannaphong Phatthiyaphaibun w****g@y****m 2,153
Arthit Suriyawongkul a****t@g****m 1,228
Chakri Lowphansirikul a****a@C****l 232
dependabot[bot] 4****] 61
konbraphat51 k****t@g****m 58
heytitle p****i@g****m 47
Pavarissy p****g@p****m 28
Saiko Yoneyabashi a****c@g****m 28
Charin c****b@g****m 27
noppayut n****t@h****m 24
Saharsh Jain 1****8 22
orapat Buppodom n****3@g****m 21
Pakin Pirch p****h@g****m 20
smeeklai w****s@h****m 17
c4n u****n@g****m 16
Ubuntu u****u@i****l 15
petetanru p****u@g****m 13
HRNPH h****h@p****m 13
Peradon p****r@h****m 12
charin c****n@c****h 11
seth s****h@g****m 10
Abhabongse Janthong 6****e 9
root r****t@D****n 9
cstorm125 c****s@d****g 8
BLKSerene b****e@g****m 8
TripleKdev 8****v 7
Ubuntu u****u@i****l 7
zkan k****n@p****m 7
ayaan-qadri d****n@g****m 6
vikimark m****t@g****m 6
and 39 more...

Issues and Pull Requests

Last synced: 4 months ago

All Time
  • Total issues: 155
  • Total pull requests: 427
  • Average time to close issues: 7 months
  • Average time to close pull requests: 4 days
  • Total issue authors: 42
  • Total pull request authors: 23
  • Average comments per issue: 2.5
  • Average comments per pull request: 2.35
  • Merged pull requests: 333
  • Bot issues: 4
  • Bot pull requests: 171
Past Year
  • Issues: 45
  • Pull requests: 303
  • Average time to close issues: 12 days
  • Average time to close pull requests: 1 day
  • Issue authors: 13
  • Pull request authors: 10
  • Average comments per issue: 1.02
  • Average comments per pull request: 1.93
  • Merged pull requests: 218
  • Bot issues: 4
  • Bot pull requests: 162
Top Authors
Issue Authors
  • wannaphong (65)
  • bact (27)
  • pavaris-pm (6)
  • konbraphat51 (5)
  • leky40 (4)
  • dependabot[bot] (4)
  • S2P2 (4)
  • new5558 (3)
  • p16i (2)
  • tonezzz (2)
  • ghost (2)
  • PhakphumV (1)
  • free-bug (1)
  • joshbk1 (1)
  • kaiwa (1)
Pull Request Authors
  • dependabot[bot] (171)
  • wannaphong (123)
  • bact (81)
  • pavaris-pm (7)
  • kangkengkhadev (6)
  • BLKSerene (5)
  • HRNPH (5)
  • konbraphat51 (5)
  • noppayut (4)
  • new5558 (4)
  • WTFPUn (2)
  • varunkatiyar819 (2)
  • allrob23 (2)
  • LXZE (1)
  • c4n (1)
Top Labels
Issue Labels
Hacktoberfest (28) enhancement (24) bug (24) help wanted (14) documentation (13) corpus (11) question (11) refactoring (7) dependencies (7) infrastructure (5) python (4) github_actions (1) news (1)
Pull Request Labels
dependencies (175) python (134) enhancement (41) github_actions (25) bug (23) hacktoberfest-accepted (20) tests (18) infrastructure (17) documentation (15) refactoring (13) corpus (9) stale (7) Hacktoberfest (3)

Packages

  • Total packages: 4
  • Total downloads:
    • pypi 621,206 last-month
  • Total docker downloads: 559
  • Total dependent packages: 37
    (may contain duplicates)
  • Total dependent repositories: 184
    (may contain duplicates)
  • Total versions: 261
  • Total maintainers: 2
pypi.org: pythainlp

Thai Natural Language Processing library

  • Versions: 114
  • Dependent Packages: 37
  • Dependent Repositories: 183
  • Downloads: 621,114 Last month
  • Docker Downloads: 559
Rankings
Dependent packages count: 0.5%
Downloads: 0.6%
Dependent repos count: 1.1%
Average: 1.9%
Stargazers count: 2.1%
Forks count: 3.2%
Docker downloads count: 3.7%
Maintainers (2)
Last synced: 4 months ago
proxy.golang.org: github.com/pythainlp/pythainlp
  • Versions: 72
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.4%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 4 months ago
proxy.golang.org: github.com/PyThaiNLP/pythainlp
  • Versions: 72
  • Dependent Packages: 0
  • Dependent Repositories: 0
Rankings
Dependent packages count: 5.4%
Average: 5.6%
Dependent repos count: 5.8%
Last synced: 4 months ago
pypi.org: thainlp

Thai NLP library

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 92 Last month
Rankings
Dependent packages count: 10.0%
Dependent repos count: 21.7%
Average: 22.2%
Downloads: 35.0%
Maintainers (1)
Last synced: 4 months ago

Dependencies

.github/workflows/codeql-analysis.yml actions
  • actions/checkout v2 composite
  • github/codeql-action/analyze v1 composite
  • github/codeql-action/autobuild v1 composite
  • github/codeql-action/init v1 composite
.github/workflows/deploy_docs.yml actions
  • actions/checkout v1 composite
  • actions/setup-python v1 composite
  • peaceiris/actions-gh-pages v3 composite
.github/workflows/greetings.yml actions
  • actions/first-interaction v1 composite
.github/workflows/lint.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v4 composite
  • psf/black stable composite
.github/workflows/macos-test.yml actions
  • actions/checkout v2 composite
  • conda-incubator/setup-miniconda v2 composite
.github/workflows/pypi-publish.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/pypi-test.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/stale.yml actions
  • actions/stale v6 composite
.github/workflows/test.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v1 composite
.github/workflows/windows-test.yml actions
  • actions/checkout v2 composite
  • conda-incubator/setup-miniconda v2 composite
Dockerfile docker
  • python 3.8-slim-buster build
docker_requirements.txt pypi
  • OSKut ==1.3
  • PyYAML ==5.4
  • attacut ==1.0.6
  • bpemb ==0.3.2
  • deepcut ==0.7.0.0
  • emoji ==0.5.2
  • epitran ==1.9
  • esupar ==1.3.8
  • fairseq ==0.10.2
  • fastai ==1.0.61
  • gensim ==4.0.
  • h5py ==3.1.0
  • khanaa ==0.0.6
  • nlpo3 ==1.2.6
  • nltk ==3.6.6
  • numpy ==1.22.
  • pandas ==1.4.
  • phunspell ==0.1.6
  • pyicu ==2.8
  • python-crfsuite ==0.9.7
  • requests ==2.25.
  • sacremoses ==0.0.41
  • sefr_cut ==1.1
  • sentencepiece ==0.1.91
  • spacy ==2.3.
  • spacy_thai ==0.7.1
  • spylls ==0.1.5
  • ssg ==0.0.8
  • symspellpy ==6.7.6
  • tensorflow ==2.9.3
  • thai-nner ==0.3
  • tltk ==1.3.8
  • torch ==1.8.1
  • transformers ==4.22.1
  • ufal.chu-liu-edmonds ==1.0.2
  • wunsen ==0.0.3
requirements.txt pypi
  • PyYAML ==5.4
  • numpy ==1.22.
  • python-crfsuite ==0.9.
  • requests ==2.25.