pinyintokenizer

pinyintokenizer, 拼音分词器,将连续的拼音切分为单字拼音列表。

https://github.com/shibing624/pinyin-tokenizer

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.7%) to scientific vocabulary

Keywords

nlp pinyin pinyin-analysis pinyin4j tokenizer trie-tree
Last synced: 6 months ago · JSON representation ·

Repository

pinyintokenizer, 拼音分词器,将连续的拼音切分为单字拼音列表。

Basic Info
  • Host: GitHub
  • Owner: shibing624
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 60.5 KB
Statistics
  • Stars: 31
  • Watchers: 2
  • Forks: 2
  • Open Issues: 0
  • Releases: 1
Topics
nlp pinyin pinyin-analysis pinyin4j tokenizer trie-tree
Created about 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme Contributing License Citation

README.md

PyPI version Downloads Contributions welcome GitHub contributors License Apache 2.0 python_vesion GitHub issues Wechat Group

Pinyin Tokenizer

pinyin tokenizer(拼音分词器),将连续的拼音切分为单字拼音列表,开箱即用。python3开发。

Guide

Feature

  • 基于前缀树(PyTrie)高效快速把连续拼音切分为单字拼音列表,便于后续拼音转汉字等处理。

Install

  • Requirements and Installation

pip install pinyintokenizer

or

git clone https://github.com/shibing624/pinyin-tokenizer.git cd pinyin-tokenizer python setup.py install

Usage

拼音切分(Pinyin Tokenizer)

example:examples/pinyintokenizedemo.py:

```python import sys

sys.path.append('..') from pinyintokenizer import PinyinTokenizer

if name == 'main': m = PinyinTokenizer() print(f"{m.tokenize('wo3')}") print(f"{m.tokenize('nihao')}") print(f"{m.tokenize('lv3you2')}") # 旅游 print(f"{m.tokenize('liudehua')}") print(f"{m.tokenize('liu de hua')}") # 刘德华 print(f"{m.tokenize('womenzuogelvyougongnue')}") # 我们做个旅游攻略 print(f"{m.tokenize('xi anjiaotongdaxue')}") # 西安交通大学

# not support english
print(f"{m.tokenize('good luck')}")

```

output:

shell (['wo'], ['3']) (['ni', 'hao'], []) (['lv', 'you'], ['3', '2']) (['liu', 'de', 'hua'], []) (['liu', 'de', 'hua'], [' ', ' ']) (['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], []) (['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' ']) (['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k']) - tokenize方法返回两个结果,第一个为拼音列表,第二个为非法拼音列表。

连续拼音转汉字(Pinyin2Hanzi)

先使用本库pinyintokenizer把连续拼音切分,再使用Pinyin2Hanzi库把拼音转汉字。

example:examples/pinyin2hanzi_demo.py:

```python import sys from Pinyin2Hanzi import DefaultDagParams from Pinyin2Hanzi import dag

sys.path.append('..') from pinyintokenizer import PinyinTokenizer

dagparams = DefaultDagParams()

def pinyin2hanzi(pinyinsentence): pinyinlist, _ = PinyinTokenizer().tokenize(pinyinsentence) result = dag(dagparams, pinyinlist, path_num=1) return ''.join(result[0].path)

if name == 'main': print(f"{pinyin2hanzi('wo3')}") print(f"{pinyin2hanzi('jintianxtianqibucuo')}") print(f"{pinyin2hanzi('liudehua')}") ```

output:

shell 我 今天天气不错 刘德华

Contact

  • Issue(建议):GitHub issues
  • 邮件我:xuming: xuming624@qq.com
  • 微信我:加我微信号:xuming624, 进Python-NLP交流群,备注:姓名-公司名-NLP

Citation

如果你在研究中使用了pinyin-tokenizer,请按如下格式引用:

APA: latex Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer

BibTeX: latex @misc{pinyin-tokenizer, title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP}, author={Xu Ming}, year={2022}, howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}}, }

License

授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加pinyin-tokenizer的链接和授权协议。

Contribute

项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:

  • tests添加相应的单元测试
  • 使用python -m pytest来运行所有单元测试,确保所有单测都是通过的

之后即可提交PR。

Related Projects

Owner

  • Name: xuming
  • Login: shibing624
  • Kind: user
  • Location: Beijing, China
  • Company: @tencent

Senior Researcher, Machine Learning Developer, Advertising Risk Control.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Xu"
  given-names: "Ming"
title: "pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP"
url: "https://github.com/shibing624/pinyin-tokenizer"
data-released: 2022-12-26
version: 0.0.1

GitHub Events

Total
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 6
  • Push event: 2
Last Year
  • Issues event: 2
  • Watch event: 4
  • Issue comment event: 6
  • Push event: 2

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 7
  • Total Committers: 1
  • Avg Commits per committer: 7.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
shibing624 s****4@1****m 7
Committer Domains (Top 20 + Academic)
126.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: 2 days
  • Average time to close pull requests: N/A
  • Total issue authors: 2
  • Total pull request authors: 0
  • Average comments per issue: 4.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 5 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 5.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • Vange95 (1)
  • demoliisher (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,035 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 3
  • Total maintainers: 1
pypi.org: pinyintokenizer

Pinyin Tokenizer, chinese pinyin tokenizer

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 1,035 Last month
Rankings
Dependent packages count: 6.6%
Stargazers count: 21.8%
Forks count: 23.2%
Average: 26.4%
Dependent repos count: 30.6%
Downloads: 49.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • six *
setup.py pypi
  • six *