pinyintokenizer

pinyintokenizer, 拼音分词器，将连续的拼音切分为单字拼音列表。

https://github.com/shibing624/pinyin-tokenizer

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (7.7%) to scientific vocabulary

Keywords

nlp pinyin pinyin-analysis pinyin4j tokenizer trie-tree

Last synced: 10 months ago · JSON representation ·

Repository

pinyintokenizer, 拼音分词器，将连续的拼音切分为单字拼音列表。

Basic Info

Host: GitHub
Owner: shibing624
License: apache-2.0
Language: Python
Default Branch: main
Homepage:
Size: 60.5 KB

Statistics

Stars: 31
Watchers: 2
Forks: 2
Open Issues: 0
Releases: 1

Topics

nlp pinyin pinyin-analysis pinyin4j tokenizer trie-tree

Created over 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme Contributing License Citation

Pinyin Tokenizer

pinyin tokenizer（拼音分词器），将连续的拼音切分为单字拼音列表，开箱即用。python3开发。

Guide

Feature
Install
Usage
Contact
Citation
Related-Projects

Feature

基于前缀树（PyTrie）高效快速把连续拼音切分为单字拼音列表，便于后续拼音转汉字等处理。

Install

Requirements and Installation

pip install pinyintokenizer

git clone https://github.com/shibing624/pinyin-tokenizer.git cd pinyin-tokenizer python setup.py install

Usage

拼音切分（Pinyin Tokenizer）

example：examples/pinyintokenizedemo.py:

```python import sys

sys.path.append('..') from pinyintokenizer import PinyinTokenizer

if name == 'main': m = PinyinTokenizer() print(f"{m.tokenize('wo3')}") print(f"{m.tokenize('nihao')}") print(f"{m.tokenize('lv3you2')}") # 旅游 print(f"{m.tokenize('liudehua')}") print(f"{m.tokenize('liu de hua')}") # 刘德华 print(f"{m.tokenize('womenzuogelvyougongnue')}") # 我们做个旅游攻略 print(f"{m.tokenize('xi anjiaotongdaxue')}") # 西安交通大学

# not support english
print(f"{m.tokenize('good luck')}")

```

output:

shell (['wo'], ['3']) (['ni', 'hao'], []) (['lv', 'you'], ['3', '2']) (['liu', 'de', 'hua'], []) (['liu', 'de', 'hua'], [' ', ' ']) (['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], []) (['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' ']) (['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k']) - tokenize方法返回两个结果，第一个为拼音列表，第二个为非法拼音列表。

连续拼音转汉字（Pinyin2Hanzi）

先使用本库pinyintokenizer把连续拼音切分，再使用Pinyin2Hanzi库把拼音转汉字。

example：examples/pinyin2hanzi_demo.py:

```python import sys from Pinyin2Hanzi import DefaultDagParams from Pinyin2Hanzi import dag

sys.path.append('..') from pinyintokenizer import PinyinTokenizer

dagparams = DefaultDagParams()

def pinyin2hanzi(pinyinsentence): pinyinlist, _ = PinyinTokenizer().tokenize(pinyinsentence) result = dag(dagparams, pinyinlist, path_num=1) return ''.join(result[0].path)

if name == 'main': print(f"{pinyin2hanzi('wo3')}") print(f"{pinyin2hanzi('jintianxtianqibucuo')}") print(f"{pinyin2hanzi('liudehua')}") ```

output:

shell 我今天天气不错刘德华

Contact

Issue(建议)：
邮件我：xuming: xuming624@qq.com
微信我：加我微信号：xuming624, 进Python-NLP交流群，备注：姓名-公司名-NLP

Citation

如果你在研究中使用了pinyin-tokenizer，请按如下格式引用：

APA: latex Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer

BibTeX: latex @misc{pinyin-tokenizer, title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP}, author={Xu Ming}, year={2022}, howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}}, }

License

授权协议为 The Apache License 2.0，可免费用做商业用途。请在产品说明中附加pinyin-tokenizer的链接和授权协议。

Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

在tests添加相应的单元测试
使用python -m pytest来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

Related Projects

汉字转拼音：pypinyin
拼音转汉字：Pinyin2Hanzi

Owner

Name: xuming
Login: shibing624
Kind: user
Location: Beijing, China
Company: @tencent

Website: https://blog.csdn.net/mingzai624
Repositories: 32
Profile: https://github.com/shibing624

Senior Researcher, Machine Learning Developer, Advertising Risk Control.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Xu"
  given-names: "Ming"
title: "pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP"
url: "https://github.com/shibing624/pinyin-tokenizer"
data-released: 2022-12-26
version: 0.0.1

GitHub Events

Total

Issues event: 2
Watch event: 4
Issue comment event: 6
Push event: 2

Last Year

Issues event: 2
Watch event: 4
Issue comment event: 6
Push event: 2

Committers

Last synced: over 3 years ago

All Time

Total Commits: 7
Total Committers: 1
Avg Commits per committer: 7.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
shibing624	s**4@1**m	7

Committer Domains (Top 20 + Academic)

126.com: 1

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 2
Total pull requests: 0
Average time to close issues: 2 days
Average time to close pull requests: N/A
Total issue authors: 2
Total pull request authors: 0
Average comments per issue: 4.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: 5 days
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 5.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Vange95 (1)
demoliisher (1)

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- pypi 1,035 last-month

Total dependent packages: 0
Total dependent repositories: 0
Total versions: 3
Total maintainers: 1

pypi.org: pinyintokenizer

Pinyin Tokenizer, chinese pinyin tokenizer

Homepage: https://github.com/shibing624/pinyin-tokenizer
Documentation: https://pinyintokenizer.readthedocs.io/
License: Apache 2.0
Latest release: 0.0.3
published over 1 year ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 0
Downloads: 1,035 Last month

Rankings

Dependent packages count: 6.6%

Stargazers count: 21.8%

Forks count: 23.2%

Average: 26.4%

Dependent repos count: 30.6%

Downloads: 49.9%

Maintainers (1)

shibing624

Last synced: 11 months ago

Dependencies

requirements.txt pypi

six *

setup.py pypi

six *

pinyintokenizer

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Pinyin Tokenizer

Feature

Install

Usage

拼音切分（Pinyin Tokenizer）

连续拼音转汉字（Pinyin2Hanzi）

Contact

Citation

License

Contribute

Related Projects

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: pinyintokenizer

Rankings

Maintainers (1)

Dependencies