pinyintokenizer
pinyintokenizer, 拼音分词器,将连续的拼音切分为单字拼音列表。
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (7.7%) to scientific vocabulary
Keywords
Repository
pinyintokenizer, 拼音分词器,将连续的拼音切分为单字拼音列表。
Basic Info
Statistics
- Stars: 31
- Watchers: 2
- Forks: 2
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Pinyin Tokenizer
pinyin tokenizer(拼音分词器),将连续的拼音切分为单字拼音列表,开箱即用。python3开发。
Guide
Feature
- 基于前缀树(PyTrie)高效快速把连续拼音切分为单字拼音列表,便于后续拼音转汉字等处理。
Install
- Requirements and Installation
pip install pinyintokenizer
or
git clone https://github.com/shibing624/pinyin-tokenizer.git
cd pinyin-tokenizer
python setup.py install
Usage
拼音切分(Pinyin Tokenizer)
example:examples/pinyintokenizedemo.py:
```python import sys
sys.path.append('..') from pinyintokenizer import PinyinTokenizer
if name == 'main': m = PinyinTokenizer() print(f"{m.tokenize('wo3')}") print(f"{m.tokenize('nihao')}") print(f"{m.tokenize('lv3you2')}") # 旅游 print(f"{m.tokenize('liudehua')}") print(f"{m.tokenize('liu de hua')}") # 刘德华 print(f"{m.tokenize('womenzuogelvyougongnue')}") # 我们做个旅游攻略 print(f"{m.tokenize('xi anjiaotongdaxue')}") # 西安交通大学
# not support english
print(f"{m.tokenize('good luck')}")
```
output:
shell
(['wo'], ['3'])
(['ni', 'hao'], [])
(['lv', 'you'], ['3', '2'])
(['liu', 'de', 'hua'], [])
(['liu', 'de', 'hua'], [' ', ' '])
(['wo', 'men', 'zuo', 'ge', 'lv', 'you', 'gong', 'nue'], [])
(['xi', 'an', 'jiao', 'tong', 'da', 'xue'], [' '])
(['o', 'o', 'lu'], ['g', 'd', ' ', 'c', 'k'])
- tokenize方法返回两个结果,第一个为拼音列表,第二个为非法拼音列表。
连续拼音转汉字(Pinyin2Hanzi)
先使用本库pinyintokenizer把连续拼音切分,再使用Pinyin2Hanzi库把拼音转汉字。
example:examples/pinyin2hanzi_demo.py:
```python import sys from Pinyin2Hanzi import DefaultDagParams from Pinyin2Hanzi import dag
sys.path.append('..') from pinyintokenizer import PinyinTokenizer
dagparams = DefaultDagParams()
def pinyin2hanzi(pinyinsentence): pinyinlist, _ = PinyinTokenizer().tokenize(pinyinsentence) result = dag(dagparams, pinyinlist, path_num=1) return ''.join(result[0].path)
if name == 'main': print(f"{pinyin2hanzi('wo3')}") print(f"{pinyin2hanzi('jintianxtianqibucuo')}") print(f"{pinyin2hanzi('liudehua')}") ```
output:
shell
我
今天天气不错
刘德华
Contact
Citation
如果你在研究中使用了pinyin-tokenizer,请按如下格式引用:
APA:
latex
Xu, M. pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP (Version 0.0.1) [Computer software]. https://github.com/shibing624/pinyin-tokenizer
BibTeX:
latex
@misc{pinyin-tokenizer,
title={pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP},
author={Xu Ming},
year={2022},
howpublished={\url{https://github.com/shibing624/pinyin-tokenizer}},
}
License
授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加pinyin-tokenizer的链接和授权协议。
Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:
- 在
tests添加相应的单元测试 - 使用
python -m pytest来运行所有单元测试,确保所有单测都是通过的
之后即可提交PR。
Related Projects
- 汉字转拼音:pypinyin
- 拼音转汉字:Pinyin2Hanzi
Owner
- Name: xuming
- Login: shibing624
- Kind: user
- Location: Beijing, China
- Company: @tencent
- Website: https://blog.csdn.net/mingzai624
- Repositories: 32
- Profile: https://github.com/shibing624
Senior Researcher, Machine Learning Developer, Advertising Risk Control.
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Xu" given-names: "Ming" title: "pinyin-tokenizer: Chinese Pinyin tokenizer toolkit for NLP" url: "https://github.com/shibing624/pinyin-tokenizer" data-released: 2022-12-26 version: 0.0.1
GitHub Events
Total
- Issues event: 2
- Watch event: 4
- Issue comment event: 6
- Push event: 2
Last Year
- Issues event: 2
- Watch event: 4
- Issue comment event: 6
- Push event: 2
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 7
- Total Committers: 1
- Avg Commits per committer: 7.0
- Development Distribution Score (DDS): 0.0
Top Committers
| Name | Commits | |
|---|---|---|
| shibing624 | s****4@1****m | 7 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: 2 days
- Average time to close pull requests: N/A
- Total issue authors: 2
- Total pull request authors: 0
- Average comments per issue: 4.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: 5 days
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 5.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- Vange95 (1)
- demoliisher (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 1,035 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 3
- Total maintainers: 1
pypi.org: pinyintokenizer
Pinyin Tokenizer, chinese pinyin tokenizer
- Homepage: https://github.com/shibing624/pinyin-tokenizer
- Documentation: https://pinyintokenizer.readthedocs.io/
- License: Apache 2.0
-
Latest release: 0.0.3
published about 1 year ago
Rankings
Maintainers (1)
Dependencies
- six *
- six *
