Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (5.7%) to scientific vocabulary
Keywords
Repository
A small package to fuzzy match chinese words
Basic Info
- Host: GitHub
- Owner: znwang25
- License: bsd-3-clause
- Language: Python
- Default Branch: master
- Homepage: https://fuzzychinese.zenan-wang.com
- Size: 1.81 MB
Statistics
- Stars: 89
- Watchers: 1
- Forks: 10
- Open Issues: 1
- Releases: 0
Topics
Metadata Files
README.md
fuzzychinese
形近词中文模糊匹配
A simple tool to fuzzy match chinese words, particular useful for proper name matching and address matching.
一个可以模糊匹配形近字词的小工具。对于专有名词,地址的匹配尤其有用。
安装说明
pip install fuzzychinese
使用说明
首先使用想要匹配的字典对模型进行训练。
然后用FuzzyChineseMatch.transform(raw_words, n) 来快速查找与raw_words的词最相近的前n个词。
训练模型时有三种分析方式可以选择,笔划分析(stroke),部首分析(radical),和单字分析(char)。也可以通过调整ngram_range的值来提高模型性能。
匹配完成后返回相似度分数,匹配的相近词语及其原有索引号。
```python import pandas as pd from fuzzychinese import FuzzyChineseMatch testdict = pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市']) rawword = pd.Series(['达茂联合旗','长阳县','汩罗市']) assert('汩罗市'!='汨罗市') # They are not the same!
fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
fcm.fit(test_dict)
top2_similar = fcm.transform(raw_word, n=2)
res = pd.concat([
raw_word,
pd.DataFrame(top2_similar, columns=['top1', 'top2']),
pd.DataFrame(
fcm.get_similarity_score(),
columns=['top1_score', 'top2_score']),
pd.DataFrame(
fcm.get_index(),
columns=['top1_index', 'top2_index'])],
axis=1)
```
| | top1 | top2 | top1score | top2score | top1index | top2index | | ---------- | ------------------ | ---------------- | ---------- | ---------- | ---------- | ---------- | | 达茂联合旗 | 达尔罕茂明安联合旗 | 长白朝鲜族自治县 | 0.824751 | 0.287237 | 3 | 0 | | 长阳县 | 长阳土家族自治县 | 长白朝鲜族自治县 | 0.610285 | 0.475000 | 1 | 0 | | 汩罗市 | 汨罗市 | 长白朝鲜族自治县 | 1.000000 | 0.152093 | 4 | 0 |
其他功能
- 直接使用
Stroke,Radical进行汉字分解。python stroke = Stroke() radical = Radical() print("像", stroke.get_stroke("像")) print("像", radical.get_radical("像"))像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏ 像 人象 使用
FuzzyChineseMatch.compare_two_columns(X, Y)对每一行的两个词进行比较,获得相似度分数。详情请参见说明文档.
致谢
Installation
pip install fuzzychinese
Quickstart
First train a model with the target list of words you want to match to.
Then use FuzzyChineseMatch.transform(raw_words, n) to find top n most similar words in the target for your raw_words .
There are three analyzers to choose from when training a model: stroke, radical, and char. You can also change ngram_range to fine-tune the model.
After the matching, similarity score, matched words and its corresponding index are returned.
```python from fuzzychinese import FuzzyChineseMatch testdict = pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市']) rawword = pd.Series(['达茂联合旗','长阳县','汩罗市']) assert('汩罗市'!='汨罗市') # They are not the same!
fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
fcm.fit(test_dict)
top2_similar = fcm.transform(raw_word, n=2)
res = pd.concat([
raw_word,
pd.DataFrame(top2_similar, columns=['top1', 'top2']),
pd.DataFrame(
fcm.get_similarity_score(),
columns=['top1_score', 'top2_score']),
pd.DataFrame(
fcm.get_index(),
columns=['top1_index', 'top2_index'])],
axis=1)
```
| | top1 | top2 | top1score | top2score | top1index | top2index | | ---------- | ------------------ | ---------------- | ---------- | ---------- | ---------- | ---------- | | 达茂联合旗 | 达尔罕茂明安联合旗 | 长白朝鲜族自治县 | 0.824751 | 0.287237 | 3 | 0 | | 长阳县 | 长阳土家族自治县 | 长白朝鲜族自治县 | 0.610285 | 0.475000 | 1 | 0 | | 汩罗市 | 汨罗市 | 长白朝鲜族自治县 | 1.000000 | 0.152093 | 4 | 0 |
Other use
- Directly use
Stroke,Radicalto decompose Chinese character into strokes or radicals.python stroke = Stroke() radical = Radical() print("像", stroke.get_stroke("像")) print("像", radical.get_radical("像"))像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏ 像 人象 Use
FuzzyChineseMatch.compare_two_columns(X, Y)to compare the pair of words in each row to get similarity score.See documentation for details.
Credits
Owner
- Login: znwang25
- Kind: user
- Location: San Francisco
- Repositories: 3
- Profile: https://github.com/znwang25
Citation (CITATION.cff)
cff-version: 1.2.0
title: FuzzyChinese
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Zenan
family-names: Wang
orcid: 'https://orcid.org/0000-0001-6337-6548'
repository-code: 'https://github.com/znwang25/fuzzychinese'
abstract: >-
A simple tool to fuzzy match Chinese words, particular
useful for proper name matching and address matching.
keywords:
- text-processing
- chinese
- fuzzy-matching
- nlp
license: BSD-3-Clause
version: '0.1.5 '
date-released: '2019-04-29'
GitHub Events
Total
- Watch event: 9
Last Year
- Watch event: 9
Committers
Last synced: almost 3 years ago
All Time
- Total Commits: 17
- Total Committers: 1
- Avg Commits per committer: 17.0
- Development Distribution Score (DDS): 0.0
Top Committers
| Name | Commits | |
|---|---|---|
| znwang25 | z****5@g****m | 17 |
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 5
- Total pull requests: 3
- Average time to close issues: 8 months
- Average time to close pull requests: 2 minutes
- Total issue authors: 5
- Total pull request authors: 1
- Average comments per issue: 1.0
- Average comments per pull request: 0.0
- Merged pull requests: 3
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- lingvisa (1)
- marcusau (1)
- ahbon123 (1)
- znwang25 (1)
- Veekshit (1)
Pull Request Authors
- znwang25 (3)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 759 last-month
- Total dependent packages: 0
- Total dependent repositories: 2
- Total versions: 3
- Total maintainers: 1
pypi.org: fuzzychinese
A small package to fuzzy match chinese words 中文模糊匹配
- Homepage: https://github.com/znwang25/fuzzychinese
- Documentation: https://fuzzychinese.readthedocs.io/
- License: BSD License
-
Latest release: 0.1.5
published almost 7 years ago
Rankings
Maintainers (1)
Dependencies
- numpy *
- pandas *
- scikit-learn *