fuzzychinese

A small package to fuzzy match chinese words

https://github.com/znwang25/fuzzychinese

Keywords

chinese fuzzy-matching natural-language python radicals strokes text-processing

Last synced: 10 months ago · JSON representation ·

Repository

A small package to fuzzy match chinese words

Basic Info

Host: GitHub
Owner: znwang25
License: bsd-3-clause
Language: Python
Default Branch: master
Homepage: https://fuzzychinese.zenan-wang.com
Size: 1.81 MB

Statistics

Stars: 89
Watchers: 1
Forks: 10
Open Issues: 1
Releases: 0

Topics

chinese fuzzy-matching natural-language python radicals strokes text-processing

Created over 7 years ago · Last pushed over 3 years ago

Metadata Files

Readme License Citation

fuzzychinese

形近词中文模糊匹配

A simple tool to fuzzy match chinese words, particular useful for proper name matching and address matching.

一个可以模糊匹配形近字词的小工具。对于专有名词，地址的匹配尤其有用。

安装说明

pip install fuzzychinese

使用说明

首先使用想要匹配的字典对模型进行训练。

然后用FuzzyChineseMatch.transform(raw_words, n) 来快速查找与raw_words的词最相近的前n个词。

训练模型时有三种分析方式可以选择，笔划分析(stroke)，部首分析(radical)，和单字分析(char)。也可以通过调整ngram_range的值来提高模型性能。

匹配完成后返回相似度分数，匹配的相近词语及其原有索引号。

```python import pandas as pd from fuzzychinese import FuzzyChineseMatch testdict = pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市']) rawword = pd.Series(['达茂联合旗','长阳县','汩罗市']) assert('汩罗市'!='汨罗市') # They are not the same!

fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
fcm.fit(test_dict)
top2_similar = fcm.transform(raw_word, n=2)
res = pd.concat([
    raw_word,
    pd.DataFrame(top2_similar, columns=['top1', 'top2']),
    pd.DataFrame(
        fcm.get_similarity_score(),
        columns=['top1_score', 'top2_score']),
    pd.DataFrame(
        fcm.get_index(),
        columns=['top1_index', 'top2_index'])],
                axis=1)

```

| | top1 | top2 | top1score | top2score | top1index | top2index | | ---------- | ------------------ | ---------------- | ---------- | ---------- | ---------- | ---------- | | 达茂联合旗 | 达尔罕茂明安联合旗 | 长白朝鲜族自治县 | 0.824751 | 0.287237 | 3 | 0 | | 长阳县 | 长阳土家族自治县 | 长白朝鲜族自治县 | 0.610285 | 0.475000 | 1 | 0 | | 汩罗市 | 汨罗市 | 长白朝鲜族自治县 | 1.000000 | 0.152093 | 4 | 0 |

其他功能

直接使用Stroke, Radical进行汉字分解。 python stroke = Stroke() radical = Radical() print("像", stroke.get_stroke("像")) print("像", radical.get_radical("像")) 像㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏像人象
使用FuzzyChineseMatch.compare_two_columns(X, Y)对每一行的两个词进行比较，获得相似度分数。
详情请参见说明文档.

致谢

拆字数据来自于漢語拆字字典 by 開放詞典網。

Installation

pip install fuzzychinese

Quickstart

First train a model with the target list of words you want to match to.

Then use FuzzyChineseMatch.transform(raw_words, n) to find top n most similar words in the target for your raw_words .

There are three analyzers to choose from when training a model: stroke, radical, and char. You can also change ngram_range to fine-tune the model.

After the matching, similarity score, matched words and its corresponding index are returned.

```python from fuzzychinese import FuzzyChineseMatch testdict = pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市']) rawword = pd.Series(['达茂联合旗','长阳县','汩罗市']) assert('汩罗市'!='汨罗市') # They are not the same!

fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
fcm.fit(test_dict)
top2_similar = fcm.transform(raw_word, n=2)
res = pd.concat([
    raw_word,
    pd.DataFrame(top2_similar, columns=['top1', 'top2']),
    pd.DataFrame(
        fcm.get_similarity_score(),
        columns=['top1_score', 'top2_score']),
    pd.DataFrame(
        fcm.get_index(),
        columns=['top1_index', 'top2_index'])],
                axis=1)

```

| | top1 | top2 | top1score | top2score | top1index | top2index | | ---------- | ------------------ | ---------------- | ---------- | ---------- | ---------- | ---------- | | 达茂联合旗 | 达尔罕茂明安联合旗 | 长白朝鲜族自治县 | 0.824751 | 0.287237 | 3 | 0 | | 长阳县 | 长阳土家族自治县 | 长白朝鲜族自治县 | 0.610285 | 0.475000 | 1 | 0 | | 汩罗市 | 汨罗市 | 长白朝鲜族自治县 | 1.000000 | 0.152093 | 4 | 0 |

Other use

Directly use Stroke, Radical to decompose Chinese character into strokes or radicals. python stroke = Stroke() radical = Radical() print("像", stroke.get_stroke("像")) print("像", radical.get_radical("像")) 像㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏像人象
Use FuzzyChineseMatch.compare_two_columns(X, Y) to compare the pair of words in each row to get similarity score.
See documentation for details.

Credits

Data for Chinese radicals are from 漢語拆字字典 by 開放詞典網.

Owner

Login: znwang25
Kind: user
Location: San Francisco

Repositories: 3
Profile: https://github.com/znwang25

Citation (CITATION.cff)

cff-version: 1.2.0
title: FuzzyChinese
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Zenan
    family-names: Wang
    orcid: 'https://orcid.org/0000-0001-6337-6548'
repository-code: 'https://github.com/znwang25/fuzzychinese'
abstract: >-
  A simple tool to fuzzy match Chinese words, particular
  useful for proper name matching and address matching.
keywords:
  - text-processing
  - chinese
  - fuzzy-matching
  - nlp
license: BSD-3-Clause
version: '0.1.5 '
date-released: '2019-04-29'

GitHub Events

Total

Watch event: 9

Last Year

Watch event: 9

Committers

Last synced: over 3 years ago

All Time

Total Commits: 17
Total Committers: 1
Avg Commits per committer: 17.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
znwang25	z**5@g**m	17

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 5
Total pull requests: 3
Average time to close issues: 8 months
Average time to close pull requests: 2 minutes
Total issue authors: 5
Total pull request authors: 1
Average comments per issue: 1.0
Average comments per pull request: 0.0
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

lingvisa (1)
marcusau (1)
ahbon123 (1)
znwang25 (1)
Veekshit (1)

Pull Request Authors

znwang25 (3)

Top Labels

Issue Labels

bug (1)

Pull Request Labels

enhancement (3)

Packages

Total packages: 1
Total downloads:
- pypi 759 last-month

Total dependent packages: 0
Total dependent repositories: 2
Total versions: 3
Total maintainers: 1

pypi.org: fuzzychinese

A small package to fuzzy match chinese words 中文模糊匹配

Homepage: https://github.com/znwang25/fuzzychinese
Documentation: https://fuzzychinese.readthedocs.io/
License: BSD License
Latest release: 0.1.5
published about 7 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 2
Downloads: 759 Last month

Rankings

Downloads: 2.9%

Stargazers count: 8.7%

Average: 9.0%

Dependent packages count: 10.1%

Dependent repos count: 11.6%

Forks count: 11.9%

Maintainers (1)

znwang25

Last synced: 11 months ago

fuzzychinese

Science Score: 44.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

fuzzychinese

安装说明

使用说明

其他功能

致谢

Installation

Quickstart

Other use

Credits

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: fuzzychinese

Rankings

Maintainers (1)

Dependencies