fuzzychinese

A small package to fuzzy match chinese words

https://github.com/znwang25/fuzzychinese

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (5.7%) to scientific vocabulary

Keywords

chinese fuzzy-matching natural-language python radicals strokes text-processing
Last synced: 6 months ago · JSON representation ·

Repository

A small package to fuzzy match chinese words

Basic Info
Statistics
  • Stars: 89
  • Watchers: 1
  • Forks: 10
  • Open Issues: 1
  • Releases: 0
Topics
chinese fuzzy-matching natural-language python radicals strokes text-processing
Created almost 7 years ago · Last pushed almost 3 years ago
Metadata Files
Readme License Citation

README.md

fuzzychinese

形近词中文模糊匹配

A simple tool to fuzzy match chinese words, particular useful for proper name matching and address matching.

一个可以模糊匹配形近字词的小工具。对于专有名词,地址的匹配尤其有用。

安装说明

pip install fuzzychinese

使用说明

首先使用想要匹配的字典对模型进行训练。

然后用FuzzyChineseMatch.transform(raw_words, n) 来快速查找与raw_words的词最相近的前n个词。

训练模型时有三种分析方式可以选择,笔划分析(stroke),部首分析(radical),和单字分析(char)。也可以通过调整ngram_range的值来提高模型性能。

匹配完成后返回相似度分数,匹配的相近词语及其原有索引号。

```python import pandas as pd from fuzzychinese import FuzzyChineseMatch testdict = pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市']) rawword = pd.Series(['达茂联合旗','长阳县','汩罗市']) assert('汩罗市'!='汨罗市') # They are not the same!

fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
fcm.fit(test_dict)
top2_similar = fcm.transform(raw_word, n=2)
res = pd.concat([
    raw_word,
    pd.DataFrame(top2_similar, columns=['top1', 'top2']),
    pd.DataFrame(
        fcm.get_similarity_score(),
        columns=['top1_score', 'top2_score']),
    pd.DataFrame(
        fcm.get_index(),
        columns=['top1_index', 'top2_index'])],
                axis=1)

```

| | top1 | top2 | top1score | top2score | top1index | top2index | | ---------- | ------------------ | ---------------- | ---------- | ---------- | ---------- | ---------- | | 达茂联合旗 | 达尔罕茂明安联合旗 | 长白朝鲜族自治县 | 0.824751 | 0.287237 | 3 | 0 | | 长阳县 | 长阳土家族自治县 | 长白朝鲜族自治县 | 0.610285 | 0.475000 | 1 | 0 | | 汩罗市 | 汨罗市 | 长白朝鲜族自治县 | 1.000000 | 0.152093 | 4 | 0 |

其他功能

  • 直接使用Stroke, Radical进行汉字分解。 python stroke = Stroke() radical = Radical() print("像", stroke.get_stroke("像")) print("像", radical.get_radical("像")) 像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏ 像 人象
  • 使用FuzzyChineseMatch.compare_two_columns(X, Y)对每一行的两个词进行比较,获得相似度分数。

  • 详情请参见说明文档.

致谢

拆字数据来自于 漢語拆字字典 by 開放詞典網

Installation

pip install fuzzychinese

Quickstart

First train a model with the target list of words you want to match to.

Then use FuzzyChineseMatch.transform(raw_words, n) to find top n most similar words in the target for your raw_words .

There are three analyzers to choose from when training a model: stroke, radical, and char. You can also change ngram_range to fine-tune the model.

After the matching, similarity score, matched words and its corresponding index are returned.

```python from fuzzychinese import FuzzyChineseMatch testdict = pd.Series(['长白朝鲜族自治县','长阳土家族自治县','城步苗族自治县','达尔罕茂明安联合旗','汨罗市']) rawword = pd.Series(['达茂联合旗','长阳县','汩罗市']) assert('汩罗市'!='汨罗市') # They are not the same!

fcm = FuzzyChineseMatch(ngram_range=(3, 3), analyzer='stroke')
fcm.fit(test_dict)
top2_similar = fcm.transform(raw_word, n=2)
res = pd.concat([
    raw_word,
    pd.DataFrame(top2_similar, columns=['top1', 'top2']),
    pd.DataFrame(
        fcm.get_similarity_score(),
        columns=['top1_score', 'top2_score']),
    pd.DataFrame(
        fcm.get_index(),
        columns=['top1_index', 'top2_index'])],
                axis=1)

```

| | top1 | top2 | top1score | top2score | top1index | top2index | | ---------- | ------------------ | ---------------- | ---------- | ---------- | ---------- | ---------- | | 达茂联合旗 | 达尔罕茂明安联合旗 | 长白朝鲜族自治县 | 0.824751 | 0.287237 | 3 | 0 | | 长阳县 | 长阳土家族自治县 | 长白朝鲜族自治县 | 0.610285 | 0.475000 | 1 | 0 | | 汩罗市 | 汨罗市 | 长白朝鲜族自治县 | 1.000000 | 0.152093 | 4 | 0 |

Other use

  • Directly use Stroke, Radical to decompose Chinese character into strokes or radicals. python stroke = Stroke() radical = Radical() print("像", stroke.get_stroke("像")) print("像", radical.get_radical("像")) 像 ㇒〡㇒㇇〡㇕一㇒㇁㇒㇒㇒㇏ 像 人象
  • Use FuzzyChineseMatch.compare_two_columns(X, Y) to compare the pair of words in each row to get similarity score.

  • See documentation for details.

Credits

Data for Chinese radicals are from 漢語拆字字典 by 開放詞典網.

Owner

  • Login: znwang25
  • Kind: user
  • Location: San Francisco

Citation (CITATION.cff)

cff-version: 1.2.0
title: FuzzyChinese
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Zenan
    family-names: Wang
    orcid: 'https://orcid.org/0000-0001-6337-6548'
repository-code: 'https://github.com/znwang25/fuzzychinese'
abstract: >-
  A simple tool to fuzzy match Chinese words, particular
  useful for proper name matching and address matching.
keywords:
  - text-processing
  - chinese
  - fuzzy-matching
  - nlp
license: BSD-3-Clause
version: '0.1.5 '
date-released: '2019-04-29'

GitHub Events

Total
  • Watch event: 9
Last Year
  • Watch event: 9

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 17
  • Total Committers: 1
  • Avg Commits per committer: 17.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
znwang25 z****5@g****m 17

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 5
  • Total pull requests: 3
  • Average time to close issues: 8 months
  • Average time to close pull requests: 2 minutes
  • Total issue authors: 5
  • Total pull request authors: 1
  • Average comments per issue: 1.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 3
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • lingvisa (1)
  • marcusau (1)
  • ahbon123 (1)
  • znwang25 (1)
  • Veekshit (1)
Pull Request Authors
  • znwang25 (3)
Top Labels
Issue Labels
bug (1)
Pull Request Labels
enhancement (3)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 759 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 2
  • Total versions: 3
  • Total maintainers: 1
pypi.org: fuzzychinese

A small package to fuzzy match chinese words 中文模糊匹配

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 2
  • Downloads: 759 Last month
Rankings
Downloads: 2.9%
Stargazers count: 8.7%
Average: 9.0%
Dependent packages count: 10.1%
Dependent repos count: 11.6%
Forks count: 11.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • numpy *
  • pandas *
  • scikit-learn *