fugashi

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

https://github.com/polm/fugashi

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
    1 of 10 committers (10.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.8%) to scientific vocabulary

Keywords

cython-wrapper japanese mecab nlp tokenizer

Keywords from Contributors

transformer cryptocurrency cryptography jax
Last synced: 6 months ago · JSON representation ·

Repository

A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.

Basic Info
  • Host: GitHub
  • Owner: polm
  • License: mit
  • Language: C++
  • Default Branch: main
  • Homepage:
  • Size: 492 KB
Statistics
  • Stars: 459
  • Watchers: 7
  • Forks: 40
  • Open Issues: 9
  • Releases: 13
Topics
cython-wrapper japanese mecab nlp tokenizer
Created over 6 years ago · Last pushed 9 months ago
Metadata Files
Readme Funding License Citation

README.md

Open in Streamlit Current PyPI packages Test Status PyPI - Downloads Supported Platforms

fugashi

fugashi by Irasutoya

fugashi is a Cython wrapper for MeCab, a Japanese tokenizer and morphological analysis tool. Wheels are provided for Linux, OSX (Intel), and Win64, and UniDic is easy to install.

issueを英語で書く必要はありません。

Check out the interactive demo, see the blog post for background on why fugashi exists and some of the design decisions, or see this guide for a basic introduction to Japanese tokenization.

If you are on a platform for which wheels are not provided, you'll need to install MeCab first. It's recommended you install from source. If you need to build from source on Windows, @chezou's fork is recommended; see issue #44 for an explanation of the problems with the official repo.

Known platforms without wheels:

  • musl-based distros like alpine #77
  • PowerPC
  • Windows 32bit

Usage

```python from fugashi import Tagger

tagger = Tagger('-Owakati') text = "麩菓子は、麩を主材料とした日本の菓子。" tagger.parse(text)

=> '麩 菓子 は 、 麩 を 主材 料 と し た 日本 の 菓子 。'

for word in tagger(text): print(word, word.feature.lemma, word.pos, sep='\t') # "feature" is the Unidic feature data as a named tuple ```

Installing a Dictionary

fugashi requires a dictionary. UniDic is recommended, and two easy-to-install versions are provided.

  • unidic-lite, a slightly modified version 2.1.2 of Unidic (from 2013) that's relatively small
  • unidic, the latest UniDic 3.1.0, which is 770MB on disk and requires a separate download step

If you just want to make sure things work you can start with unidic-lite, but for more serious processing unidic is recommended. For production use you'll generally want to generate your own dictionary too; for details see the MeCab documentation.

To get either of these dictionaries, you can install them directly using pip or do the below:

```sh pip install 'fugashi[unidic-lite]'

The full version of UniDic requires a separate download step

pip install 'fugashi[unidic]' python -m unidic download ```

For more information on the different MeCab dictionaries available, see this article.

Dictionary Use

fugashi is written with the assumption you'll use Unidic to process Japanese, but it supports arbitrary dictionaries.

If you're using a dictionary besides Unidic you can use the GenericTagger like this:

```python from fugashi import GenericTagger tagger = GenericTagger()

parse can be used as normal

tagger.parse('something')

features from the dictionary can be accessed by field numbers

for word in tagger(text): print(word.surface, word.feature[0]) ```

You can also create a dictionary wrapper to get feature information as a named tuple.

python from fugashi import GenericTagger, create_feature_wrapper CustomFeatures = create_feature_wrapper('CustomFeatures', 'alpha beta gamma') tagger = GenericTagger(wrapper=CustomFeatures) for word in tagger.parseToNodeList(text): print(word.surface, word.feature.alpha)

Citation

If you use fugashi in research, it would be appreciated if you cite this paper. You can read it at the ACL Anthology or on Arxiv.

@inproceedings{mccann-2020-fugashi,
    title = "fugashi, a Tool for Tokenizing {J}apanese in Python",
    author = "McCann, Paul",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.7",
    pages = "44--51",
    abstract = "Recent years have seen an increase in the number of large-scale multilingual NLP projects. However, even in such projects, languages with special processing requirements are often excluded. One such language is Japanese. Japanese is written without spaces, tokenization is non-trivial, and while high quality open source tokenizers exist they can be hard to use and lack English documentation. This paper introduces fugashi, a MeCab wrapper for Python, and gives an introduction to tokenizing Japanese.",
}

Alternatives

If you have a problem with fugashi feel free to open an issue. However, there are some cases where it might be better to use a different library.

  • If you don't want to deal with installing MeCab at all, try SudachiPy.
  • If you need to work with Korean, try pymecab-ko or KoNLPy.

License and Copyright Notice

fugashi is released under the terms of the MIT license. Please copy it far and wide.

fugashi is a wrapper for MeCab, and fugashi wheels include MeCab binaries. MeCab is copyrighted free software by Taku Kudo <taku@chasen.org> and Nippon Telegraph and Telephone Corporation, and is redistributed under the BSD License.

Owner

  • Name: Paul O'Leary McCann
  • Login: polm
  • Kind: user
  • Location: 🗼Tokyo
  • Company: Cotonoha

These days I mostly just take pictures of flowers.

Citation (CITATION.cff)

cff-version: 1.2.0
preferred-citation:
  type: article
  message: "If you use fugashi in research, it would be appreciated if you site this paper."
  authors:
  - family-names: "McCann"
    given-names: "Paul"
    orcid: "https://orcid.org/0000-0003-3376-8772"
  title: "fugashi, a Tool for Tokenizing Japanese in Python"
  doi: "10.18653/v1/2020.nlposs-1.7"
  journal: "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)"
  year: 2020
  month: 11
  start: 44
  end: 51

  

GitHub Events

Total
  • Issues event: 15
  • Watch event: 65
  • Delete event: 1
  • Issue comment event: 43
  • Push event: 12
  • Pull request review event: 11
  • Pull request review comment event: 17
  • Pull request event: 10
  • Fork event: 7
  • Create event: 15
Last Year
  • Issues event: 15
  • Watch event: 65
  • Delete event: 1
  • Issue comment event: 43
  • Push event: 12
  • Pull request review event: 11
  • Pull request review comment event: 17
  • Pull request event: 10
  • Fork event: 7
  • Create event: 15

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 255
  • Total Committers: 10
  • Avg Commits per committer: 25.5
  • Development Distribution Score (DDS): 0.055
Top Committers
Name Email Commits
Paul O'Leary McCann p****m@d****m 241
Hiromu Hota h****a@h****m 3
Ashlynn Anderson g****b@p****h 3
Aki Ariga c****b@g****m 2
Ronny Pfannschmidt o****e@r****e 1
Koichi Yasuoka y****a@k****p 1
Teo Wen Shen 3****n@u****m 1
odidev o****v@p****m 1
Yohei Tamura t****y@g****m 1
Deyong Zheng z****y@m****m 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 82
  • Total pull requests: 30
  • Average time to close issues: about 2 months
  • Average time to close pull requests: 7 days
  • Total issue authors: 59
  • Total pull request authors: 16
  • Average comments per issue: 3.9
  • Average comments per pull request: 2.1
  • Merged pull requests: 27
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 10
  • Pull requests: 11
  • Average time to close issues: 6 days
  • Average time to close pull requests: 7 days
  • Issue authors: 9
  • Pull request authors: 4
  • Average comments per issue: 1.2
  • Average comments per pull request: 3.27
  • Merged pull requests: 10
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • polm (13)
  • HiromuHota (3)
  • garfieldnate (3)
  • pedrominicz (3)
  • KoichiYasuoka (2)
  • roblframpton (2)
  • baker-ling (2)
  • joshdavham (2)
  • rabbit19981023 (2)
  • bekim-bajraktari (1)
  • gembleman (1)
  • tamuhey (1)
  • Skimige (1)
  • ghost (1)
  • yurivict (1)
Pull Request Authors
  • sabonerune (7)
  • lambdadog (4)
  • HiromuHota (3)
  • polm (3)
  • kino-ma (2)
  • eginhard (2)
  • yihong0618 (2)
  • chezou (2)
  • sophiefy (1)
  • KoichiYasuoka (1)
  • tamuhey (1)
  • zdyh (1)
  • RonnyPfannschmidt (1)
  • teowenshen (1)
  • odidev (1)
Top Labels
Issue Labels
windows (13) osx (5) help wanted (5) enhancement (4) question (3) bug (3) conda (1) korean (1) missing-dll (1)
Pull Request Labels
enhancement (2)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 398,371 last-month
  • Total docker downloads: 426,662
  • Total dependent packages: 32
  • Total dependent repositories: 243
  • Total versions: 87
  • Total maintainers: 1
pypi.org: fugashi

Cython MeCab wrapper for fast, pythonic Japanese tokenization.

  • Versions: 87
  • Dependent Packages: 32
  • Dependent Repositories: 243
  • Downloads: 398,371 Last month
  • Docker Downloads: 426,662
Rankings
Dependent packages count: 0.4%
Downloads: 0.8%
Docker downloads count: 1.0%
Dependent repos count: 1.0%
Average: 2.3%
Stargazers count: 3.5%
Forks count: 7.4%
Maintainers (1)
23
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • Cython >=0.29.13
.github/workflows/actions/build-manylinux/action.yml actions
  • docker://quay.io/pypa/manylinux2014_x86_64 * docker
.github/workflows/actions/build-manylinux-aarch64/action.yml actions
  • docker://quay.io/pypa/manylinux2014_aarch64 * docker
.github/workflows/manylinux1.yml actions
  • ./.github/workflows/actions/build-manylinux-aarch64/ * composite
  • ./.github/workflows/actions/build-manylinux/ * composite
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v1 composite
  • docker/setup-qemu-action v1 composite
.github/workflows/osx.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v1 composite
.github/workflows/test_manylinux.yml actions
  • actions/checkout v3 composite
.github/workflows/windows.yml actions
  • actions/cache v1 composite
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v1 composite
Dockerfile docker
  • quay.io/pypa/manylinux2014_x86_64 latest build
setup.py pypi