bunkai

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

https://github.com/megagonlabs/bunkai

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 4 committers (25.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.8%) to scientific vocabulary

Keywords

japanese python sentence-boundary-detection sentence-tokenizer

Keywords from Contributors

interactive diffusers mesh interpretability profiles distribution sequences generic projection standardization
Last synced: 6 months ago · JSON representation ·

Repository

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Basic Info
Statistics
  • Stars: 193
  • Watchers: 4
  • Forks: 10
  • Open Issues: 18
  • Releases: 14
Topics
japanese python sentence-boundary-detection sentence-tokenizer
Created almost 5 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

Bunkai

PyPI version Python Versions License Downloads

CI Typos CodeQL Maintainability Test Coverage markdownlint jsonlint yamllint

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.
Bunkaiは日本語文境界判定器です.

Quick Start

Install

console $ pip install -U bunkai

Disambiguation without Models

console $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \ | bunkai 宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★ 2文書目の先頭行です。▁│改行はU+2581で表現します。

  • Feed a document as one line by using (U+2581) for line breaks.
    1行は1つの文書を表します.文書中の改行は (U+2581) で与えてください.
  • The output shows sentence boundaries with (U+2502).
    出力では文境界は (U+2502) で表示されます.

Disambiguation for Line Breaks with a Model

If you want to disambiguate sentence boundaries for line breaks, please add a --model option with the path to the model.
改行記号に対しても文境界判定を行いたい場合は,--modelオプションを与える必要があります.

First, please install extras to use --model option.
--modelオプションを利用するために、まずextraパッケージをインストールしてください.

console $ pip install -U 'bunkai[lb]'

Second, please setup a model. It will take some time.
次にモデルをセットアップする必要があります.セットアップには少々時間がかかります.

console $ bunkai --model bunkai-model-directory --setup

Then, please designate the directory.
そしてモデルを指定して動かしてください.

console $ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory 文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。

Morphological Analysis Result

You can get morphological analysis results with --ma option.
--maオプションを付与すると形態素解析結果が得られます.

It can be used with the --model option.
--modelオプションと同時に使えます.

console $ echo -e '形態素解析し▁ます。結果を 表示します!' | bunkai --ma --model bunkai-model-directory 形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ 解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ ▁ ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス 。 記号,句点,*,*,*,*,。,。,。 EOS 結果 名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ 記号,空白,*,*,*,*, ,*,* 表示 名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス ! 記号,一般,*,*,*,*,!,!,! EOS

Python Library

You can also use Bunkai as Python library.
BunkaiはPythonライブラリとしても使えます.

python from bunkai import Bunkai bunkai = Bunkai() for sentence in bunkai("はい。このようにpythonライブラリとしても使えます!"): print(sentence)

改行を文境界判定に含める場合はセットアップしたモデルパスを指定してください.
If you want to disambiguate line breaks too, please designate the model path where you set up.

```python from pathlib import Path

from bunkai import Bunkai

bunkai = Bunkai(path_model=Path("bunkai-model-directory")) for sentence in bunkai("そうなんです▁このように▁pythonライブラリとしても▁使えます!"): print(sentence)

""" Output: そうなんです▁ このように▁pythonライブラリとしても▁使えます! """ ```

For more information, see examples.
ほかの例はexamplesをご覧ください.

Documents

References

  • Yuta Hayashibe and Kensuke Mitsuzawa. Sentence Boundary Detection on Line Breaks in Japanese. Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75. November 2020. [PDF] [bib]

License

Apache License 2.0

Owner

  • Name: Megagon Labs
  • Login: megagonlabs
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work in a project of yours and write about it, please cite our paper using the following citation data."
authors:
  - family-names: Hayashibe
    given-names: Yuta
  - family-names: Mitsuzawa
    given-names: Kensuke
title: "Bunkai"
url: https://github.com/megagonlabs/bunkai
preferred-citation:
  type: conference-paper
  title: Sentence Boundary Detection on Line Breaks in Japanese
  authors:
    - family-names: Hayashibe
      given-names: Yuta
    - family-names: Mitsuzawa
      given-names: Kensuke
  doi: 10.18653/v1/2020.wnut-1.10
  collection-title: Proceedings of The 6th Workshop on Noisy User-generated Text
  year: 2020
  month: 11
  publisher: 
    name: Association for Computational Linguistics
  url: https://aclanthology.org/2020.wnut-1.10/
  start: 71
  end: 75

GitHub Events

Total
  • Watch event: 8
Last Year
  • Watch event: 8

Committers

Last synced: 9 months ago

All Time
  • Total Commits: 554
  • Total Committers: 4
  • Avg Commits per committer: 138.5
  • Development Distribution Score (DDS): 0.264
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Yuta Hayashibe y****a@h****p 408
dependabot[bot] 4****] 140
r-terada r****3@g****m 5
t-yamamura t****a@p****p 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 1
  • Total pull requests: 194
  • Average time to close issues: N/A
  • Average time to close pull requests: 3 days
  • Total issue authors: 1
  • Total pull request authors: 3
  • Average comments per issue: 3.0
  • Average comments per pull request: 0.46
  • Merged pull requests: 90
  • Bot issues: 0
  • Bot pull requests: 190
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • yosato (1)
Pull Request Authors
  • dependabot[bot] (181)
  • shirayu (3)
  • mh-northlander (1)
Top Labels
Issue Labels
Pull Request Labels
Type: Dependencies (181) python (76) github_actions (58) javascript (47)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 605 last-month
  • Total dependent packages: 1
  • Total dependent repositories: 3
  • Total versions: 17
  • Total maintainers: 1
pypi.org: bunkai

Sentence boundary disambiguation tool for Japanese texts

  • Versions: 17
  • Dependent Packages: 1
  • Dependent Repositories: 3
  • Downloads: 605 Last month
Rankings
Stargazers count: 5.3%
Downloads: 8.0%
Average: 8.8%
Dependent repos count: 8.9%
Dependent packages count: 10.1%
Forks count: 11.9%
Maintainers (1)
Last synced: 6 months ago

Dependencies

.github/workflows/ci.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/setup-node v3 composite
  • actions/setup-python v4 composite
  • snok/install-poetry v1.3.4 composite
.github/workflows/codeql-analysis.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/typos.yml actions
  • actions/checkout v3 composite
  • crate-ci/typos v1.16.2 composite
package-lock.json npm
  • @isaacs/cliui 8.0.2 development
  • @pkgjs/parseargs 0.11.0 development
  • ansi-regex 6.0.1 development
  • ansi-regex 5.0.1 development
  • ansi-styles 6.2.1 development
  • ansi-styles 4.3.0 development
  • argparse 2.0.1 development
  • balanced-match 1.0.2 development
  • brace-expansion 2.0.1 development
  • color-convert 2.0.1 development
  • color-name 1.1.4 development
  • commander 11.0.0 development
  • cross-spawn 7.0.3 development
  • deep-extend 0.6.0 development
  • eastasianwidth 0.2.0 development
  • emoji-regex 9.2.2 development
  • emoji-regex 8.0.0 development
  • entities 3.0.1 development
  • foreground-child 3.1.1 development
  • fsevents 2.3.2 development
  • get-stdin 9.0.0 development
  • glob 10.2.7 development
  • ignore 5.2.4 development
  • ini 3.0.1 development
  • is-fullwidth-code-point 3.0.0 development
  • isexe 2.0.0 development
  • jackspeak 2.2.2 development
  • js-yaml 4.1.0 development
  • jsonc-parser 3.2.0 development
  • linkify-it 4.0.1 development
  • lru-cache 10.0.0 development
  • markdown-it 13.0.1 development
  • markdownlint 0.29.0 development
  • markdownlint-cli 0.35.0 development
  • markdownlint-micromark 0.1.5 development
  • mdurl 1.0.1 development
  • minimatch 9.0.3 development
  • minimist 1.2.8 development
  • minipass 6.0.2 development
  • path-key 3.1.1 development
  • path-scurry 1.10.1 development
  • pyright 1.1.321 development
  • run-con 1.2.12 development
  • shebang-command 2.0.0 development
  • shebang-regex 3.0.0 development
  • signal-exit 4.1.0 development
  • string-width 5.1.2 development
  • string-width 4.2.3 development
  • string-width-cjs 4.2.3 development
  • strip-ansi 6.0.1 development
  • strip-ansi 7.1.0 development
  • strip-ansi-cjs 6.0.1 development
  • strip-json-comments 3.1.1 development
  • uc.micro 1.0.6 development
  • which 2.0.2 development
  • wrap-ansi 8.1.0 development
  • wrap-ansi-cjs 7.0.0 development
package.json npm
  • markdown-it >=13.0.0 development
  • markdownlint-cli ^0.35.0 development
  • pyright <1.1.322 development
poetry.lock pypi
  • attrs 23.1.0
  • black 23.7.0
  • certifi 2023.7.22
  • cffconvert 2.0.0
  • charset-normalizer 3.2.0
  • click 8.1.6
  • cmake 3.27.1
  • colorama 0.4.6
  • coverage 7.2.7
  • dataclasses-json 0.5.14
  • docopt 0.6.2
  • emoji 2.7.0
  • emojis 0.7.0
  • filelock 3.12.2
  • flake8 5.0.4
  • fsspec 2023.6.0
  • huggingface-hub 0.16.4
  • idna 3.4
  • isort 5.12.0
  • janome 0.5.0
  • jinja2 3.1.2
  • joblib 1.3.1
  • jsonschema 3.2.0
  • lit 16.0.6
  • markupsafe 2.1.3
  • marshmallow 3.20.1
  • mccabe 0.7.0
  • mock 5.1.0
  • more-itertools 10.1.0
  • mpmath 1.3.0
  • mypy-extensions 1.0.0
  • networkx 3.1
  • numpy 1.24.4
  • nvidia-cublas-cu11 11.10.3.66
  • nvidia-cuda-cupti-cu11 11.7.101
  • nvidia-cuda-nvrtc-cu11 11.7.99
  • nvidia-cuda-runtime-cu11 11.7.99
  • nvidia-cudnn-cu11 8.5.0.96
  • nvidia-cufft-cu11 10.9.0.58
  • nvidia-curand-cu11 10.2.10.91
  • nvidia-cusolver-cu11 11.4.0.1
  • nvidia-cusparse-cu11 11.7.4.91
  • nvidia-nccl-cu11 2.14.3
  • nvidia-nvtx-cu11 11.7.91
  • packaging 23.1
  • pathspec 0.11.2
  • platformdirs 3.10.0
  • pycodestyle 2.9.1
  • pydocstyle 6.3.0
  • pyflakes 2.5.0
  • pykwalify 1.8.0
  • pyrsistent 0.19.3
  • python-dateutil 2.8.2
  • pyyaml 6.0.1
  • regex 2023.8.8
  • requests 2.31.0
  • ruamel-yaml 0.17.32
  • ruamel-yaml-clib 0.2.7
  • safetensors 0.3.2
  • scikit-learn 1.3.0
  • scipy 1.10.1
  • seqeval 1.2.2
  • setuptools 68.0.0
  • six 1.16.0
  • snowballstemmer 2.2.0
  • spans 1.1.1
  • sympy 1.12
  • threadpoolctl 3.2.0
  • tokenizers 0.13.3
  • toml 0.10.2
  • tomli 2.0.1
  • torch 2.0.0
  • tqdm 4.66.0
  • transformers 4.31.0
  • triton 2.0.0
  • typing-extensions 4.7.1
  • typing-inspect 0.9.0
  • urllib3 2.0.4
  • wheel 0.41.1
  • yamllint 1.32.0
pyproject.toml pypi
  • black >=21.10b0 develop
  • cffconvert ^2.0.0 develop
  • coverage >=5.3 develop
  • flake8 >=3.8.4 develop
  • isort >=5.9.3 develop
  • mock >=4.0.2 develop
  • pydocstyle >=5.1.1 develop
  • yamllint >=1.25.0 develop
  • dataclasses-json >=0.5.2
  • emoji >=2.0.0
  • emojis >=0.6.0
  • janome >=0.4.1
  • more_itertools >=8.6.0
  • numpy >=1.16.0
  • python >=3.8,<3.12
  • regex !=2022.7.24
  • requests ^2.27.1
  • seqeval >=1.2.2
  • spans >=1.1.0
  • toml >=0.10.2
  • torch >=1.3.0,!=2.0.1
  • tqdm *
  • transformers >=4.22.0