bunkai

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

https://github.com/megagonlabs/bunkai

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
1 of 4 committers (25.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary

Keywords

japanese python sentence-boundary-detection sentence-tokenizer

Keywords from Contributors

interactive diffusers mesh interpretability profiles distribution sequences generic projection standardization

Last synced: 9 months ago · JSON representation ·

Repository

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Basic Info

Host: GitHub
Owner: megagonlabs
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://pypi.org/project/bunkai/
Size: 1.18 MB

Statistics

Stars: 193
Watchers: 4
Forks: 10
Open Issues: 18
Releases: 14

Topics

japanese python sentence-boundary-detection sentence-tokenizer

Created about 5 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

Bunkai

Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.
Bunkaiは日本語文境界判定器です．

Quick Start

Install

console $ pip install -U bunkai

Disambiguation without Models

console $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \ | bunkai 宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★ 2文書目の先頭行です。▁│改行はU+2581で表現します。

Feed a document as one line by using ▁ (U+2581) for line breaks.
1行は1つの文書を表します．文書中の改行は ▁ (U+2581) で与えてください．
The output shows sentence boundaries with │ (U+2502).
出力では文境界は│ (U+2502) で表示されます．

Disambiguation for Line Breaks with a Model

If you want to disambiguate sentence boundaries for line breaks, please add a --model option with the path to the model.
改行記号に対しても文境界判定を行いたい場合は，--modelオプションを与える必要があります．

First, please install extras to use --model option.
--modelオプションを利用するために、まずextraパッケージをインストールしてください．

console $ pip install -U 'bunkai[lb]'

Second, please setup a model. It will take some time.
次にモデルをセットアップする必要があります．セットアップには少々時間がかかります．

console $ bunkai --model bunkai-model-directory --setup

Then, please designate the directory.
そしてモデルを指定して動かしてください．

console $ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory 文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。

Morphological Analysis Result

You can get morphological analysis results with --ma option.
--maオプションを付与すると形態素解析結果が得られます．

It can be used with the --model option.
--modelオプションと同時に使えます．

console $ echo -e '形態素解析し▁ます。結果を表示します！' | bunkai --ma --model bunkai-model-directory 形態素名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ解析名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキし動詞,自立,*,*,サ変・スル,連用形,する,シ,シ ▁ ます助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス。記号,句点,*,*,*,*,。,。,。 EOS 結果名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカを助詞,格助詞,一般,*,*,*,を,ヲ,ヲ記号,空白,*,*,*,*, ,*,* 表示名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージし動詞,自立,*,*,サ変・スル,連用形,する,シ,シます助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス！記号,一般,*,*,*,*,！,！,！ EOS

Python Library

You can also use Bunkai as Python library.
BunkaiはPythonライブラリとしても使えます．

python from bunkai import Bunkai bunkai = Bunkai() for sentence in bunkai("はい。このようにpythonライブラリとしても使えます！"): print(sentence)

改行を文境界判定に含める場合はセットアップしたモデルパスを指定してください．
If you want to disambiguate line breaks too, please designate the model path where you set up.

```python from pathlib import Path

from bunkai import Bunkai

bunkai = Bunkai(path_model=Path("bunkai-model-directory")) for sentence in bunkai("そうなんです▁このように▁pythonライブラリとしても▁使えます！"): print(sentence)

""" Output: そうなんです▁ このように▁pythonライブラリとしても▁使えます！ """ ```

For more information, see examples.
ほかの例はexamplesをご覧ください．

Documents

Documents

References

Yuta Hayashibe and Kensuke Mitsuzawa. Sentence Boundary Detection on Line Breaks in Japanese. Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75. November 2020. [PDF] [bib]

License

Apache License 2.0

Owner

Name: Megagon Labs
Login: megagonlabs
Kind: organization

Website: https://www.megagon.ai
Repositories: 23
Profile: https://github.com/megagonlabs

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work in a project of yours and write about it, please cite our paper using the following citation data."
authors:
  - family-names: Hayashibe
    given-names: Yuta
  - family-names: Mitsuzawa
    given-names: Kensuke
title: "Bunkai"
url: https://github.com/megagonlabs/bunkai
preferred-citation:
  type: conference-paper
  title: Sentence Boundary Detection on Line Breaks in Japanese
  authors:
    - family-names: Hayashibe
      given-names: Yuta
    - family-names: Mitsuzawa
      given-names: Kensuke
  doi: 10.18653/v1/2020.wnut-1.10
  collection-title: Proceedings of The 6th Workshop on Noisy User-generated Text
  year: 2020
  month: 11
  publisher: 
    name: Association for Computational Linguistics
  url: https://aclanthology.org/2020.wnut-1.10/
  start: 71
  end: 75

GitHub Events

Total

Watch event: 8

Last Year

Watch event: 8

Committers

Last synced: about 1 year ago

All Time

Total Commits: 554
Total Committers: 4
Avg Commits per committer: 138.5
Development Distribution Score (DDS): 0.264

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Yuta Hayashibe	y**a@h**p	408
dependabot[bot]	4****]	140
r-terada	r**3@g**m	5
t-yamamura	t**a@p**p	1

Committer Domains (Top 20 + Academic)

pluto.ai.kyutech.ac.jp: 1 hayashibe.jp: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 1
Total pull requests: 194
Average time to close issues: N/A
Average time to close pull requests: 3 days
Total issue authors: 1
Total pull request authors: 3
Average comments per issue: 3.0
Average comments per pull request: 0.46
Merged pull requests: 90
Bot issues: 0
Bot pull requests: 190

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

yosato (1)

Pull Request Authors

dependabot[bot] (181)
shirayu (3)
mh-northlander (1)

Top Labels

Issue Labels

Pull Request Labels

Type: Dependencies (181) python (76) github_actions (58) javascript (47)

Packages

Total packages: 1
Total downloads:
- pypi 605 last-month

Total dependent packages: 1
Total dependent repositories: 3
Total versions: 17
Total maintainers: 1

pypi.org: bunkai

Sentence boundary disambiguation tool for Japanese texts

Homepage: https://github.com/megagonlabs/bunkai
Documentation: https://bunkai.readthedocs.io/
License: Apache-2.0
Latest release: 1.5.7
published over 3 years ago

Versions: 17
Dependent Packages: 1
Dependent Repositories: 3
Downloads: 605 Last month

Rankings

Stargazers count: 5.3%

Downloads: 8.0%

Average: 8.8%

Dependent repos count: 8.9%

Dependent packages count: 10.1%

Forks count: 11.9%

Maintainers (1)

megagonlabs

Last synced: 9 months ago

Dependencies

.github/workflows/ci.yml actions

actions/cache v3 composite
actions/checkout v3 composite
actions/setup-node v3 composite
actions/setup-python v4 composite
snok/install-poetry v1.3.4 composite

.github/workflows/codeql-analysis.yml actions

actions/checkout v3 composite
github/codeql-action/analyze v2 composite
github/codeql-action/autobuild v2 composite
github/codeql-action/init v2 composite

.github/workflows/typos.yml actions

actions/checkout v3 composite
crate-ci/typos v1.16.2 composite

package-lock.json npm

@isaacs/cliui 8.0.2 development
@pkgjs/parseargs 0.11.0 development
ansi-regex 6.0.1 development
ansi-regex 5.0.1 development
ansi-styles 6.2.1 development
ansi-styles 4.3.0 development
argparse 2.0.1 development
balanced-match 1.0.2 development
brace-expansion 2.0.1 development
color-convert 2.0.1 development
color-name 1.1.4 development
commander 11.0.0 development
cross-spawn 7.0.3 development
deep-extend 0.6.0 development
eastasianwidth 0.2.0 development
emoji-regex 9.2.2 development
emoji-regex 8.0.0 development
entities 3.0.1 development
foreground-child 3.1.1 development
fsevents 2.3.2 development
get-stdin 9.0.0 development
glob 10.2.7 development
ignore 5.2.4 development
ini 3.0.1 development
is-fullwidth-code-point 3.0.0 development
isexe 2.0.0 development
jackspeak 2.2.2 development
js-yaml 4.1.0 development
jsonc-parser 3.2.0 development
linkify-it 4.0.1 development
lru-cache 10.0.0 development
markdown-it 13.0.1 development
markdownlint 0.29.0 development
markdownlint-cli 0.35.0 development
markdownlint-micromark 0.1.5 development
mdurl 1.0.1 development
minimatch 9.0.3 development
minimist 1.2.8 development
minipass 6.0.2 development
path-key 3.1.1 development
path-scurry 1.10.1 development
pyright 1.1.321 development
run-con 1.2.12 development
shebang-command 2.0.0 development
shebang-regex 3.0.0 development
signal-exit 4.1.0 development
string-width 5.1.2 development
string-width 4.2.3 development
string-width-cjs 4.2.3 development
strip-ansi 6.0.1 development
strip-ansi 7.1.0 development
strip-ansi-cjs 6.0.1 development
strip-json-comments 3.1.1 development
uc.micro 1.0.6 development
which 2.0.2 development
wrap-ansi 8.1.0 development
wrap-ansi-cjs 7.0.0 development

package.json npm

markdown-it >=13.0.0 development
markdownlint-cli ^0.35.0 development
pyright <1.1.322 development

poetry.lock pypi

attrs 23.1.0
black 23.7.0
certifi 2023.7.22
cffconvert 2.0.0
charset-normalizer 3.2.0
click 8.1.6
cmake 3.27.1
colorama 0.4.6
coverage 7.2.7
dataclasses-json 0.5.14
docopt 0.6.2
emoji 2.7.0
emojis 0.7.0
filelock 3.12.2
flake8 5.0.4
fsspec 2023.6.0
huggingface-hub 0.16.4
idna 3.4
isort 5.12.0
janome 0.5.0
jinja2 3.1.2
joblib 1.3.1
jsonschema 3.2.0
lit 16.0.6
markupsafe 2.1.3
marshmallow 3.20.1
mccabe 0.7.0
mock 5.1.0
more-itertools 10.1.0
mpmath 1.3.0
mypy-extensions 1.0.0
networkx 3.1
numpy 1.24.4
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
packaging 23.1
pathspec 0.11.2
platformdirs 3.10.0
pycodestyle 2.9.1
pydocstyle 6.3.0
pyflakes 2.5.0
pykwalify 1.8.0
pyrsistent 0.19.3
python-dateutil 2.8.2
pyyaml 6.0.1
regex 2023.8.8
requests 2.31.0
ruamel-yaml 0.17.32
ruamel-yaml-clib 0.2.7
safetensors 0.3.2
scikit-learn 1.3.0
scipy 1.10.1
seqeval 1.2.2
setuptools 68.0.0
six 1.16.0
snowballstemmer 2.2.0
spans 1.1.1
sympy 1.12
threadpoolctl 3.2.0
tokenizers 0.13.3
toml 0.10.2
tomli 2.0.1
torch 2.0.0
tqdm 4.66.0
transformers 4.31.0
triton 2.0.0
typing-extensions 4.7.1
typing-inspect 0.9.0
urllib3 2.0.4
wheel 0.41.1
yamllint 1.32.0

pyproject.toml pypi

black >=21.10b0 develop
cffconvert ^2.0.0 develop
coverage >=5.3 develop
flake8 >=3.8.4 develop
isort >=5.9.3 develop
mock >=4.0.2 develop
pydocstyle >=5.1.1 develop
yamllint >=1.25.0 develop
dataclasses-json >=0.5.2
emoji >=2.0.0
emojis >=0.6.0
janome >=0.4.1
more_itertools >=8.6.0
numpy >=1.16.0
python >=3.8,<3.12
regex !=2022.7.24
requests ^2.27.1
seqeval >=1.2.2
spans >=1.1.0
toml >=0.10.2
torch >=1.3.0,!=2.0.1
tqdm *
transformers >=4.22.0

bunkai

Science Score: 54.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Bunkai

Quick Start

Install

Disambiguation without Models

Disambiguation for Line Breaks with a Model

Morphological Analysis Result

Python Library

Documents

References

License

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: bunkai

Rankings

Maintainers (1)

Dependencies