bunkai
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 4 committers (25.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.8%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
Basic Info
- Host: GitHub
- Owner: megagonlabs
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://pypi.org/project/bunkai/
- Size: 1.18 MB
Statistics
- Stars: 193
- Watchers: 4
- Forks: 10
- Open Issues: 18
- Releases: 14
Topics
Metadata Files
README.md
Bunkai
Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts.
Bunkaiは日本語文境界判定器です.
Quick Start
Install
console
$ pip install -U bunkai
Disambiguation without Models
console
$ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎかな(笑)楽しみです★\n2文書目の先頭行です。▁改行はU+2581で表現します。' \
| bunkai
宿を予約しました♪!│まだ2ヶ月も先だけど。│早すぎかな(笑)│楽しみです★
2文書目の先頭行です。▁│改行はU+2581で表現します。
- Feed a document as one line by using
▁(U+2581) for line breaks.
1行は1つの文書を表します.文書中の改行は▁(U+2581) で与えてください. - The output shows sentence boundaries with
│(U+2502).
出力では文境界は│(U+2502) で表示されます.
Disambiguation for Line Breaks with a Model
If you want to disambiguate sentence boundaries for line breaks, please add a --model option with the path to the model.
改行記号に対しても文境界判定を行いたい場合は,--modelオプションを与える必要があります.
First, please install extras to use --model option.
--modelオプションを利用するために、まずextraパッケージをインストールしてください.
console
$ pip install -U 'bunkai[lb]'
Second, please setup a model. It will take some time.
次にモデルをセットアップする必要があります.セットアップには少々時間がかかります.
console
$ bunkai --model bunkai-model-directory --setup
Then, please designate the directory.
そしてモデルを指定して動かしてください.
console
$ echo -e "文の途中で改行を▁入れる文章ってありますよね▁それも対象です。" | bunkai --model bunkai-model-directory
文の途中で改行を▁入れる文章ってありますよね▁│それも対象です。
Morphological Analysis Result
You can get morphological analysis results with --ma option.
--maオプションを付与すると形態素解析結果が得られます.
It can be used with the --model option.
--modelオプションと同時に使えます.
console
$ echo -e '形態素解析し▁ます。結果を 表示します!' | bunkai --ma --model bunkai-model-directory
形態素 名詞,一般,*,*,*,*,形態素,ケイタイソ,ケイタイソ
解析 名詞,サ変接続,*,*,*,*,解析,カイセキ,カイセキ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
▁
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
。 記号,句点,*,*,*,*,。,。,。
EOS
結果 名詞,副詞可能,*,*,*,*,結果,ケッカ,ケッカ
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
記号,空白,*,*,*,*, ,*,*
表示 名詞,サ変接続,*,*,*,*,表示,ヒョウジ,ヒョージ
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ます 助動詞,*,*,*,特殊・マス,基本形,ます,マス,マス
! 記号,一般,*,*,*,*,!,!,!
EOS
Python Library
You can also use Bunkai as Python library.
BunkaiはPythonライブラリとしても使えます.
python
from bunkai import Bunkai
bunkai = Bunkai()
for sentence in bunkai("はい。このようにpythonライブラリとしても使えます!"):
print(sentence)
改行を文境界判定に含める場合はセットアップしたモデルパスを指定してください.
If you want to disambiguate line breaks too, please designate the model path where you set up.
```python from pathlib import Path
from bunkai import Bunkai
bunkai = Bunkai(path_model=Path("bunkai-model-directory")) for sentence in bunkai("そうなんです▁このように▁pythonライブラリとしても▁使えます!"): print(sentence)
""" Output: そうなんです▁ このように▁pythonライブラリとしても▁使えます! """ ```
For more information, see examples.
ほかの例はexamplesをご覧ください.
Documents
References
- Yuta Hayashibe and Kensuke Mitsuzawa. Sentence Boundary Detection on Line Breaks in Japanese. Proceedings of The 6th Workshop on Noisy User-generated Text (W-NUT 2020), pp.71-75. November 2020. [PDF] [bib]
License
Apache License 2.0
Owner
- Name: Megagon Labs
- Login: megagonlabs
- Kind: organization
- Website: https://www.megagon.ai
- Repositories: 23
- Profile: https://github.com/megagonlabs
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this work in a project of yours and write about it, please cite our paper using the following citation data."
authors:
- family-names: Hayashibe
given-names: Yuta
- family-names: Mitsuzawa
given-names: Kensuke
title: "Bunkai"
url: https://github.com/megagonlabs/bunkai
preferred-citation:
type: conference-paper
title: Sentence Boundary Detection on Line Breaks in Japanese
authors:
- family-names: Hayashibe
given-names: Yuta
- family-names: Mitsuzawa
given-names: Kensuke
doi: 10.18653/v1/2020.wnut-1.10
collection-title: Proceedings of The 6th Workshop on Noisy User-generated Text
year: 2020
month: 11
publisher:
name: Association for Computational Linguistics
url: https://aclanthology.org/2020.wnut-1.10/
start: 71
end: 75
GitHub Events
Total
- Watch event: 8
Last Year
- Watch event: 8
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Yuta Hayashibe | y****a@h****p | 408 |
| dependabot[bot] | 4****] | 140 |
| r-terada | r****3@g****m | 5 |
| t-yamamura | t****a@p****p | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 7 months ago
All Time
- Total issues: 1
- Total pull requests: 194
- Average time to close issues: N/A
- Average time to close pull requests: 3 days
- Total issue authors: 1
- Total pull request authors: 3
- Average comments per issue: 3.0
- Average comments per pull request: 0.46
- Merged pull requests: 90
- Bot issues: 0
- Bot pull requests: 190
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- yosato (1)
Pull Request Authors
- dependabot[bot] (181)
- shirayu (3)
- mh-northlander (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 605 last-month
- Total dependent packages: 1
- Total dependent repositories: 3
- Total versions: 17
- Total maintainers: 1
pypi.org: bunkai
Sentence boundary disambiguation tool for Japanese texts
- Homepage: https://github.com/megagonlabs/bunkai
- Documentation: https://bunkai.readthedocs.io/
- License: Apache-2.0
-
Latest release: 1.5.7
published about 3 years ago
Rankings
Maintainers (1)
Dependencies
- actions/cache v3 composite
- actions/checkout v3 composite
- actions/setup-node v3 composite
- actions/setup-python v4 composite
- snok/install-poetry v1.3.4 composite
- actions/checkout v3 composite
- github/codeql-action/analyze v2 composite
- github/codeql-action/autobuild v2 composite
- github/codeql-action/init v2 composite
- actions/checkout v3 composite
- crate-ci/typos v1.16.2 composite
- @isaacs/cliui 8.0.2 development
- @pkgjs/parseargs 0.11.0 development
- ansi-regex 6.0.1 development
- ansi-regex 5.0.1 development
- ansi-styles 6.2.1 development
- ansi-styles 4.3.0 development
- argparse 2.0.1 development
- balanced-match 1.0.2 development
- brace-expansion 2.0.1 development
- color-convert 2.0.1 development
- color-name 1.1.4 development
- commander 11.0.0 development
- cross-spawn 7.0.3 development
- deep-extend 0.6.0 development
- eastasianwidth 0.2.0 development
- emoji-regex 9.2.2 development
- emoji-regex 8.0.0 development
- entities 3.0.1 development
- foreground-child 3.1.1 development
- fsevents 2.3.2 development
- get-stdin 9.0.0 development
- glob 10.2.7 development
- ignore 5.2.4 development
- ini 3.0.1 development
- is-fullwidth-code-point 3.0.0 development
- isexe 2.0.0 development
- jackspeak 2.2.2 development
- js-yaml 4.1.0 development
- jsonc-parser 3.2.0 development
- linkify-it 4.0.1 development
- lru-cache 10.0.0 development
- markdown-it 13.0.1 development
- markdownlint 0.29.0 development
- markdownlint-cli 0.35.0 development
- markdownlint-micromark 0.1.5 development
- mdurl 1.0.1 development
- minimatch 9.0.3 development
- minimist 1.2.8 development
- minipass 6.0.2 development
- path-key 3.1.1 development
- path-scurry 1.10.1 development
- pyright 1.1.321 development
- run-con 1.2.12 development
- shebang-command 2.0.0 development
- shebang-regex 3.0.0 development
- signal-exit 4.1.0 development
- string-width 5.1.2 development
- string-width 4.2.3 development
- string-width-cjs 4.2.3 development
- strip-ansi 6.0.1 development
- strip-ansi 7.1.0 development
- strip-ansi-cjs 6.0.1 development
- strip-json-comments 3.1.1 development
- uc.micro 1.0.6 development
- which 2.0.2 development
- wrap-ansi 8.1.0 development
- wrap-ansi-cjs 7.0.0 development
- markdown-it >=13.0.0 development
- markdownlint-cli ^0.35.0 development
- pyright <1.1.322 development
- attrs 23.1.0
- black 23.7.0
- certifi 2023.7.22
- cffconvert 2.0.0
- charset-normalizer 3.2.0
- click 8.1.6
- cmake 3.27.1
- colorama 0.4.6
- coverage 7.2.7
- dataclasses-json 0.5.14
- docopt 0.6.2
- emoji 2.7.0
- emojis 0.7.0
- filelock 3.12.2
- flake8 5.0.4
- fsspec 2023.6.0
- huggingface-hub 0.16.4
- idna 3.4
- isort 5.12.0
- janome 0.5.0
- jinja2 3.1.2
- joblib 1.3.1
- jsonschema 3.2.0
- lit 16.0.6
- markupsafe 2.1.3
- marshmallow 3.20.1
- mccabe 0.7.0
- mock 5.1.0
- more-itertools 10.1.0
- mpmath 1.3.0
- mypy-extensions 1.0.0
- networkx 3.1
- numpy 1.24.4
- nvidia-cublas-cu11 11.10.3.66
- nvidia-cuda-cupti-cu11 11.7.101
- nvidia-cuda-nvrtc-cu11 11.7.99
- nvidia-cuda-runtime-cu11 11.7.99
- nvidia-cudnn-cu11 8.5.0.96
- nvidia-cufft-cu11 10.9.0.58
- nvidia-curand-cu11 10.2.10.91
- nvidia-cusolver-cu11 11.4.0.1
- nvidia-cusparse-cu11 11.7.4.91
- nvidia-nccl-cu11 2.14.3
- nvidia-nvtx-cu11 11.7.91
- packaging 23.1
- pathspec 0.11.2
- platformdirs 3.10.0
- pycodestyle 2.9.1
- pydocstyle 6.3.0
- pyflakes 2.5.0
- pykwalify 1.8.0
- pyrsistent 0.19.3
- python-dateutil 2.8.2
- pyyaml 6.0.1
- regex 2023.8.8
- requests 2.31.0
- ruamel-yaml 0.17.32
- ruamel-yaml-clib 0.2.7
- safetensors 0.3.2
- scikit-learn 1.3.0
- scipy 1.10.1
- seqeval 1.2.2
- setuptools 68.0.0
- six 1.16.0
- snowballstemmer 2.2.0
- spans 1.1.1
- sympy 1.12
- threadpoolctl 3.2.0
- tokenizers 0.13.3
- toml 0.10.2
- tomli 2.0.1
- torch 2.0.0
- tqdm 4.66.0
- transformers 4.31.0
- triton 2.0.0
- typing-extensions 4.7.1
- typing-inspect 0.9.0
- urllib3 2.0.4
- wheel 0.41.1
- yamllint 1.32.0
- black >=21.10b0 develop
- cffconvert ^2.0.0 develop
- coverage >=5.3 develop
- flake8 >=3.8.4 develop
- isort >=5.9.3 develop
- mock >=4.0.2 develop
- pydocstyle >=5.1.1 develop
- yamllint >=1.25.0 develop
- dataclasses-json >=0.5.2
- emoji >=2.0.0
- emojis >=0.6.0
- janome >=0.4.1
- more_itertools >=8.6.0
- numpy >=1.16.0
- python >=3.8,<3.12
- regex !=2022.7.24
- requests ^2.27.1
- seqeval >=1.2.2
- spans >=1.1.0
- toml >=0.10.2
- torch >=1.3.0,!=2.0.1
- tqdm *
- transformers >=4.22.0