Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Repository
Code for COLING 2020 Paper
Basic Info
Statistics
- Stars: 13
- Watchers: 2
- Forks: 1
- Open Issues: 1
- Releases: 0
Metadata Files
README.md
Cohesion: A Japanese cohesion analyzer
Description
This project provides a system to perform the following analyses in a multi-task manner.
- Verbal predicate-argument structure analysis
- Nominal predicate-argument structure analysis
- Bridging reference resolution
- Coreference resolution
The process is as follows.
- Apply Juman++ and KNP to an input text and split the text into base phrases.
- Extract target base phrases to analyze by referring to the features added by KNP such as
<用言>and<体言>. - For each target base phrase, select its arguments (or antecedents).
For more information, please refer to the original paper
Demo
https://lotus.kuee.kyoto-u.ac.jp/cohesion-analysis/pub/

Requirements
- Python 3.9+
- Dependencies: See pyproject.toml.
- Juman++ 2.0.0-rc4 (optional)
- KNP 5.0 (optional)
- KWJA 2.3.0 (optional)
Getting started
Create a virtual environment and install dependencies.
shell $ poetry env use /path/to/python $ poetry installLog in to wandb (optional).
shell $ wandb login
Quick Start
Install Juman++/KNP or KWJA.
Download pre-trained models.
shell
$ wget https://lotus.kuee.kyoto-u.ac.jp/~ueda/dist/cohesion_analysis_v2/model_base.bin # trained checkpoint (base)
$ wget https://lotus.kuee.kyoto-u.ac.jp/~ueda/dist/cohesion_analysis_v2/model_large.bin # trained checkpoint (large)
$ ls model_*.bin
model_base.bin model_large.bin
- Run prediction.
```shell $ poetry run python src/predict.py checkpoint=modellarge.bin inputfile=<(echo "太郎はパンを買って食べた。") [devices=1] > analyzed.knp; rhoknp show -r analyzed.knp
S-ID:0-1 KNP:5.0-25425d33 DATE:2024/01/01 SCORE:59.00000
太郎は─────┐ パンを─┐ │ 買って─┤ ガ:太郎 ヲ:パン 食べた。 ガ:太郎 ヲ:パン
```
The output of predict.py is in the KNP format, which looks like the following:
```
S-ID:0-1 KNP:5.0-25425d33 DATE:2024/05/05 SCORE:59.00000
- 3D <文頭><人名><ハ><助詞><体言><係:未格><提題><区切:3-5><主題表現><格要素><連用要素><正規化代表表記:太郎/たろう><主辞代表表記:太郎/たろう>
- 3D <文頭><人名><ハ><助詞><体言><係:未格><提題><区切:3-5><主題表現><格要素><連用要素><名詞項候補><先行詞候補>
<正規化代表表記:太郎/たろう><主辞代表表記:太郎/たろう> 太郎 たろう 太郎 名詞 6 人名 5 * 0 * 0 "代表表記:太郎/たろう 人名:日本:名:45:0.00106" <代表表記:太郎/たろう><人名:日本:名:45:0.00106><正規化代表表記:太郎/たろう><漢字><かな漢字><名詞相当語><文頭><自立><内容語><タグ単位始><文節始><固有キー><文節主辞> は は は 助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は><正規化代表表記:は/は><かな漢字><ひらがな><付属> - 2D BGH:パン/ぱん<ヲ><助詞><体言><係:ヲ格><区切:0-0><格要素><連用要素><正規化代表表記:パン/ぱん><主辞代表表記:パン/ぱん>
- 2D BGH:パン/ぱん<ヲ><助詞><体言><係:ヲ格><区切:0-0><格要素><連用要素><名詞項候補><先行詞候補><正規化代表表記:パン/ぱん><主辞代表表記:パン/ぱん>
パン ぱん パン 名詞 6 普通名詞 1 * 0 * 0 "代表表記:パン/ぱん ドメイン:料理・食事 カテゴリ:人工物-食べ物" <代表表記:パン/ぱん><ドメイン:料理・食事><カテゴリ:人工物-食べ物><正規化代表表記:パン/ぱん><記英数カ><カタカナ><名詞相当語><自立><内容語><タグ単位始><文節始><固有キー><文節主辞> を を を 助詞 9 格助詞 1 * 0 * 0 "代表表記:を/を" <代表表記:を/を><正規化代表表記:を/を><かな漢字><ひらがな><付属> ... ```
You can read a KNP format file with rhoknp.
python
from rhoknp import Document
with open("analyzed.knp") as f:
parsed_document = Document.from_knp(f.read())
For more details about KNP format, see rhoknp documentation.
Building a dataset
shell
$ OUT_DIR=data/dataset [JOBS=4] ./scripts/build_dataset.sh
$ ls data/dataset
fuman/ kwdlc/ wac/
Creating a .env file and set DATA_DIR.
shell
echo 'DATA_DIR="data/dataset"' >> .env
Training
shell
poetry run python src/train.py -cn default datamodule=all_wo_kc devices=[0,1] max_batches_per_device=4
Here are commonly used options:
-cn: Config name (default:default).devices: GPUs to use (default:0).max_batches_per_device: Maximum number of batches to process per device (default:4).compile: JIT-compile the model with torch.compile for faster training ( default:false).model_name_or_path: Path to a pre-trained model or model identifier from the Huggingface Hub (default:ku-nlp/deberta-v2-large-japanese).
For more options, see YAML config files under configs.
Testing
shell
poetry run python src/test.py checkpoint=/path/to/trained/checkpoint eval_set=valid devices=[0,1]
Debugging
shell
poetry run python src/train.py -cn debug
If you are on a machine with MPS devices (e.g. Apple M1), specify trainer=cpu.debug to use CPU.
shell
poetry run python scripts/train.py -cn debug trainer=cpu.debug
If you are on a machine with GPUs, you can specify the GPUs to use with the devices option.
shell
poetry run python scripts/train.py -cn debug devices=[0]
Environment Variables
COHESION_CACHE_DIR: A directory where processed documents are cached. Default value is/tmp/$USER/cohesion_cache.COHESION_OVERWRITE_CACHE: If set, the data loader does not load cache even if it exists.COHESION_DISABLE_CACHE: If set, the data loader does not load or save cache.
Dataset
- Kyoto University Web Document Leads Corpus (KWDLC)
- Kyoto University Text Corpus (KyotoCorpus)
- Annotated FKC Corpus
- Wikipedia Annotated Corpus
Reference
- BERT-based Cohesion Analysis of Japanese Texts [Ueda et al., COLING, 2020]
- BERTに基づく統合的日本語結束性解析 [植田, master thesis, 2021]
Author
Nobuhiro Ueda
Owner
- Name: Nobuhiro Ueda
- Login: nobu-g
- Kind: user
- Location: Kyoto, Japan
- Company: Kyoto University
- Website: https://nobu-g.github.io/
- Repositories: 6
- Profile: https://github.com/nobu-g
A Ph.D student at Kyoto University.
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "cohesion-analysis: A BERT based Japanese cohesion analyzer"
authors:
- family-names: Ueda
given-names: Nobuhiro
version: 2.0.0
license: MIT
repository-code: "https://github.com/nobu-g/cohesion-analysis"
GitHub Events
Total
- Watch event: 1
- Push event: 18
- Pull request event: 12
Last Year
- Watch event: 1
- Push event: 18
- Pull request event: 12
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 0
- Total pull requests: 6
- Average time to close issues: N/A
- Average time to close pull requests: about 12 hours
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 6
Past Year
- Issues: 0
- Pull requests: 6
- Average time to close issues: N/A
- Average time to close pull requests: about 12 hours
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 5
- Bot issues: 0
- Bot pull requests: 6
Top Authors
Issue Authors
Pull Request Authors
- pre-commit-ci[bot] (13)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- 107 dependencies
- ipdb ^0.13.9 develop
- pytest ^8.2 develop
- cohesion-tools ^0.7.1
- dataclasses-json ^0.6.1
- hydra-core ^1.3
- jaconv ^0.3.4
- lightning ~2.2.2
- omegaconf ^2.3
- pandas ^2.0
- python ^3.9
- rhoknp ~1.7.0
- rich ^13.3
- tokenizers ^0.19.1
- torch >=2.1.1
- torchmetrics ^1.1
- transformers ~4.40.0
- typing-extensions >=4.4
- wandb ^0.16.0
- fastapi >=0.95.1 server
- pyhumps ^3.8 server
- uvicorn >=0.22.0 server