cohesion-analysis

Code for COLING 2020 Paper

https://github.com/nobu-g/cohesion-analysis

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Code for COLING 2020 Paper

Basic Info

Host: GitHub
Owner: nobu-g
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 525 KB

Statistics

Stars: 13
Watchers: 2
Forks: 1
Open Issues: 1
Releases: 0

Created over 5 years ago · Last pushed 9 months ago

Metadata Files

Readme License Citation

Cohesion: A Japanese cohesion analyzer

Description

This project provides a system to perform the following analyses in a multi-task manner.

Verbal predicate-argument structure analysis
Nominal predicate-argument structure analysis
Bridging reference resolution
Coreference resolution

The process is as follows.

Apply Juman++ and KNP to an input text and split the text into base phrases.
Extract target base phrases to analyze by referring to the features added by KNP such as <用言> and <体言>.
For each target base phrase, select its arguments (or antecedents).

For more information, please refer to the original paper

Demo

https://lotus.kuee.kyoto-u.ac.jp/cohesion-analysis/pub/

demo-view

Requirements

Python 3.9+
Dependencies: See pyproject.toml.
Juman++ 2.0.0-rc4 (optional)
KNP 5.0 (optional)
KWJA 2.3.0 (optional)

Getting started

Create a virtual environment and install dependencies. shell $ poetry env use /path/to/python $ poetry install
Log in to wandb (optional). shell $ wandb login

Quick Start

Install Juman++/KNP or KWJA.
- Juman++/KNP shell docker pull kunlp/jumanpp-knp:latest echo 'docker run -i --rm --platform linux/amd64 kunlp/jumanpp-knp jumanpp' > /somewhere/in/your/path/jumanpp echo 'docker run -i --rm --platform linux/amd64 kunlp/jumanpp-knp knp' > /somewhere/in/your/path/knp
- KWJA shell pipx install kwja
Download pre-trained models.

shell $ wget https://lotus.kuee.kyoto-u.ac.jp/~ueda/dist/cohesion_analysis_v2/model_base.bin # trained checkpoint (base) $ wget https://lotus.kuee.kyoto-u.ac.jp/~ueda/dist/cohesion_analysis_v2/model_large.bin # trained checkpoint (large) $ ls model_*.bin model_base.bin model_large.bin

Run prediction.

```shell $ poetry run python src/predict.py checkpoint=modellarge.bin inputfile=<(echo "太郎はパンを買って食べた。") [devices=1] > analyzed.knp; rhoknp show -r analyzed.knp

S-ID:0-1 KNP:5.0-25425d33 DATE:2024/01/01 SCORE:59.00000

太郎は─────┐ パンを─┐ │ 買って─┤ ガ:太郎ヲ:パン食べた。ガ:太郎ヲ:パン

```

The output of predict.py is in the KNP format, which looks like the following:

```

S-ID:0-1 KNP:5.0-25425d33 DATE:2024/05/05 SCORE:59.00000

3D <文頭><人名><ハ><助詞><体言><係:未格><提題><区切:3-5><主題表現><格要素><連用要素><正規化代表表記:太郎/たろう><主辞代表表記:太郎/たろう>
3D <文頭><人名><ハ><助詞><体言><係:未格><提題><区切:3-5><主題表現><格要素><連用要素><名詞項候補><先行詞候補><正規化代表表記:太郎/たろう><主辞代表表記:太郎/たろう> 太郎たろう太郎名詞 6 人名 5 * 0 * 0 "代表表記:太郎/たろう人名:日本:名:45:0.00106" <代表表記:太郎/たろう><人名:日本:名:45:0.00106><正規化代表表記:太郎/たろう><漢字><かな漢字><名詞相当語><文頭><自立><内容語><タグ単位始><文節始><固有キー><文節主辞> ははは助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は><正規化代表表記:は/は><かな漢字><ひらがな><付属>
2D BGH:パン/ぱん<ヲ><助詞><体言><係:ヲ格><区切:0-0><格要素><連用要素><正規化代表表記:パン/ぱん><主辞代表表記:パン/ぱん>
2D BGH:パン/ぱん<ヲ><助詞><体言><係:ヲ格><区切:0-0><格要素><連用要素><名詞項候補><先行詞候補><正規化代表表記:パン/ぱん><主辞代表表記:パン/ぱん> パンぱんパン名詞 6 普通名詞 1 * 0 * 0 "代表表記:パン/ぱんドメイン:料理・食事カテゴリ:人工物-食べ物" <代表表記:パン/ぱん><ドメイン:料理・食事><カテゴリ:人工物-食べ物><正規化代表表記:パン/ぱん><記英数カ><カタカナ><名詞相当語><自立><内容語><タグ単位始><文節始><固有キー><文節主辞> ををを助詞 9 格助詞 1 * 0 * 0 "代表表記:を/を" <代表表記:を/を><正規化代表表記:を/を><かな漢字><ひらがな><付属> ... ```

You can read a KNP format file with rhoknp.

python from rhoknp import Document with open("analyzed.knp") as f: parsed_document = Document.from_knp(f.read())

For more details about KNP format, see rhoknp documentation.

Building a dataset

shell $ OUT_DIR=data/dataset [JOBS=4] ./scripts/build_dataset.sh $ ls data/dataset fuman/ kwdlc/ wac/

Creating a `.env` file and set `DATA_DIR`.

shell echo 'DATA_DIR="data/dataset"' >> .env

Training

shell poetry run python src/train.py -cn default datamodule=all_wo_kc devices=[0,1] max_batches_per_device=4

Here are commonly used options:

-cn: Config name (default: default).
devices: GPUs to use (default: 0).
max_batches_per_device: Maximum number of batches to process per device (default: 4).
compile: JIT-compile the model with torch.compile for faster training ( default: false).
model_name_or_path: Path to a pre-trained model or model identifier from the Huggingface Hub (default: ku-nlp/deberta-v2-large-japanese).

For more options, see YAML config files under configs.

Testing

shell poetry run python src/test.py checkpoint=/path/to/trained/checkpoint eval_set=valid devices=[0,1]

Debugging

shell poetry run python src/train.py -cn debug

If you are on a machine with MPS devices (e.g. Apple M1), specify trainer=cpu.debug to use CPU.

shell poetry run python scripts/train.py -cn debug trainer=cpu.debug

If you are on a machine with GPUs, you can specify the GPUs to use with the devices option.

shell poetry run python scripts/train.py -cn debug devices=[0]

Environment Variables

COHESION_CACHE_DIR: A directory where processed documents are cached. Default value is /tmp/$USER/cohesion_cache.
COHESION_OVERWRITE_CACHE: If set, the data loader does not load cache even if it exists.
COHESION_DISABLE_CACHE: If set, the data loader does not load or save cache.

Dataset

Kyoto University Web Document Leads Corpus (KWDLC)
Kyoto University Text Corpus (KyotoCorpus)
Annotated FKC Corpus
Wikipedia Annotated Corpus

Reference

BERT-based Cohesion Analysis of Japanese Texts [Ueda et al., COLING, 2020]
BERTに基づく統合的日本語結束性解析 [植田, master thesis, 2021]

Author

Nobuhiro Ueda

Owner

Name: Nobuhiro Ueda
Login: nobu-g
Kind: user
Location: Kyoto, Japan
Company: Kyoto University

Website: https://nobu-g.github.io/
Repositories: 6
Profile: https://github.com/nobu-g

A Ph.D student at Kyoto University.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "cohesion-analysis: A BERT based Japanese cohesion analyzer"
authors:
  - family-names: Ueda
    given-names: Nobuhiro
version: 2.0.0
license: MIT
repository-code: "https://github.com/nobu-g/cohesion-analysis"

GitHub Events

Total

Watch event: 1
Push event: 18
Pull request event: 12

Last Year

Watch event: 1
Push event: 18
Pull request event: 12

Issues and Pull Requests

Last synced: 9 months ago

All Time

Total issues: 0
Total pull requests: 6
Average time to close issues: N/A
Average time to close pull requests: about 12 hours
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 6

Past Year

Issues: 0
Pull requests: 6
Average time to close issues: N/A
Average time to close pull requests: about 12 hours
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 5
Bot issues: 0
Bot pull requests: 6

View more stats

Top Authors

Issue Authors

Pull Request Authors

pre-commit-ci[bot] (13)

Top Labels

Issue Labels

Pull Request Labels

Dependencies

poetry.lock pypi

107 dependencies

pyproject.toml pypi

ipdb ^0.13.9 develop
pytest ^8.2 develop
cohesion-tools ^0.7.1
dataclasses-json ^0.6.1
hydra-core ^1.3
jaconv ^0.3.4
lightning ~2.2.2
omegaconf ^2.3
pandas ^2.0
python ^3.9
rhoknp ~1.7.0
rich ^13.3
tokenizers ^0.19.1
torch >=2.1.1
torchmetrics ^1.1
transformers ~4.40.0
typing-extensions >=4.4
wandb ^0.16.0
fastapi >=0.95.1 server
pyhumps ^3.8 server
uvicorn >=0.22.0 server