cohesion-analysis

Code for COLING 2020 Paper

https://github.com/nobu-g/cohesion-analysis

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Code for COLING 2020 Paper

Basic Info
  • Host: GitHub
  • Owner: nobu-g
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 525 KB
Statistics
  • Stars: 13
  • Watchers: 2
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created over 5 years ago · Last pushed 6 months ago
Metadata Files
Readme License Citation

README.md

Cohesion: A Japanese cohesion analyzer

Description

This project provides a system to perform the following analyses in a multi-task manner.

  • Verbal predicate-argument structure analysis
  • Nominal predicate-argument structure analysis
  • Bridging reference resolution
  • Coreference resolution

The process is as follows.

  1. Apply Juman++ and KNP to an input text and split the text into base phrases.
  2. Extract target base phrases to analyze by referring to the features added by KNP such as <用言> and <体言>.
  3. For each target base phrase, select its arguments (or antecedents).

For more information, please refer to the original paper

Demo

https://lotus.kuee.kyoto-u.ac.jp/cohesion-analysis/pub/

demo-view

Requirements

Getting started

  • Create a virtual environment and install dependencies. shell $ poetry env use /path/to/python $ poetry install

  • Log in to wandb (optional). shell $ wandb login

Quick Start

  • Install Juman++/KNP or KWJA.

    • Juman++/KNP shell docker pull kunlp/jumanpp-knp:latest echo 'docker run -i --rm --platform linux/amd64 kunlp/jumanpp-knp jumanpp' > /somewhere/in/your/path/jumanpp echo 'docker run -i --rm --platform linux/amd64 kunlp/jumanpp-knp knp' > /somewhere/in/your/path/knp
    • KWJA shell pipx install kwja
  • Download pre-trained models.

shell $ wget https://lotus.kuee.kyoto-u.ac.jp/~ueda/dist/cohesion_analysis_v2/model_base.bin # trained checkpoint (base) $ wget https://lotus.kuee.kyoto-u.ac.jp/~ueda/dist/cohesion_analysis_v2/model_large.bin # trained checkpoint (large) $ ls model_*.bin model_base.bin model_large.bin

  • Run prediction.

```shell $ poetry run python src/predict.py checkpoint=modellarge.bin inputfile=<(echo "太郎はパンを買って食べた。") [devices=1] > analyzed.knp; rhoknp show -r analyzed.knp

S-ID:0-1 KNP:5.0-25425d33 DATE:2024/01/01 SCORE:59.00000

太郎は─────┐ パンを─┐ │ 買って─┤ ガ:太郎 ヲ:パン 食べた。 ガ:太郎 ヲ:パン

```

The output of predict.py is in the KNP format, which looks like the following:

```

S-ID:0-1 KNP:5.0-25425d33 DATE:2024/05/05 SCORE:59.00000

  • 3D <文頭><人名><ハ><助詞><体言><係:未格><提題><区切:3-5><主題表現><格要素><連用要素><正規化代表表記:太郎/たろう><主辞代表表記:太郎/たろう>
  • 3D <文頭><人名><ハ><助詞><体言><係:未格><提題><区切:3-5><主題表現><格要素><連用要素><名詞項候補><先行詞候補><正規化代表表記:太郎/たろう><主辞代表表記:太郎/たろう> 太郎 たろう 太郎 名詞 6 人名 5 * 0 * 0 "代表表記:太郎/たろう 人名:日本:名:45:0.00106" <代表表記:太郎/たろう><人名:日本:名:45:0.00106><正規化代表表記:太郎/たろう><漢字><かな漢字><名詞相当語><文頭><自立><内容語><タグ単位始><文節始><固有キー><文節主辞> は は は 助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は><正規化代表表記:は/は><かな漢字><ひらがな><付属>
  • 2D BGH:パン/ぱん<ヲ><助詞><体言><係:ヲ格><区切:0-0><格要素><連用要素><正規化代表表記:パン/ぱん><主辞代表表記:パン/ぱん>
  • 2D BGH:パン/ぱん<ヲ><助詞><体言><係:ヲ格><区切:0-0><格要素><連用要素><名詞項候補><先行詞候補><正規化代表表記:パン/ぱん><主辞代表表記:パン/ぱん> パン ぱん パン 名詞 6 普通名詞 1 * 0 * 0 "代表表記:パン/ぱん ドメイン:料理・食事 カテゴリ:人工物-食べ物" <代表表記:パン/ぱん><ドメイン:料理・食事><カテゴリ:人工物-食べ物><正規化代表表記:パン/ぱん><記英数カ><カタカナ><名詞相当語><自立><内容語><タグ単位始><文節始><固有キー><文節主辞> を を を 助詞 9 格助詞 1 * 0 * 0 "代表表記:を/を" <代表表記:を/を><正規化代表表記:を/を><かな漢字><ひらがな><付属> ... ```

You can read a KNP format file with rhoknp.

python from rhoknp import Document with open("analyzed.knp") as f: parsed_document = Document.from_knp(f.read())

For more details about KNP format, see rhoknp documentation.

Building a dataset

shell $ OUT_DIR=data/dataset [JOBS=4] ./scripts/build_dataset.sh $ ls data/dataset fuman/ kwdlc/ wac/

Creating a .env file and set DATA_DIR.

shell echo 'DATA_DIR="data/dataset"' >> .env

Training

shell poetry run python src/train.py -cn default datamodule=all_wo_kc devices=[0,1] max_batches_per_device=4

Here are commonly used options:

  • -cn: Config name (default: default).
  • devices: GPUs to use (default: 0).
  • max_batches_per_device: Maximum number of batches to process per device (default: 4).
  • compile: JIT-compile the model with torch.compile for faster training ( default: false).
  • model_name_or_path: Path to a pre-trained model or model identifier from the Huggingface Hub (default: ku-nlp/deberta-v2-large-japanese).

For more options, see YAML config files under configs.

Testing

shell poetry run python src/test.py checkpoint=/path/to/trained/checkpoint eval_set=valid devices=[0,1]

Debugging

shell poetry run python src/train.py -cn debug

If you are on a machine with MPS devices (e.g. Apple M1), specify trainer=cpu.debug to use CPU.

shell poetry run python scripts/train.py -cn debug trainer=cpu.debug

If you are on a machine with GPUs, you can specify the GPUs to use with the devices option.

shell poetry run python scripts/train.py -cn debug devices=[0]

Environment Variables

  • COHESION_CACHE_DIR: A directory where processed documents are cached. Default value is /tmp/$USER/cohesion_cache.
  • COHESION_OVERWRITE_CACHE: If set, the data loader does not load cache even if it exists.
  • COHESION_DISABLE_CACHE: If set, the data loader does not load or save cache.

Dataset

Reference

Author

Nobuhiro Ueda

Owner

  • Name: Nobuhiro Ueda
  • Login: nobu-g
  • Kind: user
  • Location: Kyoto, Japan
  • Company: Kyoto University

A Ph.D student at Kyoto University.

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it using these metadata."
title: "cohesion-analysis: A BERT based Japanese cohesion analyzer"
authors:
  - family-names: Ueda
    given-names: Nobuhiro
version: 2.0.0
license: MIT
repository-code: "https://github.com/nobu-g/cohesion-analysis"

GitHub Events

Total
  • Watch event: 1
  • Push event: 18
  • Pull request event: 12
Last Year
  • Watch event: 1
  • Push event: 18
  • Pull request event: 12

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 0
  • Total pull requests: 6
  • Average time to close issues: N/A
  • Average time to close pull requests: about 12 hours
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 6
Past Year
  • Issues: 0
  • Pull requests: 6
  • Average time to close issues: N/A
  • Average time to close pull requests: about 12 hours
  • Issue authors: 0
  • Pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 5
  • Bot issues: 0
  • Bot pull requests: 6
Top Authors
Issue Authors
Pull Request Authors
  • pre-commit-ci[bot] (13)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

poetry.lock pypi
  • 107 dependencies
pyproject.toml pypi
  • ipdb ^0.13.9 develop
  • pytest ^8.2 develop
  • cohesion-tools ^0.7.1
  • dataclasses-json ^0.6.1
  • hydra-core ^1.3
  • jaconv ^0.3.4
  • lightning ~2.2.2
  • omegaconf ^2.3
  • pandas ^2.0
  • python ^3.9
  • rhoknp ~1.7.0
  • rich ^13.3
  • tokenizers ^0.19.1
  • torch >=2.1.1
  • torchmetrics ^1.1
  • transformers ~4.40.0
  • typing-extensions >=4.4
  • wandb ^0.16.0
  • fastapi >=0.95.1 server
  • pyhumps ^3.8 server
  • uvicorn >=0.22.0 server