kwja

An integrated Japanese analyzer based on foundation models

https://github.com/ku-nlp/kwja

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    3 of 7 committers (42.9%) from academic institutions
  • Institutional organization owner
    Organization ku-nlp has institutional domain (nlp.ist.i.kyoto-u.ac.jp)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

An integrated Japanese analyzer based on foundation models

Basic Info
  • Host: GitHub
  • Owner: ku-nlp
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 56.9 MB
Statistics
  • Stars: 134
  • Watchers: 4
  • Forks: 7
  • Open Issues: 4
  • Releases: 24
Created almost 4 years ago · Last pushed 7 months ago
Metadata Files
Readme Changelog Contributing License Citation

README.md

KWJA: Kyoto-Waseda Japanese Analyzer[^1]

[^1]: Pronunciation: /kuʒa/

test Ruff codecov CodeFactor Grade PyPI PyPI - Python Version

[Paper (ja)] [Paper (en)] [Slides]

KWJA is an integrated Japanese text analyzer based on foundation models. KWJA performs many text analysis tasks, including: - Typo correction - Sentence segmentation - Word segmentation - Word normalization - Morphological analysis - Word feature tagging - Base phrase feature tagging - NER (Named Entity Recognition) - Dependency parsing - Predicate-argument structure (PAS) analysis - Bridging reference resolution - Coreference resolution - Discourse relation analysis

Requirements

  • Python: 3.9+
  • Dependencies: See pyproject.toml.
  • GPUs with CUDA (optional)
  • GPUs with MPS (optional)

Getting Started

Install KWJA with pip:

shell $ pip install kwja

Perform language analysis with the kwja command (the result is in the KNP format):

```shell

Analyze a text

$ kwja --text "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"

Analyze text files and write the result to a file

$ kwja --filename path/to/file1.txt --filename path/to/file2.txt > path/to/analyzed.knp

Analyze texts interactively

$ kwja Please end your input with a new line and type "EOD" KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。 EOD ```

If you use Windows and PowerShell, you need to set PYTHONUTF8 environment variable to 1:

```shell

$env:PYTHONUTF8 = "1" kwja ... ````

The output is in the KNP format, which looks like the following:

```

S-ID:202210010000-0-0 kwja:1.0.2

  • 2D
  • 5D <体言>NE:ARTIFACT:KWJA KWJA KWJA KWJA 名詞 6 固有名詞 3 * 0 * 0 <基本句-主辞> は は は 助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は>
  • 2D
  • 2D <体言> 日本 にほん 日本 名詞 6 地名 4 * 0 * 0 "代表表記:日本/にほん 地名:国" <代表表記:日本/にほん><地名:国><基本句-主辞>
  • 4D <体言><係:ノ格> 語 ご 語 名詞 6 普通名詞 1 * 0 * 0 "代表表記:語/ご 漢字読み:音 カテゴリ:抽象物" <代表表記:語/ご><漢字読み:音><カテゴリ:抽象物><基本句-主辞> の の の 助詞 9 接続助詞 3 * 0 * 0 "代表表記:の/の" <代表表記:の/の> ... ```

Here are options for kwja command:

  • --text: Text to be analyzed.

  • --filename: Path to a text file to be analyzed. You can specify this option multiple times.

  • --model-size: Model size to be used. Specify one of tiny, base (default), and large.

  • --device: Device to be used. Specify cpu, cuda, or mps. If not specified, the device is automatically selected.

- --typo-batch-size: Batch size for typo module.

  • --char-batch-size: Batch size for character module.

  • --seq2seq-batch-size: Batch size for seq2seq module.

  • --word-batch-size: Batch size for word module.

  • --tasks: Tasks to be performed. Specify one or more of the following values separated by commas:

    • typo: Typo correction
    • char: Sentence segmentation, Word segmentation, and Word normalization
    • seq2seq: Word segmentation, Word normalization, Reading prediction, lemmatization, and Canonicalization.
    • word: Morphological analysis, Named entity recognition, Word feature tagging, Dependency parsing, PAS analysis, Bridging reference resolution, and Coreference resolution

--config-file: Path to a custom configuration file.

You can read a KNP format file with rhoknp.

python from rhoknp import Document with open("analyzed.knp") as f: parsed_document = Document.from_knp(f.read())

For more details about KNP format, see Reference.

Usage from Python

Make sure you have kwja command in your path:

shell $ which kwja /path/to/kwja

Install rhoknp:

shell $ pip install rhoknp

Perform language analysis with the kwja instance:

python from rhoknp import KWJA kwja = KWJA() analyzed_document = kwja.apply( "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。" )

Configuration

kwja can be configured with a configuration file to set the default options. Check Config file content for details.

Config file location

On non-Windows systems kwja follows the XDG Base Directory Specification convention for the location of the configuration file. The configuration dir kwja uses is itself named kwja. In that directory it refers to a file named config.yaml. For most people it should be enough to put their config file at ~/.config/kwja/config.yaml. You can also provide a configuration file in a non-standard location with an environment variable KWJA_CONFIG_FILE or a command line option --config-file.

Config file example

yaml model_size: base device: cpu num_workers: 0 torch_compile: false typo_batch_size: 1 char_batch_size: 1 seq2seq_batch_size: 1 word_batch_size: 1

Performance Table

  • typo, character, seq2seq, and word modules
    • The performance on each task except typo correction and discourse relation analysis is the mean over all the corpora (KC, KWDLC, Fuman, and WAC) and over three runs with different random seeds.
    • We set the learning rate of RoBERTaLARGE (word) to 2e-5 because we failed to fine-tune it with a higher learning rate. Other hyperparameters are the same described in configs, which are tuned for DeBERTaBASE.
  • seq2seq module
    • The performance on each task is the mean over all the corpora (KC, KWDLC, Fuman, and WAC).
    • * denotes results of a single run
    • Scores are calculated using a separate script from the character and word modules.
Task Model
v1.x base
( char, word )
v2.x base
( char, word / seq2seq )
v1.x large
( char, word )
v2.x large
( char, word / seq2seq )
Typo Correction 79.0 76.7 80.8 83.1
Sentence Segmentation - 98.4 - 98.6
Word Segmentation 98.5 98.1 / 98.2* 98.7 98.4 / 98.4*
Word Normalization 44.0 15.3 39.8 48.6
Morphological Analysis POS 99.3 99.4 99.3 99.4
sub-POS 98.1 98.5 98.2 98.5
conjtype 99.4 99.6 99.2 99.6
conjform 99.5 99.7 99.4 99.7
reading 95.5 95.4 / 96.2* 90.8 95.6 / 96.8*
lemma - - / 97.8* - - / 98.1*
canon - - / 95.2* - - / 95.9*
Named Entity Recognition 83.0 84.6 82.1 85.9
Linguistic Feature Tagging word 98.3 98.6 98.5 98.6
base phrase 86.6 93.6 86.4 93.4
Dependency Parsing 92.9 93.5 93.8 93.6
Pas Analysis 74.2 76.9 75.3 77.5
Bridging Reference Resolution 66.5 67.3 65.2 67.5
Coreference Resolution 74.9 78.6 75.9 79.2
Discourse Relation Analysis 42.2 39.2 41.3 44.3

Citation

bibtex @InProceedings{Ueda2023a, author = {Nobuhiro Ueda and Kazumasa Omura and Takashi Kodama and Hirokazu Kiyomaru and Yugo Murawaki and Daisuke Kawahara and Sadao Kurohashi}, title = {KWJA: A Unified Japanese Analyzer Based on Foundation Models}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, year = {2023}, address = {Toronto, Canada}, }

bibtex @InProceedings{植田2022, author = {植田 暢大 and 大村 和正 and 児玉 貴志 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫}, title = {KWJA:汎用言語モデルに基づく日本語解析器}, booktitle = {第253回自然言語処理研究会}, year = {2022}, address = {京都}, }

bibtex @InProceedings{児玉2023, author = {児玉 貴志 and 植田 暢大 and 大村 和正 and 清丸 寛一 and 村脇 有吾 and 河原 大輔 and 黒橋 禎夫}, title = {テキスト生成モデルによる日本語形態素解析}, booktitle = {言語処理学会 第29回年次大会}, year = {2023}, address = {沖縄}, }

License

This software is released under the MIT License, see LICENSE.

Reference

Owner

  • Name: Language Media Processing Lab, Kyoto University
  • Login: ku-nlp
  • Kind: organization
  • Location: Kyoto, Japan

We are working on making NLP better

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "KWJA: Kyoto-Waseda Japanese Analyzer"
authors:
  - family-names: Ueda
    given-names: Nobuhiro
  - family-names: Omura
    given-names: Kazumasa
  - family-names: Kodama
    given-names: Takashi
  - family-names: Kiyomaru
    given-names: Hirokazu
  - family-names: Murawaki
    given-names: Yugo
  - family-names: Kawahara
    given-names: Daisuke
  - family-names: Kurohashi
    given-names: Sadao
version: 2.0.0
repository-code: "https://github.com/ku-nlp/kwja"
date-released: 2022-09-28

GitHub Events

Total
  • Create event: 7
  • Issues event: 2
  • Release event: 2
  • Watch event: 4
  • Delete event: 4
  • Issue comment event: 9
  • Push event: 40
  • Pull request review event: 2
  • Pull request review comment event: 1
  • Pull request event: 18
  • Fork event: 2
Last Year
  • Create event: 7
  • Issues event: 2
  • Release event: 2
  • Watch event: 4
  • Delete event: 4
  • Issue comment event: 9
  • Push event: 40
  • Pull request review event: 2
  • Pull request review comment event: 1
  • Pull request event: 18
  • Fork event: 2

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 1,206
  • Total Committers: 7
  • Avg Commits per committer: 172.286
  • Development Distribution Score (DDS): 0.541
Top Committers
Name Email Commits
nobu-g u****7@h****p 553
omura o****a@n****p 227
Hirokazu Kiyomaru h****u@g****m 213
Taka008 k****a@n****p 182
MURAWAKI Yugo m****i@i****p 21
Takashi Kodama 4****8@u****m 8
Yuta Hayashibe y****a@h****p 2
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 38
  • Total pull requests: 117
  • Average time to close issues: 3 months
  • Average time to close pull requests: 7 days
  • Total issue authors: 7
  • Total pull request authors: 8
  • Average comments per issue: 0.89
  • Average comments per pull request: 0.65
  • Merged pull requests: 104
  • Bot issues: 0
  • Bot pull requests: 17
Past Year
  • Issues: 2
  • Pull requests: 14
  • Average time to close issues: about 14 hours
  • Average time to close pull requests: 28 days
  • Issue authors: 2
  • Pull request authors: 4
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.57
  • Merged pull requests: 7
  • Bot issues: 0
  • Bot pull requests: 8
Top Authors
Issue Authors
  • hkiyomaru (17)
  • nobu-g (13)
  • Taka008 (3)
  • anhnami (1)
  • murawaki (1)
  • omukazu (1)
  • kanagoon (1)
Pull Request Authors
  • nobu-g (31)
  • dependabot[bot] (27)
  • Taka008 (26)
  • hkiyomaru (23)
  • omukazu (22)
  • pre-commit-ci[bot] (5)
  • shirayu (1)
  • murawaki (1)
Top Labels
Issue Labels
enhancement (5) bug (4) high priority (1) documentation (1)
Pull Request Labels
dependencies (27) python (20) github_actions (4)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 821 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 24
  • Total maintainers: 1
pypi.org: kwja

A unified Japanese analyzer based on foundation models

  • Versions: 24
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 821 Last month
Rankings
Stargazers count: 6.7%
Downloads: 8.6%
Dependent packages count: 10.1%
Average: 12.3%
Forks count: 14.3%
Dependent repos count: 21.6%
Maintainers (1)
Last synced: 7 months ago