kwja

An integrated Japanese analyzer based on foundation models

https://github.com/ku-nlp/kwja

Science Score: 62.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
3 of 7 committers (42.9%) from academic institutions
✓
Institutional organization owner
Organization ku-nlp has institutional domain (nlp.ist.i.kyoto-u.ac.jp)
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

An integrated Japanese analyzer based on foundation models

Basic Info

Host: GitHub
Owner: ku-nlp
License: mit
Language: Python
Default Branch: main
Homepage:
Size: 56.9 MB

Statistics

Stars: 134
Watchers: 4
Forks: 7
Open Issues: 4
Releases: 24

Created about 4 years ago · Last pushed 10 months ago

Metadata Files

Readme Changelog Contributing License Citation

KWJA: Kyoto-Waseda Japanese Analyzer[^1]

[^1]: Pronunciation: /kuʒa/

PyPI - Python Version

[Paper (ja)] [Paper (en)] [Slides]

KWJA is an integrated Japanese text analyzer based on foundation models. KWJA performs many text analysis tasks, including: - Typo correction - Sentence segmentation - Word segmentation - Word normalization - Morphological analysis - Word feature tagging - Base phrase feature tagging - NER (Named Entity Recognition) - Dependency parsing - Predicate-argument structure (PAS) analysis - Bridging reference resolution - Coreference resolution - Discourse relation analysis

Requirements

Python: 3.9+
Dependencies: See pyproject.toml.
GPUs with CUDA (optional)
GPUs with MPS (optional)

Getting Started

Install KWJA with pip:

shell $ pip install kwja

Perform language analysis with the kwja command (the result is in the KNP format):

```shell

Analyze a text

$ kwja --text "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"

Analyze text files and write the result to a file

$ kwja --filename path/to/file1.txt --filename path/to/file2.txt > path/to/analyzed.knp

Analyze texts interactively

$ kwja Please end your input with a new line and type "EOD" KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。 EOD ```

If you use Windows and PowerShell, you need to set PYTHONUTF8 environment variable to 1:

```shell

$env:PYTHONUTF8 = "1" kwja ... ````

The output is in the KNP format, which looks like the following:

```

S-ID:202210010000-0-0 kwja:1.0.2

2D
5D <体言>NE:ARTIFACT:KWJA KWJA ＫWＪＡ KWJA 名詞 6 固有名詞 3 * 0 * 0 <基本句-主辞> ははは助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は>
2D
2D <体言> 日本にほん日本名詞 6 地名 4 * 0 * 0 "代表表記:日本/にほん地名:国" <代表表記:日本/にほん><地名:国><基本句-主辞>
4D <体言><係:ノ格> 語ご語名詞 6 普通名詞 1 * 0 * 0 "代表表記:語/ご漢字読み:音カテゴリ:抽象物" <代表表記:語/ご><漢字読み:音><カテゴリ:抽象物><基本句-主辞> ののの助詞 9 接続助詞 3 * 0 * 0 "代表表記:の/の" <代表表記:の/の> ... ```

Here are options for kwja command:

--text: Text to be analyzed.
--filename: Path to a text file to be analyzed. You can specify this option multiple times.
--model-size: Model size to be used. Specify one of tiny, base (default), and large.
--device: Device to be used. Specify cpu, cuda, or mps. If not specified, the device is automatically selected.

- `--typo-batch-size`: Batch size for typo module.

--char-batch-size: Batch size for character module.
--seq2seq-batch-size: Batch size for seq2seq module.
--word-batch-size: Batch size for word module.
--tasks: Tasks to be performed. Specify one or more of the following values separated by commas:
- typo: Typo correction
- char: Sentence segmentation, Word segmentation, and Word normalization
- seq2seq: Word segmentation, Word normalization, Reading prediction, lemmatization, and Canonicalization.
- word: Morphological analysis, Named entity recognition, Word feature tagging, Dependency parsing, PAS analysis, Bridging reference resolution, and Coreference resolution

--config-file: Path to a custom configuration file.

You can read a KNP format file with rhoknp.

python from rhoknp import Document with open("analyzed.knp") as f: parsed_document = Document.from_knp(f.read())

For more details about KNP format, see Reference.

Usage from Python

Make sure you have kwja command in your path:

shell $ which kwja /path/to/kwja

Install rhoknp:

shell $ pip install rhoknp

Perform language analysis with the kwja instance:

python from rhoknp import KWJA kwja = KWJA() analyzed_document = kwja.apply( "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。" )

Configuration

kwja can be configured with a configuration file to set the default options. Check Config file content for details.

Config file location

On non-Windows systems kwja follows the XDG Base Directory Specification convention for the location of the configuration file. The configuration dir kwja uses is itself named kwja. In that directory it refers to a file named config.yaml. For most people it should be enough to put their config file at ~/.config/kwja/config.yaml. You can also provide a configuration file in a non-standard location with an environment variable KWJA_CONFIG_FILE or a command line option --config-file.

Config file example

yaml model_size: base device: cpu num_workers: 0 torch_compile: false typo_batch_size: 1 char_batch_size: 1 seq2seq_batch_size: 1 word_batch_size: 1

Performance Table

typo, character, seq2seq, and word modules
- The performance on each task except typo correction and discourse relation analysis is the mean over all the corpora (KC, KWDLC, Fuman, and WAC) and over three runs with different random seeds.
- We set the learning rate of RoBERTa_LARGE (word) to 2e-5 because we failed to fine-tune it with a higher learning rate. Other hyperparameters are the same described in configs, which are tuned for DeBERTa_BASE.
seq2seq module
- The performance on each task is the mean over all the corpora (KC, KWDLC, Fuman, and WAC).
- * denotes results of a single run
- Scores are calculated using a separate script from the character and word modules.

Task		Model
Task		v1.x base ( char, word )	v2.x base ( char, word / seq2seq )	v1.x large ( char, word )	v2.x large ( char, word / seq2seq )
Typo Correction		79.0	76.7	80.8	83.1
Sentence Segmentation		-	98.4	-	98.6
Word Segmentation		98.5	98.1 / 98.2*	98.7	98.4 / 98.4*
Word Normalization		44.0	15.3	39.8	48.6
Morphological Analysis	POS	99.3	99.4	99.3	99.4
	sub-POS	98.1	98.5	98.2	98.5
	conjtype	99.4	99.6	99.2	99.6
	conjform	99.5	99.7	99.4	99.7
	reading	95.5	95.4 / 96.2*	90.8	95.6 / 96.8*
	lemma	-	- / 97.8*	-	- / 98.1*
	canon	-	- / 95.2*	-	- / 95.9*
Named Entity Recognition		83.0	84.6	82.1	85.9
Linguistic Feature Tagging	word	98.3	98.6	98.5	98.6
Linguistic Feature Tagging	base phrase	86.6	93.6	86.4	93.4
Dependency Parsing		92.9	93.5	93.8	93.6
Pas Analysis		74.2	76.9	75.3	77.5
Bridging Reference Resolution		66.5	67.3	65.2	67.5
Coreference Resolution		74.9	78.6	75.9	79.2
Discourse Relation Analysis		42.2	39.2	41.3	44.3

Citation

bibtex @InProceedings{Ueda2023a, author = {Nobuhiro Ueda and Kazumasa Omura and Takashi Kodama and Hirokazu Kiyomaru and Yugo Murawaki and Daisuke Kawahara and Sadao Kurohashi}, title = {KWJA: A Unified Japanese Analyzer Based on Foundation Models}, booktitle = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics: System Demonstrations}, year = {2023}, address = {Toronto, Canada}, }

bibtex @InProceedings{植田2022, author = {植田暢大 and 大村和正 and 児玉貴志 and 清丸寛一 and 村脇有吾 and 河原大輔 and 黒橋禎夫}, title = {KWJA：汎用言語モデルに基づく日本語解析器}, booktitle = {第253回自然言語処理研究会}, year = {2022}, address = {京都}, }

bibtex @InProceedings{児玉2023, author = {児玉貴志 and 植田暢大 and 大村和正 and 清丸寛一 and 村脇有吾 and 河原大輔 and 黒橋禎夫}, title = {テキスト生成モデルによる日本語形態素解析}, booktitle = {言語処理学会第29回年次大会}, year = {2023}, address = {沖縄}, }

License

This software is released under the MIT License, see LICENSE.

Reference

KNP format

Owner

Name: Language Media Processing Lab, Kyoto University
Login: ku-nlp
Kind: organization
Location: Kyoto, Japan

Website: https://nlp.ist.i.kyoto-u.ac.jp/EN/
Repositories: 42
Profile: https://github.com/ku-nlp

We are working on making NLP better

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "KWJA: Kyoto-Waseda Japanese Analyzer"
authors:
  - family-names: Ueda
    given-names: Nobuhiro
  - family-names: Omura
    given-names: Kazumasa
  - family-names: Kodama
    given-names: Takashi
  - family-names: Kiyomaru
    given-names: Hirokazu
  - family-names: Murawaki
    given-names: Yugo
  - family-names: Kawahara
    given-names: Daisuke
  - family-names: Kurohashi
    given-names: Sadao
version: 2.0.0
repository-code: "https://github.com/ku-nlp/kwja"
date-released: 2022-09-28

GitHub Events

Total

Create event: 7
Issues event: 2
Release event: 2
Watch event: 4
Delete event: 4
Issue comment event: 9
Push event: 40
Pull request review event: 2
Pull request review comment event: 1
Pull request event: 18
Fork event: 2

Last Year

Create event: 7
Issues event: 2
Release event: 2
Watch event: 4
Delete event: 4
Issue comment event: 9
Push event: 40
Pull request review event: 2
Pull request review comment event: 1
Pull request event: 18
Fork event: 2

Committers

Last synced: over 3 years ago

All Time

Total Commits: 1,206
Total Committers: 7
Avg Commits per committer: 172.286
Development Distribution Score (DDS): 0.541

Top Committers

Name	Email	Commits
nobu-g	u**7@h**p	553
omura	o**a@n**p	227
Hirokazu Kiyomaru	h**u@g**m	213
Taka008	k**a@n**p	182
MURAWAKI Yugo	m**i@i**p	21
Takashi Kodama	4**8@u**m	8
Yuta Hayashibe	y**a@h**p	2

Committer Domains (Top 20 + Academic)

nlp.ist.i.kyoto-u.ac.jp: 2 hayashibe.jp: 1 i.kyoto-u.ac.jp: 1 hotmail.co.jp: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 38
Total pull requests: 117
Average time to close issues: 3 months
Average time to close pull requests: 7 days
Total issue authors: 7
Total pull request authors: 8
Average comments per issue: 0.89
Average comments per pull request: 0.65
Merged pull requests: 104
Bot issues: 0
Bot pull requests: 17

Past Year

Issues: 2
Pull requests: 14
Average time to close issues: about 14 hours
Average time to close pull requests: 28 days
Issue authors: 2
Pull request authors: 4
Average comments per issue: 1.5
Average comments per pull request: 0.57
Merged pull requests: 7
Bot issues: 0
Bot pull requests: 8

View more stats

Top Authors

Issue Authors

hkiyomaru (17)
nobu-g (13)
Taka008 (3)
anhnami (1)
murawaki (1)
omukazu (1)
kanagoon (1)

Pull Request Authors

nobu-g (31)
dependabot[bot] (27)
Taka008 (26)
hkiyomaru (23)
omukazu (22)
pre-commit-ci[bot] (5)
shirayu (1)
murawaki (1)

Top Labels

Issue Labels

enhancement (5) bug (4) high priority (1) documentation (1)

Pull Request Labels

dependencies (27) python (20) github_actions (4)

Packages

Total packages: 1
Total downloads:
- pypi 821 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 24
Total maintainers: 1

pypi.org: kwja

A unified Japanese analyzer based on foundation models

Homepage: https://github.com/ku-nlp/kwja
Documentation: https://kwja.readthedocs.io/
License: MIT
Latest release: 2.5.1
published 10 months ago

Versions: 24
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 821 Last month

Rankings

Stargazers count: 6.7%

Downloads: 8.6%

Dependent packages count: 10.1%

Average: 12.3%

Forks count: 14.3%

Dependent repos count: 21.6%

Maintainers (1)

nobu-g

Last synced: 10 months ago

kwja

Science Score: 62.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

KWJA: Kyoto-Waseda Japanese Analyzer[^1]

Requirements

Getting Started

Analyze a text

Analyze text files and write the result to a file

Analyze texts interactively

S-ID:202210010000-0-0 kwja:1.0.2

- --typo-batch-size: Batch size for typo module.

Usage from Python

Configuration

Config file location

Config file example

Performance Table

Citation

License

Reference

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

pypi.org: kwja

Rankings

Maintainers (1)

- `--typo-batch-size`: Batch size for typo module.