ginza

A Japanese NLP Library using spaCy as framework based on Universal Dependencies

https://github.com/megagonlabs/ginza

Keywords from Contributors

transformer cryptocurrency cryptography jax

Last synced: 10 months ago · JSON representation ·

Repository

A Japanese NLP Library using spaCy as framework based on Universal Dependencies

Basic Info

Host: GitHub
Owner: megagonlabs
License: mit
Language: Python
Default Branch: develop
Size: 1.02 MB

Statistics

Stars: 806
Watchers: 30
Forks: 58
Open Issues: 12
Releases: 27

Created over 7 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

GiNZA NLP Library

An Open Source Japanese NLP Library, based on Universal Dependencies

Please read the Important changes before you upgrade GiNZA.

日本語ページはこちら

License

GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models are distributed under the MIT License. You must agree and follow the MIT License to use GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models.

Explosion / spaCy

spaCy is the key framework of GiNZA.

spaCy LICENSE PAGE

Works Applications Enterprise / Sudachi/SudachiPy - SudachiDict - chiVe

SudachiPy provides high accuracies for tokenization and pos tagging.

Sudachi LICENSE PAGE, SudachiPy LICENSE PAGE, SudachiDict LEGAL PAGE, chiVe LICENSE PAGE

Hugging Face / transformers

The GiNZA v5 Transformers model (jaginzaelectra) is trained by using Hugging Face Transformers as a framework for pretrained models.

transformers LICENSE PAGE

Training Datasets

UD Japanese BCCWJ r2.8

The parsing model of GiNZA v5 is trained on a part of UD Japanese BCCWJ r2.8 (Omura and Asahara:2018). This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.

GSK2014-A (2019) BCCWJ edition

The named entity recognition model of GiNZA v5 is trained on a part of GSK2014-A (2019) BCCWJ edition (Hashimoto, Inui, and Murakami:2008). We use two of the named entity label systems, both Sekine's Extended Named Entity Hierarchy and extended OntoNotes5. This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.

mC4

The GiNZA v5 Transformers model (jaginzaelectra) is trained by using transformers-ud-japanese-electra-base-discriminator which is pretrained on more than 200 million Japanese sentences extracted from mC4.

Contains information from mC4 which is made available under the ODC Attribution License. @article{2019t5, author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer}, journal = {arXiv e-prints}, year = {2019}, archivePrefix = {arXiv}, eprint = {1910.10683}, }

Runtime Environment

This project is developed with Python>=3.8 and pip for it. We do not recommend to use Anaconda environment because the pip install step may not work properly.

Please also see the Development Environment section below.

Runtime set up

1. Install GiNZA NLP Library with Transformer-based Model

Uninstall previous version of ginza and jaginzaelectra packages: console $ pip uninstall ginza ja_ginza_electra Then, install the latest version of ginza and ja_ginza_electra: console $ pip install -U ginza ja_ginza_electra

The package of ja_ginza_electra does not include pytorch_model.bin due to PyPI's archive size restrictions. This large model file will be automatically downloaded at the first run time, and the locally cached file will be used for subsequent runs.

If you need to install ja_ginza_electra along with pytorch_model.bin at the install time, you can specify direct link for GitHub release archive as follows: console $ pip install -U ginza https://github.com/megagonlabs/ginza/releases/download/latest/ja_ginza_electra-latest-with-model.tar.gz

If you hope to accelarate the transformers-based models by using GPUs with CUDA support, you can install spacy by specifying the CUDA version as follows: console pip install -U "spacy[cuda117]"

And you need to install a version of pytorch that is consistent with the CUDA version.

2. Install GiNZA NLP Library with Standard Model

Uninstall previous version: console $ pip uninstall ginza ja_ginza Then, install the latest version of ginza and ja_ginza: console $ pip install -U ginza ja_ginza

When using Apple Silicon such as M1 or M2, you can accelerate the analysis process by installing thinc-apple-ops: console $ pip install torch thinc-apple-ops

Execute ginza command

Run ginza command from the console, then input some Japanese text. After pressing enter key, you will get the parsed results with CoNLL-U Syntactic Annotation format. ```console $ ginza 銀座でランチをご一緒しましょう。

text = 銀座でランチをご一緒しましょう。

`ginzame` command provides tokenization function like [MeCab](https://taku910.github.io/mecab/). The output format of `ginzame` is almost same as `mecab`, but the last `pronunciation` field is always '*'.console $ ginzame 銀座でランチをご一緒しましょう。銀座名詞,固有名詞,地名,一般,,,銀座,ギンザ,* で助詞,格助詞,,,,,で,デ,* ランチ名詞,普通名詞,一般,,,,ランチ,ランチ, を助詞,格助詞,,,,,を,ヲ,* ご接頭辞,,,,,,御,ゴ, 一緒名詞,普通名詞,サ変可能,,,,一緒,イッショ, し動詞,非自立可能,,,サ行変格,連用形-一般,為る,シ,* ましょう助動詞,,,,助動詞-マス,意志推量形,ます,マショウ, 。補助記号,句点,,,,,。,。,* EOS

The format of spaCy's JSON is available by specifying `-f 3` or `-f json` for `ginza` command.console $ ginza -f json 銀座でランチをご一緒しましょう。 [ { "paragraphs": [ { "raw": "銀座でランチをご一緒しましょう。", "sentences": [ { "tokens": [ {"id": 1, "orth": "銀座", "tag": "名詞-固有名詞-地名-一般", "pos": "PROPN", "lemma": "銀座", "head": 5, "dep": "obl", "ner": "B-City"}, {"id": 2, "orth": "で", "tag": "助詞-格助詞", "pos": "ADP", "lemma": "で", "head": -1, "dep": "case", "ner": "O"}, {"id": 3, "orth": "ランチ", "tag": "名詞-普通名詞-一般", "pos": "NOUN", "lemma": "ランチ", "head": 3, "dep": "obj", "ner": "O"}, {"id": 4, "orth": "を", "tag": "助詞-格助詞", "pos": "ADP", "lemma": "を", "head": -1, "dep": "case", "ner": "O"}, {"id": 5, "orth": "ご", "tag": "接頭辞", "pos": "NOUN", "lemma": "ご", "head": 1, "dep": "compound", "ner": "O"}, {"id": 6, "orth": "一緒", "tag": "名詞-普通名詞-サ変可能", "pos": "VERB", "lemma": "一緒", "head": 0, "dep": "ROOT", "ner": "O"}, {"id": 7, "orth": "し", "tag": "動詞-非自立可能", "pos": "AUX", "lemma": "する", "head": -1, "dep": "advcl", "ner": "O"}, {"id": 8, "orth": "ましょう", "tag": "助動詞", "pos": "AUX", "lemma": "ます", "head": -2, "dep": "aux", "ner": "O"}, {"id": 9, "orth": "。", "tag": "補助記号-句点", "pos": "PUNCT", "lemma": "。", "head": -3, "dep": "punct", "ner": "O"} ] } ] } ] } ] If you want to use [`cabocha -f1`](https://taku910.github.io/cabocha/) (lattice style) like output, add `-f 1` or `-f cabocha` option to `ginza` command. This option's format is almost same as `cabocha -f1` but the `func_index` field (after the slash) is slightly different. Our `func_index` field indicates the boundary where the `自立語` ends in each `文節` (and the `機能語` might start from there). And the functional token filter is also slightly different between `cabocha -f1` and ' `ginza -f cabocha`.console $ ginza -f cabocha 銀座でランチをご一緒しましょう。 * 0 2D 0/1 0.000000 銀座名詞,固有名詞,地名,一般,,銀座,ギンザ,* B-City で助詞,格助詞,,,,で,デ,* O * 1 2D 0/1 0.000000 ランチ名詞,普通名詞,一般,,,ランチ,ランチ, O を助詞,格助詞,,,,を,ヲ,* O * 2 -1D 0/2 0.000000 ご接頭辞,,,,,ご,ゴ, O 一緒名詞,普通名詞,サ変可能,,,一緒,イッショ, O し動詞,非自立可能,,,サ行変格,連用形-一般,する,シ,* O ましょう助動詞,,,,助動詞-マス,意志推量形,ます,マショウ, O 。補助記号,句点,,,,。,。,* O EOS

```

Multi-processing (Experimental)

We added -p NUM_PROCESS option from GiNZA v3.0. Please specify the number of analyzing processes to NUM_PROCESS. You might want to use all the cpu cores for GiNZA, then execute ginza -p 0. The memory requirement is about 130MB/process (to be improved).

Coding example

Following steps shows dependency parsing results with sentence boundary 'EOS'. python import spacy nlp = spacy.load('ja_ginza_electra') doc = nlp('銀座でランチをご一緒しましょう。') for sent in doc.sents: for token in sent: print( token.i, token.orth_, token.lemma_, token.norm_, token.morph.get("Reading"), token.pos_, token.morph.get("Inflection"), token.tag_, token.dep_, token.head.i, ) print('EOS')

User Dictionary

The user dictionary files should be set to userDict field of sudachi.json in the installed package directory ofja_ginza_dict package.

Please read the official documents to compile user dictionaries with sudachipy command. SudachiPy - User defined Dictionary Sudachi User Dictionary Construction (Japanese Only)

Releases

version 5.x

ginza-5.2.0

2024-03-31
Require python>=3.8
Migrate to spaCy v3.7
New functionality
- add Japanese clause recognition API (experimental)

ginza-5.1.3

2023-09-25
Migrate to spaCy v3.6
Beta release of ja_ginza_bert_large

ginza-5.1.2

2022-03-12
Migrate to spaCy v3.4

ginza-5.1.1

2022-03-12
Improvements
- auto deploy for pypi by @nimiusrd in #184
- modify github actions: trigger by tagging, stop uploading test pypi by @r-terada in #233

ginza-5.1.0

2021-12-10, Euclase
Important changes
- Upgrade: spaCy v3.2 and Sudachi.rs(SudachiPy v0.6.2)
- Change token information fields #208 #209
- doc.user_data["reading_forms"][token.i] -> token.morph.get("Reading")
- doc.user_data["inflections"][token.i] -> token.morph.get("Inflection")
- force_using_normalized_form_as_lemma(True) -> token.norm_
- All spaCy models, including non-Japanese, are now available with the ginza command #217
- Download and analyze the model at once by specifying the model name in the following form #219
- ginza -m en_core_web_md
- Change ginza --require_gpu and ginza -g to take a gpu_id argument
- The default gpu_id value is -1 which uses only CPUs
- ginza -f json option always analyze the line which starts with # regardless the option value of -c. #215
Improvements
- Batch analysis processing speeds up by 50-60% in GPU environment and 10-40% in CPU environment
- Improved processing efficiency of parallel execution options (ginza -p {n_process} and ginzame) of ginza command #204
- add tests #198 #210 #214
- add benchmark #207 #220

ginza-5.0.3

2021-10-15
Bug fix
- Bunsetu span should not cross the sentence boundary #195

ginza-5.0.2

2021-09-06
Bug fix
- Command Line -s option and set_split_mode() not working in v5.0.x #185

ginza-5.0.1

2021-08-26
Bug fix
- ginzame not woriking in ginza ver. 5 #179
- Command Line -d option not working in v5.0.0 #178
Improvement
- accept ja-ginza and ja-ginza-electra for -m option of ginza command

ginza-5.0.0

2021-08-26, Demantoid
Important changes
- Upgrade spaCy to v3
- Release transformer-based ja-ginza-electra model
- Improve UPOS accuracy of the standard ja-ginza model by adding morphologizer to the tail of spaCy pipleline
- Need to insrtall analysis model along with ginza package
- High accuracy model (>=16GB memory needed)
  - pip install -U ginza ja-ginza-electra
- Speed oriented model
  - pip install -U ginza ja-ginza
- Change component names of CompoundSplitter and BunsetuRecognizer to compound_splitter and bunsetu_recognizer respectively
- Also see spaCy v3 Backwards Incompatibilities
Improvements
- Add command line options
- -n
  - Force using SudachiPy's normalized_form as Token.lemma_
- -m (ja_ginza|ja_ginza_electra)
  - Select model package
- Revise ENE category name
- Degital_Game to Digital_Game

version 4.x

ginza-4.0.6

2021-06-01
Bug fix
- Issue #160: IndexError: list assignment index out of range for empty string

ginza-4.0.5

2020-10-01
Improvements
- Add -d option, which disables spaCy's sentence separator, to ginza command line tool

ginza-4.0.4

2020-09-11
Improvements
- ginza command line tool works correctly without BunsetuRecognizer in the pipeline

ginza-4.0.3

2020-09-10
Improve bunsetu head identification accuracy over inconsistent deps in ent spans

ginza-4.0.2

2020-09-04
Improvements
- Serialization of CompoundSplitter for nlp.to_disk()
- Bunsetu span detection accuracy

ginza-4.0.1

2020-08-30
Debug
- Add type arguments for singledispatch register annotations (for Python 3.6)

ginza-4.0.0

2020-08-16, Chrysoberyl
Important changes
- Replace Japanese model with spacy.lang.ja of spaCy v2.3
- Replace values of Token.lemma_ with the output of SudachiPy's Morpheme.dictionary_form()
- Replace jaginzadict with official SudachiDict-core package
- You can deleteja_ginza_dict package safety
- Change options and misc field contents of output of command line tool
- delete usesentenceseparator(-s)
- NE(OntoNotes) BI labels as B-GPE
- Add subfields: Reading, Inf(inflection) and ENE(Extended NE)
- Obsolete Token._.* and add some entries for Doc.user_data[] and accessors
- inflections (ginza.inflection(Token))
- readingforms (`ginza.readingform(Token)`)
- bunsetubilabels (ginza.bunsetu_bi_label(Token))
- bunsetupositiontypes (ginza.bunsetu_position_type(Token))
- bunsetuheads (`ginza.isbunsetu_head(Token)`)
- Change pipeline architecture
- JapaneseCorrector was obsoleted
- Add CompoundSplitter and BunsetuRecognizer
- Upgrade UD_JAPANESE-BCCWJ to v2.6
- Change word2vec to chiVe mc90
API Changes
- Add bunsetu-unit APIs (from ginza import *)
- bunsetu(Token)
- phrase(Token)
- sub_phrases(Token)
- phrases(Span)
- bunsetu_spans(Span)
- bunsetuphrasespans(Span)
- bunsetuheadlist(Span)
- bunsetuheadtokens(Span)
- bunsetubilabels(Span)
- bunsetupositiontypes(Span)

version 3.x

ginza-3.1.2

2020-02-12
Debug
- Fix: degrade of cabocha mode

ginza-3.1.1

2020-01-19
API Changes
- Extension fields
- The values of Token._.sudachi field would be set after calling SudachipyTokenizer.set_enable_ex_sudachi(True), to avoid serializtion errors ```python import spacy import pickle nlp = spacy.load('jaginza') doc1 = nlp('This example will be serialized correctly.') doc1.tobytes() with open('sample1.pickle', 'wb') as f: pickle.dump(doc1, f)

nlp.tokenizer.setenableexsudachi(True) doc2 = nlp('This example will cause a serialization error.') doc2.tobytes() with open('sample2.pickle', 'wb') as f: pickle.dump(doc2, f) ```

ginza-3.1.0

2020-01-16
Important changes
- Distribute ja_ginza_dict from PyPI
API Changes
- commands
- ginza and ginzame
  - add -i option to initialize the files of ja_ginza_dict

ginza-3.0.0

2020-01-15, Benitoite
Important changes
- Distribute ginza and ja_ginza from PyPI
- Simple installation; pip install ginza, and run ginza
- The model package, ja_ginza, is also available from PyPI.
- Model improvements
- Change NER training data-set to GSK2014-A (2019) BCCWJ edition
  - Improved accuracy of NER
  - token.ent_type_ value is changed to Sekine's Extended Named Entity Hierarchy
  - Add ENE7 attribute to the last field of the output of ginza
  - Move OntoNotes5 -based label to token._.ne
  - We extended the OntoNotes5 named entity labels with PHONE, EMAIL, URL, and PET_NAME
- Overall accuracy is improved by executing spacy pretrain over 100 epochs
  - Multi-task learning of spacy train effectively working on UD Japanese BCCWJ
- The newest SudachiDict_core-20191224
- ginzame
- Execute sudachipy by multiprocessing.Pool and output results with mecab like format
- Now sudachipy command requires additional SudachiDict package installation
Breaking API Changes
- commands
- ginza (ginza.command_line.main_ginza)
  - change option mode to sudachipy_mode
  - drop options: disable_pipes and recreate_corrector
  - add options: hash_comment, parallel, files
  - add mecab to the choices for the argument of -f option
  - add parallel NUM_PROCESS option (EXPERIMENTAL)
  - add ENE7 attribute to conllu miscellaneous field
  - ginza.ent_type_mapping.ENE_NE_MAPPING is used to convert ENE7 label to NE
- add ginzame (ginza.command_line.main_ginzame)
  - a multi-process tokenizer providing mecab like output format
- spaCy field extensions
- add token._.ne for ner label
- ginza/sudachipy_tokenizer.py
- change SudachiTokenizer to SudachipyTokenizer
- use SUDACHI_DEFAULT_SPLIT_MODE instead of SUDACHI_DEFAULT_SPLITMODE or SUDACHI_DEFAULT_MODE
Dependencies
- upgrade spacy to v2.2.3
- upgrade sudachipy to v0.4.2

version 2.x

ginza-2.2.1

2019-10-28
Improvements
- JapaneseCorrector can merge the as_* type dependencies completely
Bug fixes
- command line tool failed at the specific situations

ginza-2.2.0

2019-10-04, Ametrine
Important changes
- split_mode has been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)
- This bug caused split_mode incompatibility between the training phase and the ginza command.
- split_mode was set to 'B' for training phase and python APIs, but 'C' for ginza command.
- We fixed this bug by setting the default split_mode to 'C' entirely.
- This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
New features
- Add -f and --output-format option to ginza command:
- -f 0 or -f conllu : CoNLL-U Syntactic Annotation format
- -f 1 or -f cabocha: cabocha -f1 compatible format
- Add custom token fields:
- bunsetu_index : bunsetu index starting from 0
- reading: reading of token (not a pronunciation)
- sudachi: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
Performance improvements
- Tokenizer
- Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
- Use Cythonized SudachiPy (v0.4.0)
- Dependency parser
- Apply spacy pretrain command to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC.
- Apply multitask objectives by using -pt 'tag,dep' option of spacy train
- New model file
- ja_ginza-2.2.0.tar.gz

ginza-2.0.0

2019-07-08
Add ginza command
- run ginza from the console
Change package structure
- module package as ginza
- language model package as ja_ginza
- spacy.lang.ja is overridden by ginza
Remove sudachipy related directories
- SudachiPy and its dictionary are installed via pip during ginza installation
User dictionary available
- See Customized dictionary - SudachiPy
Token extension fields
- Added
- token._.bunsetu_bi_label, token._.bunsetu_position_type
- Remained
- token._.inf
- Removed
- pos_detail (same value is set to token.tag_)

version 1.x

jaginzanopn-1.0.2

2019-04-07
Set depending token index of root as 0 to meet with conllu format definitions

jaginzanopn-1.0.1

2019-04-02
Add new Japanese era 'reiwa' to system_core.dic.

jaginzanopn-1.0.0

2019-04-01
First release version

Development Environment

Development set up

1. Clone from github

console $ git clone 'https://github.com/megagonlabs/ginza.git'

2. Run python setup.py

For normal environment: console $ python setup.py develop

3. Set up system.dic

Copy system.dic from installed package directory of ja_ginza_dict to ./ja_ginza_dict/sudachidict/.

Training models

The analysis model of GiNZA is trained by spacy train command. console $ python -m spacy train ja ja_ginza-4.0.0 corpus/ja_ginza-ud-train.json corpus/ja_ginza-ud-dev.json -b ja_vectors_chive_mc90_35k/ -ovl 0.3 -n 100 -m meta.json.ginza -V 4.0.0

Run tests

Ginza uses the pytest framework for testing, and you can run the tests via setup.py without install test requirements explicitly. Some tests depends on the ginza default models (ja-ginza, ja-ginza-electra), so install them before the tests is needed.

```console $ pip install ja-ginza ja-ginza-electra $ pip install -e .

full test

$ python setup.py test

test single file

$ python setup.py test --addopts ginza/tests/test_analyzer.py ```

Owner

Name: Megagon Labs
Login: megagonlabs
Kind: organization

Website: https://www.megagon.ai
Repositories: 23
Profile: https://github.com/megagonlabs

Citation (CITATION)

@ARTICLE{GiNZA NLP,
   AUTHOR  = {Hiroshi, Mai and Masayuki},
   TITLE   = {短単位品詞の用法曖昧性解決と依存関係ラベリングの同時学習},
   YEAR    = {2019},
   JOURNAL = {言語処理学会第25回年次大会},
   URL     = {http://www.anlp.jp/proceedings/annual_meeting/2019/pdf_dir/F2-3.pdf}
}

GitHub Events

Total

Issues event: 1
Watch event: 56
Fork event: 1

Last Year

Issues event: 1
Watch event: 56
Fork event: 1

Committers

Last synced: about 1 year ago

All Time

Total Commits: 380
Total Committers: 15
Avg Commits per committer: 25.333
Development Distribution Score (DDS): 0.153

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
hiroshi	h**a@m**i	322
r-terada	r**3@g**m	39
Yuta Hayashibe	s****u	3
wafuwafu13	m**e@i**p	3
Shin Uozumi	s****u	2
Yudai Udagawa	n**d@g**m	2
Koichi Yasuoka	y**a@k**p	1
Kuni88	k**3@g**m	1
Paul O'Leary McCann	p**m@d**m	1
Sorami Hisamoto	s@8****o	1
Yohei Tamura	t**y@g**m	1
nikkie	t**p@g**m	1
wataruhashimoto52	w**e@g**m	1
Sorami Hisamoto	h**s@w**p	1
Yusuke Yaguchi	m**e@m**l	1

Committer Domains (Top 20 + Academic)

worksap.co.jp: 1 89.io: 1 dampfkraft.com: 1 kanji.zinbun.kyoto-u.ac.jp: 1 i.softbank.jp: 1 megagon.ai: 1

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 21
Total pull requests: 93
Average time to close issues: about 1 month
Average time to close pull requests: 3 days
Total issue authors: 14
Total pull request authors: 8
Average comments per issue: 0.71
Average comments per pull request: 0.16
Merged pull requests: 88
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 1
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 1
Pull request authors: 0
Average comments per issue: 0.0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

hiroshi-matsuda-rit (7)
cidrugHug8 (2)
YuseiYokoyama (1)
divyadilip91 (1)
ftnext (1)
vincentmichael089 (1)
PyVCEchecker (1)
adamkolar (1)
ShoSoejima (1)
TatsuyaShirakawa (1)
hungnmai (1)
lemonov (1)
tadashikumano (1)
borh (1)

Pull Request Authors

hiroshi-matsuda-rit (69)
r-terada (10)
shirayu (3)
wafuwafu13 (3)
nimiusrd (1)
wataruhashimoto52 (1)
ftnext (1)
sinozu (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 4
Total downloads:
- pypi 91,948 last-month
Total docker downloads: 1,950

Total dependent packages: 9
(may contain duplicates)
Total dependent repositories: 118
(may contain duplicates)
Total versions: 34
Total maintainers: 1

pypi.org: ginza

GiNZA, An Open Source Japanese NLP Library, based on Universal Dependencies

Homepage: https://github.com/megagonlabs/ginza
Documentation: https://ginza.readthedocs.io/
License: MIT
Latest release: 5.2.0
published about 2 years ago

Versions: 20
Dependent Packages: 6
Dependent Repositories: 63
Downloads: 48,426 Last month
Docker Downloads: 975

Rankings

Dependent packages count: 1.6%

Dependent repos count: 1.9%

Docker downloads count: 2.0%

Downloads: 2.1%

Stargazers count: 2.4%

Average: 2.6%

Forks count: 5.7%

Maintainers (1)

megagonlabs

Last synced: 10 months ago

pypi.org: ja-ginza

Japanese multi-task CNN trained on UD-Japanese BCCWJ r2.8 + GSK2014-A(2019). Assigns word2vec token vectors. Components: tok2vec, parser, ner, morphologizer, atteribute_ruler, compound_splitter, bunsetu_recognizer.

Homepage: https://github.com/megagonlabs/ginza
Documentation: https://ja-ginza.readthedocs.io/
License: MIT License
Latest release: 5.2.0
published about 2 years ago

Versions: 8
Dependent Packages: 3
Dependent Repositories: 40
Downloads: 36,970 Last month
Docker Downloads: 975

Rankings

Docker downloads count: 2.0%

Dependent repos count: 2.3%

Downloads: 2.3%

Stargazers count: 2.4%

Average: 3.0%

Dependent packages count: 3.2%

Forks count: 5.7%

Maintainers (1)

megagonlabs

Last synced: 10 months ago

pypi.org: ja-ginza-electra

Japanese multi-task CNN trained on UD-Japanese BCCWJ r2.8 + GSK2014-A(2019) + transformers-ud-japanese-electra--base. Components: transformer, parser, atteribute_ruler, ner, morphologizer, compound_splitter, bunsetu_recognizer.

Homepage: https://github.com/megagonlabs/ginza
Documentation: https://ja-ginza-electra.readthedocs.io/
License: MIT License
Latest release: 5.2.0
published about 2 years ago

Versions: 5
Dependent Packages: 0
Dependent Repositories: 7
Downloads: 6,504 Last month

Rankings

Stargazers count: 2.4%

Downloads: 5.5%

Dependent repos count: 5.5%

Forks count: 5.7%

Average: 5.8%

Dependent packages count: 10.1%

Maintainers (1)

megagonlabs

Last synced: 10 months ago

pypi.org: ja-ginza-dict

SudachiDict for ja_ginza (SudachiDict is originally developed by Works Applications Tokushima Laboratory of AI and NLP)

Homepage: https://github.com/megagonlabs/ginza
Documentation: https://ja-ginza-dict.readthedocs.io/
License: MIT
Latest release: 3.1.0
published over 6 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 8
Downloads: 48 Last month

Rankings

Stargazers count: 2.4%

Dependent repos count: 5.2%

Forks count: 5.7%

Average: 8.3%

Dependent packages count: 10.0%

Downloads: 18.2%

Maintainers (1)

megagonlabs

Last synced: 10 months ago

ginza

Science Score: 41.0%

Keywords from Contributors

Repository

Basic Info

Statistics

Metadata Files

README.md

GiNZA NLP Library

License

Explosion / spaCy

Works Applications Enterprise / Sudachi/SudachiPy - SudachiDict - chiVe

Hugging Face / transformers

Training Datasets

UD Japanese BCCWJ r2.8

GSK2014-A (2019) BCCWJ edition

mC4

Runtime Environment

Runtime set up

1. Install GiNZA NLP Library with Transformer-based Model

2. Install GiNZA NLP Library with Standard Model

Execute ginza command

text = 銀座でランチをご一緒しましょう。

Multi-processing (Experimental)

Coding example

User Dictionary

Releases

version 5.x

ginza-5.2.0

ginza-5.1.3

ginza-5.1.2

ginza-5.1.1

ginza-5.1.0

ginza-5.0.3

ginza-5.0.2

ginza-5.0.1

ginza-5.0.0

version 4.x

ginza-4.0.6

ginza-4.0.5

ginza-4.0.4

ginza-4.0.3

ginza-4.0.2

ginza-4.0.1

ginza-4.0.0

version 3.x

ginza-3.1.2

ginza-3.1.1

ginza-3.1.0

ginza-3.0.0

version 2.x

ginza-2.2.1

ginza-2.2.0

ginza-2.0.0

version 1.x

jaginzanopn-1.0.2

jaginzanopn-1.0.1

jaginzanopn-1.0.0

Development Environment

Development set up

1. Clone from github

2. Run python setup.py

3. Set up system.dic

Training models

Run tests

full test

test single file

Owner

Citation (CITATION)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year