Recent Releases of ginza
ginza - Release v5.2.0
What's Changed
- Require python>=3.8
- Migrate to spaCy v3.7
- New functionality
- add Japanese clause recognition API (experimental)
Full Changelog: https://github.com/megagonlabs/ginza/compare/v5.1.3...v5.2.0
How to Use ja_ginza_bert_large β1
Prepare Create a virtual-env to separate
ja_ginza_bert_largefrom other GiNZA model environments. (ja_ginza_bert_largerequires the latestspacy-transformersversion which is not compatible withja_ginzaorja_ginza_electra)Console $ python -m venv venv_bert_large $ source venv_bert_large/bin/activateInstall
Console $ pip install "https://github.com/megagonlabs/ginza/releases/download/v5.2.0/ja_ginza_bert_large-5.2.0b1-py3-none-any.whl"
For CUDA environments, you need to upgrade spacy with CUDA version number as follows:
Console
$ pip install -U spacy[cuda117]
- Analyze ```Console $ ginza -g 0 -b jaginzabertlarge 銀座でランチをご一緒しましょう。 # text = 銀座でランチをご一緒しましょう。 1 銀座 銀座 PROPN 名詞-固有名詞-地名-一般 _ 6 obl _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEMHEAD|NPB|Reading=ギンザ|NE=B-GPE|ENE=B-City|ClauseHead=6 2 で で ADP 助詞-格助詞 _ 1 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Reading=デ|ClauseHead=6 3 ランチ ランチ NOUN 名詞-普通名詞-一般 _ 6 obj _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEMHEAD|NPB|Reading=ランチ|ClauseHead=6 4 を を ADP 助詞-格助詞 _ 3 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Reading=ヲ|ClauseHead=6 5 ご ご NOUN 接頭辞 _ 6 compound _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Reading=ゴ|ClauseHead=6 6 一緒 一緒 VERB 名詞-普通名詞-サ変可能 _ 0 root _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Reading=イッショ|ClauseHead=6 7 し する AUX 動詞-非自立可能 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Inf=サ行変格,連用形-一般|Reading=シ|ClauseHead=6 8 ましょう ます AUX 助動詞 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-マス,意志推量形|Reading=マショウ|ClauseHead=6 9 。 。 PUNCT 補助記号-句点 _ 6 punct _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。|ClauseHead=6
```
- Python
Published by hiroshi-matsuda-rit about 2 years ago
ginza - Release v5.1.3
What's Changed
- Migrate to spaCy v3.6
- Beta release of
ja_ginza_bert_large
Full Changelog: https://github.com/megagonlabs/ginza/compare/v5.1.2...v5.1.3
How to Use ja_ginza_bert_large β1
Prepare Create a virtual-env to separate
ja_ginza_bert_largefrom other GiNZA model environments. (ja_ginza_bert_largerequires the latestspacy-transformersversion which is not compatible withja_ginzaorja_ginza_electra)Console $ python -m venv venv_bert_large $ source venv_bert_large/bin/activateInstall
Console $ pip install "https://github.com/megagonlabs/ginza/releases/download/v5.1.3/ja_ginza_bert_large-5.1.3b1-py3-none-any.whl"
For CUDA environments, you need to upgrade spacy with CUDA version number as follows:
Console
$ pip install -U spacy[cuda117]
- Analyze ```Console $ ginza -g 0 -b jaginzabertlarge 銀座でランチをご一緒しましょう。 # text = 銀座でランチをご一緒しましょう。 1 銀座 銀座 PROPN 名詞-固有名詞-地名-一般 _ 6 obl _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEMHEAD|NPB|Reading=ギンザ|NE=B-GPE|ENE=B-City 2 で で ADP 助詞-格助詞 _ 1 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Reading=デ 3 ランチ ランチ NOUN 名詞-普通名詞-一般 _ 6 obj _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEMHEAD|NPB|Reading=ランチ 4 を を ADP 助詞-格助詞 _ 3 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Reading=ヲ 5 ご ご NOUN 接頭辞 _ 6 compound _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Reading=ゴ 6 一緒 一緒 VERB 名詞-普通名詞-サ変可能 _ 0 root _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Reading=イッショ 7 し する AUX 動詞-非自立可能 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Inf=サ行変格,連用形-一般|Reading=シ 8 ましょう ます AUX 助動詞 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-マス,意志推量形|Reading=マショウ 9 。 。 PUNCT 補助記号-句点 _ 6 punct _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。
```
- Python
Published by hiroshi-matsuda-rit almost 3 years ago
ginza - Release v5.1.2
What's Changed
- add pytest github actions workflow by @r-terada in https://github.com/megagonlabs/ginza/pull/241
- Migrate to spaCy v3.4 by @hiroshi-matsuda-rit in https://github.com/megagonlabs/ginza/pull/250
New Contributors
- @ftnext made their first contribution in https://github.com/megagonlabs/ginza/pull/239
- @wafuwafu13 made their first contribution in https://github.com/megagonlabs/ginza/pull/244
Full Changelog: https://github.com/megagonlabs/ginza/compare/v5.1.1...v5.1.2
- Python
Published by hiroshi-matsuda-rit almost 4 years ago
ginza - Release v5.1.1
What's Changed
- auto deploy for pypi by @nimiusrd in https://github.com/megagonlabs/ginza/pull/184
- modify github actions: trigger by tagging, stop uploading test pypi by @r-terada in https://github.com/megagonlabs/ginza/pull/233
New Contributors
- @sinozu made their first contribution in https://github.com/megagonlabs/ginza/pull/230
- @wataruhashimoto52 made their first contribution in https://github.com/megagonlabs/ginza/pull/236
Full Changelog: https://github.com/megagonlabs/ginza/compare/v5.1.0...v5.1.1
- Python
Published by hiroshi-matsuda-rit over 4 years ago
ginza - Release v5.1.0
ginza-5.1.0
- 2021-12-10, Euclase
- Important changes
- Upgrade: spaCy v3.2 and Sudachi.rs(SudachiPy v0.6.2)
- Change token information fields #208 #209
doc.user_data[“reading_forms”][token.i]->token.morph.get(“Reading”)doc.user_data[“inflections”][token.i]->token.morph.get(“Inflection”)force_using_normalized_form_as_lemma(True)->token.norm_- All spaCy models, including non-Japanese, are now available with the ginza command #217
- Download and analyze the model at once by specifying the model name in the following form #219
ginza -m en_core_web_mdginza -f jsonoption always analyze the line which starts with#regardless the option value of-c. #215
- Improvements
- Batch analysis processing speeds up by 50-60% in GPU environment and 10-40% in CPU environment
- Improved processing efficiency of parallel execution options (
ginza -p {n_process}andginzame) of ginza command #204 - add tests #198 #210 #214
- add benchmark #207 #220
- Python
Published by hiroshi-matsuda-rit over 4 years ago
ginza - Release v5.0.3
ginza-5.0.3
- 2021-10-15
- Bug fix
Bunsetu span should not cross the sentence boundary#195
- Python
Published by hiroshi-matsuda-rit over 4 years ago
ginza - Release v5.0.2
ginza-5.0.2
- 2021-09-06
- Bug fix
Command Line -s option and set_split_mode() not working in v5.0.x#185
- Python
Published by hiroshi-matsuda-rit over 4 years ago
ginza - Release v5.0.1
ginza-5.0.1
- 2021-08-26
- Bug fix
ginzame not woriking in ginza ver. 5#179Command Line -d option not working in v5.0.0#178
- Improvement
- accept
ja-ginzaandja-ginza-electrafor-moption ofginzacommand
- accept
- Python
Published by hiroshi-matsuda-rit almost 5 years ago
ginza - Release v5.0.0
ginza-5.0.0
- 2021-08-26, Demantoid
- Important changes
- Upgrade spaCy to v3
- Release transformer-based
ja-ginza-electramodel - Improve UPOS accuracy of the standard
ja-ginzamodel by addingmorphologizerto the tail of spaCy pipleline - Need to insrtall analysis model along with
ginzapackage - High accuracy model (>=16GB memory needed)
pip install -U ginza ja-ginza-electra
- Speed oriented model
pip install -U ginza ja-ginza
- Change component names of
CompoundSplitterandBunsetuRecognizertocompound_splitterandbunsetu_recognizerrespectively - Also see spaCy v3 Backwards Incompatibilities
- Improvements
- Add command line options
-n- Force using SudachiPy's
normalized_formasToken.lemma_
- Force using SudachiPy's
-m (ja_ginza|ja_ginza_electra)- Select model package
- Revise ENE category name
Degital_GametoDigital_Game
- Python
Published by hiroshi-matsuda-rit almost 5 years ago
ginza - ginza-4.0.5
ginza-4.0.5
- 2020-10-01
- Improvements
- Add
-doption, which disables spaCy's sentence separator, toginzacommand line tool
- Add
- Python
Published by hiroshi-matsuda-rit over 5 years ago
ginza - ginza-4.0.4
ginza-4.0.4
- 2020-09-11
- Improvements
ginzacommand line tool works correctly without BunsetuRecognizer in the pipeline
- Python
Published by hiroshi-matsuda-rit almost 6 years ago
ginza - ginza-4.0.3
ginza-4.0.3
- 2020-09-10
- Improve bunsetu head identification accuracy over inconsistent deps in ent spans
- Python
Published by hiroshi-matsuda-rit almost 6 years ago
ginza - ginza-4.0.2
ginza-4.0.2
- 2020-09-04
- Improvements
- Serialization of
CompoundSplitterfornlp.to_disk() - Bunsetu span detection accuracy
- Serialization of
- Python
Published by hiroshi-matsuda-rit almost 6 years ago
ginza - ginza-4.0.1
ginza-4.0.1
- 2020-08-30
- Debug
- Add type arguments for singledispatch register annotations (for Python 3.6)
- Python
Published by hiroshi-matsuda-rit almost 6 years ago
ginza - LUW-4.0.0
The Ninjal's LUW (long-unit-word) NER model for GiNZA v4 and SudachiPy mode A. The license of this model is the same as GiNZA and its models.
Usage: $ ginza -b ja_luw-4.0.0/
Accuracy: ``` entf1:SPANLABEL=0.9551,SPANONLY=0.9784 entrecall:SPANLABEL=0.9524,SPANONLY=0.9757 entprecision:SPANLABEL=0.9578,SPAN_ONLY=0.9812
ent_confusion URL(36): URL=30, _=3, 名詞-固有名詞-一般=2, 名詞-固有名詞-人名-一般=1 web誤脱(31): _=14, 名詞-普通名詞-一般=7, 動詞-一般=4, 助動詞=2, 助詞-接続助詞=1, 助詞-格助詞=1, 助詞-終助詞=1, 感動詞-一般=1 代名詞(1664): 代名詞=1606, _=43, 名詞-普通名詞-一般=11, 名詞-固有名詞-人名-名=1, 形状詞-一般=1, 副詞=1, 動詞-一般=1 副詞(2841): 副詞=2604, _=114, 名詞-普通名詞-一般=53, 動詞-一般=17, 接続詞=14, 形容詞-一般=11, 名詞-固有名詞-人名-一般=10, 名詞-数詞=6, 形状詞-一般=4, 助詞-格助詞=2, 代名詞=2, 感動詞-一般=2, 連体詞=1, 名詞-固有名詞-一般=1 助動詞(15394): 助動詞=15097, _=140, 助詞-格助詞=111, 助詞-副助詞=13, 動詞-一般=10, 助詞-終助詞=6, 形容詞-一般=6, 名詞-普通名詞-一般=5, 助詞-接続助詞=4, 助詞-準体助詞=1, 接続詞=1 助詞-係助詞(4989): 助詞-係助詞=4906, _=80, 助詞-副助詞=1, 助動詞=1, 感動詞-一般=1 助詞-副助詞(1841): 助詞-副助詞=1790, 助詞-終助詞=24, _=17, 助詞-接続助詞=4, 副詞=2, 動詞-一般=1, 形容詞-一般=1, 助動詞=1, 接続詞=1 助詞-接続助詞(3354): 助詞-接続助詞=3201, _=105, 助詞-格助詞=41, 助動詞=5, 名詞-普通名詞-一般=1, 助詞-終助詞=1 助詞-格助詞(21539): 助詞-格助詞=21268, _=159, 助動詞=72, 助詞-接続助詞=30, 助詞-準体助詞=4, 助詞-終助詞=3, 接続詞=2, 名詞-固有名詞-人名-名=1 助詞-準体助詞(576): 助詞-準体助詞=565, _=5, 助詞-格助詞=5, 助詞-終助詞=1 助詞-終助詞(1483): 助詞-終助詞=1443, _=14, 助詞-副助詞=9, 助動詞=5, 名詞-普通名詞-一般=5, 助詞-接続助詞=2, 形容詞-一般=2, 助詞-準体助詞=1, 副詞=1, 助詞-格助詞=1 動詞-一般(12483): 動詞-一般=12005, _=364, 名詞-普通名詞-一般=72, 形容詞-一般=21, 副詞=10, 助動詞=2, 感動詞-一般=2, 形状詞-一般=2, 連体詞=1, 接続詞=1, 助詞-副助詞=1, 代名詞=1, 名詞-固有名詞-地名-一般=1 名詞-助動詞語幹(29): 名詞-助動詞語幹=27, 形状詞-助動詞語幹=2 名詞-固有名詞-一般(540): 名詞-固有名詞-一般=306, 名詞-普通名詞-一般=141, _=61, 名詞-固有名詞-地名-一般=13, 名詞-固有名詞-人名-一般=9, 副詞=4, 名詞-固有名詞-人名-姓=2, 名詞-固有名詞-地名-国=1, 名詞-固有名詞-人名-名=1, 動詞-一般=1, 形状詞-一般=1 名詞-固有名詞-人名-一般(459): 名詞-普通名詞-一般=174, 名詞-固有名詞-人名-一般=141, _=55, 名詞-固有名詞-人名-姓=32, 名詞-固有名詞-一般=16, 代名詞=15, 名詞-固有名詞-人名-名=13, 名詞-固有名詞-地名-一般=5, 記号-文字=2, 補助記号-括弧開 =2, 接尾辞-名詞的-一般=1, 副詞=1, 形容詞-一般=1, 動詞-一般=1 名詞-固有名詞-人名-名(497): 名詞-普通名詞-一般=345, 名詞-固有名詞-人名-名=73, 名詞-固有名詞-人名-姓=44, _=18, 名詞-固有名詞-人名-一般=7, 動詞-一般=3, 副詞=2, 名詞-固有名詞-地名-一般=1, 形容詞-一般=1, 名詞-数詞=1, 名詞-固有名詞-一 般=1, 感動詞-一般=1 名詞-固有名詞-人名-姓(364): 名詞-固有名詞-人名-姓=194, 名詞-普通名詞-一般=108, 名詞-固有名詞-人名-名=26, 名詞-固有名詞-地名-一般=14, _=12, 名詞-固有名詞-人名-一般=2, 名詞-固有名詞-一般=2, 副詞=2, 形容詞-一般=2, 動詞-一般=1, 名詞- 数詞=1 名詞-固有名詞-地名-一般(409): 名詞-固有名詞-地名-一般=257, 名詞-普通名詞-一般=89, _=24, 名詞-固有名詞-人名-一般=14, 名詞-固有名詞-一般=11, 副詞=4, 名詞-固有名詞-地名-国=2, 名詞-固有名詞-人名-姓=2, 形状詞-一般=1, 動詞-一般=1, 形容 詞-一般=1, 感動詞-一般=1, 接続詞=1, 名詞-数詞=1 名詞-固有名詞-地名-国(230): 名詞-固有名詞-地名-国=215, _=7, 名詞-普通名詞-一般=3, 名詞-固有名詞-人名-一般=2, 名詞-固有名詞-地名-一般=2, 名詞-固有名詞-一般=1 名詞-数詞(2308): 名詞-数詞=2096, _=165, 名詞-普通名詞-一般=34, 補助記号-AA-顔文字=4, 補助記号-一般=3, 名詞-固有名詞-地名-一般=2, 感動詞-一般=1, 接続詞=1, 補助記号-括弧閉=1, 副詞=1 名詞-普通名詞-一般(24746): 名詞-普通名詞-一般=23234, _=1019, 名詞-固有名詞-一般=121, 形状詞-一般=83, 動詞-一般=64, 名詞-固有名詞-人名-一般=51, 副詞=42, 名詞-数詞=36, 名詞-固有名詞-地名-一般=27, 形容詞-一般=15, 名詞-固有名詞-人名- 姓=14, 名詞-固有名詞-人名-名=11, 感動詞-一般=5, 補助記号-一般=4, 名詞-固有名詞-地名-国=4, 代名詞=3, 形状詞-助動詞語幹=3, 助詞-終助詞=2, 助詞-格助詞=2, 助詞-副助詞=2, 形状詞-タリ=1, 補助記号-句点=1, 接続詞=1, 補助記号-括弧閉=1 形容詞-一般(1646): 形容詞-一般=1515, _=56, 動詞-一般=33, 名詞-普通名詞-一般=20, 副詞=10, 助動詞=5, 形状詞-一般=2, 名詞-固有名詞-人名-一般=2, 感動詞-一般=1, 代名詞=1, 名詞-数詞=1 形状詞-タリ(18): 形状詞-タリ=7, 名詞-普通名詞-一般=5, _=3, 代名詞=1, 動詞-一般=1, 感動詞-一般=1 形状詞-一般(1582): 形状詞-一般=1454, 名詞-普通名詞-一般=84, _=27, 副詞=5, 形容詞-一般=5, 動詞-一般=3, 名詞-固有名詞-人名-姓=1, 連体詞=1, 感動詞-一般=1, 助動詞=1 形状詞-助動詞語幹(465): 形状詞-助動詞語幹=450, _=10, 名詞-助動詞語幹=3, 副詞=2 感動詞-フィラー(5): 感動詞-一般=3, 感動詞-フィラー=1, 接続詞=1 感動詞-一般(161): 感動詞-一般=120, 名詞-普通名詞-一般=12, _=7, 形状詞-一般=4, 形容詞-一般=4, 動詞-一般=3, 副詞=3, 接続詞=2, 代名詞=2, 補助記号-一般=2, 助詞-終助詞=1, 名詞-固有名詞-人名-一般=1 接尾辞-名詞的-一般(17): 接尾辞-名詞的-一般=7, 名詞-普通名詞-一般=6, _=4 接尾辞-形容詞的(1): 名詞-普通名詞-一般=1 接続詞(814): 接続詞=768, 副詞=31, _=11, 助詞-格助詞=2, 形容詞-一般=1, 名詞-普通名詞-一般=1 接頭辞(2): _=1, 名詞-普通名詞-一般=1 未知語(9): 名詞-固有名詞-一般=7, _=2 漢文(1): 助詞-係助詞=1 英単語(35): _=21, 名詞-固有名詞-一般=7, 名詞-普通名詞-一般=5, 補助記号-一般=1, 名詞-固有名詞-地名-一般=1 補助記号-一般(1926): 補助記号-一般=1730, _=183, 補助記号-括弧閉=2, 助詞-終助詞=2, 助動詞=2, 動詞-一般=2, 感動詞-一般=2, 補助記号-AA-顔文字=1, 形容詞-一般=1, 名詞-固有名詞-人名-一般=1 補助記号-句点(6322): 補助記号-句点=6215, _=107 補助記号-括弧閉(2104): 補助記号-括弧閉=2100, _=4 補助記号-括弧開(2067): 補助記号-括弧開=2064, _=3 補助記号-読点(6992): 補助記号-読点=6991, _=1 補助記号-AA-顔文字(103): 補助記号-AA-顔文字=73, _=8, 感動詞-一般=4, 名詞-普通名詞-一般=4, 副詞=4, 補助記号-一般=3, 名詞-数詞=3, 補助記号-括弧開=2, 補助記号-句点=1, 補助記号-AA-一般=1 言いよどみ(2): 助詞-終助詞=1, 名詞-普通名詞-一般=1 記号-一般(47): _=33, 名詞-普通名詞-一般=5, 名詞-固有名詞-一般=3, 記号-一般=3, 名詞-数詞=1, 英単語=1, 補助記号-一般=1 記号-文字(103): _=58, 記号-文字=34, 名詞-普通名詞-一般=8, 補助記号-AA-一般=2, 名詞-数詞=1 連体詞(1088): 連体詞=1069, _=13, 動詞-一般=4, 形容詞-一般=1, 副詞=1 ```
- Python
Published by hiroshi-matsuda-rit almost 6 years ago
ginza - ginza-4.0.0
ginza-4.0.0
- 2020-08-16, Chrysoberyl
- Important changes
- Replace Japanese model with
spacy.lang.jaof spaCy v2.3 - Replace values of
Token.lemma_with the output of SudachiPy'sMorpheme.dictionary_form() - Replace jaginzadict with official SudachiDict-core package
- You can delete
ja_ginza_dictpackage safety - Change options and misc field contents of output of command line tool
- Delete usesentenceseparator(-s) option
- NE(OntoNotes) BI labels as
B-GPE - Add subfields: Reading, Inf(inflection) and ENE(Extended NE)
- Obsolete
Token._.*and add some entries forDoc.user_data[]and accessors - inflections (
ginza.inflection(Token)) - readingforms (`ginza.readingform(Token)`)
- bunsetubilabels (
ginza.bunsetu_bi_label(Token)) - bunsetupositiontypes (
ginza.bunsetu_position_type(Token)) - bunsetuheads (`ginza.isbunsetu_head(Token)`)
- Change pipeline architecture
- JapaneseCorrector was obsoleted
- Add CompoundSplitter and BunsetuRecognizer
- Upgrade UD_JAPANESE-BCCWJ to v2.6
- Change word2vec to chiVe mc90
- Replace Japanese model with
- API Changes
- Add bunsetu-unit APIs (
from ginza import *) - bunsetu(Token)
- phrase(Token)
- sub_phrases(Token)
- phrases(Span)
- bunsetu_spans(Span)
- bunsetuphrasespans(Span)
- bunsetuheadlist(Span)
- bunsetuheadtokens(Span)
- bunsetubilabels(Span)
- bunsetupositiontypes(Span)
- Add bunsetu-unit APIs (
- Python
Published by hiroshi-matsuda-rit almost 6 years ago
ginza - ginza-3.1.1
ginza-3.1.1
- 2020-01-19
- API Changes
- Extension fields
- The values of Token..sudachi field would be set after calling SudachipyTokenizer.enableexsudachi(True), to avoid serializtion errors ``` import spacy import pickle nlp = spacy.load('jaginza') doc1 = nlp('This example will be serialized correctly.') doc1.to_bytes() with open('sample1.pickle', 'wb') as f: pickle.dump(doc1, f)
nlp.tokenizer.setenableexsudachi(True) doc2 = nlp('This example will cause a serialization error.') doc2.tobytes() with open('sample2.pickle', 'wb') as f: pickle.dump(doc2, f) ```
- Python
Published by hiroshi-matsuda-rit over 6 years ago
ginza - ginza-3.1.0
ginza-3.1.0
- 2020-01-16
- Important changes
- Distribute
ja_ginza_dictfrom PyPI
- Distribute
- API Changes
- commands
ginzaandginzame- add
-ioption to initialize the files ofja_ginza_dict
- add
- Python
Published by hiroshi-matsuda-rit over 6 years ago
ginza - ginza-3.0.0
ginza-3.0.0
- 2020-01-15
- Important changes
- Distribute
ginzaandja_ginzafrom PyPI - Simple installation;
pip install ginza, and runginza - The model package,
ja_ginza, is also available from PyPI. - Model improvements
- Change NER training data-set to GSK2014-A (2019) BCCWJ edition
- Improved accuracy of NER
token.ent_type_value is changed to Sekine's Extended Named Entity Hierarchy- Add
ENE7attribute to the last field of the output ofginza - Move OntoNotes5 -based label to
token._.ne - We extended the OntoNotes5 named entity labels with
PHONE,EMAIL,URL, andPET_NAME
- Overall accuracy is improved by executing
spacy pretrainover 100 epochs- Multi-task learning of
spacy traineffectively working on UD Japanese BCCWJ
- Multi-task learning of
- The newest
SudachiDict_core-20191224 ginzame- Execute
sudachipybymultiprocessing.Pooland output results withmecablike format - Now
sudachipycommand requires additional SudachiDict package installation
- Distribute
- Breaking API Changes
- commands
ginza(ginza.command_line.main_ginza)- change option
modetosudachipy_mode - drop options:
disable_pipesandrecreate_corrector - add options:
hash_comment,parallel,files - add
mecabto the choices for the argument of-foption - add
parallel NUM_PROCESSoption (EXPERIMENTAL) - add
ENE7attribute to conllu miscellaneous field ginza.ent_type_mapping.ENE_NE_MAPPINGis used to convertENE7label toNE
- change option
- add
ginzame(ginza.command_line.main_ginzame)- a multi-process tokenizer providing
mecablike output format
- a multi-process tokenizer providing
- spaCy field extensions
- add
token._.nefor ner label ginza/sudachipy_tokenizer.py- change
SudachiTokenizertoSudachipyTokenizer - use
SUDACHI_DEFAULT_SPLIT_MODEinstead ofSUDACHI_DEFAULT_SPLITMODEorSUDACHI_DEFAULT_MODE
- Dependencies
- upgrade
spacyto v2.2.3 - upgrade
sudachipyto v0.4.2
- upgrade
- Python
Published by hiroshi-matsuda-rit over 6 years ago
ginza - ginza-2.2.1
ginza-2.2.1
- 2019-10-28
- Improvements
- JapaneseCorrector can merge the
as_*type dependencies completely
- JapaneseCorrector can merge the
- Bug fixes
- command line tool failed at the specific situations
- Python
Published by hiroshi-matsuda-rit over 6 years ago
ginza - ginza-2.2.0
ginza-2.2.0
- 2019-10-04, Ametrine
- Important changes
split_modehas been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)- This bug caused
split_modeincompatibility between the training phase and theginzacommand. split_modewas set to 'B' for training phase and python APIs, but 'C' forginzacommand.- We fixed this bug by setting the default
split_modeto 'C' entirely. - This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
- New features
- Add
-fand--output-formatoption toginzacommand: -f 0or-f conllu: CoNLL-U Syntactic Annotation format-f 1or-f cabocha: cabocha -f1 compatible format- Add custom token fields:
bunsetu_index: bunsetu index starting from 0reading: reading of token (not a pronunciation)sudachi: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
- Add
- Performance improvements
- Tokenizer
- Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
- Use Cythonized SudachiPy (v0.4.0)
- Dependency parser
- Apply
spacy pretraincommand to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC. - Apply multitask objectives by using
-pt 'tag,dep'option ofspacy train - New model file
- ja_ginza-2.2.0.tar.gz
- Python
Published by hiroshi-matsuda-rit over 6 years ago
ginza - ginza-2.0.0
ginza-2.0.0 (2019-07-08)
- Add
ginzacommand- run
ginzafrom the console
- run
- Change package structure
- module package as
ginza - language model package as
ja_ginza spacy.lang.jais overridden byginza
- module package as
- Remove
sudachipyrelated directories- SudachiPy and its dictionary are installed via
pipduringginzainstallation
- SudachiPy and its dictionary are installed via
- User dictionary available
- Token extension fields
- Added
token._.bunsetu_bi_label,token._.bunsetu_position_type- Remained
token._.inf- Removed
pos_detail(same value is set totoken.tag_)
- Python
Published by hiroshi-matsuda-rit almost 7 years ago
ginza - Add new era 'reiwa' to system_core.dic
- Python
Published by hiroshi-matsuda-rit about 7 years ago
ginza - GiNZA NLP formal release version
- Python
Published by hiroshi-matsuda-rit about 7 years ago