Recent Releases of ginza

ginza - Release v5.2.0

What's Changed

  • Require python>=3.8
  • Migrate to spaCy v3.7
  • New functionality
    • add Japanese clause recognition API (experimental)

Full Changelog: https://github.com/megagonlabs/ginza/compare/v5.1.3...v5.2.0

How to Use ja_ginza_bert_large β1

  • Prepare Create a virtual-env to separate ja_ginza_bert_large from other GiNZA model environments. (ja_ginza_bert_large requires the latest spacy-transformers version which is not compatible with ja_ginza or ja_ginza_electra) Console $ python -m venv venv_bert_large $ source venv_bert_large/bin/activate

  • Install Console $ pip install "https://github.com/megagonlabs/ginza/releases/download/v5.2.0/ja_ginza_bert_large-5.2.0b1-py3-none-any.whl"

For CUDA environments, you need to upgrade spacy with CUDA version number as follows: Console $ pip install -U spacy[cuda117]

  • Analyze ```Console $ ginza -g 0 -b jaginzabertlarge 銀座でランチをご一緒しましょう。 # text = 銀座でランチをご一緒しましょう。 1 銀座 銀座 PROPN 名詞-固有名詞-地名-一般 _ 6 obl _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEMHEAD|NPB|Reading=ギンザ|NE=B-GPE|ENE=B-City|ClauseHead=6 2 で で ADP 助詞-格助詞 _ 1 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Reading=デ|ClauseHead=6 3 ランチ ランチ NOUN 名詞-普通名詞-一般 _ 6 obj _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEMHEAD|NPB|Reading=ランチ|ClauseHead=6 4 を を ADP 助詞-格助詞 _ 3 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Reading=ヲ|ClauseHead=6 5 ご ご NOUN 接頭辞 _ 6 compound _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Reading=ゴ|ClauseHead=6 6 一緒 一緒 VERB 名詞-普通名詞-サ変可能 _ 0 root _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Reading=イッショ|ClauseHead=6 7 し する AUX 動詞-非自立可能 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Inf=サ行変格,連用形-一般|Reading=シ|ClauseHead=6 8 ましょう ます AUX 助動詞 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-マス,意志推量形|Reading=マショウ|ClauseHead=6 9 。 。 PUNCT 補助記号-句点 _ 6 punct _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。|ClauseHead=6

```

- Python
Published by hiroshi-matsuda-rit about 2 years ago

ginza - Release v5.1.3

What's Changed

  • Migrate to spaCy v3.6
  • Beta release of ja_ginza_bert_large

Full Changelog: https://github.com/megagonlabs/ginza/compare/v5.1.2...v5.1.3

How to Use ja_ginza_bert_large β1

  • Prepare Create a virtual-env to separate ja_ginza_bert_large from other GiNZA model environments. (ja_ginza_bert_large requires the latest spacy-transformers version which is not compatible with ja_ginza or ja_ginza_electra) Console $ python -m venv venv_bert_large $ source venv_bert_large/bin/activate

  • Install Console $ pip install "https://github.com/megagonlabs/ginza/releases/download/v5.1.3/ja_ginza_bert_large-5.1.3b1-py3-none-any.whl"

For CUDA environments, you need to upgrade spacy with CUDA version number as follows: Console $ pip install -U spacy[cuda117]

  • Analyze ```Console $ ginza -g 0 -b jaginzabertlarge 銀座でランチをご一緒しましょう。 # text = 銀座でランチをご一緒しましょう。 1 銀座 銀座 PROPN 名詞-固有名詞-地名-一般 _ 6 obl _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEMHEAD|NPB|Reading=ギンザ|NE=B-GPE|ENE=B-City 2 で で ADP 助詞-格助詞 _ 1 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Reading=デ 3 ランチ ランチ NOUN 名詞-普通名詞-一般 _ 6 obj _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEMHEAD|NPB|Reading=ランチ 4 を を ADP 助詞-格助詞 _ 3 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Reading=ヲ 5 ご ご NOUN 接頭辞 _ 6 compound _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Reading=ゴ 6 一緒 一緒 VERB 名詞-普通名詞-サ変可能 _ 0 root _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Reading=イッショ 7 し する AUX 動詞-非自立可能 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYNHEAD|Inf=サ行変格,連用形-一般|Reading=シ 8 ましょう ます AUX 助動詞 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-マス,意志推量形|Reading=マショウ 9 。 。 PUNCT 補助記号-句点 _ 6 punct _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。

```

- Python
Published by hiroshi-matsuda-rit almost 3 years ago

ginza - Release v5.1.2

What's Changed

  • add pytest github actions workflow by @r-terada in https://github.com/megagonlabs/ginza/pull/241
  • Migrate to spaCy v3.4 by @hiroshi-matsuda-rit in https://github.com/megagonlabs/ginza/pull/250

New Contributors

  • @ftnext made their first contribution in https://github.com/megagonlabs/ginza/pull/239
  • @wafuwafu13 made their first contribution in https://github.com/megagonlabs/ginza/pull/244

Full Changelog: https://github.com/megagonlabs/ginza/compare/v5.1.1...v5.1.2

- Python
Published by hiroshi-matsuda-rit almost 4 years ago

ginza - Release v5.1.1

What's Changed

  • auto deploy for pypi by @nimiusrd in https://github.com/megagonlabs/ginza/pull/184
  • modify github actions: trigger by tagging, stop uploading test pypi by @r-terada in https://github.com/megagonlabs/ginza/pull/233

New Contributors

  • @sinozu made their first contribution in https://github.com/megagonlabs/ginza/pull/230
  • @wataruhashimoto52 made their first contribution in https://github.com/megagonlabs/ginza/pull/236

Full Changelog: https://github.com/megagonlabs/ginza/compare/v5.1.0...v5.1.1

- Python
Published by hiroshi-matsuda-rit over 4 years ago

ginza - Release v5.1.0

ginza-5.1.0

  • 2021-12-10, Euclase
  • Important changes
    • Upgrade: spaCy v3.2 and Sudachi.rs(SudachiPy v0.6.2)
    • Change token information fields #208 #209
    • doc.user_data[“reading_forms”][token.i] -> token.morph.get(“Reading”)
    • doc.user_data[“inflections”][token.i] -> token.morph.get(“Inflection”)
    • force_using_normalized_form_as_lemma(True) -> token.norm_
    • All spaCy models, including non-Japanese, are now available with the ginza command #217
    • Download and analyze the model at once by specifying the model name in the following form #219
    • ginza -m en_core_web_md
    • ginza -f json option always analyze the line which starts with # regardless the option value of -c. #215
  • Improvements
    • Batch analysis processing speeds up by 50-60% in GPU environment and 10-40% in CPU environment
    • Improved processing efficiency of parallel execution options (ginza -p {n_process} and ginzame) of ginza command #204
    • add tests #198 #210 #214
    • add benchmark #207 #220

- Python
Published by hiroshi-matsuda-rit over 4 years ago

ginza - Release v5.0.3

ginza-5.0.3

  • 2021-10-15
  • Bug fix
    • Bunsetu span should not cross the sentence boundary #195

- Python
Published by hiroshi-matsuda-rit over 4 years ago

ginza - Release v5.0.2

ginza-5.0.2

  • 2021-09-06
  • Bug fix
    • Command Line -s option and set_split_mode() not working in v5.0.x #185

- Python
Published by hiroshi-matsuda-rit over 4 years ago

ginza - Release v5.0.1

ginza-5.0.1

  • 2021-08-26
  • Bug fix
    • ginzame not woriking in ginza ver. 5 #179
    • Command Line -d option not working in v5.0.0 #178
  • Improvement
    • accept ja-ginza and ja-ginza-electra for -m option of ginza command

- Python
Published by hiroshi-matsuda-rit almost 5 years ago

ginza - Release v5.0.0

ginza-5.0.0

  • 2021-08-26, Demantoid
  • Important changes
    • Upgrade spaCy to v3
    • Release transformer-based ja-ginza-electra model
    • Improve UPOS accuracy of the standard ja-ginza model by adding morphologizer to the tail of spaCy pipleline
    • Need to insrtall analysis model along with ginza package
    • High accuracy model (>=16GB memory needed)
      • pip install -U ginza ja-ginza-electra
    • Speed oriented model
      • pip install -U ginza ja-ginza
    • Change component names of CompoundSplitter and BunsetuRecognizer to compound_splitter and bunsetu_recognizer respectively
    • Also see spaCy v3 Backwards Incompatibilities
  • Improvements
    • Add command line options
    • -n
      • Force using SudachiPy's normalized_form as Token.lemma_
    • -m (ja_ginza|ja_ginza_electra)
      • Select model package
    • Revise ENE category name
    • Degital_Game to Digital_Game

- Python
Published by hiroshi-matsuda-rit almost 5 years ago

ginza - v4.0.6

ginza-4.0.6

  • 2021-06-01
  • Bug fix
    • Issue #160: IndexError: list assignment index out of range for empty string

- Python
Published by hiroshi-matsuda-rit about 5 years ago

ginza - ginza-4.0.5

ginza-4.0.5

  • 2020-10-01
  • Improvements
    • Add -d option, which disables spaCy's sentence separator, to ginza command line tool

- Python
Published by hiroshi-matsuda-rit over 5 years ago

ginza - ginza-4.0.4

ginza-4.0.4

  • 2020-09-11
  • Improvements
    • ginza command line tool works correctly without BunsetuRecognizer in the pipeline

- Python
Published by hiroshi-matsuda-rit almost 6 years ago

ginza - ginza-4.0.3

ginza-4.0.3

  • 2020-09-10
  • Improve bunsetu head identification accuracy over inconsistent deps in ent spans

- Python
Published by hiroshi-matsuda-rit almost 6 years ago

ginza - ginza-4.0.2

ginza-4.0.2

  • 2020-09-04
  • Improvements
    • Serialization of CompoundSplitter for nlp.to_disk()
    • Bunsetu span detection accuracy

- Python
Published by hiroshi-matsuda-rit almost 6 years ago

ginza - ginza-4.0.1

ginza-4.0.1

  • 2020-08-30
  • Debug
    • Add type arguments for singledispatch register annotations (for Python 3.6)

- Python
Published by hiroshi-matsuda-rit almost 6 years ago

ginza - LUW-4.0.0

The Ninjal's LUW (long-unit-word) NER model for GiNZA v4 and SudachiPy mode A. The license of this model is the same as GiNZA and its models.

Usage: $ ginza -b ja_luw-4.0.0/

Accuracy: ``` entf1:SPANLABEL=0.9551,SPANONLY=0.9784 entrecall:SPANLABEL=0.9524,SPANONLY=0.9757 entprecision:SPANLABEL=0.9578,SPAN_ONLY=0.9812

ent_confusion URL(36): URL=30, _=3, 名詞-固有名詞-一般=2, 名詞-固有名詞-人名-一般=1 web誤脱(31): _=14, 名詞-普通名詞-一般=7, 動詞-一般=4, 助動詞=2, 助詞-接続助詞=1, 助詞-格助詞=1, 助詞-終助詞=1, 感動詞-一般=1 代名詞(1664): 代名詞=1606, _=43, 名詞-普通名詞-一般=11, 名詞-固有名詞-人名-名=1, 形状詞-一般=1, 副詞=1, 動詞-一般=1 副詞(2841): 副詞=2604, _=114, 名詞-普通名詞-一般=53, 動詞-一般=17, 接続詞=14, 形容詞-一般=11, 名詞-固有名詞-人名-一般=10, 名詞-数詞=6, 形状詞-一般=4, 助詞-格助詞=2, 代名詞=2, 感動詞-一般=2, 連体詞=1, 名詞-固有名詞-一般=1 助動詞(15394): 助動詞=15097, _=140, 助詞-格助詞=111, 助詞-副助詞=13, 動詞-一般=10, 助詞-終助詞=6, 形容詞-一般=6, 名詞-普通名詞-一般=5, 助詞-接続助詞=4, 助詞-準体助詞=1, 接続詞=1 助詞-係助詞(4989): 助詞-係助詞=4906, _=80, 助詞-副助詞=1, 助動詞=1, 感動詞-一般=1 助詞-副助詞(1841): 助詞-副助詞=1790, 助詞-終助詞=24, _=17, 助詞-接続助詞=4, 副詞=2, 動詞-一般=1, 形容詞-一般=1, 助動詞=1, 接続詞=1 助詞-接続助詞(3354): 助詞-接続助詞=3201, _=105, 助詞-格助詞=41, 助動詞=5, 名詞-普通名詞-一般=1, 助詞-終助詞=1 助詞-格助詞(21539): 助詞-格助詞=21268, _=159, 助動詞=72, 助詞-接続助詞=30, 助詞-準体助詞=4, 助詞-終助詞=3, 接続詞=2, 名詞-固有名詞-人名-名=1 助詞-準体助詞(576): 助詞-準体助詞=565, _=5, 助詞-格助詞=5, 助詞-終助詞=1 助詞-終助詞(1483): 助詞-終助詞=1443, _=14, 助詞-副助詞=9, 助動詞=5, 名詞-普通名詞-一般=5, 助詞-接続助詞=2, 形容詞-一般=2, 助詞-準体助詞=1, 副詞=1, 助詞-格助詞=1 動詞-一般(12483): 動詞-一般=12005, _=364, 名詞-普通名詞-一般=72, 形容詞-一般=21, 副詞=10, 助動詞=2, 感動詞-一般=2, 形状詞-一般=2, 連体詞=1, 接続詞=1, 助詞-副助詞=1, 代名詞=1, 名詞-固有名詞-地名-一般=1 名詞-助動詞語幹(29): 名詞-助動詞語幹=27, 形状詞-助動詞語幹=2 名詞-固有名詞-一般(540): 名詞-固有名詞-一般=306, 名詞-普通名詞-一般=141, _=61, 名詞-固有名詞-地名-一般=13, 名詞-固有名詞-人名-一般=9, 副詞=4, 名詞-固有名詞-人名-姓=2, 名詞-固有名詞-地名-国=1, 名詞-固有名詞-人名-名=1, 動詞-一般=1, 形状詞-一般=1 名詞-固有名詞-人名-一般(459): 名詞-普通名詞-一般=174, 名詞-固有名詞-人名-一般=141, _=55, 名詞-固有名詞-人名-姓=32, 名詞-固有名詞-一般=16, 代名詞=15, 名詞-固有名詞-人名-名=13, 名詞-固有名詞-地名-一般=5, 記号-文字=2, 補助記号-括弧開 =2, 接尾辞-名詞的-一般=1, 副詞=1, 形容詞-一般=1, 動詞-一般=1 名詞-固有名詞-人名-名(497): 名詞-普通名詞-一般=345, 名詞-固有名詞-人名-名=73, 名詞-固有名詞-人名-姓=44, _=18, 名詞-固有名詞-人名-一般=7, 動詞-一般=3, 副詞=2, 名詞-固有名詞-地名-一般=1, 形容詞-一般=1, 名詞-数詞=1, 名詞-固有名詞-一 般=1, 感動詞-一般=1 名詞-固有名詞-人名-姓(364): 名詞-固有名詞-人名-姓=194, 名詞-普通名詞-一般=108, 名詞-固有名詞-人名-名=26, 名詞-固有名詞-地名-一般=14, _=12, 名詞-固有名詞-人名-一般=2, 名詞-固有名詞-一般=2, 副詞=2, 形容詞-一般=2, 動詞-一般=1, 名詞- 数詞=1 名詞-固有名詞-地名-一般(409): 名詞-固有名詞-地名-一般=257, 名詞-普通名詞-一般=89, _=24, 名詞-固有名詞-人名-一般=14, 名詞-固有名詞-一般=11, 副詞=4, 名詞-固有名詞-地名-国=2, 名詞-固有名詞-人名-姓=2, 形状詞-一般=1, 動詞-一般=1, 形容 詞-一般=1, 感動詞-一般=1, 接続詞=1, 名詞-数詞=1 名詞-固有名詞-地名-国(230): 名詞-固有名詞-地名-国=215, _=7, 名詞-普通名詞-一般=3, 名詞-固有名詞-人名-一般=2, 名詞-固有名詞-地名-一般=2, 名詞-固有名詞-一般=1 名詞-数詞(2308): 名詞-数詞=2096, _=165, 名詞-普通名詞-一般=34, 補助記号-AA-顔文字=4, 補助記号-一般=3, 名詞-固有名詞-地名-一般=2, 感動詞-一般=1, 接続詞=1, 補助記号-括弧閉=1, 副詞=1 名詞-普通名詞-一般(24746): 名詞-普通名詞-一般=23234, _=1019, 名詞-固有名詞-一般=121, 形状詞-一般=83, 動詞-一般=64, 名詞-固有名詞-人名-一般=51, 副詞=42, 名詞-数詞=36, 名詞-固有名詞-地名-一般=27, 形容詞-一般=15, 名詞-固有名詞-人名- 姓=14, 名詞-固有名詞-人名-名=11, 感動詞-一般=5, 補助記号-一般=4, 名詞-固有名詞-地名-国=4, 代名詞=3, 形状詞-助動詞語幹=3, 助詞-終助詞=2, 助詞-格助詞=2, 助詞-副助詞=2, 形状詞-タリ=1, 補助記号-句点=1, 接続詞=1, 補助記号-括弧閉=1 形容詞-一般(1646): 形容詞-一般=1515, _=56, 動詞-一般=33, 名詞-普通名詞-一般=20, 副詞=10, 助動詞=5, 形状詞-一般=2, 名詞-固有名詞-人名-一般=2, 感動詞-一般=1, 代名詞=1, 名詞-数詞=1 形状詞-タリ(18): 形状詞-タリ=7, 名詞-普通名詞-一般=5, _=3, 代名詞=1, 動詞-一般=1, 感動詞-一般=1 形状詞-一般(1582): 形状詞-一般=1454, 名詞-普通名詞-一般=84, _=27, 副詞=5, 形容詞-一般=5, 動詞-一般=3, 名詞-固有名詞-人名-姓=1, 連体詞=1, 感動詞-一般=1, 助動詞=1 形状詞-助動詞語幹(465): 形状詞-助動詞語幹=450, _=10, 名詞-助動詞語幹=3, 副詞=2 感動詞-フィラー(5): 感動詞-一般=3, 感動詞-フィラー=1, 接続詞=1 感動詞-一般(161): 感動詞-一般=120, 名詞-普通名詞-一般=12, _=7, 形状詞-一般=4, 形容詞-一般=4, 動詞-一般=3, 副詞=3, 接続詞=2, 代名詞=2, 補助記号-一般=2, 助詞-終助詞=1, 名詞-固有名詞-人名-一般=1 接尾辞-名詞的-一般(17): 接尾辞-名詞的-一般=7, 名詞-普通名詞-一般=6, _=4 接尾辞-形容詞的(1): 名詞-普通名詞-一般=1 接続詞(814): 接続詞=768, 副詞=31, _=11, 助詞-格助詞=2, 形容詞-一般=1, 名詞-普通名詞-一般=1 接頭辞(2): _=1, 名詞-普通名詞-一般=1 未知語(9): 名詞-固有名詞-一般=7, _=2 漢文(1): 助詞-係助詞=1 英単語(35): _=21, 名詞-固有名詞-一般=7, 名詞-普通名詞-一般=5, 補助記号-一般=1, 名詞-固有名詞-地名-一般=1 補助記号-一般(1926): 補助記号-一般=1730, _=183, 補助記号-括弧閉=2, 助詞-終助詞=2, 助動詞=2, 動詞-一般=2, 感動詞-一般=2, 補助記号-AA-顔文字=1, 形容詞-一般=1, 名詞-固有名詞-人名-一般=1 補助記号-句点(6322): 補助記号-句点=6215, _=107 補助記号-括弧閉(2104): 補助記号-括弧閉=2100, _=4 補助記号-括弧開(2067): 補助記号-括弧開=2064, _=3 補助記号-読点(6992): 補助記号-読点=6991, _=1 補助記号-AA-顔文字(103): 補助記号-AA-顔文字=73, _=8, 感動詞-一般=4, 名詞-普通名詞-一般=4, 副詞=4, 補助記号-一般=3, 名詞-数詞=3, 補助記号-括弧開=2, 補助記号-句点=1, 補助記号-AA-一般=1 言いよどみ(2): 助詞-終助詞=1, 名詞-普通名詞-一般=1 記号-一般(47): _=33, 名詞-普通名詞-一般=5, 名詞-固有名詞-一般=3, 記号-一般=3, 名詞-数詞=1, 英単語=1, 補助記号-一般=1 記号-文字(103): _=58, 記号-文字=34, 名詞-普通名詞-一般=8, 補助記号-AA-一般=2, 名詞-数詞=1 連体詞(1088): 連体詞=1069, _=13, 動詞-一般=4, 形容詞-一般=1, 副詞=1 ```

- Python
Published by hiroshi-matsuda-rit almost 6 years ago

ginza - ginza-4.0.0

ginza-4.0.0

  • 2020-08-16, Chrysoberyl
  • Important changes
    • Replace Japanese model with spacy.lang.ja of spaCy v2.3
    • Replace values of Token.lemma_ with the output of SudachiPy's Morpheme.dictionary_form()
    • Replace jaginzadict with official SudachiDict-core package
    • You can deleteja_ginza_dict package safety
    • Change options and misc field contents of output of command line tool
    • Delete usesentenceseparator(-s) option
    • NE(OntoNotes) BI labels as B-GPE
    • Add subfields: Reading, Inf(inflection) and ENE(Extended NE)
    • Obsolete Token._.* and add some entries for Doc.user_data[] and accessors
    • inflections (ginza.inflection(Token))
    • readingforms (`ginza.readingform(Token)`)
    • bunsetubilabels (ginza.bunsetu_bi_label(Token))
    • bunsetupositiontypes (ginza.bunsetu_position_type(Token))
    • bunsetuheads (`ginza.isbunsetu_head(Token)`)
    • Change pipeline architecture
    • JapaneseCorrector was obsoleted
    • Add CompoundSplitter and BunsetuRecognizer
    • Upgrade UD_JAPANESE-BCCWJ to v2.6
    • Change word2vec to chiVe mc90
  • API Changes
    • Add bunsetu-unit APIs (from ginza import *)
    • bunsetu(Token)
    • phrase(Token)
    • sub_phrases(Token)
    • phrases(Span)
    • bunsetu_spans(Span)
    • bunsetuphrasespans(Span)
    • bunsetuheadlist(Span)
    • bunsetuheadtokens(Span)
    • bunsetubilabels(Span)
    • bunsetupositiontypes(Span)

- Python
Published by hiroshi-matsuda-rit almost 6 years ago

ginza - ginza-3.1.1

ginza-3.1.1

  • 2020-01-19
  • API Changes
    • Extension fields
    • The values of Token..sudachi field would be set after calling SudachipyTokenizer.enableexsudachi(True), to avoid serializtion errors ``` import spacy import pickle nlp = spacy.load('jaginza') doc1 = nlp('This example will be serialized correctly.') doc1.to_bytes() with open('sample1.pickle', 'wb') as f: pickle.dump(doc1, f)

nlp.tokenizer.setenableexsudachi(True) doc2 = nlp('This example will cause a serialization error.') doc2.tobytes() with open('sample2.pickle', 'wb') as f: pickle.dump(doc2, f) ```

- Python
Published by hiroshi-matsuda-rit over 6 years ago

ginza - ginza-3.1.0

ginza-3.1.0

  • 2020-01-16
  • Important changes
    • Distribute ja_ginza_dict from PyPI
  • API Changes
    • commands
    • ginza and ginzame
      • add -i option to initialize the files of ja_ginza_dict

- Python
Published by hiroshi-matsuda-rit over 6 years ago

ginza - ginza-3.0.0

ginza-3.0.0

  • 2020-01-15
  • Important changes
    • Distribute ginza and ja_ginza from PyPI
    • Simple installation; pip install ginza, and run ginza
    • The model package, ja_ginza, is also available from PyPI.
    • Model improvements
    • Change NER training data-set to GSK2014-A (2019) BCCWJ edition
      • Improved accuracy of NER
      • token.ent_type_ value is changed to Sekine's Extended Named Entity Hierarchy
      • Add ENE7 attribute to the last field of the output of ginza
      • Move OntoNotes5 -based label to token._.ne
      • We extended the OntoNotes5 named entity labels with PHONE, EMAIL, URL, and PET_NAME
    • Overall accuracy is improved by executing spacy pretrain over 100 epochs
      • Multi-task learning of spacy train effectively working on UD Japanese BCCWJ
    • The newest SudachiDict_core-20191224
    • ginzame
    • Execute sudachipy by multiprocessing.Pool and output results with mecab like format
    • Now sudachipy command requires additional SudachiDict package installation
  • Breaking API Changes
    • commands
    • ginza (ginza.command_line.main_ginza)
      • change option mode to sudachipy_mode
      • drop options: disable_pipes and recreate_corrector
      • add options: hash_comment, parallel, files
      • add mecab to the choices for the argument of -f option
      • add parallel NUM_PROCESS option (EXPERIMENTAL)
      • add ENE7 attribute to conllu miscellaneous field
      • ginza.ent_type_mapping.ENE_NE_MAPPING is used to convert ENE7 label to NE
    • add ginzame (ginza.command_line.main_ginzame)
      • a multi-process tokenizer providing mecab like output format
    • spaCy field extensions
    • add token._.ne for ner label
    • ginza/sudachipy_tokenizer.py
    • change SudachiTokenizer to SudachipyTokenizer
    • use SUDACHI_DEFAULT_SPLIT_MODE instead of SUDACHI_DEFAULT_SPLITMODE or SUDACHI_DEFAULT_MODE
  • Dependencies
    • upgrade spacy to v2.2.3
    • upgrade sudachipy to v0.4.2

- Python
Published by hiroshi-matsuda-rit over 6 years ago

ginza - ginza-2.2.1

ginza-2.2.1

  • 2019-10-28
  • Improvements
    • JapaneseCorrector can merge the as_* type dependencies completely
  • Bug fixes
    • command line tool failed at the specific situations

- Python
Published by hiroshi-matsuda-rit over 6 years ago

ginza - ginza-2.2.0

ginza-2.2.0

  • 2019-10-04, Ametrine
  • Important changes
    • split_mode has been set incorrectly to sudachipy.tokenizer from v2.0.0 (#43)
    • This bug caused split_mode incompatibility between the training phase and the ginza command.
    • split_mode was set to 'B' for training phase and python APIs, but 'C' for ginza command.
    • We fixed this bug by setting the default split_mode to 'C' entirely.
    • This fix may cause the word segmentation incompatibilities during upgrading GiNZA from v2.0.0 to v2.2.0.
  • New features
    • Add -f and --output-format option to ginza command:
    • -f 0 or -f conllu : CoNLL-U Syntactic Annotation format
    • -f 1 or -f cabocha: cabocha -f1 compatible format
    • Add custom token fields:
    • bunsetu_index : bunsetu index starting from 0
    • reading: reading of token (not a pronunciation)
    • sudachi: SudachiPy's morpheme instance (or its list when then tokens are gathered by JapaneseCorrector)
  • Performance improvements
    • Tokenizer
    • Use latest SudachiDict (SudachiDict_core-20190927.tar.gz)
    • Use Cythonized SudachiPy (v0.4.0)
    • Dependency parser
    • Apply spacy pretrain command to capture the language model from UD-Japanese BCCWJ, UD_Japanese-PUD and KWDLC.
    • Apply multitask objectives by using -pt 'tag,dep' option of spacy train
    • New model file
    • ja_ginza-2.2.0.tar.gz

- Python
Published by hiroshi-matsuda-rit over 6 years ago

ginza - ginza-latest

v5.1.1

- Python
Published by hiroshi-matsuda-rit almost 7 years ago

ginza - ginza-2.0.0

ginza-2.0.0 (2019-07-08)

  • Add ginza command
    • run ginza from the console
  • Change package structure
    • module package as ginza
    • language model package as ja_ginza
    • spacy.lang.ja is overridden by ginza
  • Remove sudachipy related directories
    • SudachiPy and its dictionary are installed via pip during ginza installation
  • User dictionary available
  • Token extension fields
    • Added
    • token._.bunsetu_bi_label, token._.bunsetu_position_type
    • Remained
    • token._.inf
    • Removed
    • pos_detail (same value is set to token.tag_)

- Python
Published by hiroshi-matsuda-rit almost 7 years ago

ginza -

- Python
Published by hiroshi-matsuda-rit about 7 years ago

ginza - Add new era 'reiwa' to system_core.dic

- Python
Published by hiroshi-matsuda-rit about 7 years ago

ginza - GiNZA NLP formal release version

- Python
Published by hiroshi-matsuda-rit about 7 years ago