jrte-corpus

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)

https://github.com/megagonlabs/jrte-corpus

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.5%) to scientific vocabulary

Keywords

corpus japanese-language natural-language-processing sentiment-polarity textual-entailment
Last synced: 6 months ago · JSON representation ·

Repository

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)

Basic Info
  • Host: GitHub
  • Owner: megagonlabs
  • License: other
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 1.71 MB
Statistics
  • Stars: 76
  • Watchers: 4
  • Forks: 4
  • Open Issues: 0
  • Releases: 0
Topics
corpus japanese-language natural-language-processing sentiment-polarity textual-entailment
Created over 5 years ago · Last pushed over 2 years ago
Metadata Files
Readme License Citation

README.md

Japanese Realistic Textual Entailment Corpus

Creative Commons License CI Typos

Overview

This corpus contains examples labeled whether the premise entails the hypothesis or not as follows.

txt Hypothesis: 部屋から海が見える。 (You can see the ocean from your room.) Premise : 部屋はオーシャンビューで景色がよかったです。 (The room had an ocean view and a nice view.) Label : Entailment

All examples utilize texts in Japanese hotel reviews posted on Jalan, which is a travel information web site. This corpus also contains sentences with sentiment polarity labels and labels whether the text is hotel reputation or not as follows.

txt Text : 朝食が美味しいです。 (The breakfast is delicious.) Sentiment : Positive Hotel reputation: True

Because some of the data have been removed for various reasons, this corpus does not exactly correspond to one used in the reference papers.

Description

All files are in the Tab-separated values (TSV) format. All texts are Unicode NFKC normalized.

data/rte.*.tsv

Data for textual entailment.

| # | Explanation | Samples | | --- | --- | --- | | 0 | ID of the example | rteXYZq00001 | | 1 | Label | 1 (Entailment), 0 (Non-entailment) | | 2 | Hypothesis | 駅まで近い。 | | 3 | Premise | 温泉は、肌がスベスベになります。 | | 4 | Judges (JSON format) | [{"0": 5, "1": 0}, {"0": 0, "1": 2}]| | 5 | Reasoning (JSON format) | [["駅まで", 3], ["近い", 0], ["<unknown>", 1], ["<1>", 1]] | | 6 | Usage | train, dev, test |

Description of "Judges"

This is a collection of binary judgments of the annotators, represented by a dictionary. The key is the choice and the value is the number of people who chose the choice. When the binary annotation is not performed, this is null. If you ask more than once, this is a list of dictionary. Basically, the label is majority voted, but some are corrected manually.

Description of "Reasoning"

This is the result of the annotator's selection from tokens in Hypothesis for Non-entailment examples, represented by a list. When the annotation is not performed, this is null. There are two special tokens: <1> (The label is entailment) and <unknown> (Difficult to specify tokens)

Description of "Usage"

The usage of the example for papers. In reference papers, we used example labeled as dev for training because we have not tuned hyperparameters.

Files

  • rte.nlp2020*.tsv: Data used in "NLP 2020"
    • rte.nlp2020_base.tsv: BASE
    • rte.nlp2020_append.tsv: APPEND
  • rte.lrec2020*.tsv: Data used in "LREC 2020"
    • rte.lrec2020_surf.tsv: Surf in BASE
    • rte.lrec2020_sem_short.tsv: SemShort in BASE
    • rte.lrec2020_sem_long.tsv: SemLong in BASE
    • rte.lrec2020_me.tsv: ME
    • rte.lrec2020_mlm.tsv: MLM

data/operation.rte.lrec2020_mlm.tsv

An explanation of how we generated the MLM data.

| # | Explanation | Samples | | --- | --- | --- | | 0 | ID of the example | rteXYZq00001 | | 1 | ID of the original example | rteABCq00001| | 2 | Operation | insert, replace | | 3 | Target | hypothesis, premise |

data/rhr.tsv

Data for recognition of hotel reputation.

| # | Explanation | Samples | | --- | --- | --- | | 0 | ID of the example | rhrXYZq00001 | | 1 | Label | 1 (Hotel reputation), 0 (Not hotel reputation) | | 2 | Text | お風呂が最高でした。, 1人旅で利用しました。 | | 3 | Judges (JSON format) | {"0": 1, "1": 2}| | 4 | Usage | train, dev, test |

data/pn.tsv

Data for sentiment analysis.

| # | Explanation | Samples | | --- | --- | --- | | 0 | ID of the example | pnXYZq00001 | | 1 | Label | 1 (Positive), 0 (Neutral), -1 (Negative) | | 2 | Text | 駅まで近い。 | | 3 | Judges (JSON format) | {"0": 1, "1": 4}| | 4 | Usage | train, dev, test |

References

  1. 林部祐太. 知識の整理のための根拠付き自然文間含意関係コーパスの構築. 言語処理学会第26回年次大会論文集,pp.820-823. 2020. (NLP 2020) [PDF] [Poster]
  2. Yuta Hayashibe. Japanese Realistic Textual Entailment Corpus. Proceedings of The 12th Language Resources and Evaluation Conference, pp.6829-6836. 2020. (LREC 2020) [PDF] [bib]

Notes

  • 株式会社リクルート(以下「リクルート」といいます。)は自然言語処理の研究に貢献する目的で、言語的注釈が付与されたデータセット(以下「本データセット」といいます。)を公開いたします。
  • Recruit Co., Ltd.(hereinafter referred to as "Recruit") publishes the data set with linguistic annotations (hereinafter referred to as this "Data Set") for the purpose of contributing to the study of natural language processing.

  • 本データセットには、クチコミデータから抽出した文、それらを加工した文、アノテーション作業者が付与した判定ラベルが含まれます。ラベルは作業者によって付与されたものであり、クチコミ投稿者の体験や評価、もしくはリクルートの評価を反映したものではありません。

  • This Data Set is constructed using various methods of extraction from Customer Reviews. Annotators provide judgment via labels. Labels and recommendation sentences are provided by the cloud-sourced annotators and do not reflect the experience, assessment, or Recruit’s assessment of the review contributor.

  • 事実と異なる内容が含まれる場合があります。

  • This Data Set may contain content that is contrary to the facts.

  • 本データセットは通知なく変更・削除される場合があります。

  • This Data Set is subject to change or deletion without notice.

License and Attribution

  • 本データセットに含まれる「じゃらんクチコミデータ」の著作権は、リクルートに帰属します。
  • The copyrights to Customer Reviews included in this Data Set belong to Recruit.

  • 本データセットを用いた研究発表を行う際は、Referencesの論文を引用し、次のようにデータの入手元も記述してください。

    • 文例: 本研究では株式会社リクルートが提供する"Japanese Realistic Textual Entailment Corpus" (https://github.com/megagonlabs/jrte-corpus)を利用しました。
  • When publishing a study using this dataset, please cite papers in References and describe the source of the data as follows.

    • Example: To conduct this study, we used "Japanese Realistic Textual Entailment Corpus" (https://github.com/megagonlabs/jrte-corpus) provided by Recruit Co., Ltd.
  • 本データセットのライセンスはクリエイティブ・コモンズ・ライセンス (表示-非営利-継承 4.0 国際)です。

  • The license of this Data Set is in the same scope as Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

Prohibitions

  • リクルートは本データセットを非営利的な公共利用のために公開しています。分析・研究・その成果を発表するために必要な範囲を超えて利用すること(営利目的利用)は固く禁じます。
  • Recruit discloses this Data Set for non-profit public use. It is strictly prohibited to use for profit purposes beyond the scope necessary for the presentation of analysis, research and results.

  • 利用者は、研究成果の公表といえども、前項の出版物等の資料に、適正な例示の範囲を超えてデータセット中のデータを掲載してはならず、犯罪その他の違法行為を積極的に助長・推奨する内容や公序良俗に違反する情報等を記述しないでください。

  • Even when publishing research results, users should not post data in the data set beyond the appropriate exemplary range in the publications and other materials set forth in the preceding paragraph. Users should not describe information obtained from the data set that violates public order and morals, promote or encourage criminal or other illegal acts.

Contact

If you have any inquiries and/or problems about a dataset or notice a mistake, please contact NLP Data Support Team nlp_data_support at r.recruit.co.jp.

Owner

  • Name: Megagon Labs
  • Login: megagonlabs
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this work in a project of yours and write about it, please cite our paper using the following citation data."
authors:
  - family-names: Hayashibe
    given-names: Yuta
title: Japanese Realistic Textual Entailment Corpus
url: https://github.com/megagonlabs/jrte-corpus
preferred-citation:
  type: conference-paper
  title: Japanese Realistic Textual Entailment Corpus
  authors:
    - family-names: Hayashibe
      given-names: Yuta
  isbn: 979-10-95546-34-4
  collection-title: Proceedings of The 12th Language Resources and Evaluation Conference
  year: 2020
  month: 5
  publisher: 
    name: European Language Resources Association
  url: https://aclanthology.org/2020.lrec-1.843
  start: 6827
  end: 6834

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 0
  • Total pull requests: 1
  • Average time to close issues: N/A
  • Average time to close pull requests: less than a minute
  • Total issue authors: 0
  • Total pull request authors: 1
  • Average comments per issue: 0
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • shirayu (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • autopep8 >=1.5.4
  • flake8 >=3.8.4
  • isort >=5.6.3
  • mypy >=0.790
  • yamllint >=1.25.0
.github/workflows/ci.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/setup-node v3 composite
  • actions/setup-python v4 composite
  • snok/install-poetry v1.3.3 composite
.github/workflows/typos.yml actions
  • actions/checkout v3 composite
  • crate-ci/typos v1.12.14 composite
package-lock.json npm
  • argparse 2.0.1 development
  • balanced-match 1.0.2 development
  • brace-expansion 2.0.1 development
  • commander 9.4.1 development
  • deep-extend 0.6.0 development
  • entities 3.0.1 development
  • fs.realpath 1.0.0 development
  • get-stdin 9.0.0 development
  • glob 8.0.3 development
  • ignore 5.2.0 development
  • inflight 1.0.6 development
  • inherits 2.0.4 development
  • ini 3.0.1 development
  • js-yaml 4.1.0 development
  • jsonc-parser 3.1.0 development
  • linkify-it 4.0.1 development
  • markdown-it 13.0.1 development
  • markdownlint 0.26.2 development
  • markdownlint-cli 0.32.2 development
  • markdownlint-rule-helpers 0.17.2 development
  • mdurl 1.0.1 development
  • minimatch 5.1.0 development
  • minimist 1.2.7 development
  • once 1.4.0 development
  • pyright 1.1.280 development
  • run-con 1.2.11 development
  • strip-json-comments 3.1.1 development
  • uc.micro 1.0.6 development
  • wrappy 1.0.2 development
package.json npm
  • markdown-it >=12.3.2 development
  • markdownlint-cli ^0.32.2 development
  • pyright ^1.1.280 development
poetry.lock pypi
  • black 22.10.0 develop
  • click 8.1.3 develop
  • colorama 0.4.6 develop
  • coverage 6.5.0 develop
  • flake8 5.0.4 develop
  • isort 5.10.1 develop
  • mccabe 0.7.0 develop
  • mypy-extensions 0.4.3 develop
  • pathspec 0.10.2 develop
  • platformdirs 2.5.4 develop
  • pycodestyle 2.9.1 develop
  • pydocstyle 6.1.1 develop
  • pyflakes 2.5.0 develop
  • pyyaml 6.0 develop
  • setuptools 65.5.1 develop
  • snowballstemmer 2.2.0 develop
  • toml 0.10.2 develop
  • tomli 2.0.1 develop
  • typing-extensions 4.4.0 develop
  • yamllint 1.28.0 develop
pyproject.toml pypi
  • black >=21.12b0 develop
  • coverage >=5.3 develop
  • flake8 >=3.8.4 develop
  • isort >=5.6.4 develop
  • pydocstyle >=5.1.1 develop
  • toml ^0.10.2 develop
  • yamllint >=1.25.0 develop
  • python ^3.9