diverse_cdcr_datasets

Towards Evaluation of Cross-document Coreference Resolution Models using Datasets with Diverse Annotation Schemes

https://github.com/anastasia-zhukova/diverse_cdcr_datasets

Keywords

coreference-resolution dataset natural-language-processing

Last synced: 6 months ago · JSON representation ·

Repository

Towards Evaluation of Cross-document Coreference Resolution Models using Datasets with Diverse Annotation Schemes

Basic Info

Host: GitHub
Owner: anastasia-zhukova
License: apache-2.0
Language: Perl
Default Branch: master
Homepage:
Size: 266 MB

Statistics

Stars: 2
Watchers: 2
Forks: 0
Open Issues: 2
Releases: 0

Topics

coreference-resolution dataset natural-language-processing

Created about 4 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

Cross-document coreference resolution (CDCR) datasets with diverse annotation schemes

The repository contains the code used to report the results in the LREC 2022 paper Zhukova A., Hamborg F., Gipp B. "Towards Evaluation of Cross-document Coreference Resolution Models Using Datasets with Diverse Annotation Schemes".
Please use this .bib to cite the paper: @inproceedings{Zhukova2022a, title = {{T}owards {E}valuation of {C}ross-document {C}oreference {R}esolution {M}odels {U}sing {D}atasets with {D}iverse {A}nnotation {S}chemes}, author = {Zhukova, Anastasia and Hamborg, Felix and Gipp, Bela}, year = 2022, month = {June}, booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference}, location = {Marseille, France} }

The repository contains a code that parses original formats of CDCR datasets into the same format (conll format for coreference resolution and a separate list of mentions) and calculates summary values that enable comparison of the datasets.

Parsing scripts per dataset are contained in each separate folder, whereas the summary script is located in the root folder. The parsed datasets are available in this repository in the folders listed below.

Installation

1) Python 3.8 required

2) !!! Recommended to create a venv. 3) Install libraries: pip install -r requirements.txt 4) Download the datasets and required libraries from spacy: python setup.py 5) Download and install Perl. Add perl to PATH, restart your computer, and check that perl has been correctly installed.

Dataset information

The parsing scripts and output folders are located here:

Each dataset contains three output files suitable for a CDCR model:

1) *dataset_name*.conll 2) entity_mentions.json 3) event_mentions.json

CoNLL format (simplified)

CoNLL format is a standard input format for within-document coreference resolution. The original format contains multiple columns that contain information per each token, e.g., POS tags, NER labels. We use a simplified format (based on the format of input filed used by Barhom et al. 2019) that contains tokens, their identifiers in the text (e.g., docid, sentid), and labels of coref chains:

| Column ID | Type | Description | | :--- | :----: | :--- | | 0 | string | Composed document id: topic/subtopic/doc ("-" is used if there is no subtopic) | | 1 | int | Sentence ID | | 2 | int | Token ID | | 3 | string | Token | | 4 | string | Coreference chain |

Each document is accompanied with a beginning and end tags, sentences are separated with news lines (warning: some new line delimiters can be tokens themselves (e.g., in NewsWCL50)).

Example: ```

begin document 0/-/0_LL; part 000

0/-/0LL 0 0 This - 0/-/0LL 0 1 is - 0/-/0LL 0 2 Jim (1) 0/-/0LL 0 3 . -

0/-/0LL 1 0 He (1) 0/-/0LL 1 1 likes - 0/-/0LL 1 2 sports - 0/-/0LL 1 3 . -

end document

begin document 1/1ecb/12; part 000

1/1ecb/12 0 0 This - 1/1ecb/12 0 1 is - 1/1ecb/12 0 2 Anna (2) 1/1ecb/12 0 3 . -

1/1ecb/12 1 0 She (2) 1/1ecb/12 1 1 likes - 1/1ecb/12 1 2 singing - 1/1ecb/12 1 3 . -

end document

```

Mentions.json

The format is adapted and extended from WEC-Eng and from the mention format used by Barhom et al. 2019.

Example: [{ "coref_chain": "0_Denuclearization_MISC", "tokens_number": [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "doc_id": "0_L", "score": -1, "sent_id": 21, "mention_type": "MISC", "mention_full_type": "MISC", "mention_id": "0_L_21_49_VZrL", "topic_id": 0, "topic": "0_CIADirectorMikePompeoMeetingNorthKorea", "subtopic": "-", "description": "Denuclearization", "coref_type": "STRICT", "mention_ner": "O", "mention_head_pos": "PUNCT", "mention_head_lemma": "\"", "mention_head": "\"", "mention_head_id": 49, "is_continuous": true, "is_singleton": false, "mention_context": ["newspaper", ",", "Munhwa", "Ilbo", ",", "reported", "that", "the", "two", "countries", "were", "negotiating", "an", "announcement", "\"", "to", "ease", "military", "tensions", "and", "end", "a", "military", "confrontation", ",", "\"", "as", "part", "of", "the", "summit", "meeting", "planned", "between", "Mr.", "Kim", "and", "President", "Moon", "Jae", "-", "in", "of", "South", "Korea", ".", "\n", "That", "could", "involve", "pulling", "troops", "out", "of", "the", "Demilitarized", "Zone", ",", "making", "it", "a", "genuinely", "\"", "Demilitarized", "Zone", ".", "\"", "A", "South", "Korean", "government", "official", "later", "played", "down", "the", "report", ",", "saying", "it", "was", "too", "soon", "to", "tell", "what", "a", "joint", "statement", "by", "Mr.", "Moon", "and", "Mr.", "Kim", "would", "contain", ",", "other", "than", "broad", "and", "\"", "abstract", "\"", "statements", "about", "the", "need", "for", "North", "Korea", "to", "\"", "denuclearize", ".", "\"", "\n", "But", "analysts", "said", "South", "Korea", "was", "aiming", "for", "a", "comprehensive", "deal", ",", "in", "which", "the", "North", "agreed", "to", "give", "up", "its", "weapons", "in", "return", "for", "a", "security", "guarantee", ",", "including", "a", "peace", "treaty", ".", "Mr.", "Trump", "'s", "comments", "suggested", "he", "backed", "that", "effort", ".", "\n", "\"", "They", "do", "have", "my", "blessing", "to", "discuss", "the", "end", "of", "the", "war", ",", "\"", "he", "said", ".", "\"", "People", "do", "n't", "realize", "that", "the", "Korean", "War", "has", "not", "ended", ".", "It", "'s", "going", "on", "right", "now", ".", "And", "they", "are", "discussing", "an", "end", "to", "war", ".", "Subject", "to", "a", "deal", ",", "they"], "tokens_str": "broad and \"abstract\" statements about the need for North Korea to \"denuclearize.\" ", "tokens_text": ["broad", "and", "\"", "abstract", "\"", "statements", "about", "the", "need", "for", "North", "Korea", "to", "\"", "denuclearize", ".", "\""], "conll_doc_key": "0/-/0_L" }]

Dataset summary metrics

The following values enable comparison of the CDCR datasets on dataset and topic levels.

The results of dataset comparison is available in /summary folder.

Owner

Name: Anastasia Zhukova
Login: anastasia-zhukova
Kind: user
Location: Göttingen
Company: University of Göttingen & University of Wuppertal

Website: https://gipplab.org/team/anastasia-zhukova/
Twitter: ana_m_zhukova
Repositories: 3
Profile: https://github.com/anastasia-zhukova

Doctoral Researcher at GippLab: Scientific Information Analytics

Citation (CITATION.BIB)

@inproceedings{Zhukova2022a,
  title        = {{T}owards {E}valuation of {C}ross-document {C}oreference {R}esolution {M}odels {U}sing {D}atasets with {D}iverse {A}nnotation {S}chemes},
  author       = {Zhukova, Anastasia and Hamborg, Felix and Gipp, Bela},
  year         = 2022,
  month        = {June},
  booktitle    = {Proceedings of the 13th Language Resources and Evaluation Conference},
  location     = {Marseille, France}
}

GitHub Events

Total

Issues event: 1
Pull request event: 1

Last Year

Issues event: 1
Pull request event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 2
Total pull requests: 1
Average time to close issues: over 1 year
Average time to close pull requests: about 2 years
Total issue authors: 1
Total pull request authors: 1
Average comments per issue: 0.0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

diverse_cdcr_datasets

Science Score: 18.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Cross-document coreference resolution (CDCR) datasets with diverse annotation schemes

Installation

Dataset information

CoNLL format (simplified)

begin document 0/-/0_LL; part 000

end document

begin document 1/1ecb/12; part 000

end document

Mentions.json

Dataset summary metrics

Owner

Citation (CITATION.BIB)

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies