diverse_cdcr_datasets

Towards Evaluation of Cross-document Coreference Resolution Models using Datasets with Diverse Annotation Schemes

https://github.com/anastasia-zhukova/diverse_cdcr_datasets

Science Score: 18.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.4%) to scientific vocabulary

Keywords

coreference-resolution dataset natural-language-processing
Last synced: 6 months ago · JSON representation ·

Repository

Towards Evaluation of Cross-document Coreference Resolution Models using Datasets with Diverse Annotation Schemes

Basic Info
  • Host: GitHub
  • Owner: anastasia-zhukova
  • License: apache-2.0
  • Language: Perl
  • Default Branch: master
  • Homepage:
  • Size: 266 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 0
  • Open Issues: 2
  • Releases: 0
Topics
coreference-resolution dataset natural-language-processing
Created about 4 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

Cross-document coreference resolution (CDCR) datasets with diverse annotation schemes

The repository contains the code used to report the results in the LREC 2022 paper Zhukova A., Hamborg F., Gipp B. "Towards Evaluation of Cross-document Coreference Resolution Models Using Datasets with Diverse Annotation Schemes".
Please use this .bib to cite the paper: @inproceedings{Zhukova2022a, title = {{T}owards {E}valuation of {C}ross-document {C}oreference {R}esolution {M}odels {U}sing {D}atasets with {D}iverse {A}nnotation {S}chemes}, author = {Zhukova, Anastasia and Hamborg, Felix and Gipp, Bela}, year = 2022, month = {June}, booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference}, location = {Marseille, France} }

The repository contains a code that parses original formats of CDCR datasets into the same format (conll format for coreference resolution and a separate list of mentions) and calculates summary values that enable comparison of the datasets.

Parsing scripts per dataset are contained in each separate folder, whereas the summary script is located in the root folder. The parsed datasets are available in this repository in the folders listed below.

Installation

1) Python 3.8 required

2) !!! Recommended to create a venv. 3) Install libraries: pip install -r requirements.txt 4) Download the datasets and required libraries from spacy: python setup.py 5) Download and install Perl. Add perl to PATH, restart your computer, and check that perl has been correctly installed.

Dataset information

The parsing scripts and output folders are located here:

| Dataset | Parsing script | Output files | | :--- | :---- | :--- | | ECB+ | ECBplus-prep/parse_ecbplus.py | ECBplus-prep/output_data | | NewsWCL50 | NewsWCL50-prep/parse_newswcl50.py | NewsWCL50-prep/output_data |

Each dataset contains three output files suitable for a CDCR model:

1) *dataset_name*.conll 2) entity_mentions.json 3) event_mentions.json

CoNLL format (simplified)

CoNLL format is a standard input format for within-document coreference resolution. The original format contains multiple columns that contain information per each token, e.g., POS tags, NER labels. We use a simplified format (based on the format of input filed used by Barhom et al. 2019) that contains tokens, their identifiers in the text (e.g., docid, sentid), and labels of coref chains:

| Column ID | Type | Description | | :--- | :----: | :--- | | 0 | string | Composed document id: topic/subtopic/doc ("-" is used if there is no subtopic) | | 1 | int | Sentence ID | | 2 | int | Token ID | | 3 | string | Token | | 4 | string | Coreference chain |

Each document is accompanied with a beginning and end tags, sentences are separated with news lines (warning: some new line delimiters can be tokens themselves (e.g., in NewsWCL50)).

Example: ```

begin document 0/-/0_LL; part 000

0/-/0LL 0 0 This - 0/-/0LL 0 1 is - 0/-/0LL 0 2 Jim (1) 0/-/0LL 0 3 . -

0/-/0LL 1 0 He (1) 0/-/0LL 1 1 likes - 0/-/0LL 1 2 sports - 0/-/0LL 1 3 . -

end document

begin document 1/1ecb/12; part 000

1/1ecb/12 0 0 This - 1/1ecb/12 0 1 is - 1/1ecb/12 0 2 Anna (2) 1/1ecb/12 0 3 . -

1/1ecb/12 1 0 She (2) 1/1ecb/12 1 1 likes - 1/1ecb/12 1 2 singing - 1/1ecb/12 1 3 . -

end document

```

Mentions.json

The format is adapted and extended from WEC-Eng and from the mention format used by Barhom et al. 2019.

| Field | Type | Description | | :--- | :----: | :--- | | corefchain | string | Unique identifier of a coreference chain to which this mention belongs to. | | description | string | Description of a coreference chain. | | coreftype | string | Type of a coreference link, e.g., strict indentity. | mentionid | string | Mention ID. | | mentiontype | string | Short form of a mention type, e.g., HUM | | mentionfulltype | string | Long form of a mention type, e.g., HUMANPARTPER | | tokensstr | string | A full mention string, i.e., all consequitive chars of the mention as found in the text. | | tokenstext | list of strings | A mention split into a list of tokens, text of tokens | | tokensnumbers | list of int | A mention split into a list of tokens, token id of these tokens (as occurred in a sentence). | | mentionhead | string | A head of mention's phrase, e.g., Barack Obama | | mentionheadid | int | Token id of the head of mention's phrase | | mentionheadpos | string | Token's POS tag of the head of mention's phrase | | mentionheadlemma| string | Token's lemma of the head of mention's phrase | | sentid | int | Sentence ID | | topicid | int | Topic ID | | topic | string | Topic description | | subtopic | string | Subtopic name | | docid | string | Document ID | | iscontinuous | bool | If all tokens in the annotated mention continuously occur in the text | | issingleton | bool | If a coreference chain consists of only one mention. | | mentioncontext | list of strings | -N and +N tokens before and after the mention (N=100). | | conlldockey | string | a compositional key for one-to-one mapping documents between .conll and .json files. |

Example: [{ "coref_chain": "0_Denuclearization_MISC", "tokens_number": [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], "doc_id": "0_L", "score": -1, "sent_id": 21, "mention_type": "MISC", "mention_full_type": "MISC", "mention_id": "0_L_21_49_VZrL", "topic_id": 0, "topic": "0_CIADirectorMikePompeoMeetingNorthKorea", "subtopic": "-", "description": "Denuclearization", "coref_type": "STRICT", "mention_ner": "O", "mention_head_pos": "PUNCT", "mention_head_lemma": "\"", "mention_head": "\"", "mention_head_id": 49, "is_continuous": true, "is_singleton": false, "mention_context": ["newspaper", ",", "Munhwa", "Ilbo", ",", "reported", "that", "the", "two", "countries", "were", "negotiating", "an", "announcement", "\"", "to", "ease", "military", "tensions", "and", "end", "a", "military", "confrontation", ",", "\"", "as", "part", "of", "the", "summit", "meeting", "planned", "between", "Mr.", "Kim", "and", "President", "Moon", "Jae", "-", "in", "of", "South", "Korea", ".", "\n", "That", "could", "involve", "pulling", "troops", "out", "of", "the", "Demilitarized", "Zone", ",", "making", "it", "a", "genuinely", "\"", "Demilitarized", "Zone", ".", "\"", "A", "South", "Korean", "government", "official", "later", "played", "down", "the", "report", ",", "saying", "it", "was", "too", "soon", "to", "tell", "what", "a", "joint", "statement", "by", "Mr.", "Moon", "and", "Mr.", "Kim", "would", "contain", ",", "other", "than", "broad", "and", "\"", "abstract", "\"", "statements", "about", "the", "need", "for", "North", "Korea", "to", "\"", "denuclearize", ".", "\"", "\n", "But", "analysts", "said", "South", "Korea", "was", "aiming", "for", "a", "comprehensive", "deal", ",", "in", "which", "the", "North", "agreed", "to", "give", "up", "its", "weapons", "in", "return", "for", "a", "security", "guarantee", ",", "including", "a", "peace", "treaty", ".", "Mr.", "Trump", "'s", "comments", "suggested", "he", "backed", "that", "effort", ".", "\n", "\"", "They", "do", "have", "my", "blessing", "to", "discuss", "the", "end", "of", "the", "war", ",", "\"", "he", "said", ".", "\"", "People", "do", "n't", "realize", "that", "the", "Korean", "War", "has", "not", "ended", ".", "It", "'s", "going", "on", "right", "now", ".", "And", "they", "are", "discussing", "an", "end", "to", "war", ".", "Subject", "to", "a", "deal", ",", "they"], "tokens_str": "broad and \"abstract\" statements about the need for North Korea to \"denuclearize.\" ", "tokens_text": ["broad", "and", "\"", "abstract", "\"", "statements", "about", "the", "need", "for", "North", "Korea", "to", "\"", "denuclearize", ".", "\""], "conll_doc_key": "0/-/0_L" }]

Dataset summary metrics

The following values enable comparison of the CDCR datasets on dataset and topic levels.

| Field | Type | Description | | :--- | :----: | :--- | | dataset | string | Name of the dataset | | topic | string | Topic name (or empty for the line that contains stats for a full dataset) | | articles | int | Number of articles in a dataset/topic | | tokens | int | Number of tokens in a dataset/topic | | corefchain | int | Number of coref chains in a dataset/topic | | mentions | int | Number of all mentions in a dataset/topic | | eventmentions | int | Number of event mentions in a dataset/topic | | entitymentions | int | Number of entity mentions in a dataset/topic | | singletons | int | Number of singleton coref chains in a dataset/topic | | averagesize | float | Average number of mentions in a coref chain, i.e., chain size | | uniquelemmasall | float | Lexical diversity measurement: a number of unique mention lemmas in a chain. Calculated on all coref chains. | | uniquelemmaswosingl | float | -//- Calculated on non-singleton chains. | | phrasingdiversityweightedall | float | Lexical diversity measurement: phrasing diversity (see LREC paper). Measures diversity of the mentions given variation and frequency of the chains' mentions. Calculated on all mentions. | | phrasingdiversityweightedwosingl | float | -//- Calculated on non-singleton chains. | | F1CONLLall | float | F1 CoNLL (average of B3, MUC, and CEAFe) calculated on the simple same-lemma baseline. Calculated on all coref chains. | | F1CONLLwosingl | float | -//- Calculated on non-singleton chains. |

The results of dataset comparison is available in /summary folder.

Owner

  • Name: Anastasia Zhukova
  • Login: anastasia-zhukova
  • Kind: user
  • Location: Göttingen
  • Company: University of Göttingen & University of Wuppertal

Doctoral Researcher at GippLab: Scientific Information Analytics

Citation (CITATION.BIB)

@inproceedings{Zhukova2022a,
  title        = {{T}owards {E}valuation of {C}ross-document {C}oreference {R}esolution {M}odels {U}sing {D}atasets with {D}iverse {A}nnotation {S}chemes},
  author       = {Zhukova, Anastasia and Hamborg, Felix and Gipp, Bela},
  year         = 2022,
  month        = {June},
  booktitle    = {Proceedings of the 13th Language Resources and Evaluation Conference},
  location     = {Marseille, France}
}

GitHub Events

Total
  • Issues event: 1
  • Pull request event: 1
Last Year
  • Issues event: 1
  • Pull request event: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 1
  • Average time to close issues: over 1 year
  • Average time to close pull requests: about 2 years
  • Total issue authors: 1
  • Total pull request authors: 1
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • sago2693 (2)
Pull Request Authors
  • lavanya-m-k (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • gdown ==4.4.0
  • nltk ==3.7
  • numpy *
  • pandas ==1.4.2
  • shortuuid ==1.0.8
  • spacy ==3.3.0
  • tqdm *