mulms-az-codi2023

Code and resources for our CODI paper "MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain"

https://github.com/boschresearch/mulms-az-codi2023

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: researchgate.net, ieee.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary

Keywords

bcai paper-resource
Last synced: 6 months ago · JSON representation ·

Repository

Code and resources for our CODI paper "MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain"

Basic Info
  • Host: GitHub
  • Owner: boschresearch
  • License: agpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 14 MB
Statistics
  • Stars: 2
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
bcai paper-resource
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog License Citation

README.md

MuLMS-AZ - Experiment Resources

This repository contains the companion material for the following publication:

Timo Pierre Schrader, Teresa Bürkle, Sophie Henning, Sherry Tan, Matteo Finco, Stefan Grünewald, Maira Indrikova, Felix Hildebrand, Annemarie Friedrich. MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain. CODI 2023.

Please cite this paper if using the dataset or the code, and direct any questions regarding the dataset to Annemarie Friedrich, and any questions regarding the code to Timo Schrader.

Purpose of this Software

This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.

The Multi-Layer Materials Science Argumentative Zoning Corpus (MuLMS-AZ)

The Multi-Layer Materials Science Argumentative Zoning (MuLMS-AZ) corpus consists of 50 documents (licensed CC BY) from the materials science domain, spanning across the following 7 subareas: "Electrolysis", "Graphene", "Polymer Electrolyte Fuel Cell (PEMFC)", "Solid Oxide Fuel Cell (SOFC)", "Polymers", "Semiconductors" and "Steel". There are annotations on sentence-leven and token-level for several NLP tasks, including Argumentative Zoning (AZ). Every sentence in the dataset is labeled with one or multiple argumetative zones. The dataset can be used to train classifiers and text mining systems on argumentative zoning in the materials science domain.

You can find the information about all papers and their authors in the MuLMSCorpusMetadata.csv files, including links to the copy of their respective license.

Argumentative Zoning

Argumentative zones (AZ) describe the rethorical function of sentences within scientific publications. Therefore, we model this task on sentence level. We use a BERT-based transformer to generate contextualized embeddings, take the [CLS] token to get the sentence embedding and feed it into a linear classification layer. To compensate for imbalances of classes, we implement the ML-ROS [1] algorithm which dynamically clones instances in the training set that belong to so-called minority classes.

On top of that, we provide code for multi-tasking and data augmentation experiments. Whereas our multi-task models use different output heads on top of the language model (one per dataset), data augmentation refers to the procedure of adding additional training samples from another dataset into the same training set by mapping the label set to our own one.

We used the following publicly available AZ datasets: - AZ-CL [2,3] - DRI [4,5] - ART [6] - Materials science-related subset of PubMed/MEDLINE [7]

Due to legal reasons, we cannot provide the files directly in this repo. Please refer to the README.md in data/additional_arg_zoning_datasets in order to prepare the additional data.

The following argumentative zones are modeled in our dataset: Experiment, Results, ExpPreparation, ExpCharacterization, Background_PriorWork, Explanation, Conclusion, Motivation, Background, Metadata, Caption, Heading, Abstract

We conduct the following experiments for which there are all necessary scripts in this repository:

  • Plain Argumentative Zoning ("AZ"):
  • Argumentative Zoning with oversampling ("OS"):
  • Multi-tasking with ART dataset (MT)

For dataset statistics and counts, please refer to our paper.

Data Format

We provide our dataset in the form of a HuggingFace dataset. Therefore, it can be easily loaded using

from datasets import Dataset, load_dataset train_dataset: Dataset = load_dataset( MULMS_DATASET_READER_PATH.__str__(), data_dir=MULMS_PATH.__str__(), data_files=MULMS_FILES, name="MuLMS_Corpus", split="train", ) It is constructred from UIMA CAS XMI files that can be used in annotation tools to read in and adapt the data. It operates on a sentence-level basis.

Data Fields

Note: There are more fields in the dataset than relevant to the argumentative zoning task. They are part of another work of ours.

The following fields in the dataset are relevant to argumentative zoning: * docid: Identifier of the document that allows for lookup in the MuLMSCorpusMetadata.csv * sentence: This is the raw sentence string. * tokens: The pre-tokenized version of the sentence. * beginOffset: The character offset of the first token in the sentence in the document. * endOffset: The character offset of the last token in the sentence in the document. * AZlabels: List of (multiple) AZ labels that belong to the current instance. * data_split: The split (train1/2/3/4/5, dev or test) which the current document belongs to. * category: One of the seven presented categories which the document is part of.

Split Setup

Our dataset is divided into several splits, please look them up in the paper for further explanation:

  • train
  • tune1/tune2/tune3/tune4/tune5
  • dev
  • test

Setup

Please install all dependencies or the environment as listed in environment.yml and make sure to have Python 3.9 installed (we recommend 3.9.11). You might also add the root folder of this project to the $PYTHONPATH environment variable. This enables all scripts to automatically find the imports.

NOTE: This code really requires Python 3.9. It does **not support Python 3.8 and below or 3.10 and above due to type hinting and package dependencies.**

Code

We provide bash scripts in scripts for each AZ task separately. Furthermore, for subtaks (e.g., multi-tasking), there are additional scripts that contain all parameters necessary. Use these scripts to reproduce the results from our paper and adapt those if you want to do additional experiments. Moreover, you can check all available settings in each Python file via python <script_name.py> --help. Every flag is described individually, some of them are not used in the bash scripts but might still be of interest for further experiments.

Reproducibility

In order to reproduce the numbers of our papers, we provide the list of all hyperparameters and seeds for each experiment reported in table 5 indivually (only the best run and only with SciBERT). Please note that different GPUs might produce slightly different numbers due to differences in floating-point arithmetic. For further configurations, please refer to our paper. If you need help or have any questions regarding the experiments, feel free to contact us! Also, if you find better configurations, you are invited to let us know!

| | AZ | OS | MT (+ ART) | MT ( + AZ-CL) | |---------------------------|-----|-----|------------|---------------| | Learning Rate |3e-6 | 2e-6|2e-6 | 2e-6 | | Batch Size |32 | 32 |16 | 16 | | Oversample Percentage |x | 0.2 |0.2 | 0.2 | | Multi-Task Subsample Rate |x | x |0.4 | 0.4 | | Seed |3784 | 1848|7491 | 725 |

Note: You might obtain slightly different results for multi-tasking depending on you self-prepared training data.

Transformer-based Models

We use BERT-based language models, namely BERT, SciBERT and MatSciBERT, as contextualized transformer LMs as basis of all our models. Moreover, we implement task-specific output layers on top of the LM. All Pytorch models can be found in the models subdirectory.

Evaluation

Use the aggregatecvscores.py scripts in the evaluation subdirectory to evaluate the performance of trained models across all five folds.

License

This software is open-sourced under the AGPL-3.0 license. See the LICENSE file for details. The MuLMS-AZ corpus is released under the CC BY-SA 4.0 license. See the LICENSE file for details. For a list of other open source components included in this project, see the file 3rd-party-licenses.txt.

Citation

If you use our software or dataset in your scientific work, please cite our paper:

@inproceedings{schrader-etal-2023-mulms, title = "{M}u{LMS}-{AZ}: An Argumentative Zoning Dataset for the Materials Science Domain", author = {Schrader, Timo and B{\"u}rkle, Teresa and Henning, Sophie and Tan, Sherry and Finco, Matteo and Gr{\"u}newald, Stefan and Indrikova, Maira and Hildebrand, Felix and Friedrich, Annemarie}, booktitle = "Proceedings of the 4th Workshop on Computational Approaches to Discourse (CODI 2023)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.codi-1.1", doi = "10.18653/v1/2023.codi-1.1", pages = "1--15", }

References

[1] Addressing imbalance in multilabel classification: Measures and random resampling algorithms (Charte et al., 2015)

[2] An annotation scheme for discourse-level argumentation in research articles (Teufel et al., EACL 1999)

[3] Discourse-level argumentation in scientific articles: human and automatic annotation (Teufel & Moens, 1999)

[4] A Multi-Layered Annotated Corpus of Scientific Papers (Fisas et al., LREC 2016)

[5] On the Discoursive Structure of Computer Graphics Research Papers (Fisas et al., LAW 2015)

[6] An ontology methodology and CISP-the proposed Core Information about Scientific Papers (Soldatova and Liakata, 2007)

[7] Using LSTM Encoder-Decoder for Rhetorical Structure Prediction (de Moura and Feltrim, 2018)

Owner

  • Name: Bosch Research
  • Login: boschresearch
  • Kind: organization
  • Email: opensource@bosch.com

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software or dataset, please cite it as below."
authors:
  - family-names: Schrader
    given-names: Timo Pierre
    orcid: https://orcid.org/0009-0006-5466-4939
  - family-names: Bürkle
    given-names: Teresa
  - family-names: Henning
    given-names: Sophie
    orcid:
  - family-names: Tan
    given-names: Sherry
  - family-names: Finco
    given-names: Matteo
    orcid: https://orcid.org/0000-0003-3559-8340
  - family-names: Grünewald
    given-names: Stefan
  - family-names: Indrikova
    given-names: Maira
    orcid:
  - family-names: Hildebrand
    given-names: Felix
    orcid:
  - family-names: Friedrich
    given-names: Annemarie
    orcid: https://orcid.org/0000-0001-8771-7634
title: "MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain"
version: 1.0.0
doi: 10.18653/v1/2023.codi-1.1
date-released: 2023-07-06

GitHub Events

Total
  • Push event: 3
Last Year
  • Push event: 3

Dependencies

environment.yml conda
  • pip 23.1.2.*
  • python 3.9.11.*