mulms-wiesp2023

Code and resources for our WIESP paper "MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain"

https://github.com/boschresearch/mulms-wiesp2023

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.0%) to scientific vocabulary

Keywords

bcai paper-resource

Last synced: 11 months ago · JSON representation ·

Repository

Code and resources for our WIESP paper "MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain"

Basic Info

Host: GitHub
Owner: boschresearch
License: agpl-3.0
Language: Python
Default Branch: master
Homepage:
Size: 1.32 MB

Statistics

Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Releases: 0

Topics

bcai paper-resource

Created over 2 years ago · Last pushed about 2 years ago

Metadata Files

Readme Changelog License Citation

Multi-Layer Materials Science Corpus - Experiment Resources

This repository contains the companion material for the following publication:

Timo Pierre Schrader, Matteo Finco, Stefan Grünewald, Felix Hildebrand, Annemarie Friedrich. MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain WIESP 2023.

Please cite this paper if using the dataset or the code, and direct any questions regarding the dataset to Annemarie Friedrich, and any questions regarding the code to Timo Schrader.

Purpose of this Software

This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.

The Multi-Layer Materials Science Corpus (MuLMS)

The Multi-Layer Materials Science corpus (MuLMS) consists of 50 documents (licensed CC BY) from the materials science domain, spanning across the following 7 subareas: "Electrolysis", "Graphene", "Polymer Electrolyte Fuel Cell (PEMFC)", "Solid Oxide Fuel Cell (SOFC)", "Polymers", "Semiconductors" and "Steel".

For a detailed description, please refer to our HuggingFace Dataset and our paper.

NOTE: This code requires Python 3.9 oder newer. It does **not support Python 3.8 and below.**

Available Experiments

Named Entity Recognition

Named entitiy (NE) recognition is a token-level tagging task and deals with tagging named entities. For instance, "WO3" is an example of a "Material" in our dataset. Because named entities occur at tokel level and can span across multiple input tokens in sentence, we model this task using two different approaches, BILOU tagging scheme + CRF classification layer and dependency parsing where we treat NEs as dependencies between first and last token.

Furthermore, we provide datasets for multi-task experiments where we incorporate another related datasets and their named entities to support the classifiers in learning our NEs.

The following named entities are modeled in our dataset: MAT, NUM, VALUE, UNIT, PROPERTY, CITE, TECHNIQUE, RANGE, INSTRUMENT, SAMPLE, FORM, DEV, MEASUREMENT

Relation Extraction

MuLMS provides relations between pairs of entities. There are two types of relations: measurement-related relations and further relations. The first type always starts at Measurement trigger spans, the scond type does not start at a specific Measurement annotation.

There are the following relation types in MuLMS: hasForm, measuresProperty, usedAs, propertyValue, conditionProperty, conditionSample, conditionPropertyValue, usesTechnique, measuresPropertyValue, usedTogether, conditionEnv, usedIn, conditionInstrument, takenFrom, dopedBy

Measurement Classification

This task is about classifying experiment-describing sentences as qualitative or quantitative. Whereas a quantitative sentence describes technical details about measurement procedures and experiments, a quantitative sentence does only describe it on a high-level, leaving out important details.

We model this task on a sentence-level basis as a ternary classification task using the tagset MEASUREMENT, QUAL_MEAS and NONE.

Argumentative Zoning

For the argumentative zoning (AZ) part MuLMS that is presented in the related publication MuLMS-AZ: An Argumentative Zoning Dataset for the Materials Science Domain, please refer to the separate repository on Github, which is a submodule of this repository and hence does not need an extra download.

Data Format

Split Setup

Our dataset is divided into several splits, please look them up in the paper for further explanation:

train
tune1/tune2/tune3/tune4/tune5
dev
test

Setup

Please install all dependencies or the environment as listed in environment.yml and make sure to have Python 3.9 installed (we recommend 3.9.11). You might also add the root folder of this project to the $PYTHONPATH environment variable. This enables all scripts to automatically find the imports.

Important: Also clone the Git submodule in this repo that points to the MuLMS-AZ Repo. The code files there are required to run the experiments in this repo. Do this by appending the Git flag --recurse-submodule to git clone.

NOTE: This code really requires Python 3.9. It does **not support Python 3.8 and below or 3.10 and above due to type hinting and package dependencies.**

Code

We provide bash scripts in scripts for each NLP task separately. Furthermore, for subtaks (e.g., multi-tasking), there are additional scripts that contain all necessary parameters. Use these scripts to reproduce the results from our paper and adapt those if you want to do additional experiments. Moreover, you can check all available settings in each Python file via python <script_name.py> --help.

Transformer-based Models

We use BERT-based language models, namely BERT, SciBERT and MatSciBERT, as contextualized transformer LMs as basis of all our models. Moreover, we implement task-specific output layers on top of the LM. All Pytorch models can be found in every models subdirectory of each task.

Multi-task Datasets

Download the SOFC corpus and place the contents in data/.
For MSPT, you need to first convert the corpus to UIMA CAS format using for example INCEpTION. You can find all UIMA CAS span types in source/datahandling/msptdataset.py.

Evaluation

Use the aggregate_cv_score.py scripts in the evaluation subdirectory of each task to evaluate the performance of trained models across all five folds.

License

This software is open-sourced under the AGPL-3.0 license. See the LICENSE file for details. The MuLMS-AZ corpus is released under the CC BY-SA 4.0 license. See the LICENSE file for details. For a list of other open source components included in this project, see the file 3rd-party-licenses.txt.

Citation

If you use our software or dataset in your scientific work, please cite our paper:

@misc{schrader2023mulms, title={MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain}, author={Timo Pierre Schrader and Matteo Finco and Stefan Grünewald and Felix Hildebrand and Annemarie Friedrich}, year={2023}, eprint={2310.15569}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Owner

Name: Bosch Research
Login: boschresearch
Kind: organization
Email: opensource@bosch.com

Website: https://www.bosch.com/research
Repositories: 91
Profile: https://github.com/boschresearch

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software or dataset, please cite it as below."
authors:
  - family-names: Schrader
    given-names: Timo Pierre
    orcid: https://orcid.org/0009-0006-5466-4939
  - family-names: Finco
    given-names: Matteo
    orcid: https://orcid.org/0000-0003-3559-8340
  - family-names: Grünewald
    given-names: Stefan
  - family-names: Hildebrand
    given-names: Felix
    orcid:
  - family-names: Friedrich
    given-names: Annemarie
    orcid: https://orcid.org/0000-0001-8771-7634
title: "MuLMS: A Multi-Layer Annotated Text Corpus for Information Extraction in the Materials Science Domain"
version: 1.0.0
# doi: TODO
date-released:

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Dependencies

environment.yml pypi

datasets ==2.4.0
lxml ==4.7.1
nltk ==3.8.1
numpy ==1.23.5
pandas ==1.3.2
pre-commit ==3.3.3
pyarrow ==6.0.1
scikit_learn ==1.1.1
seaborn ==0.12.0
torch ==2.1.0
tqdm ==4.63.0
transformers ==4.30.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science