lrebench

[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study

https://github.com/zjunlp/lrebench

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.6%) to scientific vocabulary

Keywords

benchmark chinese data-augmentation data-augumentation dataset efficient emnlp few-shot information-extraction kg knowledge-graph knowprompt long-tail low-resource lrebench prompt re relation-extraction self-training
Last synced: 6 months ago · JSON representation ·

Repository

[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study

Basic Info
Statistics
  • Stars: 34
  • Watchers: 6
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
benchmark chinese data-augmentation data-augumentation dataset efficient emnlp few-shot information-extraction kg knowledge-graph knowprompt long-tail low-resource lrebench prompt re relation-extraction self-training
Created almost 4 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

LREBench: A low-resource relation extraction benchmark.

This repo is official implementation for the EMNLP2022 (Findings) paper LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [poster].

This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data; $(ii)$ diverse balancing methods to address the long-tailed distribution issue; $(iii)$ data augmentation technologies and self-training to generate more labeled in-domain data.

intro

Contents

Environment

To install requirements:

```shell

conda create -n LREBench python=3.9 conda activate LREBench pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113 ```

Datasets

We provide 8 benchmark datasets and prompts used in our experiments.

All processed full-shot datasets can be downloaded and need to be placed in the dataset folder. The expected files of one dataset contains rel2id.json, train.json and test.json.

Normal Prompt-based Tuning

prompt

1 Initialize Answer Words

Use the command below to get answer words first.

```shell

python getlabelword.py --modelpath roberta-large --dataset semeval ```

The {modelpath}_{dataset}.pt will be saved in the dataset folder, and you need to assign the modelpath and dataset with names of the pre-trained language model and the dataset to be used before.

2 Split Datasets

We provide the sampling code for obtaining 8-shot (sample_8shot.py) , 10% (sample_10.py) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and new_test.json and new_rel2id.json are obtained.

```shell

python sample8shot.py -h usage: sample8shot.py [-h] --inputdir INPUTDIR --outputdir OUTPUTDIR

optional arguments:
  -h, --help            show this help message and exit
  --input_dir INPUT_DIR, -i INPUT_DIR
                        The directory of the training file.
  --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                        The directory of the sampled files.

python sample10.py -h usage: sample10.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR

optional arguments:
  -h, --help            show this help message and exit
  --input_file INPUT_FILE, -i INPUT_FILE
                        The directory of the training file.
  --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                        The directory of the sampled files.

```

For example:

```shell

python sample8.py -i dataset/semeval -o dataset/semeval/8-shot cd dataset/semeval mkdir 8-1 cp 8-shot/newrel2id.json 8-1/rel2id.json cp 8-shot/newtest.json 8-1/test.json cp 8-shot/train81.json 8-1/train.json cp 8-shot/unlabel8_1.json 8-1/label.json ```

3 Prompt-based Tuning

All running scripts for each dataset are in the scripts folder. For example, train KonwPrompt on SemEval, CMeIE and ChemProt with the following commands:

```shell

bash scripts/semeval.sh # RoBERTa-large bash scripts/CMeIE.sh # Chinese RoBERTa-large bash scripts/ChemProt.sh # BioBERT-large ```

4 Different prompts

prompts

Simply add parameters to the scripts.

Template Prompt: --use_template_words 0

Schema Prompt: --use_template_words 0 --use_schema_prompt True

PTR: refer to PTR

Balancing

balance

1 Re-sampling

  • Create the re-sampled training file based on the 10% training set by resample.py.

```shell

python resample.py -h usage: resample.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR --relfile RELFILE

  optional arguments:
    -h, --help            show this help message and exit
    --input_file INPUT_FILE, -i INPUT_FILE
                          The path of the training file.
    --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                          The directory of the sampled files.
    --rel_file REL_FILE, -r REL_FILE
                          the path of the relation file

```

For example,

```shell

mkdir dataset/semeval/10sa-1 python resample.py -i dataset/semeval/10/train10per1.json -r dataset/semeval/rel2id.json -o dataset/semeval/sa cd dataset/semeval cp rel2id.json test.json 10sa-1/ cp sa/sa1.json 10sa-1/train.json ```

2 Re-weighting Loss

Simply add the useloss parameter to script for choosing various re-weighting loss.

For exampe: --useloss MultiFocalLoss. (chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)

Data Augmentation

DA

1 Prepare the environment

```shell

pip install nlpaug nlpcda ```

Please follow the instructions from nlpaug and nlpcda for more information (Thanks a lot!).

2 Try different DA methods

We provide many data augmentation methods

  • English (nlpaug): TF-IDF, contextual word embedding (BERT and RoBERTa), and WordNet' Synonym (-lan==en, -d).
  • Chinese (nlpcda): Synonym (-lan==cn)
  • All DA methods can be implemented on contexts, entities and both of them (--locations).
  • Generate augmented data ```shell

    python DA.py -h usage: DA.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR --language {en,cn} [--locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]] [--DAmethod {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,random_swap,synonym}] [--modeldir MODELDIR] [--modelname MODELNAME] [--createnum CREATENUM] [--changerate CHANGERATE]

    optional arguments: -h, --help show this help message and exit --inputfile INPUTFILE, -i INPUTFILE the training set file --outputdir OUTPUTDIR, -o OUTPUTDIR The directory of the sampled files. --language {en,cn}, -lan {en,cn} DA for English or Chinese --locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...], -l {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...] List of positions that you want to manipulate --DAmethod {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,randomswap,synonym}, -d {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,randomswap,synonym} Data augmentation method --modeldir MODELDIR, -m MODELDIR the path of pretrained models used in DA methods --modelname MODELNAME, -mn MODELNAME model from huggingface --createnum CREATENUM, -cn CREATENUM The number of samples augmented from one instance. --changerate CHANGERATE, -cr CHANGERATE the changing rate of text ```

Take context-level DA based on contextual word embedding on ChemProt for example:

shell python DA.py \ -i dataset/ChemProt/10/train10per_1.json \ -o dataset/ChemProt/aug \ -d word_embedding_bert \ -mn dmis-lab/biobert-large-cased-v1.1 \ -l sent1 sent2 sent3

  • Delete repeated instances and get the final augmented data

```shell

python mergedataset.py -h usage: mergedataset.py [-h] [--inputfiles INPUTFILES [INPUTFILES ...]] [--outputfile OUTPUT_FILE]

optional arguments: -h, --help show this help message and exit --inputfiles INPUTFILES [INPUTFILES ...], -i INPUTFILES [INPUTFILES ...] List of input files containing datasets to merge --outputfile OUTPUTFILE, -o OUTPUTFILE Output file containing merged dataset ```

For example:

bash python merge_dataset.py \ -i dataset/ChemProt/train10per_1.json dataset/ChemProt/aug/aug.json \ -o dataset/ChemProt/aug/merge.json

Self-training for Semi-supervised learning

st
  • Train a teacher model on a few labeled data (8-shot or 10%)
  • Place the unlabeled data label.json in the corresponding dataset folder.
  • Assigning pseudo labels using the trained teacher model: add --labeling True to the script and obtain the pseudo-labeled dataset label2.json.
  • Put the gold-labeled data and pseudo-labeled data together. For example: shell >> python self-train_combine.py -g dataset/semeval/10-1/train.json -p dataset/semeval/10-1/label2.json -la dataset/semeval/10la-1 >> cd dataset/semeval >> cp rel2id.json test.json 10la-1/
  • Train the final student model: add --stutrain True to the script

Standard Fine-tuning Baseline

ft

Fine-tuning

Citation

If you use the code, please cite the following paper:

```bibtex @inproceedings{xu-etal-2022-towards-realistic, title = "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study", author = "Xu, Xin and Chen, Xiang and Zhang, Ningyu and Xie, Xin and Chen, Xi and Chen, Huajun", editor = "Goldberg, Yoav and Kozareva, Zornitsa and Zhang, Yue", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-emnlp.29", doi = "10.18653/v1/2022.findings-emnlp.29", pages = "413--427" }

```

Owner

  • Name: ZJUNLP
  • Login: zjunlp
  • Kind: organization
  • Location: China

A NLP & KG Group of Zhejiang University

Citation (CITATION.cff)

cff-version: "1.0.0"
message: "If you use the code, please cite the following paper:"
title: "LREBench"
repository-code: "https://https://github.com/zjunlp/LREBench"
authors: 
  - family-names: Xu
    given-names: Xin
  - family-names: Chen
    given-names: Xiang
  - family-names: Zhang
    given-names: Ningyu
  - family-names: Xie
    given-names: Xin
  - family-names: Chen
    given-names: Xi
  - family-names: Chen
    given-names: Huajun
preferred-citation:
  type: article
  title: "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study"
  authors:
  - family-names: Xu
    given-names: Xin
  - family-names: Chen
    given-names: Xiang
  - family-names: Zhang
    given-names: Ningyu
  - family-names: Xie
    given-names: Xin
  - family-names: Chen
    given-names: Xi
  - family-names: Chen
    given-names: Huajun
  journal: "Conference on Empirical Methods in Natural Language Processing (EMNLP)"
  year: 2022

GitHub Events

Total
  • Watch event: 1
Last Year
  • Watch event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 66
  • Total Committers: 2
  • Avg Commits per committer: 33.0
  • Development Distribution Score (DDS): 0.136
Past Year
  • Commits: 0
  • Committers: 0
  • Avg Commits per committer: 0.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Xin Xu x****2@1****m 57
Eric z****0@v****m 9
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 8 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • PyYAML ==5.4.1
  • activations ==0.1.0
  • dataclasses *
  • file_utils ==0.0.1
  • flax ==0.3.4
  • numpy *
  • pytest *
  • pytorch_lightning ==1.3.1
  • regex ==2021.4.4
  • scikit-learn *
  • tokenizers ==0.10.3
  • torch ==1.11.0
  • torchmetrics ==0.5
  • torchsampler *
  • tqdm ==4.49.0
  • transformers ==4.7.0
  • utils ==1.0.1
.github/workflows/python-package-conda.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
environment.yml conda
  • _libgcc_mutex 0.1
  • _openmp_mutex 5.1
  • backcall 0.2.0
  • beautifulsoup4 4.11.1
  • blas 1.0
  • blessings 1.7
  • bottleneck 1.3.5
  • brotli 1.0.9
  • brotli-bin 1.0.9
  • brotlipy 0.7.0
  • ca-certificates 2022.10.11
  • certifi 2022.12.7
  • cffi 1.15.1
  • contourpy 1.0.5
  • cryptography 38.0.1
  • cudatoolkit 11.3.1
  • cycler 0.11.0
  • dbus 1.13.18
  • decorator 5.1.1
  • defusedxml 0.7.1
  • entrypoints 0.4
  • expat 2.4.9
  • fontconfig 2.14.1
  • freetype 2.11.0
  • giflib 5.2.1
  • glib 2.69.1
  • gpustat 0.6.0
  • gst-plugins-base 1.14.0
  • gstreamer 1.14.0
  • icu 58.2
  • intel-openmp 2022.0.1
  • ipykernel 6.15.2
  • ipython_genutils 0.2.0
  • jedi 0.18.1
  • jinja2 3.1.2
  • jpeg 9e
  • jupyter_client 7.4.8
  • jupyter_core 4.11.2
  • jupyter_server 1.23.4
  • jupyterlab_pygments 0.1.2
  • jupyterlab_server 2.16.3
  • krb5 1.19.2
  • lcms2 2.12
  • ld_impl_linux-64 2.38
  • lerc 3.0
  • libbrotlicommon 1.0.9
  • libbrotlidec 1.0.9
  • libbrotlienc 1.0.9
  • libclang 10.0.1
  • libdeflate 1.8
  • libedit 3.1.20221030
  • libevent 2.1.12
  • libffi 3.3
  • libgcc-ng 11.2.0
  • libgfortran-ng 11.2.0
  • libgfortran5 11.2.0
  • libgomp 11.2.0
  • libllvm10 10.0.1
  • libopenblas 0.3.21
  • libpng 1.6.37
  • libpq 12.9
  • libsodium 1.0.18
  • libstdcxx-ng 11.2.0
  • libtiff 4.4.0
  • libuv 1.40.0
  • libwebp 1.2.4
  • libwebp-base 1.2.4
  • libxcb 1.15
  • libxkbcommon 1.0.1
  • libxml2 2.9.14
  • libxslt 1.1.35
  • lxml 4.9.1
  • lz4-c 1.9.4
  • markupsafe 2.1.1
  • matplotlib-base 3.6.2
  • matplotlib-inline 0.1.6
  • mkl 2022.0.1
  • munkres 1.1.4
  • ncurses 6.3
  • nest-asyncio 1.5.5
  • nspr 4.33
  • nss 3.74
  • numexpr 2.8.4
  • numpy-base 1.23.4
  • nvidia-ml 7.352.0
  • openssl 1.1.1s
  • pandocfilters 1.5.0
  • parso 0.8.3
  • pcre 8.45
  • pexpect 4.8.0
  • pickleshare 0.7.5
  • pip 21.2.4
  • ply 3.11
  • prometheus_client 0.14.1
  • psutil 5.8.0
  • ptyprocess 0.7.0
  • pure_eval 0.2.2
  • pycparser 2.21
  • pyopenssl 22.0.0
  • pyparsing 3.0.9
  • pyqt 5.15.7
  • pyqt5-sip 12.11.0
  • pysocks 1.7.1
  • python 3.9.12
  • python-dateutil 2.8.2
  • python-fastjsonschema 2.16.2
  • pytorch-mutex 1.0
  • qt-main 5.15.2
  • qt-webengine 5.15.9
  • qtwebkit 5.212
  • readline 8.1.2
  • send2trash 1.8.0
  • setuptools 61.2.0
  • sip 6.6.2
  • six 1.16.0
  • sniffio 1.2.0
  • soupsieve 2.3.2.post1
  • sqlite 3.39.3
  • stack_data 0.2.0
  • tk 8.6.12
  • toml 0.10.2
  • tomli 2.0.1
  • tornado 6.2
  • typing_extensions 4.1.1
  • tzdata 2022a
  • wcwidth 0.2.5
  • wheel 0.37.1
  • xz 5.2.5
  • zeromq 4.3.4
  • zipp 3.8.0
  • zlib 1.2.12
  • zstd 1.5.2