lrebench

[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in README
✓
Academic publication links
Links to: arxiv.org
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary

Keywords

benchmark chinese data-augmentation data-augumentation dataset efficient emnlp few-shot information-extraction kg knowledge-graph knowprompt long-tail low-resource lrebench prompt re relation-extraction self-training

Last synced: 6 months ago · JSON representation ·

Repository

[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study

Basic Info

Host: GitHub
Owner: zjunlp
License: mit
Language: Python
Default Branch: main
Homepage: https://zjunlp.github.io/project/LREBench
Size: 1.18 MB

Statistics

Stars: 34
Watchers: 6
Forks: 1
Open Issues: 0
Releases: 0

Topics

Created almost 4 years ago · Last pushed almost 2 years ago

Metadata Files

Readme License Citation

README.md

LREBench: A low-resource relation extraction benchmark.

This repo is official implementation for the EMNLP2022 (Findings) paper LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [poster].

This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data; $(ii)$ diverse balancing methods to address the long-tailed distribution issue; $(iii)$ data augmentation technologies and self-training to generate more labeled in-domain data.

LREBench

Environment

To install requirements:

```shell

conda create -n LREBench python=3.9 conda activate LREBench pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113 ```

Datasets

We provide 8 benchmark datasets and prompts used in our experiments.

All processed full-shot datasets can be downloaded and need to be placed in the dataset folder. The expected files of one dataset contains rel2id.json, train.json and test.json.

Normal Prompt-based Tuning

1 Initialize Answer Words

Use the command below to get answer words first.

```shell

python getlabelword.py --modelpath roberta-large --dataset semeval ```

The {modelpath}_{dataset}.pt will be saved in the dataset folder, and you need to assign the modelpath and dataset with names of the pre-trained language model and the dataset to be used before.

2 Split Datasets

We provide the sampling code for obtaining 8-shot (sample_8shot.py) , 10% (sample_10.py) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and new_test.json and new_rel2id.json are obtained.

```shell

python sample8shot.py -h usage: sample8shot.py [-h] --inputdir INPUTDIR --outputdir OUTPUTDIR

optional arguments:
  -h, --help            show this help message and exit
  --input_dir INPUT_DIR, -i INPUT_DIR
                        The directory of the training file.
  --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                        The directory of the sampled files.

python sample10.py -h usage: sample10.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR

optional arguments:
  -h, --help            show this help message and exit
  --input_file INPUT_FILE, -i INPUT_FILE
                        The directory of the training file.
  --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                        The directory of the sampled files.

```

For example:

```shell

python sample8.py -i dataset/semeval -o dataset/semeval/8-shot cd dataset/semeval mkdir 8-1 cp 8-shot/newrel2id.json 8-1/rel2id.json cp 8-shot/newtest.json 8-1/test.json cp 8-shot/train81.json 8-1/train.json cp 8-shot/unlabel8_1.json 8-1/label.json ```

3 Prompt-based Tuning

All running scripts for each dataset are in the scripts folder. For example, train KonwPrompt on SemEval, CMeIE and ChemProt with the following commands:

```shell

bash scripts/semeval.sh # RoBERTa-large bash scripts/CMeIE.sh # Chinese RoBERTa-large bash scripts/ChemProt.sh # BioBERT-large ```

4 Different prompts

Simply add parameters to the scripts.

Template Prompt: --use_template_words 0

Schema Prompt: --use_template_words 0 --use_schema_prompt True

PTR: refer to PTR

Balancing

1 Re-sampling

Create the re-sampled training file based on the 10% training set by resample.py.

```shell

python resample.py -h usage: resample.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR --relfile RELFILE

  optional arguments:
    -h, --help            show this help message and exit
    --input_file INPUT_FILE, -i INPUT_FILE
                          The path of the training file.
    --output_dir OUTPUT_DIR, -o OUTPUT_DIR
                          The directory of the sampled files.
    --rel_file REL_FILE, -r REL_FILE
                          the path of the relation file

```

For example,

```shell

mkdir dataset/semeval/10sa-1 python resample.py -i dataset/semeval/10/train10per1.json -r dataset/semeval/rel2id.json -o dataset/semeval/sa cd dataset/semeval cp rel2id.json test.json 10sa-1/ cp sa/sa1.json 10sa-1/train.json ```

2 Re-weighting Loss

Simply add the useloss parameter to script for choosing various re-weighting loss.

For exampe: --useloss MultiFocalLoss. (chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)

Data Augmentation

1 Prepare the environment

```shell

pip install nlpaug nlpcda ```

Please follow the instructions from nlpaug and nlpcda for more information (Thanks a lot!).

2 Try different DA methods

We provide many data augmentation methods

English (nlpaug): TF-IDF, contextual word embedding (BERT and RoBERTa), and WordNet' Synonym (-lan==en, -d).
Chinese (nlpcda): Synonym (-lan==cn)
All DA methods can be implemented on contexts, entities and both of them (--locations).
Generate augmented data ```shell

python DA.py -h usage: DA.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR --language {en,cn} [--locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]] [--DAmethod {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,random_swap,synonym}] [--modeldir MODELDIR] [--modelname MODELNAME] [--createnum CREATENUM] [--changerate CHANGERATE]

optional arguments: -h, --help show this help message and exit --inputfile INPUTFILE, -i INPUTFILE the training set file --outputdir OUTPUTDIR, -o OUTPUTDIR The directory of the sampled files. --language {en,cn}, -lan {en,cn} DA for English or Chinese --locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...], -l {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...] List of positions that you want to manipulate --DAmethod {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,randomswap,synonym}, -d {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,randomswap,synonym} Data augmentation method --modeldir MODELDIR, -m MODELDIR the path of pretrained models used in DA methods --modelname MODELNAME, -mn MODELNAME model from huggingface --createnum CREATENUM, -cn CREATENUM The number of samples augmented from one instance. --changerate CHANGERATE, -cr CHANGERATE the changing rate of text ```

Take context-level DA based on contextual word embedding on ChemProt for example:

shell python DA.py \ -i dataset/ChemProt/10/train10per_1.json \ -o dataset/ChemProt/aug \ -d word_embedding_bert \ -mn dmis-lab/biobert-large-cased-v1.1 \ -l sent1 sent2 sent3

Delete repeated instances and get the final augmented data

```shell

python mergedataset.py -h usage: mergedataset.py [-h] [--inputfiles INPUTFILES [INPUTFILES ...]] [--outputfile OUTPUT_FILE]

optional arguments: -h, --help show this help message and exit --inputfiles INPUTFILES [INPUTFILES ...], -i INPUTFILES [INPUTFILES ...] List of input files containing datasets to merge --outputfile OUTPUTFILE, -o OUTPUTFILE Output file containing merged dataset ```

For example:

bash python merge_dataset.py \ -i dataset/ChemProt/train10per_1.json dataset/ChemProt/aug/aug.json \ -o dataset/ChemProt/aug/merge.json

Self-training for Semi-supervised learning

Train a teacher model on a few labeled data (8-shot or 10%)
Place the unlabeled data label.json in the corresponding dataset folder.
Assigning pseudo labels using the trained teacher model: add --labeling True to the script and obtain the pseudo-labeled dataset label2.json.
Put the gold-labeled data and pseudo-labeled data together. For example: shell >> python self-train_combine.py -g dataset/semeval/10-1/train.json -p dataset/semeval/10-1/label2.json -la dataset/semeval/10la-1 >> cd dataset/semeval >> cp rel2id.json test.json 10la-1/
Train the final student model: add --stutrain True to the script

Standard Fine-tuning Baseline

Fine-tuning

Citation

If you use the code, please cite the following paper:

```bibtex @inproceedings{xu-etal-2022-towards-realistic, title = "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study", author = "Xu, Xin and Chen, Xiang and Zhang, Ningyu and Xie, Xin and Chen, Xi and Chen, Huajun", editor = "Goldberg, Yoav and Kozareva, Zornitsa and Zhang, Yue", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-emnlp.29", doi = "10.18653/v1/2022.findings-emnlp.29", pages = "413--427" }

```

Owner

Name: ZJUNLP
Login: zjunlp
Kind: organization
Location: China

Website: http://zjukg.org
Repositories: 19
Profile: https://github.com/zjunlp

A NLP & KG Group of Zhejiang University

Citation (CITATION.cff)

cff-version: "1.0.0"
message: "If you use the code, please cite the following paper:"
title: "LREBench"
repository-code: "https://https://github.com/zjunlp/LREBench"
authors: 
  - family-names: Xu
    given-names: Xin
  - family-names: Chen
    given-names: Xiang
  - family-names: Zhang
    given-names: Ningyu
  - family-names: Xie
    given-names: Xin
  - family-names: Chen
    given-names: Xi
  - family-names: Chen
    given-names: Huajun
preferred-citation:
  type: article
  title: "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study"
  authors:
  - family-names: Xu
    given-names: Xin
  - family-names: Chen
    given-names: Xiang
  - family-names: Zhang
    given-names: Ningyu
  - family-names: Xie
    given-names: Xin
  - family-names: Chen
    given-names: Xi
  - family-names: Chen
    given-names: Huajun
  journal: "Conference on Empirical Methods in Natural Language Processing (EMNLP)"
  year: 2022

GitHub Events

Total

Watch event: 1

Last Year

Watch event: 1

Committers

Last synced: 8 months ago

All Time

Total Commits: 66
Total Committers: 2
Avg Commits per committer: 33.0
Development Distribution Score (DDS): 0.136

Past Year

Commits: 0
Committers: 0
Avg Commits per committer: 0.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
Xin Xu	x**2@1**m	57
Eric	z**0@v**m	9

Committer Domains (Top 20 + Academic)

vip.qq.com: 1 126.com: 1

Issues and Pull Requests

Last synced: 8 months ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

requirements.txt pypi

PyYAML ==5.4.1
activations ==0.1.0
dataclasses *
file_utils ==0.0.1
flax ==0.3.4
numpy *
pytest *
pytorch_lightning ==1.3.1
regex ==2021.4.4
scikit-learn *
tokenizers ==0.10.3
torch ==1.11.0
torchmetrics ==0.5
torchsampler *
tqdm ==4.49.0
transformers ==4.7.0
utils ==1.0.1

.github/workflows/python-package-conda.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

environment.yml conda

_libgcc_mutex 0.1
_openmp_mutex 5.1
backcall 0.2.0
beautifulsoup4 4.11.1
blas 1.0
blessings 1.7
bottleneck 1.3.5
brotli 1.0.9
brotli-bin 1.0.9
brotlipy 0.7.0
ca-certificates 2022.10.11
certifi 2022.12.7
cffi 1.15.1
contourpy 1.0.5
cryptography 38.0.1
cudatoolkit 11.3.1
cycler 0.11.0
dbus 1.13.18
decorator 5.1.1
defusedxml 0.7.1
entrypoints 0.4
expat 2.4.9
fontconfig 2.14.1
freetype 2.11.0
giflib 5.2.1
glib 2.69.1
gpustat 0.6.0
gst-plugins-base 1.14.0
gstreamer 1.14.0
icu 58.2
intel-openmp 2022.0.1
ipykernel 6.15.2
ipython_genutils 0.2.0
jedi 0.18.1
jinja2 3.1.2
jpeg 9e
jupyter_client 7.4.8
jupyter_core 4.11.2
jupyter_server 1.23.4
jupyterlab_pygments 0.1.2
jupyterlab_server 2.16.3
krb5 1.19.2
lcms2 2.12
ld_impl_linux-64 2.38
lerc 3.0
libbrotlicommon 1.0.9
libbrotlidec 1.0.9
libbrotlienc 1.0.9
libclang 10.0.1
libdeflate 1.8
libedit 3.1.20221030
libevent 2.1.12
libffi 3.3
libgcc-ng 11.2.0
libgfortran-ng 11.2.0
libgfortran5 11.2.0
libgomp 11.2.0
libllvm10 10.0.1
libopenblas 0.3.21
libpng 1.6.37
libpq 12.9
libsodium 1.0.18
libstdcxx-ng 11.2.0
libtiff 4.4.0
libuv 1.40.0
libwebp 1.2.4
libwebp-base 1.2.4
libxcb 1.15
libxkbcommon 1.0.1
libxml2 2.9.14
libxslt 1.1.35
lxml 4.9.1
lz4-c 1.9.4
markupsafe 2.1.1
matplotlib-base 3.6.2
matplotlib-inline 0.1.6
mkl 2022.0.1
munkres 1.1.4
ncurses 6.3
nest-asyncio 1.5.5
nspr 4.33
nss 3.74
numexpr 2.8.4
numpy-base 1.23.4
nvidia-ml 7.352.0
openssl 1.1.1s
pandocfilters 1.5.0
parso 0.8.3
pcre 8.45
pexpect 4.8.0
pickleshare 0.7.5
pip 21.2.4
ply 3.11
prometheus_client 0.14.1
psutil 5.8.0
ptyprocess 0.7.0
pure_eval 0.2.2
pycparser 2.21
pyopenssl 22.0.0
pyparsing 3.0.9
pyqt 5.15.7
pyqt5-sip 12.11.0
pysocks 1.7.1
python 3.9.12
python-dateutil 2.8.2
python-fastjsonschema 2.16.2
pytorch-mutex 1.0
qt-main 5.15.2
qt-webengine 5.15.9
qtwebkit 5.212
readline 8.1.2
send2trash 1.8.0
setuptools 61.2.0
sip 6.6.2
six 1.16.0
sniffio 1.2.0
soupsieve 2.3.2.post1
sqlite 3.39.3
stack_data 0.2.0
tk 8.6.12
toml 0.10.2
tomli 2.0.1
tornado 6.2
typing_extensions 4.1.1
tzdata 2022a
wcwidth 0.2.5
wheel 0.37.1
xz 5.2.5
zeromq 4.3.4
zipp 3.8.0
zlib 1.2.12
zstd 1.5.2

lrebench

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

LREBench: A low-resource relation extraction benchmark.

Contents

Environment

Datasets

Normal Prompt-based Tuning

1 Initialize Answer Words

2 Split Datasets

3 Prompt-based Tuning

4 Different prompts

Balancing

1 Re-sampling

2 Re-weighting Loss

Data Augmentation

1 Prepare the environment

2 Try different DA methods

Self-training for Semi-supervised learning

Standard Fine-tuning Baseline

Citation

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies