lrebench
[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 1 DOI reference(s) in README -
✓Academic publication links
Links to: arxiv.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.6%) to scientific vocabulary
Keywords
Repository
[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
Basic Info
- Host: GitHub
- Owner: zjunlp
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://zjunlp.github.io/project/LREBench
- Size: 1.18 MB
Statistics
- Stars: 34
- Watchers: 6
- Forks: 1
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
LREBench: A low-resource relation extraction benchmark.
This repo is official implementation for the EMNLP2022 (Findings) paper LREBench: Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study [poster].
This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent PLMs, three schemes are comprehensively investigated to evaluate the performance in low-resource settings: $(i)$ different types of prompt-based methods with few-shot labeled data; $(ii)$ diverse balancing methods to address the long-tailed distribution issue; $(iii)$ data augmentation technologies and self-training to generate more labeled in-domain data.
Contents
- LREBench
- Environment
- Datasets
- Normal Prompt-based Tuning
- 1 Initialize Answer Words
- 2 Split Datasets
- 3 Prompt-based Tuning
- 4 Different prompts
- Balancing
- 1 Re-sampling
- 2 Re-weighting Loss
- Data Augmentation
- 1 Prepare the environment
- 2 Try different DA methods
- Self-training for Semi-supervised learning
- Standard Fine-tuning Baseline
Environment
To install requirements:
```shell
conda create -n LREBench python=3.9 conda activate LREBench pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu113 ```
Datasets
We provide 8 benchmark datasets and prompts used in our experiments.
All processed full-shot datasets can be downloaded and need to be placed in the dataset folder. The expected files of one dataset contains rel2id.json, train.json and test.json.
Normal Prompt-based Tuning
1 Initialize Answer Words
Use the command below to get answer words first.
```shell
python getlabelword.py --modelpath roberta-large --dataset semeval ```
The {modelpath}_{dataset}.pt will be saved in the dataset folder, and you need to assign the modelpath and dataset with names of the pre-trained language model and the dataset to be used before.
2 Split Datasets
We provide the sampling code for obtaining 8-shot (sample_8shot.py) , 10% (sample_10.py) datasets and the rest datasets used as unlabeled data for self-training. If there are classes with less than 8 instances, these classes are removed in training and testing sets when sampling 8-shot datasets and new_test.json and new_rel2id.json are obtained.
```shell
python sample8shot.py -h usage: sample8shot.py [-h] --inputdir INPUTDIR --outputdir OUTPUTDIR
optional arguments:
-h, --help show this help message and exit
--input_dir INPUT_DIR, -i INPUT_DIR
The directory of the training file.
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
The directory of the sampled files.
python sample10.py -h usage: sample10.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR
optional arguments:
-h, --help show this help message and exit
--input_file INPUT_FILE, -i INPUT_FILE
The directory of the training file.
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
The directory of the sampled files.
```
For example:
```shell
python sample8.py -i dataset/semeval -o dataset/semeval/8-shot cd dataset/semeval mkdir 8-1 cp 8-shot/newrel2id.json 8-1/rel2id.json cp 8-shot/newtest.json 8-1/test.json cp 8-shot/train81.json 8-1/train.json cp 8-shot/unlabel8_1.json 8-1/label.json ```
3 Prompt-based Tuning
All running scripts for each dataset are in the scripts folder. For example, train KonwPrompt on SemEval, CMeIE and ChemProt with the following commands:
```shell
bash scripts/semeval.sh # RoBERTa-large bash scripts/CMeIE.sh # Chinese RoBERTa-large bash scripts/ChemProt.sh # BioBERT-large ```
4 Different prompts
Simply add parameters to the scripts.
Template Prompt: --use_template_words 0
Schema Prompt: --use_template_words 0 --use_schema_prompt True
PTR: refer to PTR
Balancing
1 Re-sampling
- Create the re-sampled training file based on the 10% training set by resample.py.
```shell
python resample.py -h usage: resample.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR --relfile RELFILE
optional arguments:
-h, --help show this help message and exit
--input_file INPUT_FILE, -i INPUT_FILE
The path of the training file.
--output_dir OUTPUT_DIR, -o OUTPUT_DIR
The directory of the sampled files.
--rel_file REL_FILE, -r REL_FILE
the path of the relation file
```
For example,
```shell
mkdir dataset/semeval/10sa-1 python resample.py -i dataset/semeval/10/train10per1.json -r dataset/semeval/rel2id.json -o dataset/semeval/sa cd dataset/semeval cp rel2id.json test.json 10sa-1/ cp sa/sa1.json 10sa-1/train.json ```
2 Re-weighting Loss
Simply add the useloss parameter to script for choosing various re-weighting loss.
For exampe: --useloss MultiFocalLoss.
(chocies: MultiDSCLoss, MultiFocalLoss, GHMC_Loss, LDAMLoss)
Data Augmentation
1 Prepare the environment
```shell
pip install nlpaug nlpcda ```
Please follow the instructions from nlpaug and nlpcda for more information (Thanks a lot!).
2 Try different DA methods
We provide many data augmentation methods
- English (nlpaug): TF-IDF, contextual word embedding (BERT and RoBERTa), and WordNet' Synonym (-lan==en, -d).
- Chinese (nlpcda): Synonym (-lan==cn)
- All DA methods can be implemented on contexts, entities and both of them (--locations).
Generate augmented data ```shell
python DA.py -h usage: DA.py [-h] --inputfile INPUTFILE --outputdir OUTPUTDIR --language {en,cn} [--locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...]] [--DAmethod {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,random_swap,synonym}] [--modeldir MODELDIR] [--modelname MODELNAME] [--createnum CREATENUM] [--changerate CHANGERATE]
optional arguments: -h, --help show this help message and exit --inputfile INPUTFILE, -i INPUTFILE the training set file --outputdir OUTPUTDIR, -o OUTPUTDIR The directory of the sampled files. --language {en,cn}, -lan {en,cn} DA for English or Chinese --locations {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...], -l {sent1,sent2,sent3,ent1,ent2} [{sent1,sent2,sent3,ent1,ent2} ...] List of positions that you want to manipulate --DAmethod {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,randomswap,synonym}, -d {word2vec,TF-IDF,wordembeddingbert,wordembeddingroberta,randomswap,synonym} Data augmentation method --modeldir MODELDIR, -m MODELDIR the path of pretrained models used in DA methods --modelname MODELNAME, -mn MODELNAME model from huggingface --createnum CREATENUM, -cn CREATENUM The number of samples augmented from one instance. --changerate CHANGERATE, -cr CHANGERATE the changing rate of text ```
Take context-level DA based on contextual word embedding on ChemProt for example:
shell
python DA.py \
-i dataset/ChemProt/10/train10per_1.json \
-o dataset/ChemProt/aug \
-d word_embedding_bert \
-mn dmis-lab/biobert-large-cased-v1.1 \
-l sent1 sent2 sent3
- Delete repeated instances and get the final augmented data
```shell
python mergedataset.py -h usage: mergedataset.py [-h] [--inputfiles INPUTFILES [INPUTFILES ...]] [--outputfile OUTPUT_FILE]
optional arguments: -h, --help show this help message and exit --inputfiles INPUTFILES [INPUTFILES ...], -i INPUTFILES [INPUTFILES ...] List of input files containing datasets to merge --outputfile OUTPUTFILE, -o OUTPUTFILE Output file containing merged dataset ```
For example:
bash
python merge_dataset.py \
-i dataset/ChemProt/train10per_1.json dataset/ChemProt/aug/aug.json \
-o dataset/ChemProt/aug/merge.json
Self-training for Semi-supervised learning
- Train a teacher model on a few labeled data (8-shot or 10%)
- Place the unlabeled data label.json in the corresponding dataset folder.
- Assigning pseudo labels using the trained teacher model: add
--labeling Trueto the script and obtain the pseudo-labeled dataset label2.json. - Put the gold-labeled data and pseudo-labeled data together. For example:
shell >> python self-train_combine.py -g dataset/semeval/10-1/train.json -p dataset/semeval/10-1/label2.json -la dataset/semeval/10la-1 >> cd dataset/semeval >> cp rel2id.json test.json 10la-1/ - Train the final student model: add
--stutrain Trueto the script
Standard Fine-tuning Baseline
Citation
If you use the code, please cite the following paper:
```bibtex @inproceedings{xu-etal-2022-towards-realistic, title = "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study", author = "Xu, Xin and Chen, Xiang and Zhang, Ningyu and Xie, Xin and Chen, Xi and Chen, Huajun", editor = "Goldberg, Yoav and Kozareva, Zornitsa and Zhang, Yue", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.findings-emnlp.29", doi = "10.18653/v1/2022.findings-emnlp.29", pages = "413--427" }
```
Owner
- Name: ZJUNLP
- Login: zjunlp
- Kind: organization
- Location: China
- Website: http://zjukg.org
- Repositories: 19
- Profile: https://github.com/zjunlp
A NLP & KG Group of Zhejiang University
Citation (CITATION.cff)
cff-version: "1.0.0"
message: "If you use the code, please cite the following paper:"
title: "LREBench"
repository-code: "https://https://github.com/zjunlp/LREBench"
authors:
- family-names: Xu
given-names: Xin
- family-names: Chen
given-names: Xiang
- family-names: Zhang
given-names: Ningyu
- family-names: Xie
given-names: Xin
- family-names: Chen
given-names: Xi
- family-names: Chen
given-names: Huajun
preferred-citation:
type: article
title: "Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study"
authors:
- family-names: Xu
given-names: Xin
- family-names: Chen
given-names: Xiang
- family-names: Zhang
given-names: Ningyu
- family-names: Xie
given-names: Xin
- family-names: Chen
given-names: Xi
- family-names: Chen
given-names: Huajun
journal: "Conference on Empirical Methods in Natural Language Processing (EMNLP)"
year: 2022
GitHub Events
Total
- Watch event: 1
Last Year
- Watch event: 1
Issues and Pull Requests
Last synced: 8 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- PyYAML ==5.4.1
- activations ==0.1.0
- dataclasses *
- file_utils ==0.0.1
- flax ==0.3.4
- numpy *
- pytest *
- pytorch_lightning ==1.3.1
- regex ==2021.4.4
- scikit-learn *
- tokenizers ==0.10.3
- torch ==1.11.0
- torchmetrics ==0.5
- torchsampler *
- tqdm ==4.49.0
- transformers ==4.7.0
- utils ==1.0.1
- actions/checkout v3 composite
- actions/setup-python v3 composite
- _libgcc_mutex 0.1
- _openmp_mutex 5.1
- backcall 0.2.0
- beautifulsoup4 4.11.1
- blas 1.0
- blessings 1.7
- bottleneck 1.3.5
- brotli 1.0.9
- brotli-bin 1.0.9
- brotlipy 0.7.0
- ca-certificates 2022.10.11
- certifi 2022.12.7
- cffi 1.15.1
- contourpy 1.0.5
- cryptography 38.0.1
- cudatoolkit 11.3.1
- cycler 0.11.0
- dbus 1.13.18
- decorator 5.1.1
- defusedxml 0.7.1
- entrypoints 0.4
- expat 2.4.9
- fontconfig 2.14.1
- freetype 2.11.0
- giflib 5.2.1
- glib 2.69.1
- gpustat 0.6.0
- gst-plugins-base 1.14.0
- gstreamer 1.14.0
- icu 58.2
- intel-openmp 2022.0.1
- ipykernel 6.15.2
- ipython_genutils 0.2.0
- jedi 0.18.1
- jinja2 3.1.2
- jpeg 9e
- jupyter_client 7.4.8
- jupyter_core 4.11.2
- jupyter_server 1.23.4
- jupyterlab_pygments 0.1.2
- jupyterlab_server 2.16.3
- krb5 1.19.2
- lcms2 2.12
- ld_impl_linux-64 2.38
- lerc 3.0
- libbrotlicommon 1.0.9
- libbrotlidec 1.0.9
- libbrotlienc 1.0.9
- libclang 10.0.1
- libdeflate 1.8
- libedit 3.1.20221030
- libevent 2.1.12
- libffi 3.3
- libgcc-ng 11.2.0
- libgfortran-ng 11.2.0
- libgfortran5 11.2.0
- libgomp 11.2.0
- libllvm10 10.0.1
- libopenblas 0.3.21
- libpng 1.6.37
- libpq 12.9
- libsodium 1.0.18
- libstdcxx-ng 11.2.0
- libtiff 4.4.0
- libuv 1.40.0
- libwebp 1.2.4
- libwebp-base 1.2.4
- libxcb 1.15
- libxkbcommon 1.0.1
- libxml2 2.9.14
- libxslt 1.1.35
- lxml 4.9.1
- lz4-c 1.9.4
- markupsafe 2.1.1
- matplotlib-base 3.6.2
- matplotlib-inline 0.1.6
- mkl 2022.0.1
- munkres 1.1.4
- ncurses 6.3
- nest-asyncio 1.5.5
- nspr 4.33
- nss 3.74
- numexpr 2.8.4
- numpy-base 1.23.4
- nvidia-ml 7.352.0
- openssl 1.1.1s
- pandocfilters 1.5.0
- parso 0.8.3
- pcre 8.45
- pexpect 4.8.0
- pickleshare 0.7.5
- pip 21.2.4
- ply 3.11
- prometheus_client 0.14.1
- psutil 5.8.0
- ptyprocess 0.7.0
- pure_eval 0.2.2
- pycparser 2.21
- pyopenssl 22.0.0
- pyparsing 3.0.9
- pyqt 5.15.7
- pyqt5-sip 12.11.0
- pysocks 1.7.1
- python 3.9.12
- python-dateutil 2.8.2
- python-fastjsonschema 2.16.2
- pytorch-mutex 1.0
- qt-main 5.15.2
- qt-webengine 5.15.9
- qtwebkit 5.212
- readline 8.1.2
- send2trash 1.8.0
- setuptools 61.2.0
- sip 6.6.2
- six 1.16.0
- sniffio 1.2.0
- soupsieve 2.3.2.post1
- sqlite 3.39.3
- stack_data 0.2.0
- tk 8.6.12
- toml 0.10.2
- tomli 2.0.1
- tornado 6.2
- typing_extensions 4.1.1
- tzdata 2022a
- wcwidth 0.2.5
- wheel 0.37.1
- xz 5.2.5
- zeromq 4.3.4
- zipp 3.8.0
- zlib 1.2.12
- zstd 1.5.2