https://github.com/compnet/ddaugner

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary

Keywords

data-augmentation ner nlp

Last synced: 5 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: CompNet
License: gpl-3.0
Language: Python
Default Branch: main
Homepage:
Size: 1.14 MB

Statistics

Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 1

Topics

data-augmentation ner nlp

Created over 3 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

Domain Data Augmentation for NER

Setup and dependencies

The project uses Poetry to manage dependencies. Having poetry installed, you can install all dependencies using :

sh poetry install

Further commands in this /README/ assume that you activated the resulting python environment, either manually or by using poetry shell.

Alternatively, you can manage everything in your own environment using the provided requirements.txt file (use pip install -r requirements.txt to install dependencies).

Literary test corpus

We re-annotated the corpus of Dekker et al., 2019 to fix some errors. Due to copyright issues, tokens from the datasets are not directly available in this repository, but can be retrieved through a script :

sh python setup_dekker_dataset.py --dekker-etal-repo-path /path/to/dekker/repository

If you don't specify a path to Dekker et al repository, the script will attempt to download it automatically using git.

Generating documentation

Some API documentation is available using sphinx. Go the the docs directory, and run make html to generate documentation under docs/_build/html.

Training a model

Use the train.py script to train a model. To see all the possible options, use python train.py --help.

Evaluating a model

The extract_metrics.py script can be used to evaluate a model. See python extract_metrics.py --help for more infos.

Published Articles

Data Augmentation for Robust Character Detection in Fantasy Novels

Main Results

The following command trains a model without any augmentation:

sh python train.py\ --epochs-nb 2\ --batch-size 4\ --context-size 1\ --model-path model.pth

While the following trains a model with our The Elder Scrolls augmentation as in the article:

```sh for aug_rate in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0; do

python train.py\
   --epochs-nb 2\
   --dynamic-epochs-nb\
   --batch-size 4\
   --context-size 0\
   --data-aug-strategies '{"PER": ["the_elder_scrolls"]}'\
   --data-aug-frequencies "{\"PER\": [${aug_rate}]}"\
   --model-path "augmented_model_${aug_rate}.pth"

done ```

After training a model, you can see its performance on the dataset with the extract_metrics.py script:

sh python extract_metrics.py\ --model-path model.pth\ --global-metrics\ --context-size 0\ --book-group "fantasy"\ --fix-sent-tokenization\ --output-file results.json

Alternative augmentation methods

You can reproduce results shown in Figure 3 using the following:

```sh for aug_rate in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0; do

for aug_method in 'balance_upsample' 'replace'; do

python train.py\
    --epochs-nb 2\
    --dynamic-epochs-nb\
    --batch-size 4\
    --context-size 0\
    --data-aug-strategies '{"PER": ["the_elder_scrolls"]}'\
    --data-aug-frequencies "{\"PER\": [${aug_rate}]}"\
    --data-aug-method "${aug_method}"
    --model-path "augmented_model_${aug_method}_${aug_rate}.pth"

done

done ```

Context size

Results in Figure 4 can be reproduced using:

```sh for context_size in 1 2 3 4 5 6 7 8 9; do

python train.py\
--epochs-nb 2\
--dynamic-epochs-nb\
--batch-size 4\
--context-size ${context_size}\
--data-aug-strategies '{"PER": ["the_elder_scrolls"]}'\
--data-aug-frequencies "{\"PER\": [0.6]}"\
--model-path "augmented_model_${context_size}.pth"

done ```

Citation

Please cite this work as follows:

bibtex @InProceedings{amalvy:hal-03972448, title = {{Data Augmentation for Robust Character Detection in Fantasy Novels}}, author = {Amalvy, Arthur and Labatut, Vincent and Dufour, Richard}, url = {https://hal.science/hal-03972448}, booktitle = {{Workshop on Computational Methods in the Humanities 2022}}, year = {2022}, hal_id = {hal-03972448}, hal_version = {v1}, }

Remplacement de mentions pour l'adaptation d'un corpus de reconnaissance d'entités nommées à un domaine cible

All augmentation configurations can be tested as in the article :

```sh for i in 0.05 0.1 0.5 1.0; do

for aug in conll wgold morrowind dekker; do

python train.py\
       --epochs-nb 2\
       --batch-size 4\
       --context-size 1\
       --data-aug-strategies "{\"PER\": [\"${aug}\"]}"\
       --data-aug-frequencies "{\"PER\": [${i}]}"\
       --model-path augmented_model.pth

python extract_metrics.py\
       --model-path model.pth\
       --global-metrics\
       --context-size 1\
       --book-group "fantasy"\
       --output-file "results_${aug}_${i}.json"

done

done ```

Citation

Please cite this work as follows :

bibtex @InProceedings{amalvy:hal-03651510, title = {{Remplacement de mentions pour l'adaptation d'un corpus de reconnaissance d'entit{\'e}s nomm{\'e}es {\`a} un domaine cible}}, author = {Amalvy, Arthur and Labatut, Vincent and Dufour, Richard}, url = {https://hal.archives-ouvertes.fr/hal-03651510}, booktitle = {{29{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN)}}, year = {2022}, hal_id = {hal-03651510}, hal_version = {v3}, }

BERT meets d'Artagnan : Data Augmentation for Robust Character Detection in Novels

The following command trains a model without any augmentation:

sh poetry run python train.py\ --epochs-nb 2\ --batch-size 4\ --context-size 1\ --model-path model.pth

While the following trains a model with our morrowind augmentation as in the article:

sh poetry run python train.py\ --epochs-nb 2\ --batch-size 4\ --context-size 1\ --data-aug-strategies '{"PER": ["morrowind"]}'\ --data-aug-frequencies '{"PER": [0.1]}'\ --model-path augmented_model.pth

Replace morrowind with word_names to use our word names augmentation.

After training a model, you can see its performance on the dataset with the extract_metrics.py script:

sh poetry run python extract_metrics.py\ --model-path model.pth\ --global-metrics\ --context-size 1\ --book-group "fantasy"\ --output-file results.json

Citation

Please cite this work as follows :

bibtex @InProceedings{amalvy:hal-03617722, title = {{BERT meets d'Artagnan: Data Augmentation for Robust Character Detection in Novels}}, author = {Amalvy, Arthur and Labatut, Vincent and Dufour, Richard}, url = {https://hal.archives-ouvertes.fr/hal-03617722}, booktitle = {{Workshop on Computational Methods in the Humanities (COMHUM)}}, year = {2022}, hal_id = {hal-03617722}, hal_version = {v2}, }

About the CoNLL-2003 Corpus

This repository contains a modified version of the CoNLL-2003 corpus. Copyrights are defined at the Reuters Corpus page. See more details at the page of the CoNLL-2003 shared task.

Owner

Name: Complex Networks
Login: CompNet
Kind: organization
Location: Avignon, France

Website: http://lia.univ-avignon.fr
Repositories: 44
Profile: https://github.com/CompNet

GitHub Events

Total

Fork event: 1

Last Year

Fork event: 1

Dependencies

poetry.lock pypi

attrs 21.4.0
certifi 2021.10.8
charset-normalizer 2.0.11
click 8.0.3
colorama 0.4.4
commonmark 0.9.1
cycler 0.11.0
filelock 3.4.2
fonttools 4.29.1
huggingface-hub 0.4.0
hypothesis 6.36.1
idna 3.3
joblib 1.1.0
kiwisolver 1.3.2
matplotlib 3.5.1
more-itertools 8.12.0
nameparser 1.1.1
nltk 3.7
numpy 1.22.2
packaging 21.3
pillow 9.0.1
pygments 2.11.2
pyparsing 3.0.7
python-dateutil 2.8.2
pyyaml 6.0
regex 2022.1.18
requests 2.27.1
rich 11.1.0
sacremoses 0.0.47
scikit-learn 1.0.2
scipy 1.6.1
seqeval 1.2.2
setuptools-scm 6.4.2
six 1.16.0
sortedcontainers 2.4.0
threadpoolctl 3.1.0
tokenizers 0.11.4
tomli 2.0.0
torch 1.10.2
tqdm 4.62.3
transformers 4.16.2
typing-extensions 4.0.1
urllib3 1.26.8

pyproject.toml pypi

hypothesis ^6.36.1
matplotlib ^3.5.1
more-itertools ^8.12.0
nameparser ^1.1.0
nltk ^3.7
python ^3.8
rich ^11.0.0
seqeval ^1.2.2
torch ^1.10.1
tqdm ^4.62.3
transformers ^4.15.0

requirements.txt pypi

alabaster ==0.7.12
attrs ==21.4.0
babel ==2.10.3
certifi ==2021.10.8
charset-normalizer ==2.0.11
click ==8.0.3
colorama ==0.4.5
commonmark ==0.9.1
cycler ==0.11.0
docutils ==0.19
filelock ==3.4.2
fonttools ==4.29.1
huggingface-hub ==0.4.0
hypothesis ==6.36.1
idna ==3.3
imagesize ==1.4.1
importlib-metadata ==5.0.0
jinja2 ==3.1.2
joblib ==1.1.0
kiwisolver ==1.3.2
markupsafe ==2.1.1
matplotlib ==3.5.1
more-itertools ==8.12.0
nameparser ==1.1.1
nltk ==3.7
numpy ==1.22.2
packaging ==21.3
pillow ==9.0.1
pygments ==2.13.0
pyparsing ==3.0.7
python-dateutil ==2.8.2
pytz ==2022.4
pyyaml ==6.0
regex ==2022.1.18
requests ==2.27.1
rich ==11.1.0
sacremoses ==0.0.47
scikit-learn ==1.0.2
scipy ==1.6.1
seqeval ==1.2.2
setuptools ==65.5.0
setuptools-scm ==6.4.2
six ==1.16.0
snowballstemmer ==2.2.0
sortedcontainers ==2.4.0
sphinx ==5.3.0
sphinx-autodoc-typehints ==1.19.4
sphinxcontrib-applehelp ==1.0.2
sphinxcontrib-devhelp ==1.0.2
sphinxcontrib-htmlhelp ==2.0.0
sphinxcontrib-jsmath ==1.0.1
sphinxcontrib-qthelp ==1.0.3
sphinxcontrib-serializinghtml ==1.1.5
threadpoolctl ==3.1.0
tokenizers ==0.11.4
tomli ==2.0.0
torch ==1.10.2
tqdm ==4.62.3
transformers ==4.16.2
typing-extensions ==4.0.1
urllib3 ==1.26.8
zipp ==3.9.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/compnet/ddaugner

Science Score: 13.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Domain Data Augmentation for NER

Setup and dependencies

Literary test corpus

Generating documentation

Training a model

Evaluating a model

Published Articles

Data Augmentation for Robust Character Detection in Fantasy Novels

Main Results

Alternative augmentation methods

Context size

Citation

Remplacement de mentions pour l'adaptation d'un corpus de reconnaissance d'entités nommées à un domaine cible

Citation

BERT meets d'Artagnan : Data Augmentation for Robust Character Detection in Novels

Citation

About the CoNLL-2003 Corpus

Owner

GitHub Events

Total

Last Year

Dependencies