https://github.com/compnet/ddaugner
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.8%) to scientific vocabulary
Keywords
Repository
Basic Info
Statistics
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 1
Topics
Metadata Files
README.md
Domain Data Augmentation for NER
Setup and dependencies
The project uses Poetry to manage dependencies. Having poetry installed, you can install all dependencies using :
sh
poetry install
Further commands in this /README/ assume that you activated the resulting python environment, either manually or by using poetry shell.
Alternatively, you can manage everything in your own environment using the provided requirements.txt file (use pip install -r requirements.txt to install dependencies).
Literary test corpus
We re-annotated the corpus of Dekker et al., 2019 to fix some errors. Due to copyright issues, tokens from the datasets are not directly available in this repository, but can be retrieved through a script :
sh
python setup_dekker_dataset.py --dekker-etal-repo-path /path/to/dekker/repository
If you don't specify a path to Dekker et al repository, the script will attempt to download it automatically using git.
Generating documentation
Some API documentation is available using sphinx. Go the the docs directory, and run make html to generate documentation under docs/_build/html.
Training a model
Use the train.py script to train a model. To see all the possible options, use python train.py --help.
Evaluating a model
The extract_metrics.py script can be used to evaluate a model. See python extract_metrics.py --help for more infos.
Published Articles
Data Augmentation for Robust Character Detection in Fantasy Novels
Main Results
The following command trains a model without any augmentation:
sh
python train.py\
--epochs-nb 2\
--batch-size 4\
--context-size 1\
--model-path model.pth
While the following trains a model with our The Elder Scrolls augmentation as in the article:
```sh for aug_rate in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0; do
python train.py\
--epochs-nb 2\
--dynamic-epochs-nb\
--batch-size 4\
--context-size 0\
--data-aug-strategies '{"PER": ["the_elder_scrolls"]}'\
--data-aug-frequencies "{\"PER\": [${aug_rate}]}"\
--model-path "augmented_model_${aug_rate}.pth"
done ```
After training a model, you can see its performance on the dataset with the extract_metrics.py script:
sh
python extract_metrics.py\
--model-path model.pth\
--global-metrics\
--context-size 0\
--book-group "fantasy"\
--fix-sent-tokenization\
--output-file results.json
Alternative augmentation methods
You can reproduce results shown in Figure 3 using the following:
```sh for aug_rate in 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0; do
for aug_method in 'balance_upsample' 'replace'; do
python train.py\
--epochs-nb 2\
--dynamic-epochs-nb\
--batch-size 4\
--context-size 0\
--data-aug-strategies '{"PER": ["the_elder_scrolls"]}'\
--data-aug-frequencies "{\"PER\": [${aug_rate}]}"\
--data-aug-method "${aug_method}"
--model-path "augmented_model_${aug_method}_${aug_rate}.pth"
done
done ```
Context size
Results in Figure 4 can be reproduced using:
```sh for context_size in 1 2 3 4 5 6 7 8 9; do
python train.py\
--epochs-nb 2\
--dynamic-epochs-nb\
--batch-size 4\
--context-size ${context_size}\
--data-aug-strategies '{"PER": ["the_elder_scrolls"]}'\
--data-aug-frequencies "{\"PER\": [0.6]}"\
--model-path "augmented_model_${context_size}.pth"
done ```
Citation
Please cite this work as follows:
bibtex
@InProceedings{amalvy:hal-03972448,
title = {{Data Augmentation for Robust Character Detection in
Fantasy Novels}},
author = {Amalvy, Arthur and Labatut, Vincent and Dufour,
Richard},
url = {https://hal.science/hal-03972448},
booktitle = {{Workshop on Computational Methods in the Humanities
2022}},
year = {2022},
hal_id = {hal-03972448},
hal_version = {v1},
}
Remplacement de mentions pour l'adaptation d'un corpus de reconnaissance d'entités nommées à un domaine cible
All augmentation configurations can be tested as in the article :
```sh for i in 0.05 0.1 0.5 1.0; do
for aug in conll wgold morrowind dekker; do
python train.py\
--epochs-nb 2\
--batch-size 4\
--context-size 1\
--data-aug-strategies "{\"PER\": [\"${aug}\"]}"\
--data-aug-frequencies "{\"PER\": [${i}]}"\
--model-path augmented_model.pth
python extract_metrics.py\
--model-path model.pth\
--global-metrics\
--context-size 1\
--book-group "fantasy"\
--output-file "results_${aug}_${i}.json"
done
done ```
Citation
Please cite this work as follows :
bibtex
@InProceedings{amalvy:hal-03651510,
title = {{Remplacement de mentions pour l'adaptation d'un corpus de reconnaissance d'entit{\'e}s nomm{\'e}es {\`a} un domaine cible}},
author = {Amalvy, Arthur and Labatut, Vincent and Dufour, Richard},
url = {https://hal.archives-ouvertes.fr/hal-03651510},
booktitle = {{29{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN)}},
year = {2022},
hal_id = {hal-03651510},
hal_version = {v3},
}
BERT meets d'Artagnan : Data Augmentation for Robust Character Detection in Novels
The following command trains a model without any augmentation:
sh
poetry run python train.py\
--epochs-nb 2\
--batch-size 4\
--context-size 1\
--model-path model.pth
While the following trains a model with our morrowind augmentation as in the article:
sh
poetry run python train.py\
--epochs-nb 2\
--batch-size 4\
--context-size 1\
--data-aug-strategies '{"PER": ["morrowind"]}'\
--data-aug-frequencies '{"PER": [0.1]}'\
--model-path augmented_model.pth
Replace morrowind with word_names to use our word names augmentation.
After training a model, you can see its performance on the dataset with the extract_metrics.py script:
sh
poetry run python extract_metrics.py\
--model-path model.pth\
--global-metrics\
--context-size 1\
--book-group "fantasy"\
--output-file results.json
Citation
Please cite this work as follows :
bibtex
@InProceedings{amalvy:hal-03617722,
title = {{BERT meets d'Artagnan: Data Augmentation for Robust Character Detection in Novels}},
author = {Amalvy, Arthur and Labatut, Vincent and Dufour, Richard},
url = {https://hal.archives-ouvertes.fr/hal-03617722},
booktitle = {{Workshop on Computational Methods in the Humanities (COMHUM)}},
year = {2022},
hal_id = {hal-03617722},
hal_version = {v2},
}
About the CoNLL-2003 Corpus
This repository contains a modified version of the CoNLL-2003 corpus. Copyrights are defined at the Reuters Corpus page. See more details at the page of the CoNLL-2003 shared task.
Owner
- Name: Complex Networks
- Login: CompNet
- Kind: organization
- Location: Avignon, France
- Website: http://lia.univ-avignon.fr
- Repositories: 44
- Profile: https://github.com/CompNet
GitHub Events
Total
- Fork event: 1
Last Year
- Fork event: 1
Dependencies
- attrs 21.4.0
- certifi 2021.10.8
- charset-normalizer 2.0.11
- click 8.0.3
- colorama 0.4.4
- commonmark 0.9.1
- cycler 0.11.0
- filelock 3.4.2
- fonttools 4.29.1
- huggingface-hub 0.4.0
- hypothesis 6.36.1
- idna 3.3
- joblib 1.1.0
- kiwisolver 1.3.2
- matplotlib 3.5.1
- more-itertools 8.12.0
- nameparser 1.1.1
- nltk 3.7
- numpy 1.22.2
- packaging 21.3
- pillow 9.0.1
- pygments 2.11.2
- pyparsing 3.0.7
- python-dateutil 2.8.2
- pyyaml 6.0
- regex 2022.1.18
- requests 2.27.1
- rich 11.1.0
- sacremoses 0.0.47
- scikit-learn 1.0.2
- scipy 1.6.1
- seqeval 1.2.2
- setuptools-scm 6.4.2
- six 1.16.0
- sortedcontainers 2.4.0
- threadpoolctl 3.1.0
- tokenizers 0.11.4
- tomli 2.0.0
- torch 1.10.2
- tqdm 4.62.3
- transformers 4.16.2
- typing-extensions 4.0.1
- urllib3 1.26.8
- hypothesis ^6.36.1
- matplotlib ^3.5.1
- more-itertools ^8.12.0
- nameparser ^1.1.0
- nltk ^3.7
- python ^3.8
- rich ^11.0.0
- seqeval ^1.2.2
- torch ^1.10.1
- tqdm ^4.62.3
- transformers ^4.15.0
- alabaster ==0.7.12
- attrs ==21.4.0
- babel ==2.10.3
- certifi ==2021.10.8
- charset-normalizer ==2.0.11
- click ==8.0.3
- colorama ==0.4.5
- commonmark ==0.9.1
- cycler ==0.11.0
- docutils ==0.19
- filelock ==3.4.2
- fonttools ==4.29.1
- huggingface-hub ==0.4.0
- hypothesis ==6.36.1
- idna ==3.3
- imagesize ==1.4.1
- importlib-metadata ==5.0.0
- jinja2 ==3.1.2
- joblib ==1.1.0
- kiwisolver ==1.3.2
- markupsafe ==2.1.1
- matplotlib ==3.5.1
- more-itertools ==8.12.0
- nameparser ==1.1.1
- nltk ==3.7
- numpy ==1.22.2
- packaging ==21.3
- pillow ==9.0.1
- pygments ==2.13.0
- pyparsing ==3.0.7
- python-dateutil ==2.8.2
- pytz ==2022.4
- pyyaml ==6.0
- regex ==2022.1.18
- requests ==2.27.1
- rich ==11.1.0
- sacremoses ==0.0.47
- scikit-learn ==1.0.2
- scipy ==1.6.1
- seqeval ==1.2.2
- setuptools ==65.5.0
- setuptools-scm ==6.4.2
- six ==1.16.0
- snowballstemmer ==2.2.0
- sortedcontainers ==2.4.0
- sphinx ==5.3.0
- sphinx-autodoc-typehints ==1.19.4
- sphinxcontrib-applehelp ==1.0.2
- sphinxcontrib-devhelp ==1.0.2
- sphinxcontrib-htmlhelp ==2.0.0
- sphinxcontrib-jsmath ==1.0.1
- sphinxcontrib-qthelp ==1.0.3
- sphinxcontrib-serializinghtml ==1.1.5
- threadpoolctl ==3.1.0
- tokenizers ==0.11.4
- tomli ==2.0.0
- torch ==1.10.2
- tqdm ==4.62.3
- transformers ==4.16.2
- typing-extensions ==4.0.1
- urllib3 ==1.26.8
- zipp ==3.9.0