https://github.com/apmoore1/ner

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: apmoore1
License: apache-2.0
Language: Python
Default Branch: master
Size: 353 KB

Statistics

Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created over 7 years ago · Last pushed almost 7 years ago

Metadata Files

Readme License

Random Seeds Problem within NER

Requirements

NOTE: This has only been tested on Ubuntu 16.04

Python 3.6.1 or above
install the pip requirements pip install -r requirements.txt

Need to download the CoNLL 2003 NER dataset and store it within ./original_dataset where the train, dev, and test splits are at the following respective paths ./original_dataset/train.txt, ./original_dataset/dev.txt, ./original_dataset/test.txt. NOTE: Ensure that all of the splits have been pre-processed so that they are in BIO format and not IOB format.

Results of the NER models on different train, validation, and test splits

First we must create different train, validation, and test splits. To do this we create a different directory for each new random train, validation, and test split. In the paper we have 150 different random splits which is created using the following command:

python creating_data_sets.py ./original_dataset ./copy 150

Where ./copy is the new directory that will store 150 folders named 0 to 149 where in each numbered folder are three files train.txt, dev.txt, and test.txt.

After creating 150 new random datasets we run the 2 models 5 times using a different seed each time on each of the 150 new random datasets. Where the results are stored in respective dataset folder under the file results.json. To do this run the following command:

./datasets_and_random_seeds.sh 0 150 ./copy PATH_TO_GLOVE_FILE PATH_TO_TEMP_DIR PATH_TO_PYTHON_RUNNABLE

PATH_TO_GLOVE_FILE -- This is the path to the 100 dimension Glove file which can be downloaded from here PATH_TO_TEMP_DIR -- Can be any folder but most likely a folder from /tmp directory PATH_TO_PYTHON_RUNNABLE -- an example would be /home/andrew/Envs/NER/bin/python

The results from our experiments can be found in ./results/ner_dataset_and_random_seeds.json of which this single results file can be genererated from all of the different dataset results files be using the following script:

python join_dataset_results.py ./copy ./results/ner_dataset_and_random_seeds.json

Our different split data

If you would like to download our original 150 different train, validation, and test splits you can download the zip file here. Note that even though the folder names go beyond 150 e.g. go up to 199 there are only 150 folders.

Trouble shooting when running the code

It is better to run the following script multiple times but processing different folders at a time as it can run out of hard disk space due to it not removing temporary directories on the fly e.g.

./datasets_and_random_seeds.sh 0 10 ./copy PATH_TO_GLOVE_FILE PATH_TO_TEMP_DIR PATH_TO_PYTHON_RUNNABLE ./datasets_and_random_seeds.sh 11 20 ./copy PATH_TO_GLOVE_FILE PATH_TO_TEMP_DIR PATH_TO_PYTHON_RUNNABLE etc

To create the visulisation from the paper

The visulisation is saved in the ./image directory, of which the violin plot showing the distribution of changing the random seed and the data split for the CNN and LSTM model can be produced using the following command:

python violin_plot_of_ner_scores.py ./results/ner_dataset_and_random_seeds.json ./image/ner_violin_plot.png test 0 --remove_bottom_5

To get the exact plot in the paper which shows 3 additional violin plots for fixed data split but different random seed run the following command:

python violin_plot_of_ner_scores.py ./results/ner_dataset_and_random_seeds.json ./image/ner_violin_plot_with_fixed.png test 3 --remove_bottom_5

The violin plot shows all the results including the bottom five results for each encoder. The bottom 5 were removed as some of the methods did not converag. To create the non removed plot run the following command:

python violin_plot_of_ner_scores.py ./results/ner_dataset_and_random_seeds.json ./image/ner_violin_plot_bottom_5.png test 0

Owner

Name: Andrew Moore
Login: apmoore1
Kind: user
Location: Lancaster
Company: Lancaster University

Website: https://apmoore1.github.io/
Repositories: 55
Profile: https://github.com/apmoore1

PhD student and researcher. Main interests: Target/Aspect based sentiment analysis, Semi-Supervised Learning.

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science