https://github.com/apmoore1/ner
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.9%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: apmoore1
- License: apache-2.0
- Language: Python
- Default Branch: master
- Size: 353 KB
Statistics
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Random Seeds Problem within NER
Requirements
NOTE: This has only been tested on Ubuntu 16.04
- Python 3.6.1 or above
- install the pip requirements
pip install -r requirements.txt
Need to download the CoNLL 2003 NER dataset and store it within ./original_dataset where the train, dev, and test splits are at the following respective paths ./original_dataset/train.txt, ./original_dataset/dev.txt, ./original_dataset/test.txt. NOTE: Ensure that all of the splits have been pre-processed so that they are in BIO format and not IOB format.
Results of the NER models on different train, validation, and test splits
First we must create different train, validation, and test splits. To do this we create a different directory for each new random train, validation, and test split. In the paper we have 150 different random splits which is created using the following command:
python creating_data_sets.py ./original_dataset ./copy 150
Where ./copy is the new directory that will store 150 folders named 0 to 149 where in each numbered folder are three files train.txt, dev.txt, and test.txt.
After creating 150 new random datasets we run the 2 models 5 times using a different seed each time on each of the 150 new random datasets. Where the results are stored in respective dataset folder under the file results.json. To do this run the following command:
./datasets_and_random_seeds.sh 0 150 ./copy PATH_TO_GLOVE_FILE PATH_TO_TEMP_DIR PATH_TO_PYTHON_RUNNABLE
PATH_TO_GLOVE_FILE -- This is the path to the 100 dimension Glove file which can be downloaded from here
PATH_TO_TEMP_DIR -- Can be any folder but most likely a folder from /tmp directory
PATH_TO_PYTHON_RUNNABLE -- an example would be /home/andrew/Envs/NER/bin/python
The results from our experiments can be found in ./results/ner_dataset_and_random_seeds.json of which this single results file can be genererated from all of the different dataset results files be using the following script:
python join_dataset_results.py ./copy ./results/ner_dataset_and_random_seeds.json
Our different split data
If you would like to download our original 150 different train, validation, and test splits you can download the zip file here. Note that even though the folder names go beyond 150 e.g. go up to 199 there are only 150 folders.
Trouble shooting when running the code
It is better to run the following script multiple times but processing different folders at a time as it can run out of hard disk space due to it not removing temporary directories on the fly e.g.
./datasets_and_random_seeds.sh 0 10 ./copy PATH_TO_GLOVE_FILE PATH_TO_TEMP_DIR PATH_TO_PYTHON_RUNNABLE
./datasets_and_random_seeds.sh 11 20 ./copy PATH_TO_GLOVE_FILE PATH_TO_TEMP_DIR PATH_TO_PYTHON_RUNNABLE
etc
To create the visulisation from the paper
The visulisation is saved in the ./image directory, of which the violin plot showing the distribution of changing the random seed and the data split for the CNN and LSTM model can be produced using the following command:
python violin_plot_of_ner_scores.py ./results/ner_dataset_and_random_seeds.json ./image/ner_violin_plot.png test 0 --remove_bottom_5
To get the exact plot in the paper which shows 3 additional violin plots for fixed data split but different random seed run the following command:
python violin_plot_of_ner_scores.py ./results/ner_dataset_and_random_seeds.json ./image/ner_violin_plot_with_fixed.png test 3 --remove_bottom_5
The violin plot shows all the results including the bottom five results for each encoder. The bottom 5 were removed as some of the methods did not converag. To create the non removed plot run the following command:
python violin_plot_of_ner_scores.py ./results/ner_dataset_and_random_seeds.json ./image/ner_violin_plot_bottom_5.png test 0
Owner
- Name: Andrew Moore
- Login: apmoore1
- Kind: user
- Location: Lancaster
- Company: Lancaster University
- Website: https://apmoore1.github.io/
- Repositories: 55
- Profile: https://github.com/apmoore1
PhD student and researcher. Main interests: Target/Aspect based sentiment analysis, Semi-Supervised Learning.
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0