https://github.com/apmoore1/bella-allennlp
Allen NLP models and datasets for Bella
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.8%) to scientific vocabulary
Repository
Allen NLP models and datasets for Bella
Basic Info
- Host: GitHub
- Owner: apmoore1
- Language: Python
- Default Branch: master
- Size: 5.79 MB
Statistics
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
TODO
- Add tests for the augmeted data iterator
Target Extraction
We are treating this problem as a sequence labelling problem. As the given datasets are not pre-tokenised first they must be tokenised. The tokeniser we use is Spacy. However as the text is not pre-tokenised we want to first see how many of the tokens line up with the span offsets that are sets as the target word and that must be predicted in this task. To do this we created the ./tokens_and_targets.py script which prints out the number of targets (samples) that the target word(s) does not neatly fit within the tokens created from the tokeniser we call these tokenisation errors. An example of this error can be seen below:
Turned on BBCQT and thought I was watching a translation of a Greek political show.Anti austerity from the SNP obviously funded on oil#bbcqt
Where the target is bbcqt but the token that spacy found was oil#bbcqt therefore predicting that as the target word would be incorrect as it incorporates more than just the target. So running the following command:
bash
python tokens_and_targets.py ~/.Bella/Datasets/ Laptop
python tokens_and_targets.py ~/.Bella/Datasets/ Restaurant
python tokens_and_targets.py ~/.Bella/Datasets/ Election
We find the following for each of the datasets: 0.94%, 0.18%, and 1.53% of the test datasets to have tokenisation errors which is minimal but note worthy. One thing we did find is that in the Laptop training dataset one of the targets included a space within sentence id 1436. Here we only report errors on the test sets as with the training and validation sets we can force a space between the target word and text within the text and change the span offsets and thus remove the tokenisation errors as shown by the following commands:
bash
python tokens_and_targets.py ~/.Bella/Datasets/ Laptop --force_space
python tokens_and_targets.py ~/.Bella/Datasets/ Restaurant --force_space
python tokens_and_targets.py ~/.Bella/Datasets/ Election --force_space
The reason we do not do this for the test datasets is because we want a fair compriason with previous work and thus do not change the dataset in any way to avoid the tokenisation errors.
Owner
- Name: Andrew Moore
- Login: apmoore1
- Kind: user
- Location: Lancaster
- Company: Lancaster University
- Website: https://apmoore1.github.io/
- Repositories: 55
- Profile: https://github.com/apmoore1
PhD student and researcher. Main interests: Target/Aspect based sentiment analysis, Semi-Supervised Learning.
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 2
- Total pull requests: 0
- Average time to close issues: about 21 hours
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- apmoore1 (2)