https://github.com/apmoore1/bella-allennlp

Allen NLP models and datasets for Bella

https://github.com/apmoore1/bella-allennlp

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.8%) to scientific vocabulary
Last synced: 9 months ago · JSON representation

Repository

Allen NLP models and datasets for Bella

Basic Info
  • Host: GitHub
  • Owner: apmoore1
  • Language: Python
  • Default Branch: master
  • Size: 5.79 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created over 7 years ago · Last pushed almost 7 years ago
Metadata Files
Readme

README.md

TODO

  1. Add tests for the augmeted data iterator

Target Extraction

We are treating this problem as a sequence labelling problem. As the given datasets are not pre-tokenised first they must be tokenised. The tokeniser we use is Spacy. However as the text is not pre-tokenised we want to first see how many of the tokens line up with the span offsets that are sets as the target word and that must be predicted in this task. To do this we created the ./tokens_and_targets.py script which prints out the number of targets (samples) that the target word(s) does not neatly fit within the tokens created from the tokeniser we call these tokenisation errors. An example of this error can be seen below:

Turned on BBCQT and thought I was watching a translation of a Greek political show.Anti austerity from the SNP obviously funded on oil#bbcqt

Where the target is bbcqt but the token that spacy found was oil#bbcqt therefore predicting that as the target word would be incorrect as it incorporates more than just the target. So running the following command: bash python tokens_and_targets.py ~/.Bella/Datasets/ Laptop python tokens_and_targets.py ~/.Bella/Datasets/ Restaurant python tokens_and_targets.py ~/.Bella/Datasets/ Election We find the following for each of the datasets: 0.94%, 0.18%, and 1.53% of the test datasets to have tokenisation errors which is minimal but note worthy. One thing we did find is that in the Laptop training dataset one of the targets included a space within sentence id 1436. Here we only report errors on the test sets as with the training and validation sets we can force a space between the target word and text within the text and change the span offsets and thus remove the tokenisation errors as shown by the following commands: bash python tokens_and_targets.py ~/.Bella/Datasets/ Laptop --force_space python tokens_and_targets.py ~/.Bella/Datasets/ Restaurant --force_space python tokens_and_targets.py ~/.Bella/Datasets/ Election --force_space The reason we do not do this for the test datasets is because we want a fair compriason with previous work and thus do not change the dataset in any way to avoid the tokenisation errors.

Owner

  • Name: Andrew Moore
  • Login: apmoore1
  • Kind: user
  • Location: Lancaster
  • Company: Lancaster University

PhD student and researcher. Main interests: Target/Aspect based sentiment analysis, Semi-Supervised Learning.

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 2
  • Total pull requests: 0
  • Average time to close issues: about 21 hours
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 0.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • apmoore1 (2)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels