https://github.com/apmoore1/tdsa_comparisons

Last synced: 9 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: apmoore1
Language: Jupyter Notebook
Default Branch: master
Size: 14.7 MB

Statistics

Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Releases: 0

Created over 6 years ago · Last pushed about 5 years ago

Metadata Files

Readme

TDSA Comparisons

This code base explores how advancements in and outside of TDSA have improved TDSA models generally. At the moment we only focus on English datasets.

Table that shows the results for various TDSA methods that use Contextualised Word Representation (CWR) (ELMo and BERT are CWR models) can be found at OverviewofTDSAmethodsthatuseCWR.csv, it compares them through various results on different datasets, whether then fine-tune the CWR model or not, if they add their own custom Neural Network layer, called here Task Specific Architecture (TSA), and if the CWR is pre-trained on some additional data (Pre-Trained). ASC = Aspect Sentiment Classification, AE = Aspect Extraction, ASAE = Joint task of Aspect Extraction and Aspect Sentiment classification. L = Laptop SemEval dataset, and R = Restaurant SemEval dataset.

Datasets

All of the data is stored in a private folder within ./data. The datasets that we will use are the following: 1. Twitter Election dataset from Wang et al. 2017 2. SemEval 2014 task 4 subtask 2 Laptop of which the training data can found here and the test data here 3. SemEval 2014 task 4 subtask 2 Restaurant of which the training and test can be found in the same place as the Laptop.

Only the SemEval datasets need downloading into the ./data folder, the Election dataset is automatically downloaded through the code. To create the training, validaton and test splits for each of the datasets run the following bash script (it uses the dataset splitter from target_extraction code base): bash ./tdsa_comparisons/splitting_data/create_splits.sh The Election, Laptop, and Restaurant dataset splits can be found in there respective folders: ./data/election_dataset, ./data/laptop_dataset, and ./data/restaurant_dataset. Each of the targets in each text are ordered so that the first target that occurs in the sentence is also the first target within the TargetText object. This re-ordering is done so that methods that rely on this ordering can be used such as the model from Hazarika et al. 2018 which encodes the aspect/target representation in sequential order using an LSTM.

Word Embeddings

For all of the experiments the 840 billion token 300 dimension GloVe word vectors will be used, these word vectors should be downloaded to the resources directory under ./resources/embeddings/glove.840B.300d.txt. The only time these word vectors will not be used is during the Contextualised Word Representation (CWR) experiments where the CWR will be used instead.

The CWR uses the Transformer ELMo architecture, they should be downloaded to the ./resources/CWR/ directory, of which each dataset has their own CWR due to Rietzler et al. 2019 showing that domain specific CWR outperform non-domain specific CWR by a large margin. The three CWR models we use are the following, and can be found at the following URL https://ucrel-web.lancs.ac.uk/moorea/research/phd_thesis/resources/CWR/, each named after the dataset:

restaurant_model.tar.gz
laptop_model.tar.gz
election_model.tar.gz

For details on how these CWR were fine tuned to the domain using a language modelling objective see the following apmoore1/language-model GitHub repository., for a summary of how they were created see the CWR Model Zoo section.

Analysis of the datasets

Before performing any of the experiments an analysis of the datasets is done looking at the different error splits that can be found within TDSA. The notebook exploring these splits can be found here, whereby it requires both uploading the original train and test data from the Restaurant and Laptop datasets (XML files), and uploading the train, validation, and test data created from the dataset splitting above (JSON files):

./analysis/TDSAErrorAnalysis.ipynb

Lastly below the script generates the general dataset statistics that is shown in the table below: bash python general_dataset_stats.py

| Dataset | Train | Validation | Test | Total | |:-----------|:--------------|:--------------|:--------------|--------:| | Election | 6811 (57.24%) | 2547 (21.41%) | 2541 (21.35%) | 11899 | | Laptop | 1661 (56.29%) | 652 (22.09%) | 638 (21.62%) | 2951 | | Restaurant | 2490 (52.73%) | 1112 (23.55%) | 1120 (23.72%) | 4722 |

Experiments

In all of the experiments we are going to use the following 3 TDSA models: 1. TDLSTM 2. IAN 3. AE (in the thesis/paper it is called AE-Attention) -- A model that is the same as the AE model from Wang et al. 2016 but with an attention layer after the LSTM enocder. This model is also the same as the inter-aspect model (from now on called Inter-AE) from Hazarika et al. 2018 but without the LSTM aspect encoder (phase 2 in figure 1) that models other targets from the same context/text.

The 4 main experiments are the following: 1. Baseline -- TDLSTM, IAN, and AE as is without modification using the GloVe word vectors. Also included in these experiments is a baseline CNN text classifier which does not take into account the target.
2. Inter-Aspect -- TDLSTM, IAN, and InterAE with Inter-Aspect modelling. To incorporate Inter Aspect modeling we will adopt the method of Hazarika et al. 2018 which uses an LSTM hence why the AE model is now called the InterAE model. 3. Position -- Run the IAN, and AE models with position of the target encoded. TDLSTM is not used in this experiment as the model already encodes position information via the network architecture. 4. CWR -- Replace the GloVe vectors with domain specific CWR for TDLSTM, IAN, and AE.

All of the default training configurations for each of the 3 models can be found here.

Furthermore all the experiments ran within the experiment section all use bash scripts of which these scripts all take two arguments: 1. The number of times to run each model. In all of experiments this is 8 2. The directory to store a saved model of each model ran on each dataset.

Baseline Text classification experiments

Before performing all of the experiments on the Target based models we want to set a benchmark on these datasets using standard text classification models that have no knowledge of the target. In these experiments we have one CNN based model from Kim 2014 which takes as input word embeddings and then passes those through 3 filters (3, 4, and 5 window filter) each with a filter map of 100. This model is going to have two versions: 1. Trained on all of the sentences from the TDSA datasets where the sentiment for sentences with multiple targets and sentiments is going to be associated with the most frequent sentiment (ties decided by random choice). 2. Trained on only sentences from the TDSA datasets where the sentence has only one sentiment associated with it.

The two versions from now on will be called CNN(average) and CNN(single) respectively. The metadata associated from the results of these baselines will be the same as those from the Target based methods, of which the metadata is described in the results section. The only extra metadata add for these experiments is the following within the predicted_target_sentiment_key dictionary: data-trained-on which can only have two values single and average this is to represent the two different model versions CNN(single) and CNN(average).

Before training these two model versions we need to create two new training and validation datasets based on the different sentiment labels (single and average). To do this easily we create new data directories for each of the datasets (Election, Restaurant, and Laptop). Of which these data directories can be found in ./data/text_classification/single and ./data/text_classification/average for the single and average sentiment labels respectively. To create these data directories run the following bash script: bash ./tdsa_comparisons/splitting_data/text_classification_dataset_creator.sh

This bash script if ran multiple times will not change the data but will provide you with the data statistics for the training and validation datasets text sentiment label. These dataset statistics are better shown for the training dataset through this notebook. The main difference with these dataset directories is that they will have an extra validation dataset called train_val.json which will be used for early stopping for the text classification models. However when predicting for the TDSA task this will be done like all of the other experiments on the val.json and test.json TDSA data.

Run the following bash script to re-create the CNN text classification experiments:

bash ./tdsa_comparisons/experiments/non_target_baseline.sh 8 ./saved_models/

The results from the CNN text classification can be seen through the associated notebook of which it was found that the CNN (average) was better on 2 (3) of the 3 datasets for the accuracy (macro f1) metric across both validation and test splits. Thus the CNN (average) will be used as the baseline to compare against the TDSA methods.

Baseline Experiments

To run all 3 TDSA methods for the baseline experiments on the 3 datasets run the following bash script:

bash ./tdsa_comparisons/experiments/baseline.sh 8 ./saved_models/

Inter-Aspect Experiments

To run all 3 TDSA methods for the inter-aspect experiments on the 3 datasets run the following bash script:

bash ./tdsa_comparisons/experiments/sequential_inter_aspect.sh 8 ./saved_models/

Position Embedding Experiments

To run the 2 TDSA methods for the position embedding experiments on the 3 datasets run the following:

bash ./tdsa_comparisons/experiments/position_embeddings.sh 8 ./saved_models/

Position Weighting and Inter-Aspect Experiments

To run the 2 TDSA methods for the position embedding experiments on the 3 datasets run the following:

bash ./tdsa_comparisons/experiments/sequential_inter_aspect_and_position_weighting.sh 8 ./saved_models/

CWR Experiments

To run the 3 TDSA methods for the CWR experiments on the 3 datasets run the following: bash ./tdsa_comparisons/experiments/cwr.sh 8 ./saved_models/

We also ran the CNN (average) model with CWR to see the difference between the TDSA and Text classification methods when CWR are used. To run the CNN model: bash ./tdsa_comparisons/experiments/non_target_cwr_baseline.sh 8 ./saved_models/ The results are saved to their own directory ./data/text_classification/average, later we show how to merge these results with the TDSA results. {'metadata':{'predictions':{'targetsentiment{wordvector}{position}_{}}}

Results

In the results section the predictions generated from all of the experiments are examined through their respective notebooks which can be found in the analysis folder.. The predictions from the experiments are saved to the original data within the data folder (./data) and is then released annoymised (no text) within this github repository in the saved results folder.

Thus before any analysis is performed we first annoymise the results so the analysis can be performed, we then explain how the predictions are stored and the metadata, and lastly we state which notebooks show what analysis.

Anonymise the results

All of the results which have been some what anonymised (text data from the dataset is removed) from these models are released in JSON format nearly identical to there original format. The results from all of the experiments can be found in the following folders for the Target Based and best performing non-target baseline: 1. ./saved_results/main/restaurant 2. ./saved_results/main/laptop 3. ./saved_results/main/election

For the results between the two CNN version non-target baselines (CNN(single) and CNN(average)) they can be found within the following two folders both containing additional sub-folders for each dataset: 1. ./saved_results/non_target_baselines/single 2. ./saved_results/non_target_baselines/average

Where each dataset folder contains a test.json, val.json, and train.json files that represent the test and validation results for the associated dataset, as well as the training dataset so that error analysis can be performed.

Before the anonymisation we want to merge the results from the best performing CNN text classifier (CNN (average)) and the CNN using CWR with the TDSA results. This is done so that the analysis is easier. To do this run the following script before the anonymisation: bash python merge_text_and_tdsa_results.py ./data/text_classification/average ./data

To create the anonymised results for the Target Based and best performing non-target baselines run the following: bash python anonymise_dataset_folder.py ./data ./saved_results/main

To create the anonymised results for the CNN version non-target baselines run the following: bash python anonymise_dataset_folder.py ./data/text_classification/single ./saved_results/non_target_baselines/single python anonymise_dataset_folder.py ./data/text_classification/average ./saved_results/non_target_baselines/average

Result/Prediction Data and Metadata

Each validation and test .json result file (this is for all data within ./saved_results and ./data) contain metadata which is stored on the last line of the .json file of which the metadata contains the following keys: 1. name -- Name of the dataset in this case this is either Laptop, Restaurant, or Election 2. split -- The dataset split this is either Validation or Test 3. predicted_target_sentiment_key -- This contains a dictionary of dictionaries where each dictionary key links to a predicted sentiment key in each sample e.g. predicted_target_sentiment_IAN_GloVe_None_None this key then has a dictionary as value describing the model that generated those predictions. This dictionary has the following keys: * CWR -- If the model used Contextualised Word Representations, if False then GloVe vectors were used. * Inter-Aspect -- Whether the model toke into account inter aspect/target modelling if so then the name of this modelling would be the value else False. The valid names are the following sequential for Hazarika et al. 2018 LSTM method. * Position -- Whether or not target position weighting or embedding were used if not this would be False. The valid names that can appear here are Weighting or Embedding * Model -- The name of the TDSA model used. Valid names that can appear are the following: AE, TDLSTM, and IAN

Example of this metadata is shown below: json {"name": "Laptop", "split": "Test", "predicted_target_sentiment_key": {"predicted_target_sentiment_IAN_GloVe_None_None": {"CWR": false, "Position": false, "Inter-Aspect": false, "Model": "IAN"} } }

The metadata requires you to know the prediction key to find the associated metadata. The prediction keys do have a structure which is the following: predicted_target_sentiment_$ Model Name $_$ Word Representation $_$ Position Encoding $_$ Inter Aspect Encoding $

Where the values within the dollar signs can be the following: 1. Model Name -- IAN, AE, TDLSTM, CNN 2. Word Representation -- CWR or GloVe. 3. Position Encoding -- None, Weighted, or Embedding. Where None represents no position encoding. 4. Inter Aspect Encoding -- None, or sequential. Where None represents no inter aspect encoding.

Thus the following prediction key is for an IAN model that has been trained with GloVe and no position or inter apsect encodings: predicted_target_sentiment_IAN_GloVe_None_None

Generating the Metric results

For all of the experiments apart from those between the two versions of the CNN models we generate the metric results into a TSV file so that they can be better analysed. To generate these metric results which include results for all of the Error splits and their associated subsets run the following script: bash python create_error_subsets_data.py ./saved_results/main/ ./saved_results/main/results.tsv accuracy Which gathers all of the results from the all of the dataset results in ./saved_results/main/ and generates the metric scores and saves them in a TSV file which will be stored at ./saved_results/main/results.tsv.

Furthermore to generate the results for Exploring why the NT split does not show a consistent trend in the baseline results notebook, run the following script, however the results from this script can be found at ./saved_results/nt_subset_results.tsv. bash python create_nt_error_subset_results.py ./saved_results/main/ ./saved_results/nt_subset_results.tsv

Also to generate the results for testing out whether the position models improve when the number of targets increases in a sentence that is used in the following notebook run the following: bash python create_nt_error_subset_results_for_position.py ./saved_results/main/ ./saved_results/position_nt_subset_results.tsv

Within the baseline results notebook there is a section on describing the results on the DS subsets that contain Macro F1 score instead of the Accuracy scores from all of the other subset experiments. Thus to get the Macro F1 scores for the subsets this script was ran and the results can be found at ./saved_results/main/macro_f1_results.tsv bash python create_error_subsets_data.py ./saved_results/main/ ./saved_results/main/macro_f1_results.tsv macro_f1

Baseline Results

The following notebook shows the Baseline results

Owner

Name: Andrew Moore
Login: apmoore1
Kind: user
Location: Lancaster
Company: Lancaster University

Website: https://apmoore1.github.io/
Repositories: 55
Profile: https://github.com/apmoore1

PhD student and researcher. Main interests: Target/Aspect based sentiment analysis, Semi-Supervised Learning.

GitHub Events

Total

Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time

Total issues: 0
Total pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Total issue authors: 0
Total pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 0
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 0
Average comments per issue: 0
Average comments per pull request: 0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0