thesis_sentiment_analysis

https://github.com/marcnhwu/thesis_sentiment_analysis

Science Score: 57.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.2%) to scientific vocabulary

Keywords

machine-learning natural-language-processing sentiment-analysis text-classification

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: marcnhwu
Language: R
Default Branch: main
Homepage:
Size: 1.31 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Topics

machine-learning natural-language-processing sentiment-analysis text-classification

Created over 5 years ago · Last pushed almost 5 years ago

Metadata Files

Readme Citation

Sentiment Analysis of Chinese Reviews Using Morphosyntactic Patterns

This is the supplementary code and datasets for the machine learning experiments in my thesis project, titled Sentiment Analysis of Chinese Reviews Using Morphosyntactic Patterns.

Abstract of the Thesis Project

Sentiment analysis is one of the most commonly discussed topics in the field of Natural Language Processing. While the traditional bag-of-words approach using n-grams is generally adopted for the sentiment analysis tasks like sentiment classification, studies have suggested that features beyond bags-of-word, such as grammatical and textual features, are crucial to the classifier’s performance. In particular, this study investigates to what extent linguistically-motivated morphosyntactic patterns may contribute to the sentiment classification through analyzing their impacts on the sentiment polarity of lexical features such as sentiment words in Chinse online movie reviews. We adopt pattern grammar as our theoretical framework to qualitatively encode patterns and the Wilcoxon rank-sum test to quantitatively determine significant patterns and their sentiment preferences.
Our analyses show that morphosyntactic patterns demonstrate two prominent sentiment modulation of lexical sentiment polarity: intensifying the positive lexical sentiment or mitigating the negative lexical sentiment. Our post-hoc collexeme analyses of these patterns also show that sentiment-intensifying patterns attract more positive words and that sentiment-mitigating patterns attract more negative words. These preferences reveal how Chinese speakers utilize morphosyntactic patterns to modulate the sentiment in their opinions and establish their credibility in online movies reviews. Finally, we train a series of Support Vector Machines models and perform two document classification experiments to validate the effectiveness of morphosyntactic patterns in comparison to the traditional bag-of-words models. In the first experiment, we examine whether our linguistically-motivated morphosyntactic patterns could capture comparable amount of the beyond-single-word information as opposed to the sentiment-word-embedded n-grams, which are traditional n-grams that specifically contain sentiment words. In the second experiment, we test if sentiment-modulating morphosyntactic patterns do contribute to sentiment classification on top of the traditional n-gram-based model. Results of the first experiment suggest that morphosyntactic patterns can encode a wider range of the crucial morphosyntactic properties of sentiment words more efficiently than sentiment-word-embedded n-grams. The second experiment shows that morphosyntactic patterns improved the traditional n-gram-based model comprising unigrams and bigrams. Moreover, we obtained an averaged F1 score of 87.80 when considering morphosyntactic patterns with other features such as n-grams and sentiment words in the classifier. We conclude that the handcrafted, linguistically-motivated morphosyntactic patterns can provide an alternative to the brutal n-gram methods that have been commonly employed in building classifiers for sentiment classification tasks.

:bulb: If you'd like to use the material, please cite this repository:
Wu, N. (2021). Supplementary Material for Sentiment Analysis of Chinese Reviews Using Morphosyntactic Patterns (Version 1.0.0) [Computer software]. https://doi.org/10.6345/NTNU202100009

Table of Content

Dataset

ANTUSD.csv: the sentiment dictionary ANTUSD
ANTUSD_PosNeg.csv: only the positive and negative words in ANTUSD
YahooChineseReviews_Corpus.csv: the Chinese movie reviews corpus collected by the author ### Code
coll.analysis.R: the Collexeme Analysis
Experiment I.R: Experiment I for machine learning
Experiment II.R: Experiment II for machine learning
patternsforranksum.R: the morphosyntactic patterns with significant sentiment modulation
preprocessingforcoll.analysis.R: preprocessing data in order to perform the Collexeme Analysis
weighting.R: weighting the co-occurrence of sentiment words in patterns based on the document-feature matrix (dfm) of sentiment words
wilcoxon_ranksum.R: quantifying the pattern and non-pattern contexts & performing the Wilcoxon rank-sum test

Experiment Procedure

Wilcoxon rank-sum test >> Experiment I >> Experiment II >> Post-hoc Analysis

Wilcoxon rank-sum test

Run wilcoxon_ranksum.R for the rank-sum test. 1. Load & pre-process data: run line 1–18. 2. Quantify the pattern context: run line 21–49. 3. Quantify the NON-pattern context: run line 51–79. 4. Multiply the token frequencies by rating: run line 82–117. 5. Perform the Wilcoxon rank-sum test & Determine the modulation of patterns: run line 120–160.

To run the rank-sum test on other patterns, open patternsforranksum.R. Copy the code for each pattern and replace the codes between line 25–35 in wilcoxon_ranksum.R with the new code. Then run all of the codes in wilcoxon_ranksum.R.

Experiment I

Run Experiment I.R for Experiment I. 1. Load & pre-process data: run line 1–37. 2. Create feature sets: run line 40–162. 3. Baseline models: - To build the SW+SW-Bigrams model, run line 168–177 first, and then line 180–190; finally, run line 241–295 for the classification experiment with the SW+SW-Bigrams model. - To build the SW+SW-Trigrams model, run line 193–203; then, run line 241–295 for the classification experiment with the SW+SW-Trigrams model. - To build the SW+SW-Four-grams model, run line 206–216; then, run line 241–295 for the classification experiment with the SW+SW-Four-grams model. - To build the SW+SW-Five-grams model, run line 219–229; then, run line 241–295 for the classification experiment with the SW+SW-Five-grams model. 4. Proposed model: - First, open weighting.R to perform weighting on the document-feature matrix based on the sentiment word features. Run all of the codes in weighting.R. - Return to Experiment I.R and run line 232–235 to build the SW+MorphosyntacticPatterns model. Finally, run line 241–295 for the classification experiment with the SW+MorphosyntacticPatterns model. 5. Post-hoc t-test: - Change the models to compare in line 306 and line 317. - Run line 315–321 for the adjusted R2-squared value.

Experiment II

Please run Experiment II.R for Experiment II. 1. Load & pre-process data: run line 1–37. 2. Create feature sets: run line 40–82. 3. Baseline models: - To build the Unigrams–Bigrams model, run line 88–99 first, and then run line 138–192 for the classification experiment with the Unigrams–Bigrams model. - To build the Unigrams–Bigrams–SentimentWord model, run line 102–117, and then run line 138–192 for the classification experiment with the Unigrams–Bigrams–SentimentWord model. - To build the Unigrams–Bigrams–SentimentWord–MorphosyntacticPatterns model, run line 120–132 (If you have already run weighting.R to perform weighting on the document-feature matrix based on the sentiment word features). Then, run line 138–192 for the classification experiment with the Unigrams–Bigrams–SentimentWord–MorphosyntacticPatterns model. 4. Post-hoc t-test: run line 198–210. - Change the models to compare in line 203.

Post-hoc Collexeme Analysis

Run preprocessingforcoll.analysis.R first. This will generate a contingency table that includes the token frequency of every collexeme in a given pattern and the raw frequency of these collexemes in the corpus. 1. Load & pre-process data: run line 1–10. 2. To get the contingency table for [feichang SW] (the prototypical sentiment-intensifying pattern), run line 13–80. This will generate [feichangSW]table.csv. 2. To get the contingency table for [youdian SW] (the prototypical sentiment-mitigating pattern), run line 85–133. This will generate [youdianSW]table.csv.

Run coll.analysis.R for the collexeme analysis and follow the instructions below: - What is the word W / the name of the construction C you investigate (without spaces)? >> [feichang SW] / [youdian SW] - Enter the size of the corpus (in constructions or words) without digit grouping symbols! >> 92041 - Enter the frequency of [feichang in the corpus you investigate (without digit grouping symbols) >> 190 for [feichang SW] / 55 for [youdian SW] - Which index of association strength do you want to compute? >> -log10 (Fisher-Yates exact, one-tailed) - How do you want to sort the output? >> collostruction strength - Enter the number of decimals you'd like to see in the results >> 99 - Choose this text file with the raw data! >> select **[feichang]table.csv_ or _[youdianSW]table.csv**_

Author

@marcnhwu Contact: nienhengwu@gmail.com

Owner

Name: Marc Wu
Login: marcnhwu
Kind: user
Location: Taipei, Taiwan
Company: Institute of Linguistics, Academia Sinica

Repositories: 1
Profile: https://github.com/marcnhwu

Citation (CITATION.cff)

cff-version: 1.0.3
message: "If you use this software, please cite it using these metadata."
authors:
- family-names: "Wu"
  given-names: "Nien-Heng"
  orcid: ""
title: "Supplementary Material for Sentiment Analysis of Chinese Reviews Using Morphosyntactic Patterns"
version: 1.0.0
doi: 10.6345/NTNU202100009
date-released: 2021-01-15
url: "https://github.com/marcnhwu/Thesis_Sentiment_Analysis"

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science