wisesight-sentiment

Thai social media text sentiment dataset

https://github.com/pythainlp/wisesight-sentiment

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 8 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (7.9%) to scientific vocabulary

Keywords

classification corpus sentiment-analysis thai tokenization
Last synced: 4 months ago · JSON representation ·

Repository

Thai social media text sentiment dataset

Basic Info
  • Host: GitHub
  • Owner: PyThaiNLP
  • License: cc0-1.0
  • Language: Jupyter Notebook
  • Default Branch: master
  • Homepage:
  • Size: 10.9 MB
Statistics
  • Stars: 83
  • Watchers: 6
  • Forks: 33
  • Open Issues: 0
  • Releases: 0
Topics
classification corpus sentiment-analysis thai tokenization
Created almost 7 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md


SPDX-License-Identifier: CC0-1.0

Wisesight Sentiment Corpus

DOI

ข้อความภาษาไทยจากสื่อสังคมออนไลน์ พร้อมกับป้ายกำกับความรู้สึก (บวก, กลางๆ, ลบ, คำถาม) รวม 26,737 ข้อความ เผยแพร่เป็นสมบัติสาธารณะ โดยการสละสิทธิ์ตาม CC0 1.0 Universal

Social media messages in Thai language with sentiment label (positive, neutral, negative, question). Contains 26,737 messages. Dedicated to the public domain under CC0 1.0 Universal.

Table of contents

Changelog

  • 2024-11-07: Released v1.1 with updated copyright text, contributors, and a software bill of materials (SBOM).
  • 2020-12-01: Added Hugging Face format PR #7
  • 2019-10-01: Fixed path in data preparation notebook PR #6
  • 2019-08-22: Added tokenization annotation for ~1,000 samples PR #4
  • 2019-07-03: Added tokenization annotation for 160 samples PR #2
  • 2019-03-31: Updated data.

Data characteristics and preprocessing

This corpus does not claim to be a statistically representative sample of the Thai language register.

General information:

  • Size: 26,737 messages.
  • Language: Central Thai.
  • Style: Informal and conversational, with some news headlines and advertisements.

Data coverage:

  • Time period: Approximately 2016 to early 2019, with a small amount from other periods.
  • Domains: Mixed, with a majority focusing on consumer products and services (restaurants, cosmetics, drinks, cars, hotels). Some current affairs topics are also included.

Privacy:

  • Messages were collected from publicly available online sources only (websites, blogs, social network sites).
  • For Facebook data, this includes public comments on public pages.
  • The dataset does not contain private/protected messages or messages from groups, chats, and inboxes.
  • Personally identifiable information has been removed or masked.

Data alterations and modifications:

  • A large portion of messages are not in their original form:
    • Usernames and non-public figure names are removed.
    • Phone numbers are masked (e.g., 088-888-8888, 09-9999-9999, 0-2222-2222).
    • Duplicated, leading, and trailing whitespaces are removed.
    • Other punctuations, symbols, and emojis are retained.
    • Misspellings remain uncorrected.
  • Messages exceeding 2,000 characters or non-Thai messages are removed.
  • Duplicate messages (exact matches) are removed.

Further exploration:

  • Refer to sbom.spdx3.json for a machine-readable data bill of materials (BOM) in SPDX 3.0 format.
  • Explore additional data characteristics using this notebook.

Annotation methodology

  • Sentiment values are assigned by human annotators.
  • A human annotator put his/her best effort to assign just one label, out of four, to a message.
  • A message can be ambiguous. When possible, the judgement will be based solely on the text itself.
    • In some situation, like when the context is missing, the annotator may have to rely on his/her own world knowledge and just guess.
    • In some cases, the human annotator may have access to the message's context, like an image. These additional information are not included as part of this corpus.
  • Agreement, enjoyment, and satisfaction are positive. Disagreement, sadness, and disappointment are negative.
  • Showing interest in a topic or in a product is counted as positive.
    • In this sense, a question about a particular product could have a positive sentiment value, if it shows the interest in the product.
  • Saying that other product or service is better is counted as negative.
  • General information or news title tend to be counted as neutral.
  • For word tokenization annotation methodology, please refer to word-tokenization/README.md.

Corpus file structure

  • All files are UTF-8 encoded plaintext.
  • One message per line.
  • A newline character in the original message will be replaced with a space.
  • q.txt: Questions (575 messages).
  • neg.txt: Message with negative sentiment (6,823).
  • neu.txt: Message with neutral sentiment (14,561).
  • pos.txt: Message with positive sentiment (4,778).
  • huggingface directory contains an archive file meant to be fetched by Hugging Face Datasets
  • kaggle-competition/ directory contains the legacy dataset in Kaggle competition format:
    • train.txt: Message for training (24,066 messages).
    • train_label.txt: Label for training. Each line is the label corresponding to the same line in train.txt.
    • test.txt - Message for testing (2,674 messages)
    • test_label.txt: Label for testing. Each line is the label corresponding to the same line in test.txt.
    • test_majority.csv: Sample submission in Kaggle format. Contains neu class as all the predictions.
    • test_solution.csv: Test solution in Kaggle format.
    • Sample code for data exploration, training, and prediction are provided.
  • word-tokenization directory contains wisesight-160 and wisesight-1000 datasets, which are samples from the full corpus in a tokenized form.

Copyright and disclaimer

This dataset contains social media text extracted from publicly accessible sources on the internet. The selection, organization, curation, and transformation of this dataset are original works that were previously copyrighted. However, the copyright holder has waived all rights to this dataset and dedicated it to the public domain under the Creative Commons Zero v1.0 Universal Public Domain Dedication.

Any trademarks or trade names appearing in the messages belong to their respective owners.

Wisesight (Thailand) Co., Ltd. has assisted in the collection and sentiment labeling of this dataset, but does not necessarily endorse the labels assigned by human annotators. These annotations are for research purposes only and do not represent the professional work Wisesight performs for its clients.

Please note that human annotators may not personally agree or disagree with the messages they label. Additionally, the labels assigned do not necessarily reflect their personal opinions on the content.

You are free to use this dataset for any purpose, without any restrictions.

Citation

Please cite the following if you make use of the dataset:

Suriyawongkul, Arthit, Ekapol Chuangsuwanich, Pattarawat Chormai, Nitchakarn Chantarapratin, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, Attapol Rutherford, Charin Polpanumas, and Can Udomcharoenchaikit. “PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label”. Zenodo, 7 November 2024. https://doi.org/10.5281/zenodo.3457446.

BibTeX:

bibtex @misc{Suriyawongkul_PyThaiNLP_Wisesight_Sentiment_Corpus_2020, author = {Suriyawongkul, Arthit and Chuangsuwanich, Ekapol and Chormai, Pattarawat and Chantarapratin, Nitchakarn and Prasertsom, Ponrawee and Sawatphol, Jitkapat and Yamada, Nozomi and Rutherford, Attapol and Polpanumas, Charin and Udomcharoenchaikit, Can}, doi = {10.5281/zenodo.3457446}, license = {CC0-1.0}, month = nov, publisher = {Zenodo}, title = {{PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label}}, url = {https://doi.org/10.5281/zenodo.3457446}, version = {v1.1}, year = 2024 }

Acknowledgement

We would like to thank:

Additional resources

Owner

  • Name: PyThaiNLP
  • Login: PyThaiNLP
  • Kind: organization
  • Location: Thailand

We build Thai NLP.

Citation (CITATION.cff)

cff-version: "1.2.0"
type: dataset
message: |
  If you use this software, please cite it as below.:

  Suriyawongkul, Arthit, Ekapol Chuangsuwanich, Pattarawat Chormai, Nitchakarn Chantarapratin, Ponrawee Prasertsom, Jitkapat Sawatphol, Nozomi Yamada, Attapol Rutherford, Charin Polpanumas, and Can Udomcharoenchaikit. “PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label”. Zenodo, 7 November 2024. https://doi.org/10.5281/zenodo.3457446.
authors:
- family-names: "Suriyawongkul"
  given-names: "Arthit"
  orcid: "https://orcid.org/0000-0002-9698-1899"
- family-names: "Chuangsuwanich"
  given-names: "Ekapol"
  orcid: "https://orcid.org/0000-0001-6104-4857"
- family-names: "Chormai"
  given-names: "Pattarawat"
  orcid: "https://orcid.org/0000-0002-7582-4667"
- family-names: "Chantarapratin"
  given-names: "Nitchakarn"
- family-names: "Prasertsom"
  given-names: "Ponrawee"
- family-names: "Sawatphol"
  given-names: "Jitkapat"
- family-names: "Yamada"
  given-names: "Nozomi"
- family-names: "Rutherford"
  given-names: "Attapol"
  orcid: "https://orcid.org/0000-0003-2270-6082"
- family-names: "Polpanumas"
  given-names: "Charin"
  orcid: "https://orcid.org/0000-0001-7822-4600"
- family-names: "Udomcharoenchaikit"
  given-names: "Can"
  orcid: "https://orcid.org/0000-0002-7090-0536"
title: "PyThaiNLP/Wisesight Sentiment Corpus with Word Tokenization Label"
version: v1.1
license: CC0-1.0
identifiers:
  - description: This is the collection of archived snapshots of all versions of the dataset
    type: doi
    value: "10.5281/zenodo.3457446"
  - description: This is the archived snapshot of version 1.0 of the dataset
    type: doi
    value: "10.5281/zenodo.3457447"
  - description: This is the archived snapshot of version 1.1 of the dataset
    type: doi
    value: "10.5281/zenodo.14052576"
repository: "https://github.com/PyThaiNLP/wisesight-sentiment/"
date-released: 2024-11-07

GitHub Events

Total
  • Release event: 2
  • Watch event: 8
  • Delete event: 1
  • Push event: 22
  • Fork event: 1
  • Create event: 2
Last Year
  • Release event: 2
  • Watch event: 8
  • Delete event: 1
  • Push event: 22
  • Fork event: 1
  • Create event: 2

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 1
  • Total pull requests: 6
  • Average time to close issues: 7 days
  • Average time to close pull requests: 4 days
  • Total issue authors: 1
  • Total pull request authors: 4
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.5
  • Merged pull requests: 6
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • c4n (1)
Pull Request Authors
  • p16i (3)
  • ekapolc (1)
  • cstorm125 (1)
  • c4n (1)
Top Labels
Issue Labels
Pull Request Labels