german_reddit_stance_dataset

A dataset of stance annotated Reddit comments

https://github.com/felixwoestmann/german_reddit_stance_dataset

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.5%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

A dataset of stance annotated Reddit comments

Basic Info
  • Host: GitHub
  • Owner: felixwoestmann
  • Default Branch: main
  • Size: 188 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 2
Created over 1 year ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.md

German Reddit stance dataset

This repository contains a dataset of Reddit comments from the German-speaking subreddit r/de, focusing on discussions about railway worker strikes in Germany from November to December 2023. The dataset was build to evaluate the abilities of LLMs such as GPT-4 or Claude 3.5 Sonnet for stance classification of German Reddit comments.

Dataset Overview

  • Source: Comments extracted from submissions to r/de subreddit
  • Time period: November 9, 2023 to December 19, 2023
  • Total comments: 1150
  • Number of submissions: 7
  • Language: German

Dataset Structure

The dataset is organized as a series of JSON files, each representing a single submission and its associated comments. The file names correspond to the unique submission IDs.

Submission Object Structure

| Field | Description | Example | |-------|-------------|---------| | id | Unique identifier | 17v3qsh | | title | Submission title | Deutsche Bahn: Lokführergewerkschaft GDL kündigt Streik an | | score | Submission score | 349 | | created | Creation date and time | 2023-11-14 15:45:00 | | url | URL to the posted news article | https://www.tagesschau.de/eilmeldung/eilmeldung-7516.html | | author | Author username | u/Digag | | branches | List of top-level comment objects | [Comment objects] |

Comment Object Structure

| Field | Description | Example | |-------|-------------|---------| | id | Unique identifier | k97ud84 | | author | Author username | u/Noodleholz | | body | Comment text | Da bei uns die Bahn abwechselnd wegen... | | created | Creation date and time | 2023-11-14 15:48 | | score | Comment score | 123 | | parentid | ID of parent (submission or comment) | t317v3qsh | | linkid | ID of submission | t317v3qsh | | branches | Replies to the comment | [Comment objects] or [] | | stanceOnSubmission | Stance towards submission | positive | | stanceOnParent | Stance towards parent comment | negative |

Data Collection and Processing

  1. Raw data acquired from the ArcticShift project, which mirrors Reddit data.
  2. Submissions filtered using keywords related to railway strikes.
  3. Comments for selected submissions extracted recursively.
  4. Deleted comments and their descendants removed to respect user privacy.
  5. Data manually annotated for stances using a custom web application.

Annotation Schema

Comments are annotated with two types of stances:

  1. Stance towards submission: {positive, negative, neither, null}

    • Interpreted as: In Favor, Against, Neither
    • null indicates absence of annotation
  2. Stance towards parent comment: {positive, negative, neither, null}

    • Interpreted as: Agrees, Disagrees, Neither
    • null indicates absence of annotation or top-level comment

Ethical Considerations

  • User names have been pseudonymized to protect privacy.
  • Deleted comments and their descendants have been removed from the dataset.
  • The r/de subreddit's rules were respected in the data collection process.

Limitations

  • Single annotator, which may introduce subjectivity.

Usage and Citation

[Include information about how to cite the dataset and any usage restrictions

Owner

  • Name: Felix Wöstmann
  • Login: felixwoestmann
  • Kind: user
  • Location: somewhere in europe
  • Company: hubblr

Computational Social Scientist

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this dataset, please cite it as below."
authors:
- family-names: "Wöstmann"
  given-names: "Felix"
  orcid: "https://orcid.org/0009-0006-7577-4993"
title: "German Reddit stance dataset"
version: 1.0.0
doi: 10.5281/zenodo.13819863
date-released: 2024-09-20
url: "https://github.com/felixwoestmann/stance_annotator_app"

GitHub Events

Total
Last Year