german_reddit_stance_dataset
A dataset of stance annotated Reddit comments
https://github.com/felixwoestmann/german_reddit_stance_dataset
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.5%) to scientific vocabulary
Repository
A dataset of stance annotated Reddit comments
Basic Info
- Host: GitHub
- Owner: felixwoestmann
- Default Branch: main
- Size: 188 KB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
German Reddit stance dataset
This repository contains a dataset of Reddit comments from the German-speaking subreddit r/de, focusing on discussions about railway worker strikes in Germany from November to December 2023. The dataset was build to evaluate the abilities of LLMs such as GPT-4 or Claude 3.5 Sonnet for stance classification of German Reddit comments.
Dataset Overview
- Source: Comments extracted from submissions to r/de subreddit
- Time period: November 9, 2023 to December 19, 2023
- Total comments: 1150
- Number of submissions: 7
- Language: German
Dataset Structure
The dataset is organized as a series of JSON files, each representing a single submission and its associated comments. The file names correspond to the unique submission IDs.
Submission Object Structure
| Field | Description | Example | |-------|-------------|---------| | id | Unique identifier | 17v3qsh | | title | Submission title | Deutsche Bahn: Lokführergewerkschaft GDL kündigt Streik an | | score | Submission score | 349 | | created | Creation date and time | 2023-11-14 15:45:00 | | url | URL to the posted news article | https://www.tagesschau.de/eilmeldung/eilmeldung-7516.html | | author | Author username | u/Digag | | branches | List of top-level comment objects | [Comment objects] |
Comment Object Structure
| Field | Description | Example | |-------|-------------|---------| | id | Unique identifier | k97ud84 | | author | Author username | u/Noodleholz | | body | Comment text | Da bei uns die Bahn abwechselnd wegen... | | created | Creation date and time | 2023-11-14 15:48 | | score | Comment score | 123 | | parentid | ID of parent (submission or comment) | t317v3qsh | | linkid | ID of submission | t317v3qsh | | branches | Replies to the comment | [Comment objects] or [] | | stanceOnSubmission | Stance towards submission | positive | | stanceOnParent | Stance towards parent comment | negative |
Data Collection and Processing
- Raw data acquired from the ArcticShift project, which mirrors Reddit data.
- Submissions filtered using keywords related to railway strikes.
- Comments for selected submissions extracted recursively.
- Deleted comments and their descendants removed to respect user privacy.
- Data manually annotated for stances using a custom web application.
Annotation Schema
Comments are annotated with two types of stances:
Stance towards submission: {positive, negative, neither, null}
- Interpreted as: In Favor, Against, Neither
- null indicates absence of annotation
Stance towards parent comment: {positive, negative, neither, null}
- Interpreted as: Agrees, Disagrees, Neither
- null indicates absence of annotation or top-level comment
Ethical Considerations
- User names have been pseudonymized to protect privacy.
- Deleted comments and their descendants have been removed from the dataset.
- The r/de subreddit's rules were respected in the data collection process.
Limitations
- Single annotator, which may introduce subjectivity.
Usage and Citation
[Include information about how to cite the dataset and any usage restrictions
Owner
- Name: Felix Wöstmann
- Login: felixwoestmann
- Kind: user
- Location: somewhere in europe
- Company: hubblr
- Website: https://felixwoestmann.me
- Repositories: 23
- Profile: https://github.com/felixwoestmann
Computational Social Scientist
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this dataset, please cite it as below." authors: - family-names: "Wöstmann" given-names: "Felix" orcid: "https://orcid.org/0009-0006-7577-4993" title: "German Reddit stance dataset" version: 1.0.0 doi: 10.5281/zenodo.13819863 date-released: 2024-09-20 url: "https://github.com/felixwoestmann/stance_annotator_app"