https://github.com/atomashevic/rmadoc

R package to easily download and combine MADOC dataset fules

https://github.com/atomashevic/rmadoc

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.7%) to scientific vocabulary
Last synced: 7 months ago · JSON representation

Repository

R package to easily download and combine MADOC dataset fules

Basic Info
  • Host: GitHub
  • Owner: atomashevic
  • Language: R
  • Default Branch: main
  • Size: 21.5 KB
Statistics
  • Stars: 2
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created about 1 year ago · Last pushed 10 months ago
Metadata Files
Readme

README.md

rMADOC

R package to easily to download and combine parts of MADOC dataset from Zenodo. The MADOC dataset contains social media posts from multiple platforms (Reddit, Voat, Bluesky, and Koo), making it easy to study cross-platform content and community dynamics.

Installation

You can install the package directly from GitHub:

```r

install.packages("devtools")

devtools::install_github("atomashevic/rMADOC") ```

Usage

List Available Data

To see what data is available in the MADOC dataset:

r library(rMADOC) list_available_data()

This will show you all available platforms and communities, along with their file sizes.

Download Individual Files

There are two ways to download files:

  1. Save to disk and load later (recommended, especially for Reddit data): ```r # Download and save files filename <- downloadfile("reddit", "gaming", outputdir = "data")

Load the downloaded file later when needed

df <- load_local(filename) ```

  1. Load directly into memory (suitable for smaller files like Voat, Bluesky, or Koo): ```r # Load Bluesky data directly as a data.frame blueskydf <- downloadfile("bluesky", as_dataframe = TRUE)

Load Voat gaming data directly (small file)

voatgamingdf <- downloadfile("voat", "gaming", asdataframe = TRUE) ```

Note: Direct memory loading (as_dataframe = TRUE) is not recommended for Reddit data as some communities have very large file sizes (up to 9GB). For Reddit data, always download to disk first and then load the files as needed.

Download and Combine Community Pairs

You can download and combine Reddit-Voat pairs for the same community:

  1. Save to disk and load later (recommended): ```r # Download both files files <- downloadcommunitypair("gaming", output_dir = "data")

Load them later when needed

redditdf <- loadlocal(files$redditfile) voatdf <- loadlocal(files$voatfile) ```

  1. Load directly into memory (not recommended for most communities due to Reddit file sizes): r # Only use this for communities with smaller file sizes combined_df <- download_community_pair("gaming", as_dataframe = TRUE)

Available Data

Standalone Platforms

  • Bluesky (~449.3 MB)
  • Koo (~774.3 MB)

Communities (available on both Reddit and Voat)

  • CringeAnarchy (Reddit: 951.7 MB, Voat: 476.2 KB)
  • fatpeoplehate (Reddit: 214.5 MB, Voat: 61.9 MB)
  • funny (Reddit: 9.1 GB, Voat: 18.8 MB)
  • gaming (Reddit: 7.2 GB, Voat: 12.7 MB)
  • gifs (Reddit: 3.1 GB, Voat: 2.8 MB)
  • greatawakening (Reddit: 179.3 MB, Voat: 76.1 MB)
  • KotakuInAction (Reddit: 1.5 GB, Voat: 1.8 MB)
  • MensRights (Reddit: 797.8 MB, Voat: 792.6 KB)
  • milliondollarextreme (Reddit: 170.2 MB, Voat: 3.5 MB)
  • pics (Reddit: 8.3 GB, Voat: 5.3 MB)
  • technology (Reddit: 2.5 GB, Voat: 15.2 MB)
  • videos (Reddit: 6.5 GB, Voat: 102.4 KB)

As you can see, Reddit files are significantly larger than their Voat counterparts. Use list_available_data() to see this information in a formatted table.

Features

  • Option to load data directly into memory or save to disk
  • Helper function to load saved parquet files
  • Progress bars with download speed
  • Support for both individual platform downloads and Reddit-Voat community pairs
  • Automatic file size verification
  • Human-readable file size formatting

Dependencies

  • arrow (for parquet file support)
  • dplyr
  • curl
  • utils

Citation

If you use this package or the MADOC dataset in your research, please cite:

Mitrovic Dankulov, M., Tomašević, A., Maletic, S., Andjelkovic, M., Vranic, A., Cvetkovic, D., Stupovski, B., Vudragovic, D., Major, S., & Bogojević, A. (2025). MADOC: Multi-Platform Aggregated Dataset of Online Communities (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14637314

Or BibTeX:

``` @MISC{mitrovicDankulov2025madoc, title = "MADOC: Multi-Platform Aggregated Dataset of Online Communities", author = "Mitrovic Dankulov, Marija and Tomašević, Aleksandar and Maletic, Slobodan and Andjelkovic, Miroslav and Vranic, Ana and Cvetkovic, Darja and Stupovski, Boris and Vudragovic, Dusan and Major, Sara and Bogojević, Aleksandar", publisher = "Zenodo", abstract = "The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from four distinct platforms: Bluesky, Koo, Reddit, and Voat, spanning from 2012 to 2024. The dataset includes 18.9 million posts, 236 million comments, and data from 23.1 million unique users across all platforms, with a particular focus on understanding community dynamics, user migration patterns, and the evolution of toxic behavior across platforms. By providing standardized data structures and FAIR-compliant access through Zenodo, MADOC enables researchers to conduct comparative analyses of user behavior, interaction networks, and content sentiment across diverse social media environments. The dataset's unique value lies in its cross-platform scope, standardized structure, and rich metadata, making it particularly suitable for studying societal phenomena such as community formation, toxic behavior propagation, and user migration patterns in response to platform moderation policies.", year = 2025, url = "https://zenodo.org/records/14637314", keywords = "Social Media; Online Social Networking; Social Network Analysis", doi = "10.5281/ZENODO.14637314" }

```

License

MIT

Owner

  • Name: Aleksandar Tomašević
  • Login: atomashevic
  • Kind: user
  • Location: Novi Sad, Serbia
  • Company: University of Novi Sad

GitHub Events

Total
  • Issues event: 2
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 4
  • Public event: 1
  • Fork event: 1
Last Year
  • Issues event: 2
  • Watch event: 1
  • Issue comment event: 1
  • Push event: 4
  • Public event: 1
  • Fork event: 1

Committers

Last synced: 10 months ago

All Time
  • Total Commits: 8
  • Total Committers: 1
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 8
  • Committers: 1
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Aleksandar Tomašević 3****c 8

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 1
  • Total pull requests: 0
  • Average time to close issues: 3 days
  • Average time to close pull requests: N/A
  • Total issue authors: 1
  • Total pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 1
  • Pull requests: 0
  • Average time to close issues: 3 days
  • Average time to close pull requests: N/A
  • Issue authors: 1
  • Pull request authors: 0
  • Average comments per issue: 1.0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • stefanolocci (1)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels