https://github.com/atomashevic/pymadoc

Python package to download and combine parts of MADOC dataset

https://github.com/atomashevic/pymadoc

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 3 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

bluesky data-science dataset koo pypi-package python reddit voat
Last synced: 5 months ago · JSON representation

Repository

Python package to download and combine parts of MADOC dataset

Basic Info
Statistics
  • Stars: 3
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
bluesky data-science dataset koo pypi-package python reddit voat
Created about 1 year ago · Last pushed 8 months ago
Metadata Files
Readme

README.md

pyMADOC

Python package to download and combine parts of MADOC dataset from Zenodo. The MADOC dataset contains social media posts from multiple platforms (Reddit, Voat, Bluesky, and Koo), making it easy to study cross-platform content and community dynamics.

Features

  • Easy download of platform-specific data files
  • Automatic pairing of Reddit-Voat community data
  • Both Python API and Command Line Interface
  • Support for direct DataFrame loading
  • Efficient parquet file format

Installation

bash pip install pymadoc

Usage

As a Python Package

```python from pymadoc import listavailabledata, downloadfile, downloadcommunity_pair

List available platforms and communities

datainfo = listavailabledata() print(datainfo["platforms"]) # ['reddit', 'voat', 'bluesky', 'koo'] print(data_info["communities"]) # ['CringeAnarchy', 'fatpeoplehate', ...]

Download a specific file

For Reddit/Voat, specify both platform and community

filepath = downloadfile("reddit", community="funny", output_dir="data")

For Bluesky/Koo, specify only platform

filepath = downloadfile("bluesky", output_dir="data")

Load directly as DataFrame

df = downloadfile("reddit", community="funny", asdataframeTrue)

Download and combine Reddit-Voat community pair

As files

redditfile, voatfile = downloadcommunitypair("funny", output_dir="data")

As combined DataFrame

combineddf = downloadcommunitypair("funny", asdataframe=True) ```

Command Line Interface

List available platforms and communities: bash pymadoc list

Download a specific file: ```bash

Reddit/Voat (requires community)

pymadoc download reddit --community funny --output-dir data

Bluesky/Koo

pymadoc download bluesky --output-dir data ```

Download Reddit-Voat community pair: bash pymadoc pair funny --output-dir data

Available Data

Platforms

  • Reddit: Community-specific posts and comments
  • Voat: Community-specific posts and comments
  • Bluesky: Platform-wide posts
  • Koo: Platform-wide posts

Communities (Reddit/Voat only)

  • CringeAnarchy
  • fatpeoplehate
  • funny
  • gaming
  • gifs
  • greatawakening
  • KotakuInAction
  • MensRights
  • milliondollarextreme
  • pics
  • technology
  • videos

Data Format

All files are stored in parquet format for efficient storage and fast loading. Each file contains the following columns: - Platform-specific post/comment IDs - Content text - Timestamps - User information - Engagement metrics

Requirements

  • Python 3.6 or higher
  • pandas
  • requests
  • tqdm

Citation

If you use this package or the MADOC dataset in your research, please cite:

Mitrovic Dankulov, M., Tomašević, A., Maletic, S., Andjelkovic, M., Vranic, A., Cvetkovic, D., Stupovski, B., Vudragovic, D., Major, S., & Bogojević, A. (2025). MADOC: Multi-Platform Aggregated Dataset of Online Communities (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14637314

Or BibTeX:

``` @MISC{mitrovicDankulov2025madoc, title = "MADOC: Multi-Platform Aggregated Dataset of Online Communities", author = "Mitrovic Dankulov, Marija and Tomašević, Aleksandar and Maletic, Slobodan and Andjelkovic, Miroslav and Vranic, Ana and Cvetkovic, Darja and Stupovski, Boris and Vudragovic, Dusan and Major, Sara and Bogojević, Aleksandar", publisher = "Zenodo", abstract = "The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from four distinct platforms: Bluesky, Koo, Reddit, and Voat, spanning from 2012 to 2024. The dataset includes 18.9 million posts, 236 million comments, and data from 23.1 million unique users across all platforms, with a particular focus on understanding community dynamics, user migration patterns, and the evolution of toxic behavior across platforms. By providing standardized data structures and FAIR-compliant access through Zenodo, MADOC enables researchers to conduct comparative analyses of user behavior, interaction networks, and content sentiment across diverse social media environments. The dataset's unique value lies in its cross-platform scope, standardized structure, and rich metadata, making it particularly suitable for studying societal phenomena such as community formation, toxic behavior propagation, and user migration patterns in response to platform moderation policies.", year = 2025, url = "https://zenodo.org/records/14637314", keywords = "Social Media; Online Social Networking; Social Network Analysis", doi = "10.5281/ZENODO.14637314" }

```

License

MIT License

Owner

  • Name: Aleksandar Tomašević
  • Login: atomashevic
  • Kind: user
  • Location: Novi Sad, Serbia
  • Company: University of Novi Sad

GitHub Events

Total
  • Watch event: 1
  • Public event: 1
Last Year
  • Watch event: 1
  • Public event: 1

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 7
  • Total Committers: 1
  • Avg Commits per committer: 7.0
  • Development Distribution Score (DDS): 0.0
Past Year
  • Commits: 7
  • Committers: 1
  • Avg Commits per committer: 7.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Aleksandar Tomasevic a****c@g****m 7

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 17 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 2
  • Total maintainers: 1
pypi.org: pymadoc

Python package to download and combine parts of MADOC dataset

  • Versions: 2
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 17 Last month
Rankings
Dependent packages count: 9.8%
Average: 32.4%
Dependent repos count: 54.9%
Maintainers (1)
Last synced: 6 months ago