https://github.com/atomashevic/pymadoc
Python package to download and combine parts of MADOC dataset
Science Score: 49.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Committers with academic emails
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Keywords
Repository
Python package to download and combine parts of MADOC dataset
Basic Info
- Host: GitHub
- Owner: atomashevic
- Language: Python
- Default Branch: main
- Homepage: https://pypi.org/project/pymadoc/
- Size: 8.41 MB
Statistics
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
pyMADOC
Python package to download and combine parts of MADOC dataset from Zenodo. The MADOC dataset contains social media posts from multiple platforms (Reddit, Voat, Bluesky, and Koo), making it easy to study cross-platform content and community dynamics.
Features
- Easy download of platform-specific data files
- Automatic pairing of Reddit-Voat community data
- Both Python API and Command Line Interface
- Support for direct DataFrame loading
- Efficient parquet file format
Installation
bash
pip install pymadoc
Usage
As a Python Package
```python from pymadoc import listavailabledata, downloadfile, downloadcommunity_pair
List available platforms and communities
datainfo = listavailabledata() print(datainfo["platforms"]) # ['reddit', 'voat', 'bluesky', 'koo'] print(data_info["communities"]) # ['CringeAnarchy', 'fatpeoplehate', ...]
Download a specific file
For Reddit/Voat, specify both platform and community
filepath = downloadfile("reddit", community="funny", output_dir="data")
For Bluesky/Koo, specify only platform
filepath = downloadfile("bluesky", output_dir="data")
Load directly as DataFrame
df = downloadfile("reddit", community="funny", asdataframeTrue)
Download and combine Reddit-Voat community pair
As files
redditfile, voatfile = downloadcommunitypair("funny", output_dir="data")
As combined DataFrame
combineddf = downloadcommunitypair("funny", asdataframe=True) ```
Command Line Interface
List available platforms and communities:
bash
pymadoc list
Download a specific file: ```bash
Reddit/Voat (requires community)
pymadoc download reddit --community funny --output-dir data
Bluesky/Koo
pymadoc download bluesky --output-dir data ```
Download Reddit-Voat community pair:
bash
pymadoc pair funny --output-dir data
Available Data
Platforms
- Reddit: Community-specific posts and comments
- Voat: Community-specific posts and comments
- Bluesky: Platform-wide posts
- Koo: Platform-wide posts
Communities (Reddit/Voat only)
- CringeAnarchy
- fatpeoplehate
- funny
- gaming
- gifs
- greatawakening
- KotakuInAction
- MensRights
- milliondollarextreme
- pics
- technology
- videos
Data Format
All files are stored in parquet format for efficient storage and fast loading. Each file contains the following columns: - Platform-specific post/comment IDs - Content text - Timestamps - User information - Engagement metrics
Requirements
- Python 3.6 or higher
- pandas
- requests
- tqdm
Citation
If you use this package or the MADOC dataset in your research, please cite:
Mitrovic Dankulov, M., Tomašević, A., Maletic, S., Andjelkovic, M., Vranic, A., Cvetkovic, D., Stupovski, B., Vudragovic, D., Major, S., & Bogojević, A. (2025). MADOC: Multi-Platform Aggregated Dataset of Online Communities (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.14637314
Or BibTeX:
``` @MISC{mitrovicDankulov2025madoc, title = "MADOC: Multi-Platform Aggregated Dataset of Online Communities", author = "Mitrovic Dankulov, Marija and Tomašević, Aleksandar and Maletic, Slobodan and Andjelkovic, Miroslav and Vranic, Ana and Cvetkovic, Darja and Stupovski, Boris and Vudragovic, Dusan and Major, Sara and Bogojević, Aleksandar", publisher = "Zenodo", abstract = "The Multi-platform Aggregated Dataset of Online Communities (MADOC) is a comprehensive dataset that facilitates computational social science research by providing a unified, standardized dataset for cross-platform analysis of online social dynamics. MADOC aggregates and standardizes data from four distinct platforms: Bluesky, Koo, Reddit, and Voat, spanning from 2012 to 2024. The dataset includes 18.9 million posts, 236 million comments, and data from 23.1 million unique users across all platforms, with a particular focus on understanding community dynamics, user migration patterns, and the evolution of toxic behavior across platforms. By providing standardized data structures and FAIR-compliant access through Zenodo, MADOC enables researchers to conduct comparative analyses of user behavior, interaction networks, and content sentiment across diverse social media environments. The dataset's unique value lies in its cross-platform scope, standardized structure, and rich metadata, making it particularly suitable for studying societal phenomena such as community formation, toxic behavior propagation, and user migration patterns in response to platform moderation policies.", year = 2025, url = "https://zenodo.org/records/14637314", keywords = "Social Media; Online Social Networking; Social Network Analysis", doi = "10.5281/ZENODO.14637314" }
```
License
MIT License
Owner
- Name: Aleksandar Tomašević
- Login: atomashevic
- Kind: user
- Location: Novi Sad, Serbia
- Company: University of Novi Sad
- Website: www.atomasevic.com
- Twitter: atomasevic
- Repositories: 2
- Profile: https://github.com/atomashevic
GitHub Events
Total
- Watch event: 1
- Public event: 1
Last Year
- Watch event: 1
- Public event: 1
Committers
Last synced: 8 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Aleksandar Tomasevic | a****c@g****m | 7 |
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 17 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 2
- Total maintainers: 1
pypi.org: pymadoc
Python package to download and combine parts of MADOC dataset
- Homepage: https://github.com/atomashevic/pyMADOC
- Documentation: https://pymadoc.readthedocs.io/
- License: MIT License
-
Latest release: 0.1.1
published about 1 year ago