https://github.com/centrefordigitalhumanities/gabber
A project for the Data School.
Science Score: 31.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary
Keywords
Repository
A project for the Data School.
Basic Info
- Host: GitHub
- Owner: CentreForDigitalHumanities
- License: gpl-3.0
- Language: Python
- Default Branch: master
- Homepage: https://dataschool.nl
- Size: 43 KB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Gabber - data-analysis tools for gab.ai
This repository aims to provide a set of tools for data-driven media studies on the gab.ai platform.
Requirements
These tools require python3 and access to a MongoDB server. On a debian system, run:
sudo apt-get install python3-pymongo python3-igraph python3-nltk python3-scipy mongodb-server
pip3 install hatesonar gensim pyLDAvis
Scraping
The minegab.py script is meant for scraping data from the gab.ai platform. All scraped data is stored in MongoDB for further parsing/analysis.
Usage
Scraping data from gab.ai starts at a particular account, whose username has to be manually provided to the script:
./minegab.py -u <username>
From there, the script will discover other accounts through reposts, follow-relations, comments, and quotes. Once the first account has been processed, the -a parameter will tell the script to scrape data from all the discovered accounts. In doing so, more accounts will likely be discovered:
./minegab.py -a
Keep running the script with -a until no new accounts are discovered. The giant graph within gab.ai has now been scraped. The minegab.py will give verbose output with the -d flag. Note that this might contain special characters that could be problematic to print on your terminal:
export PYTHONIOENCODING=UTF-8
./minegab.py -da
To keep a logfile of the scraping, you could use the following command:
./minegab.py -a | tee -a ./scrapelog.txt
To redo scraping of accounts, first remove the account from the profiles collection, and then scrape it again:
./minegab.py -d <username> ; ./minegab.py -u <username>
To scrape the news section, simply run:
./minegab.py -n
Performance
Performance will increase when multiple scrapers are run simultaneously. Ideally, the scrapers would use different outbound IP addresses to decrease the impact of rate-limiting, but performance is already greatly improved when running multiple scrapes from the same node. Note that running scrapers from multiple nodes requires replication of the MongoDB backend.
Limitations
The minegab.py script can not scrape beyond the giant graph of which the manually provided accounts are a part. It will not find other communities if they are completely isolated from the accounts provided to the script.
Furthermore, the minegab.py script does not retrieve any media content. It will store links to media assets in the database, which could be used as an input for a downloading script, but this functionality is not provided by the script. Note that scraping all media content will require considerable bandwidth and storage capacity.
Finally, the 'groups' section of gab is mostly ignored. Group metadata is shown in the posts, but group membership is not scraped.
Processing
Communities
The gabcommunities.py script reads from a graphml file generated by the gabgraph.py script. It detects communities and can output to file as well as mongodb.
Usage:
./gabcommunities.py -i <graphml file> [-n <community type>] [-p] [-o output directory]
The script gives the modularity score as output on the command line.
If the -p parameter is given, the script will calculate the pagerank for each user within the detected community.
If the -n parameter is given, user profiles in the mongodb will be enriched with the community id and optionally the pagerank. The parameter expects a name for the edge type the community is based on, e.g., follow, quote, repost, or comment. Values will be written under the communities attributes of the user profile.
If the -o parameter is given, an output directory will be created and graphml files for each detected community will be written in this directory. The filenames match the 'id' field written to mongodb if the -n parameter was given.
Once you are done with all community detection, run the com2posts.py to copy the community metadata from the profiles collection to the actuser attribute of every post and the user attribute of every comment.
Groups
The gabgroups.py script will gather all group metadata found in the scraped posts and fill a mongo collection named groups. It will also add a post count to the metadata.
By default, gabgroups.py will only consider original posts. Use the -r parameter to also include reposts in the gathering of groups and counting of posts.
Hatespeech
The gabhate.py script uses the HateSonar to detect hate- and offsensive speech in all english posts and comments. Other languages are not supported. Classification and confidence is stored in the hateometer attribute in all affected posts and comments.
Topics
The gabtopics.py script uses LDA modelling to generate topics for a specific community. It will output plaintext as well as generate a visualisation in HTML. Be sure to have run com2posts.py first. Usage:
./gabtopics.py -l [language] -e [edgetype] -c [community id] -t [number of topics] -o [output file]
Currently only english, dutch, and german are supported. Note that running this script on larger communities will require serious computational resources, in particular lots of memory.
Exporting
Activity
The gabactivity.py script will export a CSV with counts of total active users, total amount of posts, total amount of reposts, and total amount of comments per month. Use -o
GraphML
The gabgraph.py script will export to a GraphML file for further processing with for instance iGraph or (if you have a powerful desktop) Gephi. It supports 4 different edge types: follow edges, repost edges, quote edges, and comment edges. Run:
./gabgraph.py -h
To see all possible parameters.
Note that the language attribute is taken from Gab itself, take these values with a grain of salt.
Groups
The groups2csv.py script will export group metadata to a csv file. Use -o
The format of the export is comma separated and single quote delimited CSV.
Hashtags
The gabhashtags.py script will export a sorted list of all hashtags used in posts and comments on gab, including a count of how often they were used. Use -o
The format of the export is comma separated and single quote delimited CSV.
Note that no weighing is applied in the hashtag count.
Hate statistics
The gabhatestats.py script will output statistics on the overall amount of hate- and offensive speech detected by the gabhate.py script, as well as statistics per community detected in the gabcommunities.py script. Beware these statistics only account for english posts and comments.
Owner
- Name: Centre for Digital Humanities
- Login: CentreForDigitalHumanities
- Kind: organization
- Email: cdh@uu.nl
- Location: Netherlands
- Website: https://cdh.uu.nl/
- Repositories: 39
- Profile: https://github.com/CentreForDigitalHumanities
Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Gabber
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- name: Utrecht University Data School
address: Drift 13
city: Utrecht
country: NL
post-code: 3512 BR
email: dataschool@uu.nl
website: 'https://dataschool.nl/'
repository-code: 'https://github.com/CentreForDigitalHumanities/gabber'
abstract: >-
A set of tools for data-driven media studies on the gab.ai
platform.
license: GPL-3.0
GitHub Events
Total
- Member event: 1
- Push event: 1
Last Year
- Member event: 1
- Push event: 1