https://github.com/centrefordigitalhumanities/gabber

A project for the Data School.

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.0%) to scientific vocabulary

Keywords

analysis data-mining python scraper

Last synced: 5 months ago · JSON representation ·

Repository

A project for the Data School.

Basic Info

Host: GitHub
Owner: CentreForDigitalHumanities
License: gpl-3.0
Language: Python
Default Branch: master
Homepage: https://dataschool.nl
Size: 43 KB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Archived

Topics

analysis data-mining python scraper

Created about 7 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

Gabber - data-analysis tools for gab.ai

This repository aims to provide a set of tools for data-driven media studies on the gab.ai platform.

Requirements

These tools require python3 and access to a MongoDB server. On a debian system, run:

sudo apt-get install python3-pymongo python3-igraph python3-nltk python3-scipy mongodb-server
pip3 install hatesonar gensim pyLDAvis

Scraping

The minegab.py script is meant for scraping data from the gab.ai platform. All scraped data is stored in MongoDB for further parsing/analysis.

Usage

Scraping data from gab.ai starts at a particular account, whose username has to be manually provided to the script:

./minegab.py -u <username>

From there, the script will discover other accounts through reposts, follow-relations, comments, and quotes. Once the first account has been processed, the -a parameter will tell the script to scrape data from all the discovered accounts. In doing so, more accounts will likely be discovered:

./minegab.py -a

Keep running the script with -a until no new accounts are discovered. The giant graph within gab.ai has now been scraped. The minegab.py will give verbose output with the -d flag. Note that this might contain special characters that could be problematic to print on your terminal:

export PYTHONIOENCODING=UTF-8
./minegab.py -da

To keep a logfile of the scraping, you could use the following command:

./minegab.py -a | tee -a ./scrapelog.txt

To redo scraping of accounts, first remove the account from the profiles collection, and then scrape it again:

./minegab.py -d <username> ; ./minegab.py -u <username>

To scrape the news section, simply run:

./minegab.py -n

Performance

Performance will increase when multiple scrapers are run simultaneously. Ideally, the scrapers would use different outbound IP addresses to decrease the impact of rate-limiting, but performance is already greatly improved when running multiple scrapes from the same node. Note that running scrapers from multiple nodes requires replication of the MongoDB backend.

Limitations

The minegab.py script can not scrape beyond the giant graph of which the manually provided accounts are a part. It will not find other communities if they are completely isolated from the accounts provided to the script.

Furthermore, the minegab.py script does not retrieve any media content. It will store links to media assets in the database, which could be used as an input for a downloading script, but this functionality is not provided by the script. Note that scraping all media content will require considerable bandwidth and storage capacity.

Finally, the 'groups' section of gab is mostly ignored. Group metadata is shown in the posts, but group membership is not scraped.

Processing

Communities

The gabcommunities.py script reads from a graphml file generated by the gabgraph.py script. It detects communities and can output to file as well as mongodb.

Usage:

./gabcommunities.py -i <graphml file> [-n <community type>] [-p] [-o output directory]

The script gives the modularity score as output on the command line.

If the -p parameter is given, the script will calculate the pagerank for each user within the detected community.

If the -n parameter is given, user profiles in the mongodb will be enriched with the community id and optionally the pagerank. The parameter expects a name for the edge type the community is based on, e.g., follow, quote, repost, or comment. Values will be written under the communities attributes of the user profile.

If the -o parameter is given, an output directory will be created and graphml files for each detected community will be written in this directory. The filenames match the 'id' field written to mongodb if the -n parameter was given.

Once you are done with all community detection, run the com2posts.py to copy the community metadata from the profiles collection to the actuser attribute of every post and the user attribute of every comment.

Groups

The gabgroups.py script will gather all group metadata found in the scraped posts and fill a mongo collection named groups. It will also add a post count to the metadata.

By default, gabgroups.py will only consider original posts. Use the -r parameter to also include reposts in the gathering of groups and counting of posts.

Hatespeech

The gabhate.py script uses the HateSonar to detect hate- and offsensive speech in all english posts and comments. Other languages are not supported. Classification and confidence is stored in the hateometer attribute in all affected posts and comments.

Topics

The gabtopics.py script uses LDA modelling to generate topics for a specific community. It will output plaintext as well as generate a visualisation in HTML. Be sure to have run com2posts.py first. Usage:

./gabtopics.py -l [language] -e [edgetype] -c [community id] -t [number of topics] -o [output file]

Currently only english, dutch, and german are supported. Note that running this script on larger communities will require serious computational resources, in particular lots of memory.

Exporting

Activity

The gabactivity.py script will export a CSV with counts of total active users, total amount of posts, total amount of reposts, and total amount of comments per month. Use -o to export to a specific filename, by default the export will be written to gabactivity.csv.

GraphML

The gabgraph.py script will export to a GraphML file for further processing with for instance iGraph or (if you have a powerful desktop) Gephi. It supports 4 different edge types: follow edges, repost edges, quote edges, and comment edges. Run:

./gabgraph.py -h

To see all possible parameters.

Note that the language attribute is taken from Gab itself, take these values with a grain of salt.

Groups

The groups2csv.py script will export group metadata to a csv file. Use -o to export to a specific filename, by default the export will be written to gabgroups.csv.

The format of the export is comma separated and single quote delimited CSV.

Hashtags

The gabhashtags.py script will export a sorted list of all hashtags used in posts and comments on gab, including a count of how often they were used. Use -o to export to a specific filename, by default the export will be written to gabhashtags.csv.

The format of the export is comma separated and single quote delimited CSV.

Note that no weighing is applied in the hashtag count.

Hate statistics

The gabhatestats.py script will output statistics on the overall amount of hate- and offensive speech detected by the gabhate.py script, as well as statistics per community detected in the gabcommunities.py script. Beware these statistics only account for english posts and comments.

Owner

Name: Centre for Digital Humanities
Login: CentreForDigitalHumanities
Kind: organization
Email: cdh@uu.nl
Location: Netherlands

Website: https://cdh.uu.nl/
Repositories: 39
Profile: https://github.com/CentreForDigitalHumanities

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Gabber
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - name: Utrecht University Data School
    address: Drift 13
    city: Utrecht
    country: NL
    post-code: 3512 BR
    email: dataschool@uu.nl
    website: 'https://dataschool.nl/'
repository-code: 'https://github.com/CentreForDigitalHumanities/gabber'
abstract: >-
  A set of tools for data-driven media studies on the gab.ai
  platform.
license: GPL-3.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

https://github.com/centrefordigitalhumanities/gabber

Science Score: 31.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Gabber - data-analysis tools for gab.ai

Requirements

Scraping

Usage

Performance

Limitations

Processing

Communities

Groups

Hatespeech

Topics

Exporting

Activity

GraphML

Groups

Hashtags

Hate statistics

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year