widesky

Tool for Bulk collection of the BSky Firehose into Postgres

https://github.com/jhculb/widesky

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.6%) to scientific vocabulary

Keywords

bluesky bulk-data collector firehose postgres social-media social-sciences-data
Last synced: 6 months ago · JSON representation ·

Repository

Tool for Bulk collection of the BSky Firehose into Postgres

Basic Info
  • Host: GitHub
  • Owner: jhculb
  • License: lgpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 71.3 KB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 11
  • Releases: 0
Topics
bluesky bulk-data collector firehose postgres social-media social-sciences-data
Created 9 months ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

WideSky

Description

A containerised Python based listener that ingests the Bluesky firehose and exports the processed data to a Postgres database to allow for easy collection of samples from BlueSky for research and wider purposes.

Statement of Need

BlueSky is an up-an-coming social media site based on the AT protocol, which may merit further study. However the ATProtocol can be difficult to interpret and code around, therefore this application is designed to ease the barrier of entry for bulk collection and analysis of BlueSky data.

How to use

  1. Install Docker
  2. Navigate to the root folder of the repository
  3. Run '''docker compose up'''
  4. Connect via your preferred method to the PostgreSQL database hosted locally at port 5432

Please note: that the bind mounted volumes are difficult to delete, due to security features within Docker. You will need to use a command such as docker exec --privileged --user root <CONTAINER_ID> chown -R "$(id -u):$(id -g)" <TARGET_DIR>, more details here.

Please note: Some instability has been observed when running WideSky for the first time. If the logs show connection errors between the Python and Postgres containers after building the project for the first time, please restart the application.

Output

The schema for Postgres database is as follows:

Users Table

| did | firstknownas | alsoknownas | | ---------------- | -------------- | ------------- | | TEXT PRIMARY KEY | TEXT | TEXT |

Posts Table

| cid | createdat | did | commit | text | langs | facets | hasembed | embedtype | embedrefs | externaluri | hasrecord | recordcid | recorduri | isreply | replyrootcid | replyrooturi | replyparentcid | replyparent_uri | | ---------------- | ------------------------ | ---- | ------ | ---- | ---------- | ------ | --------- | ---------- | ---------- | ------------ | ---------- | ---------- | ---------- | -------- | -------------- | -------------- | ---------------- | ---------------- | | TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT ARRAY | JSONB | BOOLEAN | TEXT | TEXT ARRAY | TEXT | BOOLEAN | TEXT | TEXT | BOOLEAN | TEXT | TEXT | TEXT | TEXT |

Likes Table

| cid | createdat | did | commit | subjectcid | subject_url | | ---------------- | ------------------------ | ---- | ------ | ----------- | ----------- | | TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT |

Reposts Table

| cid | createdat | did | commit | subjectcid | subject_uri | | ---------------- | ------------------------ | ---- | ------ | ----------- | ----------- | | TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT |

(Planned) Features

Current Features

  • Async functionality
  • Exponential backoff for reconnections to firehose and reattempts for plc.directory
  • Rotating Logging bind mounted to a widesky/logs folder
  • Async workers for processing and batching to Postgres
  • Batched Postgres saving

To-do

  • Implement graph.list post type
  • Implement embed types
    • images#main
    • selectionQuote
    • secret
    • Others I have not seen?
  • Improve error handling
  • Add testing
  • Capture PostgreSQL logs in logs/postgres
  • Add webserver with metrics and ability to configure capture protocols
  • Integrate with a crawler to reach back for full activity records of active users where not present already in data
  • Add option to prevent HTTPX logging clogging up the logs
  • Capture delete and other non-create events

Acknowledgements

Thanks particularly to David Peck whose work I have captured in the firehose_utils.py file, who implemented a lovely decoding of the CBOR protocol. Please see his work here: https://gist.github.com/davepeck/8ada49d42d44a5632b540a4093225719 and https://github.com/davepeck.

How to Cite

A technical paper will be released soon, for now please mention the github repository and in academic works please mention my ORCID (https://orcid.org/0009-0000-1581-4021).

License

This work is licensed under the LGPL-3.0.

Contact Details

In case of questions please contact @jhculb.

For contributions and bug reports, open an issue here.

Owner

  • Name: Jack H. Culbert
  • Login: jhculb
  • Kind: user
  • Location: Cologne
  • Company: GESIS

Coffee......

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: WideSky
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Jack H.
    family-names: Culbert
    email: jack.culbert@gesis.org
    orcid: 'https://orcid.org/0009-0000-1581-4021'
    affiliation: GESIS - Leibniz Institute for the Social Sciences
abstract: >-
  A tool that ingests the BlueSky firehose for bulk analysis
  and collection.
keywords:
  - BlueSky
  - BSky
  - Firehose
  - ATProto
  - Collector
license: LGPL-3.0-only

GitHub Events

Total
  • Issues event: 20
  • Watch event: 1
  • Delete event: 1
  • Issue comment event: 14
  • Push event: 6
  • Pull request event: 4
  • Fork event: 1
  • Create event: 8
Last Year
  • Issues event: 20
  • Watch event: 1
  • Delete event: 1
  • Issue comment event: 14
  • Push event: 6
  • Pull request event: 4
  • Fork event: 1
  • Create event: 8

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 15
  • Total pull requests: 1
  • Average time to close issues: about 16 hours
  • Average time to close pull requests: less than a minute
  • Total issue authors: 3
  • Total pull request authors: 1
  • Average comments per issue: 0.73
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 15
  • Pull requests: 1
  • Average time to close issues: about 16 hours
  • Average time to close pull requests: less than a minute
  • Issue authors: 3
  • Pull request authors: 1
  • Average comments per issue: 0.73
  • Average comments per pull request: 0.0
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • jhculb (12)
  • shyamgupta196 (3)
  • taimoorkhan-nlp (1)
Pull Request Authors
  • jhculb (2)
  • taimoorkhan-nlp (2)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

docker-compose.yml docker
  • postgres 15
widesky/Dockerfile docker
  • python 3.13-slim build
widesky/requirements.txt pypi
  • aiocache *
  • httpx *
  • httpx-ws *
  • psycopg *
  • tenacity *
  • websockets *