widesky
Tool for Bulk collection of the BSky Firehose into Postgres
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (9.6%) to scientific vocabulary
Keywords
Repository
Tool for Bulk collection of the BSky Firehose into Postgres
Basic Info
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 11
- Releases: 0
Topics
Metadata Files
README.md
WideSky
Description
A containerised Python based listener that ingests the Bluesky firehose and exports the processed data to a Postgres database to allow for easy collection of samples from BlueSky for research and wider purposes.
Statement of Need
BlueSky is an up-an-coming social media site based on the AT protocol, which may merit further study. However the ATProtocol can be difficult to interpret and code around, therefore this application is designed to ease the barrier of entry for bulk collection and analysis of BlueSky data.
How to use
- Install Docker
- Navigate to the root folder of the repository
- Run '''docker compose up'''
- Connect via your preferred method to the PostgreSQL database hosted locally at port 5432
Please note: that the bind mounted volumes are difficult to delete, due to security features within Docker.
You will need to use a command such as docker exec --privileged --user root <CONTAINER_ID> chown -R "$(id -u):$(id -g)" <TARGET_DIR>, more details here.
Please note: Some instability has been observed when running WideSky for the first time. If the logs show connection errors between the Python and Postgres containers after building the project for the first time, please restart the application.
Output
The schema for Postgres database is as follows:
Users Table
| did | firstknownas | alsoknownas | | ---------------- | -------------- | ------------- | | TEXT PRIMARY KEY | TEXT | TEXT |
Posts Table
| cid | createdat | did | commit | text | langs | facets | hasembed | embedtype | embedrefs | externaluri | hasrecord | recordcid | recorduri | isreply | replyrootcid | replyrooturi | replyparentcid | replyparent_uri | | ---------------- | ------------------------ | ---- | ------ | ---- | ---------- | ------ | --------- | ---------- | ---------- | ------------ | ---------- | ---------- | ---------- | -------- | -------------- | -------------- | ---------------- | ---------------- | | TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT ARRAY | JSONB | BOOLEAN | TEXT | TEXT ARRAY | TEXT | BOOLEAN | TEXT | TEXT | BOOLEAN | TEXT | TEXT | TEXT | TEXT |
Likes Table
| cid | createdat | did | commit | subjectcid | subject_url | | ---------------- | ------------------------ | ---- | ------ | ----------- | ----------- | | TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT |
Reposts Table
| cid | createdat | did | commit | subjectcid | subject_uri | | ---------------- | ------------------------ | ---- | ------ | ----------- | ----------- | | TEXT PRIMARY KEY | TIMESTAMP WITH TIME ZONE | TEXT | TEXT | TEXT | TEXT |
(Planned) Features
Current Features
- Async functionality
- Exponential backoff for reconnections to firehose and reattempts for plc.directory
- Rotating Logging bind mounted to a widesky/logs folder
- Async workers for processing and batching to Postgres
- Batched Postgres saving
To-do
- Implement graph.list post type
- Implement embed types
- images#main
- selectionQuote
- secret
- Others I have not seen?
- Improve error handling
- Add testing
- Capture PostgreSQL logs in logs/postgres
- Add webserver with metrics and ability to configure capture protocols
- Integrate with a crawler to reach back for full activity records of active users where not present already in data
- Add option to prevent HTTPX logging clogging up the logs
- Capture delete and other non-create events
Acknowledgements
Thanks particularly to David Peck whose work I have captured in the firehose_utils.py file, who implemented a lovely decoding of the CBOR protocol. Please see his work here: https://gist.github.com/davepeck/8ada49d42d44a5632b540a4093225719 and https://github.com/davepeck.
How to Cite
A technical paper will be released soon, for now please mention the github repository and in academic works please mention my ORCID (https://orcid.org/0009-0000-1581-4021).
License
This work is licensed under the LGPL-3.0.
Contact Details
In case of questions please contact @jhculb.
For contributions and bug reports, open an issue here.
Owner
- Name: Jack H. Culbert
- Login: jhculb
- Kind: user
- Location: Cologne
- Company: GESIS
- Repositories: 1
- Profile: https://github.com/jhculb
Coffee......
Citation (CITATION.cff)
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: WideSky
message: >-
If you use this software, please cite it using the
metadata from this file.
type: software
authors:
- given-names: Jack H.
family-names: Culbert
email: jack.culbert@gesis.org
orcid: 'https://orcid.org/0009-0000-1581-4021'
affiliation: GESIS - Leibniz Institute for the Social Sciences
abstract: >-
A tool that ingests the BlueSky firehose for bulk analysis
and collection.
keywords:
- BlueSky
- BSky
- Firehose
- ATProto
- Collector
license: LGPL-3.0-only
GitHub Events
Total
- Issues event: 20
- Watch event: 1
- Delete event: 1
- Issue comment event: 14
- Push event: 6
- Pull request event: 4
- Fork event: 1
- Create event: 8
Last Year
- Issues event: 20
- Watch event: 1
- Delete event: 1
- Issue comment event: 14
- Push event: 6
- Pull request event: 4
- Fork event: 1
- Create event: 8
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 15
- Total pull requests: 1
- Average time to close issues: about 16 hours
- Average time to close pull requests: less than a minute
- Total issue authors: 3
- Total pull request authors: 1
- Average comments per issue: 0.73
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 15
- Pull requests: 1
- Average time to close issues: about 16 hours
- Average time to close pull requests: less than a minute
- Issue authors: 3
- Pull request authors: 1
- Average comments per issue: 0.73
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- jhculb (12)
- shyamgupta196 (3)
- taimoorkhan-nlp (1)
Pull Request Authors
- jhculb (2)
- taimoorkhan-nlp (2)
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- postgres 15
- python 3.13-slim build
- aiocache *
- httpx *
- httpx-ws *
- psycopg *
- tenacity *
- websockets *