mseep-qanon-mcp

A machine readable JSON QAnon dataset, archiving all QAnon drops for research only

https://github.com/jkingsman/json-qanon

Last synced: 9 months ago · JSON representation ·

Repository

A machine readable JSON QAnon dataset, archiving all QAnon drops for research only

Basic Info

Host: GitHub
Owner: jkingsman
License: other
Language: JavaScript
Default Branch: main
Homepage:
Size: 5.19 MB

Statistics

Stars: 25
Watchers: 2
Forks: 7
Open Issues: 0
Releases: 0

Created over 5 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

QAnon is a dangerous cult. This archive is for research purposes only, and I do not endorse any material in this repo.

QAnon Post Dataset

posts.json contains all QAnon posts as collated from multiple sources, up to the most recent drop on 2022-11-27. The JSON has been dumped with ensure_ascii=False so should be UTF-8 but there may be some encoding gotchas I haven't caught (text fields contain text with line breaks as literal \n). I did my best in terms of avoiding capture glitches/bad logic but I didn't read through all 5k posts so caveat emptor; I make no guarantees of data integrity.

Where is this data from?

There are a variety of sites around the web which collate historical Q drops, as well as, of course, archived and still-live primary sources for the posts. This dataset is an amalgamation of data from various archival sites to enrich the data. It has been spot checked against the original posts, but not exhaustively validated against primary sources -- however, for better or worse, archival efforts are near-fanatical in their attention to detail, so I have strong confidence in this dataset. If you spot an error of substance (e.g. beyond simple unicode encoding errors, which I'm sure there are a few of), please open an issue.

The collate.py script originally parsed a scraped copy of https://qposts.online; that site has since gone down. I retained the site's Q drop content, and also utilize other archives of sites to enrich that data. The script is now useless without a copy of the site, and I've chosen not to publicize the other sites and scripts that I retain archival copies of and use to enrich the qposts data with additional primary-source information.

Important Notes

Images

Posts reference images which I have opted not to include in this repo due to their distasteful content; the text is already quite enough and then some. As the original site I scraped the images from is down, you may no longer scrape the images directly yourself. The original filenames of the images both as published by Q and as found on the original source archive are included in the dump, and they can often be found around the web.

If you are an academic researcher and can prove valid research interest (a university email is table stakes; a university email with a link to your page on a sociology department webpage, current relevant research focus, or equivalent proof of research beyond "I'm a college student please give me Q content", all the better), you may contact me and I may, at my discretion, provide you with my image scrape archive. Emails which do not unequivocally establish academic credentials and a reasonable, contextualized need for the content will not receive a response.

`posts.json` used to fix URLS; now it doesn't

...and represents posts accurately to original posting.

The collate.py script prior to the commit which introduced this paragraph consolidated links with spaces in the middle (making them invalid) into links without spaces (for example https:// twitter. com/ became https://twitter.com/). Q's original posts contained these spaces; I previously elected to remove them for the sake of functioning links. As the source material is becoming increasingly difficult to find, I've elected to currently represent these links faithfully to the original, although they result in broken links.

If you want fixed versions, use posts.url-normalized.json or posts.url-normalized.yml which have had the following regexes applied:

```bash

Remove spaces after http://

sed -i 's|http://\s+|http://|g' posts.json

Remove spaces after https://

sed -i 's|https://\s+|https://|g' posts.json

Fix specific spaced URLs

Scrape Instruction Removal

Prior to the commit introducing this paragraph, there were instructions for running the basic scrape and then collating the data from it yourself. As the source site has gone down and data has subsequently been enriched from other sources, those instructions have been removed from the README. Please view them in the git history if they are relevant to your work.

HTML Viewer

viewer.html will render a simple display of all posts with their basic information. This page utilizes posts.js which is simply posts.json assigned to the variable QPOSTS. This page will optionally serve the images if you scraped them from https://qposts.online before it went down.

If you have the images scraped from https://qposts.online, the location the script expects to find them in is in IMAGE_BASE. If your location isn't ./images, change it in the JS. If you don't have images, they'll just fail to render and not affect the rest of the display. Again, if you didn't scrape the sources in the early days of this repo, this door is unfortunately shut for you. See above under Important Notes for access if you are an academic researcher.

The resulting HTML can be then saved as a complete webpage with most web browsers, or printed to PDF for more visual analysis.

Schema

The JSON takes the form of an array of post objects under the posts key. Machine-readable JSON schema is available in posts.schema.json

A post consists of:

post_metadata: an object containing misc. information about the post (object)
- id: the ordinal ID of the post (sequentially from 1 forwards in time; generated and not present on original posts) (integer)
- author: the author of the post; usually Q or Anonymous (string)
- author_id: AKA "poster ID" -- a numerical identifier for a particular poster generated from a hash of the thread ID, the user's IP address, and other information by the board it was posted on (string)
- tripcode: the tripcode of the post, if included (string, optional)
- source: an object containing information about the post's origin (object)
- board: the chan board the post came from (string)
- site: one of 4ch, 8ch, or 8kun, indicating the site the post is from (4chan, 8chan, or 8kun) (string)
- link: link to the original post (optional) (string)
- time: epoch timestamp of posting time (integer/timestamp)
text: the text of the post with newlines delimited by literal \n (string, optional)
images: an array of objects indicating images used in the post (object, optional)
- file: the name of the image file itself as archived from https://qposts.online (now defunct) (string)
- name: the name of the image as it was named when posted to the image board (string)
referenced_posts: an array of objects indicating replied-to posts within Q's post (i.e. >>8251669) (object, optional)
- reference: the string within the text of the main post that referred to this one (string)
- text: the text of the referenced post with newlines delimited by literal \n (string, optional) (string, optional)
- author_id: AKA "poster ID" -- a numerical identifier for a particular poster generated from a hash of the thread ID, the user's IP address, and other information (string)
- images: an array of objects indicating images used in the post (object, optional)
- file: the name of the image file itself as archived from https://qposts.online ((now defunct) (string)
- name: the name of the image as it was named when posted to the image board (string)

Debugging

If the filename key in post_metadata, the metadata will contain the filename key which is a string indicating the HTML file that particular post was pulled from; this can be combined with the other commented-out blocks containing helpful for debugging which can restrict the parsing to a single post from a single file wich is helpful for debugging extraction/formatting/etc.

Misc. Analysis Snippets

Extract all Q posts to file

To extract all of Q and only Q's posts without regard to images, referenced posts, etc., this jq command can be used (the | select(.) carves out null values generated by extracting the non-existent text key from posts with images only):

cat posts.json | jq --raw-output '.posts[].text | select(.)' > aggregated.txt

Get top ten most linked-to domains in Q posts (and count of links)

To get the top ten domains, we'll first extract the post body and grep for all URL-like structures. Then we'll use awk to set both : and / as field separators and extract the fourth field (the domain). Then we'll use the common idiom of sort | uniq -c | sort -nr: sort will alphabetically sort the posts so that uniq -c can count the number of unique occurrences (in actuality uniq -c only provides a count of line repeats, but since it's sorted the number of repeats will be the number of times the unique line occurred) and then sort -nr will sort numerically in reverse order, giving us occurrences listed by descending count. Finally, head -10 will extract the first ten lines from the results.

Note that this will give garbled results without using the url-normalized version (posts.url-normalized.json).

bash cat posts.json | jq --raw-output '.posts[].text | select(.)' | grep -Eo "(http\S*)" | awk -F[/:] '{print $4}' | sort | uniq -c | sort -nr | head -10

Iterate with python

```python import json

with open('posts.json') as f: data = json.load(f)

for post in data['posts']: # do things here

# example of printing all post texts where they exist if 'text' in post: print(post['text']) ```

Fine Print

I provide this data for data analysis use only; the content is distasteful and misleading to put it charitably and I do not endorse it.

There are no warranties of fitness for purpose or correctness of the this data -- I've done a best effort collation, and I make no guarantees my work is correct or complete.

The code in my extraction script and any other original components of this repo are licensed under MIT (and please cite me if my script or its results are utilized as part of academic research -- I'd love to read a preprint!); as the extracted posts are not my content, I obviously cannot license them in any degree.

Cite this work

If you use JSON-QAnon in a paper, check out the CITATION.cff file for the correct citation.

bibtex @misc{JSON-QANON, title={JSON-QAnon}, author={Kingsman, Jack}, year={2023}, url={https://github.com/jkingsman/JSON-QAnon}, month={Jan}, doi="10.13140/RG.2.2.28778.32964", note={{\url{https://www.kaggle.com/datasets/jkingsman/qanondrops}}} }

Citations

Eames, William J., III. "Changing Tides: Online Conspiracy Theory Use by Radical Violent Extremist Groups Over Time." Master's thesis, University of North Florida, 2023.
Olson, Liz. "From 4-chan to the Capitol: A Text-as-Data Analysis of QAnon." Student paper, Columbia School of International and Public Affairs, April 2021.
Thuland, Tora. "01chan.org." Art and Machine Learning project, Tisch School of the Arts, New York University, 2022. Previously available at https://www.00110000chan.org/. Information available at https://itp.nyu.edu/thesis2022/?tora-thuland.

Owner

Name: Jack Kingsman
Login: jkingsman
Kind: user
Location: United States

Website: http://jacksbrain.com
Twitter: jwkingsman
Repositories: 89
Profile: https://github.com/jkingsman

All work here is mine, and not my employer's. Like my work? Donate $5 to pay for a week of server time for one of my projects! https://ko-fi.com/jackkingsman

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: JSON-QAnon
message: >-
  If you use this dataset, please cite it using the metadata
  from this file.
type: dataset
authors:
  - given-names: Jack
    family-names: Kingsman
    email: jack.kingsman@gmail.com
    orcid: 'https://orcid.org/0009-0003-7942-8922'
identifiers:
  - type: doi
    value: 10.13140/RG.2.2.28778.32964
repository-code: 'https://github.com/jkingsman/JSON-QAnon'
repository: 'https://www.kaggle.com/datasets/jkingsman/qanondrops'
abstract: JSON and YAML representations of all QAnon drops.
keywords:
  - QAnon
  - Q-Anon
  - conspiracy
license: MIT

GitHub Events

Total

Issues event: 2
Watch event: 6
Issue comment event: 4
Push event: 22
Pull request event: 1
Fork event: 1

Last Year

Issues event: 2
Watch event: 6
Issue comment event: 4
Push event: 22
Pull request event: 1
Fork event: 1

Packages

Total packages: 2
Total downloads: unknown

Total dependent packages: 0
(may contain duplicates)
Total dependent repositories: 0
(may contain duplicates)
Total versions: 2

pypi.org: mseep-unknown

MCP server for QAnon drops for sociological research

Documentation: https://mseep-unknown.readthedocs.io/
License: other
Latest release: 0.3.0
published about 1 year ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 9.3%

Average: 30.9%

Dependent repos count: 52.6%

Last synced: about 1 year ago

pypi.org: mseep-qanon-mcp

MCP server for QAnon drops for sociological research

Documentation: https://mseep-qanon-mcp.readthedocs.io/
License: other
Latest release: 0.3.0
published about 1 year ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 9.3%

Average: 31.0%

Dependent repos count: 52.6%

Last synced: about 1 year ago

mseep-qanon-mcp

Science Score: 57.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

QAnon is a dangerous cult. This archive is for research purposes only, and I do not endorse any material in this repo.

QAnon Post Dataset

Where is this data from?

Important Notes

Images

posts.json used to fix URLS; now it doesn't

Remove spaces after http://

Remove spaces after https://

Fix specific spaced URLs

Scrape Instruction Removal

HTML Viewer

Schema

Debugging

Misc. Analysis Snippets

Extract all Q posts to file

Get top ten most linked-to domains in Q posts (and count of links)

Iterate with python

Fine Print

Cite this work

Citations

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Packages

pypi.org: mseep-unknown

Rankings

pypi.org: mseep-qanon-mcp

Rankings

`posts.json` used to fix URLS; now it doesn't