Table of Contents
1. Connections
# connect to my hetzner instance ssh pecuchet@[your-server-address] -i .ssh/id_ed25519 # connect with neo4j and jupyter tunnels ssh -N -L 8888:[your-server-address]:8888 -L 7474:[your-server-address]:7474 -L 7687:[your-server-address]:7687 [your-server-address] -i ~/.ssh/id_ed25519
2. Python and jupyter
2.1. Create venv and install jupyter in it
cd /home/pecuchet/UOC python3 -m venv tfg-venv source tfg-venv/bin/activate # python3 -m pip install jupyter jupyter lab
Start jupyter like this.
Have this script ready # start_tfg_jupyter source /home/pecuchet/UOC/tfg-venv/bin/activate nohup jupyter lab &
REMOTE jupyter without ssh tunnel [instructions here](https://dbusteed.github.io/setup-jupyter-lab-on-remote-server/ "instructions").
Access on local machine [http://[your-server-address]:8000/lab](http://[your-server-address]:8000/lab).
3. Neo4j
Access Neo4j Browser on the server with a ssh tunnel with your local machine.
# use sudo for restarting neo4j cypher-shell # usr neo4j pw [PASSWORD] # ssh tunnel for browser ssh -N -L 7474:[your-server-address]:7474 -L 7687:[your-server-address]:7687 [your-server-address] -i ~/.ssh/id_ed25519 # visit on local machine http://localhost:7474/browser/
3.1. Data model
A sample user and server would be:
User: {
"identity": 0,
"labels": [
"User"
],
"properties": {
"server": "https://mastodon.eugasser.com",
"followers": 11,
"following": 62,
"name": "pecuchet",
"uri": "https://mastodon.eugasser.com/users/pecuchet"
},
"elementId": "0"
}
Server : {
"identity": 3856,
"labels": [
"Server"
],
"properties": {
"url": "https://mastodon.eugasser.com"
},
"elementId": "3856"
}
We have the following relationships.
(:User)-[:FOLLOWS]->(:User) (:User)-[:IN_COMUNITY]->(:Server) (:User)-[:SCRAPED_ON]->(:Round)
3.2. Concurrency/Threading
4. State of things/ TODOS
- DONE SOLVE DUPLICATE ENTRIES IN Db
- PARALLELIZE
[3/4]:
- DONE Queue object is SetQueue. Allows control of duplicates.
- DONE Clean users.py and implement mastodon api usage on 401 response on AP endpoint
- TODO Implement some sort of max retries/endless loop control
- DONE Implement a done attribute in neo4j as a FINISHEDON relationship. Create Nodes of type :Timestamp with a timestamp attribute
- DONE Queue object is SetQueue. Allows control of duplicates.
- TODO Fix request response bugs
- TODO Improve general scrape speed. Consistency of worker thread number.
- TODO Neo4j hangs every 24 hours. Find out why and fix.