Science Score: 52.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization rcgsheffield has institutional domain (www.sheffield.ac.uk) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (10.8%) to scientific vocabulary
Keywords
Repository
AIRBODS data pipelines and storage
Basic Info
- Host: GitHub
- Owner: rcgsheffield
- License: mit
- Language: Python
- Default Branch: main
- Homepage: https://www.sheffield.ac.uk/it-services/research
- Size: 274 KB
Statistics
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 7
- Releases: 1
Topics
Metadata Files
README.md
AIRBODS data pipeline
Data pipelines and data storage for AIRBODS air measurement experiments. The primary data pipeline downloads all the raw data available on a single Datacake "workspace" (specified by its unique identifier) and copies those data onto a database. It then transforms that data to make a "clean" data set available for research use.
This document contains some descriptions of how the system is built and how to administer and maintain it.
The data are described in the Metadata section below.
Code examples are contained the the examples directory that can be used to retrieve data in various languages.#
This repository contains an Ansible project as described below.
For an overview, see this blog post "Supporting airborne infection research with data engineering" by the Research & Innovation IT team in IT Services at the University of Sheffield.
The prototype of this system is in rcgsheffield/ventilation-for-hospitality.
Overview
Architecture
There is an overview of the system in the architecture diagram.
```mermaid flowchart LR s1[Sensor] s2[Sensor] s3[Sensor] dc("Cloud data system (Datacake)") wf["Workflow orchestrator (Airflow)"] sql[("SQL Database (PostgreSQL)")] gd["Metadata registry (Google Sheets)"]
s1 --> dc s2 --> dc s3 --> dc dc --> wf gd --> wf wf --> sql ```
The system comprises two major parts:
- Workflow Orchestrator: The data pipelines are implemented using a workflow orchestrator called Apache Airflow.
- Database: The data are stored in a relational database management system implemented using PostgreSQL.
- A web-based database administration interface is available to manage this database.
The workflow orchestrator comprises the subcomponents inside the black dotted line. The user roles are represented by red person icons. The cloud services are shown as blue clouds. The services are represented as green rectangles. The databases are shown as yellow cylinders.
There are three typical user roles:
- End user (researcher) will access the SQL database using a certain security role.
- Data engineer will access the workflow orchestrator using its web portal.
- A system administrator will access the virtual machine(s) on which the system runs.
This is a more detailed diagram of the different services and the infrastructure they run on.
```mermaid flowchart LR sheets datacake subgraph "Campus network" subgraph "Virtual machine 1" web["Administration tool"] sched["Scheduler"] backend[("Result backend")] broker[("Message broker")] end subgraph "Virtual machine 2" worker["Worker(s)"] end subgraph "Virtual machine 3" db[("PostgreSQL database")] dbadmin["Database administration tool"] end end
web <--> sched sched <--> broker broker <--> worker backend <--> worker worker --> db dbadmin <--> db sheets --> worker datacake --> worker ```
Workflow orchestrator
The data pipelines are defined using three directed acyclic graphs (DAGs) that are defined in airflow/dags/*.py.
datacakedownloads, stores and transforms data retrieved via the Datacake API.datacake_devicesretrieves and stores metadata about sensors from the Datacake APIdeploymentsretrieves and stores information about sensors and their deployments at certain locations from a Google Sheets spreadsheet.
The configuration for these is stored in airflow/variables.json.
There is a custom operator used to access GraphQL APIs via HTTP in airflow/plugins/operators/graphql.py that is used in the workflows that interact with Datacake, which exposes this kind of API.
Database
The RDMS service contains three databases:
airbodscontains the research data and is accessible by the end users. The tables are described in the metadata section below.airflowcontains the metadata and system data for the workflow orchestrator. This may be accessed by the user account (role) namedairflow.postgrescontains system data for the database service itself
The database settings are specified in the Ansible playbook and the configuration files postgresql.conf and pg_hba.conf as discussed in the PostgreSQL 12 documentation.
Installation
The system comprises multiple subsystems as described above that are installed using an automated deployment tool to facilitate continuous integration/continuous deployment (CICD).
Ansible
Automated deployment is implemented using Ansible, which defines a workflow comprising a series of tasks that install software and configure services. This workflow is executed on one or more remote machines. See their docs: Executing playbooks for troubleshooting. Most of the service configuration files are in the files directory (Ansible will automatically search this directory for files to upload.) Variables are defined in the group_vars/all YAML file.
You may need to install Ansible in a Python virtual environment if you need to use a more up-to-date version of the software, which is written in Python. To use the tool you'll need to activate the virtual environment with a command similar to the one below, assuming the environment was created in the directory ~/ansible:
bash
source ~/ansible/bin/activate
Secrets
The secret passwords used by each subsystem are defined in local files in the ./secrets directory, which isn't included in this repository for security reasons.
The secrets/variables.json file contains the fernet key used for encryption of passwords by Airflow. This file looks like this, where the Fernet key may be generated using the Python code snippet below.
json
{
"fernet_key": "********************"
}
To generate a Fernet key:
python
import cryptography.fernet
fernet_key = cryptography.fernet.Fernet.generate_key()
print(fernet_key.decode())
Key generation
Security keys, certificate requests and certificates may be generated using openssl. Certificate-authority-signed certificates were retrieved via ITS Helpdesk. This is particularly important for the SQL database because this service will be exposed to the risks associated with access via the public internet.
Airflow webserver
Generate key and certificate:
openssl req -nodes -x509 -newkey rsa:4096 -keyout secrets/webserver_key.pem -out secrets/webserver.pem -days 365
-nodes means "No DES" or an unencrypted private key which has no password.
Deployment
The deployment script installs the necessary software subsystems on a remote machine. The script should be idempotent so it can be run repeatedly without causing problems.
Secure shell access
The private key must be installed and configured on the target machine so that the control node may connect using secure shell (SSH). See this tutorial: How To Configure SSH Key-Based Authentication on a Linux Server. The ssh-copy-id tool is useful here to install your key information on a remote host.
bash
ssh-copy-id $USER@airbods.my-domain.com
ssh-copy-id $USER@airbods01.my-domain.com
ssh-copy-id $USER@airbods02.my-domain.com
You can view the installed files like so:
bash
ssh $USER@airbods.my-domain.com -i ~/.ssh/id_rsa "ls -la ~/.ssh/"
The ida_rsa file is that user's private key. The authorized_keys file is used to list the public keys that can automatically connect. These files would be stored in the directory ~/.ssh for the user you use to connect.
You can test the connection like so:
bash
ssh $USER@airbods.my-domain.com -i ~/.ssh/id_rsa "echo OK"
The tool ssh-agent is useful to save time by storing the password for encrypted private keys so you don't have to repeatedly type it in. See the Ansible docs connection info.
bash
ssh-agent bash
ssh-add ~/.ssh/id_rsa
Superuser access
It's assumed that you log into the remote system as a user (with username $USER) that is a member of the system administrator's group airbodsadmins. Contact the ITS Unix team with queries about this.
You'll also need to be able to run executables as other users. To test this, try to run the following commands (which will only work after the relevant users have been created.)
bash
sudo -u postgres id
sudo -u airflow id
Running Ansible
The variable $USER is used to specify which username to use to connect to the remote host.
bash
USER=<sa_username>
Use the shell variable $INVENTORY to select the target environment:
```bash INVENTORY=hosts-dev.yaml
INVENTORY=hosts-prod.yaml
```
Check Ansible is working. (You probably need to use ssh-agent as described above.)
```bash
View Ansible package version
ansible --version
View inventory
ansible --inventory $INVENTORY --list-hosts all
Ping all remote hosts
ansible --inventory $INVENTORY --user $USER all -m ping
Run a custom command on each host
ansible --inventory $INVENTORY --user $USER -a "echo OK" all ```
To run the deployment script, we need to use the deployment script which is defined as an Ansible "playbook" using the ansible-playbook command (see ansible-playbook CLI docs).
The --inventory option determines which inventory of hosts will be targeted, where hosts.yaml contains the development environment and hosts-prod.yaml points to the production environment. You need to use a different inventory file which will point the deployment script to a different machine using the --inventory argument of the ansible-playbook command.
```bash
Check that the Ansible Playbook command is working
ansible-playbook --version
Check a playbook
ansible-playbook --inventory $INVENTORY --user $USER --ask-become-pass airbods.yaml --check ```
Install services (inside the selected environment) which assumes that all the secret password files have been created in the ./secrets directory.
bash
ansible-playbook --inventory $INVENTORY --user $USER --ask-become-pass airbods.yaml
Run automated tests:
bash
ansible-playbook --inventory $INVENTORY --user $USER --ask-become-pass test.yaml
Troubleshooting
If problems occur, check the logs and try the following steps:
- Re-run the deployment script
- Check service status (see the Monitoring section below) and logs.
- Check that each service can connect to the other relevant services.
- Ensure that the Ansible notify handler feature is enabled for any changes you may have have made.
- Restart the services on the remote host (or perhaps reboot the entire remote system manually)
- Use Ansible's verbose mode and other debugging features
- If you encounter a "file not found" error, it may be due to a missing secrets file which should be saved in the
./secrets/directory and contain a random password.
Scaling
Currently this deployment is designed to run a single machine, assuming that it's a small-scale instance. To scale this up to multiple machines, the hosts and deployment script will need to be adapted so that each machine performs a different role.
Maintenance
Routine maintenance must be performed on the server operating system and the various services that make up the system. There is a script called maintenance.yaml to automate some of this. This can be run as follows:
bash
ansible-playbook --inventory hosts-prod.yaml --user $USER --ask-become-pass maintenance.yaml
This will clear out various aspects of the system but isn't a substitute for a proper maintenance routine.
Usage
There are several ways to interact with and control each part of the system.
Data access
The research data are contained in a Structured Query Language (SQL) database. There are several ways to connect to this SQL database, either using software packages or programming languages. See the examples directory for a list of available tools.
To connect to the database machine, use the connection string is as follows:
uri
postgresql://$USERNAME:$PASSWORD@airbods.my-domain.com/airbods
PostgreSQL interactive terminal
You can connect to the database from your local computer with psql as follows:
bash
psql --host=airbods.my-domain.com --dbname=airbods --username=$USERNAME
You may need to change the username to something else. You can enter the password manually or use a pgpass file.
You can run a command by using the shell or as follows:
bash
psql --host=airbods.my-domain.com --dbname=airbods --username=$USERNAME --command "SELECT now();"
Backup
See PostgreSQL docs Chapter 25. Backup and Restore.
First, log into the database machine as an administrator, and run the following pg_dump command as the postgres user:
```bash
Log in with database service account
sudo su - postgres --shell /bin/bash
pg_dump airbods --verbose --file "/tmp/airbods-date -I.sql"
```
Airflow CLI
You can use the Airflow Using the Command Line Interface to view, control and administer the workflow orchestrator. You should run this tool as the service user airflow or as root.
```bash
Log in as system user (or use sudo -u airflow )
sudo su - airflow
/opt/airflow/bin/airflow --help
List users
/opt/airflow/bin/airflow users list
List DAGs
/opt/airflow/bin/airflow dags list ```
Testing
Airflow
See Testing a DAG.
To check that the DAG code may be imported:
bash
python dags/datacake.py
Test a specific task:
bash
airflow task test <dag_id> <task_id> <date>
airflow tasks test datacake all_devices_history 2021-06-01
Run unit tests:
bash
python -m unittest --failfast
Database administration
The database service may be controlled using systemd:
```bash
Restart PostgreSQL service
sudo systemctl restart postgresql ```
To connect to the SQL database, see code examples.
pgAdmin web interface
The pgAdmin tool is available by opening airbods.my-domain.com using a web browser.
Configure server
The connection to the database must be configured using the Server Dialogue.
- If no servers are configured, click "Create Server"
- On the "General" tab, populate these fields:
- Name:
Airbods - Shared:
Yes
- Name:
- On the "Connection" tab, populate these fields:
- Host name:
localhost - Username:
pgadmin - Password:
***************(this must be kept secret) - Save password:
Yes
- Host name:
- Click "Save"
Add pgAdmin user
These user accounts are used to access pgAdmin only (not the SQL database itself.)
- Once logged in, to open the User Management panel, click on the arrow in the top-right corner of the screen.
- Click "Users"
- Click the "+" icon on the top-right of the "User Management" window
- Fill in the form fields
- Role = Administrator
- hit "Enter"
Create users using pgAdmin
Follow the steps below to create a new login (user credentials) to access the SQL database.
- Log into the pgAdmin tool
- On the browser (left pane) select the Airbods server
- Right-click on the server and click on "Create" -> "Login/Group Role"
- On the first tab called "General" in the name field, enter a username (don't use special characters)
- On the second tab, "Definition," enter a unique password
- On the "Privileges" tab, set "Can login?" to "Yes"
- On the "Membership" tab, next to "Member of" click on the "Select roles" box and select the "researcher" role.
- Click "Save"
As a check, the SQL tab should show something like this:
```sql CREATE ROLE joebloggs WITH LOGIN NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT NOREPLICATION CONNECTION LIMIT -1 PASSWORD 'xxxxxx';
GRANT researcher TO joebloggs; ```
PostgreSQL CLI
The PostgreSQL database can be administered using psql.
```bash sudo -u postgres psql
List databases
sudo -u postgres psql -c "\l"
List users
sudo -u postgres psql -c "\du" ```
User management
A database role exists for end users called researcher.
Create users using the CLI
Create new user credentials using the createuser shell command:
bash
createuser --pwprompt --role=researcher
You could also do this using CREATE ROLE:
sql
CREATE USER joebloggs LOGIN PASSWORD 'xxxxxx' IN ROLE researcher;
Role membership can also be managed for existing users.
Data pipeline management
The data pipelines are managed using Apache Airflow. See the Airflow tutorial.
Airflow web interface
The is an Airflow GUI available via the webserver service which is available via SSH tunnel. To test that this is available using curl using the command below, the --insecure flag is used to prevent the certificate being checked (because the certificate used is not signed by a certificate authority (CA) and is self-signed.)
```bash
Create SSH tunnel
ssh -L 4443:127.0.0.1:4443 $USER@airbods01.my-domain.com ```
This can be opened at this URL: https://localhost:4443. To view this interface in your web browser you'll need to skip any security warning messages about this certificate issue.
You can check it's worked by running this command from your local machine:
```bash
Check the web page is accessible
curl --insecure --head https://localhost:4443 ```
The DAG calendar view is useful to give an overview of the entire history of the workflow.
Airflow command line interface
To run these commands, you must log in to a worker machine and run the following commands as the user airflow (that is, prefix the commands with sudo -u airflow to run them as that service user.)
```bash
Log in as Airflow service user
sudo su - airflow --shell /bin/bash
Add the Airflow binary directory to the shell context ($PATH)
export PATH="/opt/airflow/bin:$PATH" ```
To check it's working:
```bash
Check Airflow CLI is working as expected
airflow version
Get environment information
airflow info
Check the metadata database connection
airflow db check ```
List workflows
To show the available workflows:
bash
airflow dags list
Clear
The state of failed tasks may be cleared using the GUI under Browse > DAG Runs. You can also use the CLI with the tasks clear command. This may be applied to an entire DAG run, or a subset of tasks, for a specified time range.
```bash
Clear all tasks within a time range
airflow tasks clear --start-date "YYYY-MM-DD" --end-date "YYYY-MM-DD" datacake
```
To only clear failed tasks:
bash
airflow tasks clear datacake --only-failed
Backfill
Using the Airflow CLI, use the backfill command (see CLI backfill docs) to run historic data pipelines:
```bash
airflow dags backfill $DAGID -s $STARTDATE -e $ENDDATE -t <taskregex>
airflow dags backfill datacake --start-date "$START_DATE" --end-date "$(date -I)" ```
Datacake
The workflows use the Datacake GraphQL API, which communicates over authenticated HTTPS. Please read those documents for an introduction to using this. This is implemented using an Airflow custom operator called GraphQLHttpOperator which is a subclass of the built-in SimpleHttpOperator. There is a browser-based interface at https://api.datacake.co/graphql/ where you need to input the access token and you can run test GraphQL queries.
The raw data are stored on the virtual machine in the directory /home/airflow/data as JSON files, one for each hourly batch operation.
Metadata
The following are the items in the database. There are two types of object: tables and views. Tables contain rows of data and views are predefined SQL queries that display, merge or process that data in a certain way.
The SQL DDL used to define and create this schema is contained in SQL files in the directory ansible/playbooks/files/database and is run by the deployment script.
Tables
rawcontains a copy of the original data retrieved from Datacake. It unadvisable to use this data for research analysis, usecleanor one of the views instead. Each row corresponds to a data capture reading by a sensor at a certain time. It has a timestamp column, several columns describing the sensor used and one column per metric.cleancontains the transformed data, taken fromraw, that is ready for usedevicecontains one row per sensor and has the following columns:device_idis a unique identifier created by Datacake, one per sensor entry on that platform. You can view the information about a device using its URL:https://app.datacake.de/airbods/devices/d/<device_id>serial_numberis the serial number taken from the physical device. This is the best way to uniquely identify a sensor because it won't change.verbose_nameis a human-readable label applied to each device which change over time.objectcontains a JSON object with all the additional device information stored on Datacake.
deploymentcontains one row for each time a sensor is moved to a new location for a period of time. There may be multiple rows per device, one for each combination of location and time period. Theserial_numbercolumn maps to the column of the same name on thedevicetable. Thestart_timeis the time when the device was deployed. Theend_timeis when this deployment ended, or is blank if this deployment continues. The other columns describe the location of the deployment.
Views
readingmerges the tablesclean,deviceanddeploymentto present the context for each clean data row, in addition to calculating some aggregated statistics. The deployment information is taken from the relevant deployment for that sensor at the time the reading was recorded. It contains the following columns:serial_numberis the sensor identifier.time_utcandtime_europe_londonis the time of each reading, displayed in each time zonecity,site,areaetc. are columns from thedeploymenttable describing the sensor location.air_quality,co2and other physical measurementsco2_zone_min,humidity_area_mean,temperature_zone_maxand other similar columns contain the aggregate statistics for each metric, partitioned over the location and day. The aggregate functions are the minimummin, averagemeanand maximummaxfor that deployment. The location may be the zone and area where the sensor was deployed. The day is the calendar date in UTC (GMT).
device_deploymentmerges the tablesdeviceanddeploymentbut contains one row per sensor for latest deployment. The columns are the same as those on the two source tables.
Quality Assurance
There are a number of ways to check the validity and integrity of the data in the database.
To check the time range of the raw data:
sql
SELECT DISTINCT left(raw.time_, 10) AS day
FROM airbods.public.raw
ORDER BY 1;
Similarly for the days in the clean data:
sql
SELECT DISTINCT clean.time_::DATE AS day
FROM airbods.public.clean
ORDER BY 1;
Monitoring
The system comprises several services which are managed using systemd.
View service status
You can view the status of the services using the following commands:
```bash systemctl status airflow-webserver systemctl status airflow-scheduler systemctl status airflow-worker systemctl status redis
RabbitMQ
sudo systemctl status rabbitmq-server sudo -u rabbitmq rabbitmq-diagnostics status
View PostgreSQL cluster status
systemctl status postgresql pg_lsclusters
pgAdmin4 web server
sudo systemctl status apache2
Airflow
sudo -u airflow /opt/airflow/bin/airflow info ```
View logs
Service logs are available using the journalctl command. The ---since option is used to filter by time.
```bash
Airflow service logs
sudo journalctl -u airflow-worker --reverse # most recent first sudo journalctl -u airflow-webserver --since "$(date -I)" # current day sudo journalctl -u airflow-scheduler --since "$(date -I) 12:00" # this afternoon ```
PostgreSQL database service logs:
```bash sudo systemctl status postgresql
List available log files (they're rotated regularly)
sudo ls -l /var/log/postgresql sudo tail /var/log/postgresql/postgresql-12-main.log ```
pgAdmin logs:
bash
sudo journalctl -u gunicorn --since "1 hour ago"
RabbitMQ message broker service logs:
```bash sudo journalctl -u rabbitmq-server --since "1 hour ago"
View logs for a particular node
sudo -u rabbitmq tail /var/log/rabbitmq/rabbit@localhost.log
List RabbitMQ log files
ls -lt /var/log/rabbitmq/
List RabbitMQ crash log files
sudo ls -lt /var/log/rabbitmq/log/ ```
Redis database service logs:
bash
sudo journalctl -u redis --since "$(date -I)"
System resource utilisation
Task summary
bash
top
Processor-related statistics:
bash
mpstat
Memory usage in MB:
bash
free --mega
Memory usage by process:
bash
htop --sort-key PERCENT_MEM
Storage space usage
bash
df -h
There are several tools to analyse disk usage, such as ncdu:
bash
sudo apt install ncdu
sudo ncdu /home
Worker monitoring
You can look at the workers and tasks using Flower, a celery monitoring tool. (Also see Airflow Flower docs.) This can be accessed using an SSH tunnel for port 5555:
bash
ssh -L 5555:127.0.0.1:5555 $USER@airbods01.my-domain.com
Then open http://localhost:5555 in a web browser on your computer.
Message broker
See: RabbitMQ rabbitmqctl and Management Command Line Tool. Examples:
```bash
Check it's installed
sudo -u rabbitmq rabbitmqctl version
View node status
sudo -u rabbitmq rabbitmqctl status
List users
sudo -u rabbitmq rabbitmqctl list_users ```
Management console
SSH tunnel port 15672 on the remote machine 127.0.0.1:15672 using the ssh command or putty.
bash
ssh -L 15672:127.0.0.1:15672 $USER@airbods01.my-domain.com
Then open http://localhost:15672 on your local machine.
Troubleshooting
See RabbitMQ docs Troubleshooting Guidance.
If there is a problem, try restarting the service:
bash
sudo systemctl restart rabbitmq-server
Security certificates
Check that private key fingerprint matches that of the certificate:
bash
openssl rsa -noout -modulus -in secrets/airbods_my_domain_com.key
openssl req -noout -modulus -in secrets/airbods_my_domain_com.csr
openssl x509 -noout -modulus -in files/airbods_my_domain_com_cert.cer
Check certificate expiration dates:
bash
openssl x509 -noout -dates -in secrets/webserver.pem
openssl x509 -noout -dates -in files/airbods_my_domain_com_cert.cer
Load balancer
The ITS load balancer airbods-lb.my-domain.com points to https://airbods.my-domain.com:443.
Owner
- Name: IT Services Research & Innovation team at the University of Sheffield
- Login: rcgsheffield
- Kind: organization
- Location: Sheffield, UK
- Website: https://www.sheffield.ac.uk/it-services/research
- Twitter: RIT_Sheffield
- Repositories: 12
- Profile: https://github.com/rcgsheffield
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Heffer" given-names: "Joe Salvador Thornhill" orcid: "https://orcid.org/0000-0001-8733-1117" title: "Airbods" version: 0.0.1 date-released: 2021-08-12 url: "https://github.com/Joe-Heffer-Shef/flairbods"