academic-observatory-workflows

Telescopes, Workflows and Data Services for the Academic Observatory

https://github.com/the-academic-observatory/academic-observatory-workflows

Science Score: 64.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    2 of 14 committers (14.3%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.4%) to scientific vocabulary

Keywords

academic data higher-education research-evaluation science workflow

Keywords from Contributors

ebooks airflow observatory-platform
Last synced: 6 months ago · JSON representation ·

Repository

Telescopes, Workflows and Data Services for the Academic Observatory

Basic Info
Statistics
  • Stars: 16
  • Watchers: 5
  • Forks: 2
  • Open Issues: 6
  • Releases: 12
Topics
academic data higher-education research-evaluation science workflow
Created over 4 years ago · Last pushed 6 months ago
Metadata Files
Readme Contributing License Code of conduct Citation Zenodo

README.md

Academic Observatory Workflows

Academic Observatory Workflows provides Apache Airflow workflows for fetching, processing and analysing data about academic institutions.

License Python Version Python package Documentation Status codecov DOI

Telescope Workflows

A telescope a type of workflow used to ingest data from different data sources, and to run workflows that process and output data to other places. Workflows are built on top of Apache Airflow's DAGs.

The workflows include: Crossref Events, Crossref Fundref, Crossref Metadata, Geonames, OpenAlex, Open Citations, ORCID, PubMed, ROR, Scopus, Unpaywall and Web of Science.

| Telescope Workflow | Description | | ------------- | ------------- | | Crossref Funder Registry | The Crossref Funder Registry is an open registry of grant-giving organization names and identifiers, which can be used to find funder IDs and include them as part of metadata deposits. It is a freely-downloadable RDF file. It is CC0-licensed and available to integrate with your own systems. Funder names from acknowledgements should be matched with the corresponding unique funder ID from the Funder Registry. | | Crossref Metadata | Crossref is a non-for-profit membership organisation working on making scholarly communications better. It is an official Digital Object Identifier (DOI) Registration Agency of the International DOI Foundation. They provide metadata for every DOI that is registered with Crossref. | | OpenAlex | OpenAlex is a free and open catalog of the global research system. | | ORCID | ORCID is a non-profit organization that provides researchers with a unique digital identifier which eliminates the risk of confusing an identity with another researcher having the same name. ORCID provides a record that supports automatic links among all the researcher's professional activities. | | PubMed | PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving health–both globally and personally. | | ROR | ROR is a global, community-led registry of open persistent identifiers for research organizations. | | Scopus | SCOPUS is an Elsevier bibliometrics database containing abstracts, citations, of journals, books, and conference proceedings. | | Unpaywall | Unpaywall is an open database of free scholarly articles. It includes data from open indexes like Crossref and DOAJ where it exists. Data comes from “monitoring over 50,000 unique online content hosting locations, including Gold OA journals, Hybrid journals, institutional repositories, and disciplinary repositories. |

Documentation

For detailed documentation about the Academic Observatory see the Read the Docs website https://academic-observatory-workflows.readthedocs.io

Installation

Install using pip. From the root directory: bash pip install -e ./academic-observatory-workflows[tests] --constraint https://raw.githubusercontent.com/apache/airflow/constraints-2.10.5/constraints-no-providers-3.10.txt

Deployment

These instructions show how to deploy the workflows to Google Cloud and Astronomer.io.

Prerequisites

You should have set up the following resources already: * A Google Cloud Project. * A Google Cloud Shell instance, which pre-installs gsutil, gcloud and kubectl. * A GKE Autopilot Cluster. * An Astonomer.io Airflow deployment, using Google Cloud. * Installed the Astronomer.io CLI: https://www.astronomer.io/docs/astro/cli/install-cli * Installed yq: https://github.com/mikefarah/yq (don't use sudo apt install yq, it installs the wrong tool)

The GKE Autopilot Cluster, Astonomer.io deployment and the Google Cloud buckets (that you create with the below script), should all be in the same region. The Cloud Storage buckets should be in a single region, not a dual or multi region, otherwise you will pay network costs for replication.

Setup Google Cloud Project

In a Google Cloud Shell, run the following script to set up your Google Cloud Project: bash ./bin/setup-gcloud-project.sh gcp-project-id gke-cluster-name gke-namespace gcp-download-bucket-name gcp-transform-bucket-name

The script outputs information that you need for subequent steps: * AO Astro Service Account: required to set up the 'Customer Managed Identity' in Astronomer.io. * Kube Config Path: required to configure the gke_cluster Airflow Connection.

If you are using additional buckets, then you can enable GKE and or Astro to access them with the following command: bash ./bin/setup-bucket-permissions.sh bucket-name service-account-email

Astronomer.io Customer Managed Identity

The AO Astro Service Account needs to be attached to the Astronomer.io deployment as a "Customer Managed Identity".

Please follow these steps to set it up: https://www.astronomer.io/docs/astro/authorize-deployments-to-your-cloud/?tab=gcp#setup

Step 6 is not necessary.

Astronomer.io Airflow Variables and Connections

The Airflow workflows are configured with a config file that is stored as an Airflow Variable. Copy config-example.yaml to config-prod.yaml and customise the settings.

Then deploy your config with the following command: bash ./bin/deploy-config astro-deployment-id gcp-project-id config-prod.yaml

You will also need to create the following Airflow Connections, depending on what workflows you are using:

| Connection ID | Type | Login | Password | Host | Namespace | Kube config (JSON format) | Notes | |---------------------------|--------------|----------|----------|----------|-----------|---------------------------|------------------------------------------------------------------------------------------------------------------------------------| | awsopenalex | aws | required | required | | | | OpenAlex Telescope | | awsorcid | aws | required | required | | | | ORCID Telescope | | crossrefmetadata | http | | required | | | | Crossref Metadata Telescope | | oadashboardgithubtoken | http | | required | | | | OA Dashboard Workflow | | oadashboardzenodotoken | http | | required | | | | OA Dashboard Workflow | | scopuskey1 | http | | required | | | | Scopus Telescope | | unpaywall | http | | required | | | | Unpaywall Telescope | | slack | slackwebhook | | required | required | | | Enables failure notifications to be sent to Slack | | gkecluster | kubernetes | | | | required | required | Enables communication with the GKE Autopilot Cluster.
Required for Crossref Metadata, OpenAlex, PubMed, ORCID and Unpaywall. |

Kubernetes Secrets

Kubernetes Pods can't access Airflow Connections, so some workflows that need access to secrets, need them to be stored as Kubernetes secrets as well. You can create them with the below commands.

Create Unpaywall API key secret: bash kubectl create secret generic unpaywall \ --from-literal=api-key=value \ --namespace my-gke-namespace

Create Crossref Metadata API secret: bash kubectl create secret generic crossref-metadata \ --from-literal=api-key=value \ --namespace my-gke-namespace

Deploy code to the Astronomer.io deployment

To deploy the project to Astronomer.io: bash ./bin/deploy.sh gcp-project-id astro-deployment-id

Owner

  • Name: The Academic Observatory
  • Login: The-Academic-Observatory
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "Academic Observatory Workflows"
url: "https://github.com/The-Academic-Observatory/academic-observatory-workflows"
doi: 10.5281/zenodo.6366694
license: Apache-2.0 License
authors:
  - given-names: "Richard"
    family-names: "Hosking"
  - given-names: "James P"
    family-names: "Diprose"
  - given-names: "Aniek"
    family-names: "Roelofs"
  - given-names: "Tuan-Yow"
    family-names: "Chien"
  - given-names: "Alex"
    family-names: "Masssen-Hane"
  - given-names: "Keegan R"
    family-names: "Smith"        
  - given-names: "Rebecca N"
    family-names: "Handcock"
  - given-names: "Bianca"
    family-names: "Kramer"
  - given-names: "Kathryn R"
    family-names: "Napier"
  - given-names: "Julian"
    family-names: "Tonti-Filippini"
  - given-names: "Lucy"
    family-names: "Montgomery"
  - given-names: "Cameron"
    family-names: "Neylon"
    

GitHub Events

Total
  • Create event: 26
  • Release event: 8
  • Issues event: 7
  • Delete event: 24
  • Member event: 1
  • Issue comment event: 5
  • Push event: 141
  • Pull request review comment event: 28
  • Pull request review event: 35
  • Pull request event: 45
  • Fork event: 1
Last Year
  • Create event: 26
  • Release event: 8
  • Issues event: 7
  • Delete event: 24
  • Member event: 1
  • Issue comment event: 5
  • Push event: 141
  • Pull request review comment event: 28
  • Pull request review event: 35
  • Pull request event: 45
  • Fork event: 1

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 303
  • Total Committers: 14
  • Avg Commits per committer: 21.643
  • Development Distribution Score (DDS): 0.637
Top Committers
Name Email Commits
Jamie Diprose j****e@g****m 110
aroelo a****s@c****u 76
tuanchien t****n@u****m 39
Bec Handcock 4****k@u****m 17
Richard Hosking r****d@h****m 15
Bec Handcock 4****k@u****m 10
Richard Hosking r****g@c****u 9
Cameron Neylon cn@c****t 6
Jamie Diprose 5****g@u****m 6
Keegan Smith 3****1@u****m 6
aroelo a****3@l****l 4
Alex Massen-Hane 1****e@u****m 3
kathrynnapier 4****r@u****m 1
Samuel Klein m****j@g****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 4
  • Total pull requests: 193
  • Average time to close issues: 3 days
  • Average time to close pull requests: 22 days
  • Total issue authors: 4
  • Total pull request authors: 11
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.85
  • Merged pull requests: 154
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 3
  • Pull requests: 29
  • Average time to close issues: 3 days
  • Average time to close pull requests: 13 days
  • Issue authors: 3
  • Pull request authors: 3
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.03
  • Merged pull requests: 19
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • aroelo (1)
  • denraf (1)
  • jdddog (1)
Pull Request Authors
  • jdddog (97)
  • keegansmith21 (32)
  • aroelo (27)
  • alexmassen-hane (21)
  • tuanchien (11)
  • kathrynnapier (8)
  • rhosking (4)
  • bechandcock (2)
  • bmkramer (2)
  • cameronneylon (1)
  • JulianTonti (1)
Top Labels
Issue Labels
bug (1)
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 18 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 3
  • Total maintainers: 1
pypi.org: academic-observatory-workflows

Academic Observatory Workflows provides Apache Airflow Workflows for fetching, processing and analysing data about academic institutions.

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 18 Last month
Rankings
Dependent packages count: 10.1%
Stargazers count: 15.6%
Average: 20.2%
Dependent repos count: 21.6%
Downloads: 24.1%
Forks count: 29.8%
Maintainers (1)
Last synced: 6 months ago

Dependencies

docs/requirements.txt pypi
  • Sphinx >3
  • nltk ==3.
  • pbr ==5.5.
  • recommonmark ==0.7.
  • six ==1.16.
  • sphinx-autoapi ==1.8.
  • sphinx-rtd-theme ==0.5.
requirements.txt pypi
  • Deprecated >1,<2
  • backoff <2,>=1.11.0
  • beautifulsoup4 >=4.9.3,<5
  • boto3 >=1.15.0,<2
  • nltk ==3.
  • pandas >=1.3,<2
  • pyarrow >=6,<7
  • ratelimit ==2.2.
  • wos ==0.2.
  • xmltodict ==0.12.
.github/workflows/publish-pypi.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • pypa/gh-action-pypi-publish master composite
.github/workflows/unit-tests.yml actions
  • actions/checkout v2 composite
  • actions/setup-python v2 composite
  • codecov/codecov-action v1 composite
Dockerfile docker
  • quay.io/astronomer/astro-runtime 12.7.1 build
academic-observatory-workflows/pyproject.toml pypi
  • Deprecated >=1,<2
  • backoff >=2,<3
  • beautifulsoup4 >=4.9.3,<5
  • bigquery-schema-generator >=1.6.1,<2
  • biopython >=1.81,<2
  • boto3 >=1.15.0,<2
  • deepdiff >=8,<9
  • glom >=23.0.0,<24
  • limits >=4,<5
  • lxml >=5,<6
  • nltk >=3.9.1,<4
  • pandas >=1.3,<3
  • ratelimit >=2.2.0,<3
  • xmltodict >=0.12.0,<1
.github/workflows/build-and-push-container.yml actions
  • actions/checkout v4 composite
  • google-github-actions/auth v2 composite
.github/workflows/deploy.yml actions
  • astronomer/deploy-action v0.4 composite