https://github.com/alvarocavalcante/airflow-custom-deferrable-dataflow-operator

Start your Dataflow jobs execution directly from the Triggerer without going to the Worker!

https://github.com/alvarocavalcante/airflow-custom-deferrable-dataflow-operator

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.5%) to scientific vocabulary

Keywords

airflow airflow-dags airflow-operators dags data-engineering dataflow gcp python python3
Last synced: 5 months ago · JSON representation

Repository

Start your Dataflow jobs execution directly from the Triggerer without going to the Worker!

Basic Info
  • Host: GitHub
  • Owner: AlvaroCavalcante
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 36.1 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 1
Topics
airflow airflow-dags airflow-operators dags data-engineering dataflow gcp python python3
Created 12 months ago · Last pushed 12 months ago
Metadata Files
Readme License

README.md

Airflow Custom Deferrable Dataflow Operator

Trigger Different: Cut Your AirFlow Costs By Starting From Triggerer!

Use this simple Airflow Operator to start your Dataflow jobs execution directly from the Triggerer without going to the Worker!

Contents

How It Works

The main idea of this approach is to start the task instance execution directly on the Triggerer component, bypassing the worker entirely.

This strategy is effective because, in this case, the only action the operator performs is making an HTTP request to start the external processing service and waiting for the job to complete.

For this reason, we can leverage the async design of the Triggerer for this execution, significantly reducing resource consumption. By using this architecture, the Airflow task execution proccess will be something like the following:

airflow_diagram

To know more about how the tool works, check out the Medium article.

Installation

The installation process will depend on your cloud provider or how you have set up your environment.

Regarding Google Cloud Composer, for example, the DAGs folder is not synchronized with the Airflow Triggerer, as stated in the documentation.

Consequently, just uploading your code to the DAGs folder will not work, and you'll likely face an error like this: ImportError: Module "PACKAGE_NAME" does not define a "CLASS_NAME" attribute/class

In this case, it's necessary to import the missing code from PyPI, meaning that you'll need to install the operator/trigger as a new library.

To do so, you can use the following command:

bash pip install custom-deferrable-dataflow-operator

Usage

After installing the library, you can successfully import and use the operator in your Airflow DAGs, as shown below:

```python from deferrabledataflowoperator import DeferrableDataflowOperator

dataflowtriggererjob = DeferrableDataflowOperator( triggerkwargs={ "projectid": GCPPROJECTID, "region": GCPREGION, "body": { "jobname": MYJOBNAME, "parameters": { "dataflow-parameters": MYPARAMETERS }, "environment": {**dataflowenvvars}, "containerspecgcspath": TEMPLATEGCSPATH, } }, startfromtrigger=True, taskid=MYTASKID ) In thetriggerkwargsparameter, it's important to specify your GCP project ID and region. Thebody``` parameter, on the other hand, should contain all the relevant information for your Dataflow job, as stated in the official documentation.

Contributing

This project is open to contributions! If you want to collaborate to improve the operator, please follow these steps:

  1. Open a new issue to discuss the feature or bug you want to address.
  2. Once approved, fork the repository and create a new branch.
  3. Implement the changes.
  4. Create a pull request with a detailed description of the changes.

Owner

  • Name: Alvaro Leandro Cavalcante Carneiro
  • Login: AlvaroCavalcante
  • Kind: user
  • Location: São Paulo, SP
  • Company: Bank Master

Master's degree student and Data Engineer at Bank Master.

GitHub Events

Total
  • Release event: 1
  • Push event: 1
  • Create event: 3
Last Year
  • Release event: 1
  • Push event: 1
  • Create event: 3

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 78 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 0
  • Total versions: 10
  • Total maintainers: 1
pypi.org: custom-deferrable-dataflow-operator

Start your Dataflow jobs execution directly from the Triggerer without going to the Worker.

  • Versions: 10
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 78 Last month
Rankings
Dependent packages count: 9.6%
Average: 32.0%
Dependent repos count: 54.3%
Maintainers (1)
Last synced: 6 months ago

Dependencies

setup.py pypi
  • apache-airflow >=2.10.0
  • google-cloud >=0.33.0