Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach - Published in JOSS (2026)

https://github.com/ascii-supply-networks/ascii-hydra

Science Score: 92.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
○
codemeta.json file
○
.zenodo.json file
✓
DOI references
Found 1 DOI reference(s) in JOSS metadata
✓
Academic publication links
Links to: arxiv.org
✓
Committers with academic emails
5 of 20 committers (25.0%) from academic institutions
✓
Institutional organization owner
Organization ascii-supply-networks has institutional domain (ascii.ac.at)
✓
JOSS paper metadata
Published in Journal of Open Source Software

Keywords

aws dagster databricks emr spark

Last synced: 4 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: ascii-supply-networks
License: gpl-3.0
Language: Python
Default Branch: sync_oss_v1.2.22
Homepage: https://ascii-supply-networks.github.io/research-space/
Size: 1.03 MB

Statistics

Stars: 4
Watchers: 2
Forks: 2
Open Issues: 9
Releases: 1

Topics

aws dagster databricks emr spark

Created almost 2 years ago · Last pushed 9 months ago

Metadata Files

Readme License

ASCII hydra

showcase

The ASCII Hydra project demonstrates a cost-efficient alternative to being locked into specific cloud platforms like Databricks. This examples provide out of the box way of creating assets using either local pyspark, Databricks or AWS EMR.

preprequisites

pixi curl -fsSL https://pixi.sh/install.sh | bash
credentials for:
- AWS EMR
  - ASCII_AWS_ACCESS_KEY_ID: Your AWS Access Key ID for EMR.
  - ASCII_AWS_SECRET_ACCESS_KEY: Your AWS Secret Access Key for EMR.
- Databricks
  - DATABRICKS_HOST: The Databricks host URL.
  - DATABRICKS_CLIENT_ID: Your Databricks Client ID.
  - DATABRICKS_CLIENT_SECRET: Your Databricks Client Secret.
set up a couple of environment variables:
- SPARK_PIPES_ENGINE: Specifies the engine used for Spark Pipes (valid options: databricks, emr, or pyspark).
- SPARK_EXECUTION_MODE: Defines the mode of execution for data (valid options: small_dev_sample_local, small_dev_sample_s3, or full).
- DAGSTER_HOME: Path to the Dagster home directory where Dagster-related configurations and metadata are stored.

Explanation of `SPARK_EXECUTION_MODE`

The SPARK_EXECUTION_MODE environment variable controls the scope and source of the data used during the pipeline execution. It will be transformed on the class ExecutionMode and it's thought to be use as a flag at the external script level:

`small_dev_sample_local`:
This mode should use a small, locally stored sample dataset, ideal for fast development and testing on your local machine.

`small_dev_sample_s3`:
This mode should use a small sample dataset stored on Amazon S3, allowing you to test the pipeline in a cloud environment with minimal data.

`full`:
This mode should processes the full dataset stored on Amazon S3, intended for complete runs and production-level processing.

Creation of environment

To create the environment execute the following commands:

```bash pixi run start

testing

check formatting

pixi run -e ci fmt

check typing

pixi run -e ci lint

run tests

pixi run -e testing test ```

alterantively use the makefile via:

make start make test make fmt make lint

The package can be installed using pixi via:

pixi install automatically an isolated python environment is created.

Execute make start and then go to http://localhost:3000 for the dagster UI

explanation

See https://georgheiler.com/post/paas-as-implementation-detail/ or Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach

For a detail example step by step check docs/use_assets.md

Owner

Name: ASCII - Supply Chain Intelligence Institute Austria
Login: ascii-supply-networks
Kind: organization
Email: info@ascii.ac.at
Location: Austria

Website: https://ascii.ac.at/
Repositories: 1
Profile: https://github.com/ascii-supply-networks

Evidence-based decision-making for business and politics

JOSS Publication

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach

Published

March 12, 2026

DOI

10.21105/joss.07695

Volume 11, Issue 119, Page 7695

Authors

Hernan Picatto

Supply Chain Intelligence Institute Austria, Austria

Georg Heiler

Supply Chain Intelligence Institute Austria, Austria, Complexity Science Hub Vienna, Austria

Peter Klimek
Supply Chain Intelligence Institute Austria, Austria, Complexity Science Hub Vienna, Austria, Institute of the Science of Complex Systems, Center for Medical Data Science CeDAS, Medical University of Vienna, Austria, Division of Insurance Medicine, Department of Clinical Neuroscience, Karolinska Institutet, Sweden

Editor

Rohit Goswami

GitHub Events

Total

Release event: 2
Delete event: 30
Pull request event: 11
Fork event: 2
Issues event: 20
Watch event: 4
Issue comment event: 12
Push event: 14
Pull request review event: 1
Create event: 30

Last Year

Release event: 2
Delete event: 17
Pull request event: 9
Fork event: 1
Issues event: 13
Watch event: 1
Issue comment event: 7
Push event: 10
Pull request review event: 1
Create event: 16

Committers

Last synced: 9 months ago

All Time

Total Commits: 431
Total Committers: 20
Avg Commits per committer: 21.55
Development Distribution Score (DDS): 0.661

Past Year

Commits: 141
Committers: 12
Avg Commits per committer: 11.75
Development Distribution Score (DDS): 0.447

Top Committers

Name	Email	Commits
Georg Heiler	g**r@g**m	146
geoHeil	1**l@u**m	83
Georg Heiler	g**r@a**t	77
HPicatto	h**o@g**m	31
Maximilian Heß	8**s@u**m	18
Maximilian Heß	1**n@u**m	17
CI Hotfix	c**x@a**t	12
joshuazelle	1**e@u**m	12
schmoigl	4**l@u**m	10
Hernan Picatto	h**o@a**t	8
Jaber Fooladi	3**i@u**m	3
Daniel S. Katz	d**z@i**g	2
Devetak	3**k@u**m	2
Hernan	h**o@a**m	2
PeterKlimek	4**k@u**m	2
seyda-kose	1**e@u**m	2
Elma Dervic	4**c@u**m	1
Georg Heiler	h**r@u**t	1
Peter Reschenhofer	p**r@g**m	1
Rosie Hayward	1**1@u**m	1

Committer Domains (Top 20 + Academic)

ascii.ac.at: 3 utf.ascii.ac.at: 1 aigot.com: 1 ieee.org: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 13
Total pull requests: 8
Average time to close issues: 3 months
Average time to close pull requests: about 13 hours
Total issue authors: 2
Total pull request authors: 3
Average comments per issue: 0.31
Average comments per pull request: 0.0
Merged pull requests: 6
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 13
Pull requests: 4
Average time to close issues: 3 months
Average time to close pull requests: about 9 hours
Issue authors: 2
Pull request authors: 2
Average comments per issue: 0.31
Average comments per pull request: 0.0
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 0

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach

Science Score: 92.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

ASCII hydra

showcase

preprequisites

Explanation of SPARK_EXECUTION_MODE

Creation of environment

testing

check formatting

check typing

run tests

explanation

Owner

JOSS Publication

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach

Authors

Editor

Tags

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Explanation of `SPARK_EXECUTION_MODE`