Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach
Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach - Published in JOSS (2026)
Science Score: 92.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 1 DOI reference(s) in JOSS metadata -
✓Academic publication links
Links to: arxiv.org -
✓Committers with academic emails
5 of 20 committers (25.0%) from academic institutions -
✓Institutional organization owner
Organization ascii-supply-networks has institutional domain (ascii.ac.at) -
✓JOSS paper metadata
Published in Journal of Open Source Software
Keywords
Repository
Basic Info
- Host: GitHub
- Owner: ascii-supply-networks
- License: gpl-3.0
- Language: Python
- Default Branch: sync_oss_v1.2.22
- Homepage: https://ascii-supply-networks.github.io/research-space/
- Size: 1.03 MB
Statistics
- Stars: 4
- Watchers: 2
- Forks: 2
- Open Issues: 9
- Releases: 1
Topics
Metadata Files
README.md
ASCII hydra
showcase
The ASCII Hydra project demonstrates a cost-efficient alternative to being locked into specific cloud platforms like Databricks. This examples provide out of the box way of creating assets using either local pyspark, Databricks or AWS EMR.
preprequisites
- pixi
curl -fsSL https://pixi.sh/install.sh | bash - credentials for:
- AWS EMR
ASCII_AWS_ACCESS_KEY_ID: Your AWS Access Key ID for EMR.ASCII_AWS_SECRET_ACCESS_KEY: Your AWS Secret Access Key for EMR.
- Databricks
DATABRICKS_HOST: The Databricks host URL.DATABRICKS_CLIENT_ID: Your Databricks Client ID.DATABRICKS_CLIENT_SECRET: Your Databricks Client Secret.
- AWS EMR
- set up a couple of environment variables:
SPARK_PIPES_ENGINE: Specifies the engine used for Spark Pipes (valid options:databricks,emr, orpyspark).SPARK_EXECUTION_MODE: Defines the mode of execution for data (valid options:small_dev_sample_local,small_dev_sample_s3, orfull).DAGSTER_HOME: Path to the Dagster home directory where Dagster-related configurations and metadata are stored.
Explanation of SPARK_EXECUTION_MODE
The SPARK_EXECUTION_MODE environment variable controls the scope and source of the data used during the pipeline execution. It will be transformed on the class ExecutionMode and it's thought to be use as a flag at the external script level:
`small_dev_sample_local`:
This mode should use a small, locally stored sample dataset, ideal for fast development and testing on your local machine.
`small_dev_sample_s3`:
This mode should use a small sample dataset stored on Amazon S3, allowing you to test the pipeline in a cloud environment with minimal data.
`full`:
This mode should processes the full dataset stored on Amazon S3, intended for complete runs and production-level processing.
Creation of environment
To create the environment execute the following commands:
```bash pixi run start
testing
check formatting
pixi run -e ci fmt
check typing
pixi run -e ci lint
run tests
pixi run -e testing test ```
alterantively use the makefile via:
make start
make test
make fmt
make lint
The package can be installed using pixi via:
pixi install
automatically an isolated python environment is created.
Execute make start and then go to http://localhost:3000 for the dagster UI
explanation
See https://georgheiler.com/post/paas-as-implementation-detail/ or Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach
For a detail example step by step check docs/use_assets.md
Owner
- Name: ASCII - Supply Chain Intelligence Institute Austria
- Login: ascii-supply-networks
- Kind: organization
- Email: info@ascii.ac.at
- Location: Austria
- Website: https://ascii.ac.at/
- Repositories: 1
- Profile: https://github.com/ascii-supply-networks
Evidence-based decision-making for business and politics
JOSS Publication
Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach
Authors
Supply Chain Intelligence Institute Austria, Austria, Complexity Science Hub Vienna, Austria
Supply Chain Intelligence Institute Austria, Austria, Complexity Science Hub Vienna, Austria, Institute of the Science of Complex Systems, Center for Medical Data Science CeDAS, Medical University of Vienna, Austria, Division of Insurance Medicine, Department of Clinical Neuroscience, Karolinska Institutet, Sweden
Tags
Orchestration PaaS Apache Spark Big Data Databricks AWS EMR Cost Efficiency Data EngineeringGitHub Events
Total
- Release event: 2
- Delete event: 30
- Pull request event: 11
- Fork event: 2
- Issues event: 20
- Watch event: 4
- Issue comment event: 12
- Push event: 14
- Pull request review event: 1
- Create event: 30
Last Year
- Release event: 2
- Delete event: 17
- Pull request event: 9
- Fork event: 1
- Issues event: 13
- Watch event: 1
- Issue comment event: 7
- Push event: 10
- Pull request review event: 1
- Create event: 16
Committers
Last synced: 6 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Georg Heiler | g****r@g****m | 146 |
| geoHeil | 1****l@u****m | 83 |
| Georg Heiler | g****r@a****t | 77 |
| HPicatto | h****o@g****m | 31 |
| Maximilian Heß | 8****s@u****m | 18 |
| Maximilian Heß | 1****n@u****m | 17 |
| CI Hotfix | c****x@a****t | 12 |
| joshuazelle | 1****e@u****m | 12 |
| schmoigl | 4****l@u****m | 10 |
| Hernan Picatto | h****o@a****t | 8 |
| Jaber Fooladi | 3****i@u****m | 3 |
| Daniel S. Katz | d****z@i****g | 2 |
| Devetak | 3****k@u****m | 2 |
| Hernan | h****o@a****m | 2 |
| PeterKlimek | 4****k@u****m | 2 |
| seyda-kose | 1****e@u****m | 2 |
| Elma Dervic | 4****c@u****m | 1 |
| Georg Heiler | h****r@u****t | 1 |
| Peter Reschenhofer | p****r@g****m | 1 |
| Rosie Hayward | 1****1@u****m | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 4 months ago
All Time
- Total issues: 13
- Total pull requests: 8
- Average time to close issues: 3 months
- Average time to close pull requests: about 13 hours
- Total issue authors: 2
- Total pull request authors: 3
- Average comments per issue: 0.31
- Average comments per pull request: 0.0
- Merged pull requests: 6
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 13
- Pull requests: 4
- Average time to close issues: 3 months
- Average time to close pull requests: about 9 hours
- Issue authors: 2
- Pull request authors: 2
- Average comments per issue: 0.31
- Average comments per pull request: 0.0
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- abhishektiwari (7)
- Midnighter (6)
Pull Request Authors
- HPicatto (6)
- geoHeil (1)
- danielskatz (1)