Projects

Updated 4 months ago

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach • Rank 5.6 • Science 92%

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach - Published in JOSS (2026)

aws dagster databricks emr spark

Updated 10 months ago

https://github.com/aehrc/variantspark • Rank 11.2 • Science 36%

machine learning for genomic variants

association-studies aws bioinformatics databricks emr genome gwas notebook random-forest variant-spark variantspark vcf

Updated 10 months ago

https://github.com/cdcgov/cdh-lava-react • Rank 3.4 • Science 26%

CDC Data Hub Lifecycle, Analysis & Visualization Accelerator (LAVA) REACT Components based on machine readable requirements.

agile-development azure data-analysis data-catalog data-governance data-quality data-science data-visualization databricks datavisualization devops excel-export metadata operations powerautomate powerbi pyspark security sql test-automation

Updated 10 months ago

pysparklyr • Science 26%

Extension to {sparklyr} that allows you to interact with Spark & Databricks Connect

databricks pyspark r spark spark-connect

Updated 10 months ago

https://github.com/johnsnowlabs/johnsnowlabs • Science 26%

Gateway into the John Snow Labs Ecosystem

bert databricks gpt machine-learning natural-language-processing nlp python seq2seq spark t5

Updated 10 months ago

https://github.com/kruskal-labs/toolfront • Science 26%

Data retrieval for AI agents

agent analytics artificial-intelligence bigquery data-analysis data-engineering data-science database databricks dataops information-extraction information-retrieval machine-learning mcp mlops mysql python snowflake sql sqlite

Updated 10 months ago

https://github.com/data-miner00/spark • Science 26%

A laboratory to carry out experiments with PySpark

databricks pyspark python

Updated 10 months ago

https://github.com/dadananjesha/azuredataengine • Science 13%

AzureDataEngine is a robust, scalable batch processing data architecture built on the Azure platform. It efficiently extracts, transforms, and loads massive datasets for machine learning applications, leveraging Azure Blob Storage, PostgreSQL, Databricks, and Key Vault to ensure reliability and maintainability.

azure batch-processing blob-storage databricks etl etl-framework key-vault postgresql-database spark vnet

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach • Rank 5.6 • Science 92%

https://github.com/aehrc/variantspark • Rank 11.2 • Science 36%

https://github.com/cdcgov/cdh-lava-react • Rank 3.4 • Science 26%

pysparklyr • Science 26%

https://github.com/johnsnowlabs/johnsnowlabs • Science 26%

https://github.com/kruskal-labs/toolfront • Science 26%

https://github.com/data-miner00/spark • Science 26%

https://github.com/dadananjesha/azuredataengine • Science 13%