funsies
funsies: A minimalist, distributed and dynamic workflow engine - Published in JOSS (2021)
https://github.com/airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://github.com/growthbook/growthbook
Open Source Feature Flagging and A/B Testing Platform
https://github.com/apache/airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://github.com/bacalhau-project/bacalhau
Community-driven, simple, yet powerful framework for fast, cost-effective distributed Compute over Data.
https://github.com/apache/hamilton
Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
workbench
Workbench: An easy to use Python API for creating and deploying AWS SageMaker Models
https://github.com/datafold/data-diff
Compare tables within or across databases
https://github.com/larribas/dagger
Define sophisticated data pipelines with Python and run them on different distributed systems (such as Argo Workflows).
https://github.com/buchananja/dpyp
A convenience tool for small-scale data pipelines in Python
https://github.com/ploomber/soorgeon
Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://github.com/danielvartan/open-science-pres
🔎🔓 Opens Science Presentation for the Sustentarea Research and Extension Center
https://github.com/raptor-ml/raptor
Transform your pythonic research to an artifact that engineers can deploy easily.
https://github.com/darkstarstrix/datavolt
Reusable data engineering toolkit My personal data infrastructure
template-data-package
An opinionated template for Data Packages built with Seedcase packages.
Carnival
Carnival: JVM Property graph data unification toolkit - Published in JOSS (2025)
https://github.com/data-prompt-query/dpq
dpq is an open-source python library that makes prompt-based data transformations and feature engineering easy
https://github.com/apecloud/myduckserver
Unified MySQL, Postgres & FlightSQL Server, Powered by DuckDB.
https://github.com/alvarocavalcante/airflow-parse-bench
Stop creating bad DAGs! Use this tool to measure and compare the parse time of your DAGs, identify bottlenecks, and optimize your Airflow environment for better performance.
https://github.com/arenas-guerrero-julian/pg2rml-star
RML-star Mapping Bootstrapper from Property Graphs
https://github.com/cured-plus/csvw-duckdb
Convert a CSVW document (CSV metadata) to a DuckDB query to load a CSV file into a database.
data2neo
Data2Neo is a library that simplifies the conversion of data in relational format to a graph knowledge database.
https://github.com/ccao-data/data-architecture
Codebase for CCAO data infrastructure construction and management
adequat_project_ai-and-optimization
Evolutionary Cost-Tolerance Optimization for Complex Assembly Mechanisms Via Simulation and Surrogate Modeling Approaches: Application on Micro Gears (http://dx.doi.org/10.21203/rs.3.rs-2487746/v1)
https://github.com/ibridges-for-irods/ibridges
A wrapper around the python-irodsclient to allow for easy interaction with iRODS servers.
preparing-your-mainframe-data-for-machine-learning
Mainframe Data Wrangling: Preparing Your Mainframe Data for Machine Learning
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
ids-drr-assam-risk-model
Intelligent Data Solution - Disaster Risk Reduction is a system to assist flood management in the state of Assam through data-driven ways. The repository contains codes to extract relevant datasets and the modelling approach used to calculate Risk Scores for each revenue circle in Assam.
signalslite
A small library to efficiently store and process global equity data, especially for Numerai's Signals tournament (WIP)
https://github.com/SETL-Framework/setl
A simple Spark-powered ETL framework that just works 🍺
https://github.com/simantalahkar/lammpskit
lammpskit is a Python toolkit for post-processing and analyzing molecular dynamics (MD) simulations with LAMMPS. Its modular data processing and analysis functions are broadly applicable to scientific computing, data engineering, and machine learning workflows involving time series or semi-structured data.
https://github.com/alvarocavalcante/airflow-custom-deferrable-dataflow-operator
Start your Dataflow jobs execution directly from the Triggerer without going to the Worker!
https://github.com/danielvartan/r-course
🚀 Introductory R Course Developed for the Sustentarea Research and Extension Center