funsies
funsies: A minimalist, distributed and dynamic workflow engine - Published in JOSS (2021)
https://github.com/airbytehq/airbyte
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://github.com/growthbook/growthbook
Open Source Feature Flagging and A/B Testing Platform
https://github.com/apache/airflow
Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://github.com/bacalhau-project/bacalhau
Community-driven, simple, yet powerful framework for fast, cost-effective distributed Compute over Data.
https://github.com/apache/hamilton
Apache Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
workbench
Workbench: An easy to use Python API for creating and deploying AWS SageMaker Models
https://github.com/datafold/data-diff
Compare tables within or across databases
https://github.com/larribas/dagger
Define sophisticated data pipelines with Python and run them on different distributed systems (such as Argo Workflows).
https://github.com/buchananja/dpyp
A convenience tool for small-scale data pipelines in Python
https://github.com/ploomber/soorgeon
Convert monolithic Jupyter notebooks 📙 into maintainable Ploomber pipelines. 📊
https://github.com/danielvartan/open-science-pres
🔎🔓 Opens Science Presentation for the Sustentarea Research and Extension Center
https://github.com/raptor-ml/raptor
Transform your pythonic research to an artifact that engineers can deploy easily.
template-data-package
An opinionated template for Data Packages built with Seedcase packages.
https://github.com/ibridges-for-irods/ibridges
A wrapper around the python-irodsclient to allow for easy interaction with iRODS servers.
https://github.com/ccao-data/data-architecture
Codebase for CCAO data infrastructure construction and management
https://github.com/desbordante/desbordante-core
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
https://github.com/SETL-Framework/setl
A simple Spark-powered ETL framework that just works 🍺
adequat_project_ai-and-optimization
Evolutionary Cost-Tolerance Optimization for Complex Assembly Mechanisms Via Simulation and Surrogate Modeling Approaches: Application on Micro Gears (http://dx.doi.org/10.21203/rs.3.rs-2487746/v1)
https://github.com/apecloud/myduckserver
Unified MySQL, Postgres & FlightSQL Server, Powered by DuckDB.
Carnival
Carnival: JVM Property graph data unification toolkit - Published in JOSS (2025)
https://github.com/data-prompt-query/dpq
dpq is an open-source python library that makes prompt-based data transformations and feature engineering easy
ids-drr-assam-risk-model
Intelligent Data Solution - Disaster Risk Reduction is a system to assist flood management in the state of Assam through data-driven ways. The repository contains codes to extract relevant datasets and the modelling approach used to calculate Risk Scores for each revenue circle in Assam.
https://github.com/alvarocavalcante/airflow-parse-bench
Stop creating bad DAGs! Use this tool to measure and compare the parse time of your DAGs, identify bottlenecks, and optimize your Airflow environment for better performance.
https://github.com/arenas-guerrero-julian/pg2rml-star
RML-star Mapping Bootstrapper from Property Graphs
https://github.com/cured-plus/csvw-duckdb
Convert a CSVW document (CSV metadata) to a DuckDB query to load a CSV file into a database.
https://github.com/darkstarstrix/datavolt
Reusable data engineering toolkit My personal data infrastructure
preparing-your-mainframe-data-for-machine-learning
Mainframe Data Wrangling: Preparing Your Mainframe Data for Machine Learning
https://github.com/simantalahkar/lammpskit
lammpskit is a Python toolkit for post-processing and analyzing molecular dynamics (MD) simulations with LAMMPS. Its modular data processing and analysis functions are broadly applicable to scientific computing, data engineering, and machine learning workflows involving time series or semi-structured data.
signalslite
A small library to efficiently store and process global equity data, especially for Numerai's Signals tournament (WIP)
https://github.com/danielvartan/r-course
🚀 Introductory R Course Developed for the Sustentarea Research and Extension Center
data2neo
Data2Neo is a library that simplifies the conversion of data in relational format to a graph knowledge database.
https://github.com/alvarocavalcante/airflow-custom-deferrable-dataflow-operator
Start your Dataflow jobs execution directly from the Triggerer without going to the Worker!