OnlineStats.jl
OnlineStats.jl: A Julia package for statistics on data streams - Published in JOSS (2020)
Openseize
Openseize: A digital signal processing package for large EEG datasets in Python - Published in JOSS (2023)
stemflow
stemflow: A Python Package for Adaptive Spatio-Temporal Exploratory Model - Published in JOSS (2024)
pretzel
Javascript full-stack framework for Big Data visualisation and analysis
reductstore
High Performance Storage and Streaming Solution for Data Acquisition Systems
ustore
Multi-Modal Database replacing MongoDB, Neo4J, and Elastic with 1 faster ACID solution, with NetworkX and Pandas interfaces, and bindings for C 99, C++ 17, Python 3, Java, GoLang 🗄️
org.opendc:opendc-compute-api
Collaborative Datacenter Simulation and Exploration for Everybody
https://github.com/uxlfoundation/scikit-learn-intelex
Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application
h2o
H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
graphscope
🔨 🍇 💻 🚀 GraphScope: A One-Stop Large-Scale Graph Computing System from Alibaba | 一站式图计算系统
Critical care data processing tools
Critical care data processing tools - Published in JOSS (2017)
https://github.com/pachyderm/pachyderm
Data-Centric Pipelines and Data Versioning
eland
Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://github.com/cbg-ethz/pybda
:computer::computer::computer: A commandline tool for analysis of big biological data sets for distributed HPC clusters.
workbench
Workbench: An easy to use Python API for creating and deploying AWS SageMaker Models
https://github.com/alleninstitute/vis
Typescript packages for building big-data visualization tools & components, with examples for a variety of common data types & formats
https://github.com/bsc-wdc/dislib
The Distributed Computing library for python implemented using PyCOMPSs programming model for HPC.
https://github.com/csinva/data-viz-utils
Functions for easily making publication-quality figures with matplotlib.
https://github.com/big-data-lab-umbc/cybertraining
Multidisciplinary Research and Education on Big Data + High-Performance Computing + Atmospheric Sciences at UMBC
https://github.com/aveek-saha/cricket-score-predictor
A Big data application to predict the outcome of a T20 cricket match.
ai-commercial-decisionmaking
AI-Driven Large Dataset Analysis & Commercial Decision-Making: Research on predictive analytics, machine learning strategies, and real-world business applications [Python, TensorFlow, PyTorch] 🤖📊
https://github.com/amalan-constat/needs4bigdata
R package implementing subsampling methods to find informative samples from big data
https://github.com/awslabs/amazon-s3-find-and-forget
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
AquaFetch
AquaFetch: A Unified Python Interface for Water Resource Dataset Acquisition and Harmonization - Published in JOSS (2025)
https://github.com/talariadb/talaria
TalariaDB is a distributed, highly available, and low latency time-series database for Presto
https://github.com/azzaare/compressedstacks.cpp
Compressed structure for Stack Algorithms
big-qa-architecture
BigQA Architecture: Big Data Architecture for Question Answering Systems
https://github.com/erictleung/sports-popularity-in-usa
:basketball: Analysis of sports popularity in the USA
deepicedrain
Mapping and monitoring deep subglacial water activity in Antarctica using remote sensing and machine learning, with ICESat-2!
https://github.com/SETL-Framework/setl
A simple Spark-powered ETL framework that just works 🍺
https://github.com/data-integrations/wrangler
Wrangler Transform: A DMD system for transforming Big Data
RangeExtractor
A performant way to extract subsections of arrays, under a tiling scheme. Meant for arrays with slow I/O.
marex
Marine Extremes detection, identification, and tracking/merging for Exascale Climate data
remotePARTS
remotePARTS: Spatiotemporal autoregression analyses for large data sets - Published in JOSS (2025)
path_based_traffic_flow_prediction
Forecast future traffic flow on a road network.
dockerunifieduimainterface
A UIMA-based tool for the scaled, uniform, distributed, platform-independent and easily reusable use of Natural Language Processing (NLP) methods using Docker.