Projects | Open Source Science

Scientific Software

Updated 11 months ago

Visions — Peer-reviewed • Rank 22.2 • Science 93%

Visions: An Open-Source Library for Semantic Data - Published in JOSS (2020)

data-analysis data-science hacktoberfest numpy pandas python spark type-inference type-system

Sociology

Scientific Software · Peer-reviewed

Updated 5 months ago

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach • Rank 5.6 • Science 92%

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach - Published in JOSS (2026)

aws dagster databricks emr spark

Updated 11 months ago

https://github.com/moj-analytical-services/splink • Rank 24.9 • Science 49%

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

data-matching data-science deduplicate-data deduplication duckdb em-algorithm entity-resolution fuzzy-matching record-linkage spark uk-gov-data-science

Updated 11 months ago

chuanhuchatgpt • Rank 13.6 • Science 54%

GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

chatbot chatglm chatgpt-api claude dalle3 ernie gemini gemma inspurai llama midjourney minimax moss ollama qwen spark stablelm

Updated 11 months ago

pathling • Rank 15.2 • Science 49%

Tools that make it easier to use FHIR® and clinical terminology within data analytics, built on Apache Spark.

analytics fhir spark standards terminology

Updated 11 months ago

h2o • Rank 27.7 • Science 36%

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

automl big-data data-science deep-learning distributed ensemble-learning gbm gpu h2o h2o-automl hadoop java machine-learning naive-bayes opensource pca python r random-forest spark

Updated 11 months ago

popmon • Rank 17.7 • Science 44%

Monitor the stability of a Pandas or Spark dataframe ⚙︎

covariate-shift data-analysis data-distributions data-profiling data-science dataset-shifts drift-detection hacktoberfest ing-bank ipython jupyter mlops monitoring pandas population-monitoring python spark statistical-process-control statistical-tests statistics

Updated 11 months ago

com.linkedin.isolation-forest • Rank 12.3 • Science 44%

A distributed Spark/Scala implementation of the isolation forest algorithm for unsupervised outlier detection, featuring support for scalable training and ONNX export for easy cross-platform inference.

anomaly-detection isolation-forest linkedin machine-learning onnx outlier-detection scala spark unsupervised-learning

Updated 11 months ago

https://github.com/awslabs/deequ • Rank 16.5 • Science 36%

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

dataquality scala spark unit-testing

Updated 11 months ago

fugue • Rank 25.1 • Science 26%

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

dask data-practitioners distributed distributed-computing distributed-systems machine-learning pandas spark sql

Updated about 1 month ago

ch.cern.spark:spark-avro_2.12 • Rank 40.6 • Science 10%

Apache Spark - A unified analytics engine for large-scale data processing

big-data java jdbc python r scala spark sql

Updated 11 months ago

https://github.com/broadinstitute/gatk • Rank 14.5 • Science 36%

Official code repository for GATK versions 4 and up

bioinformatics dna gatk genome genomics ngs science sequencing spark

Updated 11 months ago

https://github.com/databricks/koalas • Rank 26.4 • Science 23%

Koalas: pandas API on Apache Spark

big-data data-science dataframe mlflow pandas pydata spark

Updated 11 months ago

lab-dotnet-spark • Rank 1.4 • Science 44%

Proof of Concept for Local Testable Spark Environment

aspire spark

Updated 11 months ago

workbench • Rank 15.3 • Science 26%

Workbench: An easy to use Python API for creating and deploying AWS SageMaker Models

aws big-data data-engineering machine-learning pandas python spark

Updated 11 months ago

spark-submit • Rank 8.5 • Science 26%

Python manager for spark-submit jobs

apache spark submit

Updated 11 months ago

https://github.com/commoncrawl/cc-pyspark • Rank 7.9 • Science 26%

Process Common Crawl data with Python and Spark

common-crawl commoncrawl pyspark spark sparksql warc-files wat-files wet

Updated 11 months ago

https://github.com/commoncrawl/cc-index-table • Rank 6.2 • Science 26%

Index Common Crawl archives in tabular format

apache-parquet aws-athena columnar-storage commoncrawl spark sql

Updated 11 months ago

https://github.com/aphp/spark-etl • Rank 5.9 • Science 26%

Better bridge apache spark and postgresql

etl postgresql spark

Updated 11 months ago

https://github.com/bytedance/cloudshuffleservice • Rank 5.6 • Science 13%

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

flink hadoop-mapreduce spark

Updated 11 months ago

https://github.com/azavea/geotrellis-collections-api-research • Rank 2.6 • Science 13%

A research project to investigate using GeoTrellis as a REST service

akka-http geotrellis leaflet react react-leaflet redux scala spark victory

Updated 11 months ago

https://github.com/ccao-data/service-spark-iasworld • Rank 1.8 • Science 13%

Service for extracting tables from the CCAO system-of-record and uploading them to the Data Department's data warehouse

etl iasworld spark

Updated 11 months ago

https://github.com/aveek-saha/cricket-score-predictor • Rank 1.4 • Science 13%

A Big data application to predict the outcome of a T20 cricket match.

big-data big-data-analytics clustering pyspark spark spark-mllib

Updated 11 months ago

https://github.com/dadananjesha/spark-streaming • Science 13%

Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming

apache apache-spark-streaming data-processing hdfs kafka kpi real-time spark spark-streaming

Updated 11 months ago

https://github.com/dadananjesha/credit-card-fraud-detection • Science 13%

Credit Card Fraud Detection is a state-of-the-art real-time streaming analytics solution designed to detect fraudulent credit card transactions instantly.

case-study credit-card fraud-detection fraud-prevention fraudulent-transactions iiit-bangalore kafka pyspark spark upgrad

Updated 11 months ago

https://github.com/rumbledb/rumble • Science 36%

⛈️ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

avro azure csv data-science dataframes hdfs json jsoniq machine-learning nested parquet query query-engine s3 scale schemaless spark svm text yaml

Updated 11 months ago

https://github.com/bigbio/pgatk-io • Science 13%

High performance io library for proteogenomics

fileformats mass-spectrometry parquet proteogenomics proteomics spark

Updated 11 months ago

spark-dynamic-executor-time-prediction • Science 57%

Neural Network Models for Predicting Execution Time with Dynamic Executor Allocation in Apache Spark.

apache-spark big-data-analytics deep-learning distributed-computing dynamic-allocation execution-time-prediction machine-learning neural-networks performance-modeling spark

Updated 11 months ago

https://github.com/dadananjesha/redshift-etl-project • Science 13%

The project covers the complete data pipeline—from importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.

apache-spark aws data-engineering-etl-assignment data-ingestion data-pipeline etl-processes hdfs rds redshift spark sqoop

Updated 11 months ago

https://github.com/johnsnowlabs/johnsnowlabs • Science 26%

Gateway into the John Snow Labs Ecosystem

bert databricks gpt machine-learning natural-language-processing nlp python seq2seq spark t5

Updated 11 months ago

https://github.com/awslabs/data-on-eks • Science 26%

DoEKS is a tool to build, deploy and scale Data Platforms on Amazon EKS

aws-eks eks jupyterhub kubeflow kubernetes ml mlflow ray spark terraform

Updated 11 months ago

acid • Science 26%

Generate the AutoCorrelation Integral Drill (ACID) test set with scripts and workflows for robust validation of autocorrelation algorithms. 🌐📊

3d-engine archlinux arduino cross-platform csharp database full-text-search game-engine gles gpu indexing iot renderer rna rna-structure-prediction spark transactional vulkan

Updated 11 months ago

https://github.com/ai-team-uoa/geotriples • Science 10%

Publishing Big Geospatial data as Linked Open Geospatial Data

geospatial rdf semantic-web spark

Updated 11 months ago

https://github.com/arturbomtempo-learning/projects-for-interdisciplinary-work-2 • Science 13%

Projects created to better learn about the technologies used in the Interdisciplinary Work 2 course of the Computer Science program.

eclipse java maven postgresql spark

Updated 11 months ago

https://github.com/dadananjesha/azuredataengine • Science 13%

AzureDataEngine is a robust, scalable batch processing data architecture built on the Azure platform. It efficiently extracts, transforms, and loads massive datasets for machine learning applications, leveraging Azure Blob Storage, PostgreSQL, Databricks, and Key Vault to ensure reliability and maintainability.

azure batch-processing blob-storage databricks etl etl-framework key-vault postgresql-database spark vnet

Updated 11 months ago

https://github.com/SETL-Framework/setl • Science 13%

A simple Spark-powered ETL framework that just works 🍺

big-data data-analysis data-engineering data-science data-transformation dataset etl etl-pipeline framework machine-learning modularization pipeline scala setl spark

Updated 11 months ago

analysis-pipelines • Science 10%

Enables data scientists to compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. Data scientists can use tools of their choice through an R interface, and compose interoperable pipelines between R, Spark, and Python.

analysis-pipeline interoperable-pipelines python r spark

Updated 11 months ago

pysparklyr • Science 26%

Extension to {sparklyr} that allows you to interact with Spark & Databricks Connect

databricks pyspark r spark spark-connect

Updated 11 months ago

https://github.com/big-data-lab-team/accident-prediction-montreal • Science 10%

accidents ai big-data big-data-analytics geospatial-data geospatial-processing machine machine-learning montreal opendata pyspark spark

Updated 11 months ago

graphster • Science 54%

spark-based library that helps construct and query knowledge graphs from unstructured and structured data

ai graphs natural-language-processing spark