Scientific Software
Updated 9 months ago

Visions — Peer-reviewed • Rank 22.2 • Science 93%

Visions: An Open-Source Library for Semantic Data - Published in JOSS (2020)

Sociology
Scientific Software · Peer-reviewed
Updated 3 months ago

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach • Rank 5.6 • Science 92%

Cost-Effective Big Data Orchestration Using Dagster: A Multi-Platform Approach - Published in JOSS (2026)

Updated 9 months ago

ch.cern.spark:spark-avro_2.12 • Rank 40.2 • Science 36%

Apache Spark - A unified analytics engine for large-scale data processing

Updated 9 months ago

chuanhuchatgpt • Rank 13.6 • Science 54%

GUI for ChatGPT API and many LLMs. Supports agents, file-based QA, GPT finetuning and query with web search. All with a neat UI.

Updated 9 months ago

pathling • Rank 15.2 • Science 49%

Tools that make it easier to use FHIR® and clinical terminology within data analytics, built on Apache Spark.

Updated 9 months ago

h2o • Rank 27.7 • Science 36%

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Updated 9 months ago

com.linkedin.isolation-forest • Rank 12.3 • Science 44%

A distributed Spark/Scala implementation of the isolation forest algorithm for unsupervised outlier detection, featuring support for scalable training and ONNX export for easy cross-platform inference.

Updated 9 months ago

https://github.com/awslabs/deequ • Rank 16.5 • Science 36%

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

Updated 9 months ago

fugue • Rank 25.1 • Science 26%

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

Updated 9 months ago

lab-dotnet-spark • Rank 1.4 • Science 44%

Proof of Concept for Local Testable Spark Environment

Updated 9 months ago

workbench • Rank 15.3 • Science 26%

Workbench: An easy to use Python API for creating and deploying AWS SageMaker Models

Updated 9 months ago

spark-submit • Rank 8.5 • Science 26%

Python manager for spark-submit jobs

Updated 9 months ago

https://github.com/aphp/spark-etl • Rank 5.9 • Science 26%

Better bridge apache spark and postgresql

Updated 9 months ago

https://github.com/bytedance/cloudshuffleservice • Rank 5.6 • Science 13%

Cloud Shuffle Service(CSS) is a general purpose remote shuffle solution for compute engines, including Spark/Flink/MapReduce.

Updated 9 months ago

https://github.com/ccao-data/service-spark-iasworld • Rank 1.8 • Science 13%

Service for extracting tables from the CCAO system-of-record and uploading them to the Data Department's data warehouse

Updated 9 months ago

https://github.com/aveek-saha/cricket-score-predictor • Rank 1.4 • Science 13%

A Big data application to predict the outcome of a T20 cricket match.

Updated 9 months ago

acid • Science 26%

Generate the AutoCorrelation Integral Drill (ACID) test set with scripts and workflows for robust validation of autocorrelation algorithms. 🌐📊

Updated 9 months ago

https://github.com/ai-team-uoa/geotriples • Science 10%

Publishing Big Geospatial data as Linked Open Geospatial Data

Updated 9 months ago

https://github.com/dadananjesha/credit-card-fraud-detection • Science 13%

Credit Card Fraud Detection is a state-of-the-art real-time streaming analytics solution designed to detect fraudulent credit card transactions instantly.

Updated 9 months ago

https://github.com/rumbledb/rumble • Science 36%

⛈️ RumbleDB 2.0.0 "Lemon Ironwood" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Updated 9 months ago

pysparklyr • Science 26%

Extension to {sparklyr} that allows you to interact with Spark & Databricks Connect

Updated 9 months ago

https://github.com/awslabs/data-on-eks • Science 26%

DoEKS is a tool to build, deploy and scale Data Platforms on Amazon EKS

Updated 9 months ago

graphster • Science 54%

spark-based library that helps construct and query knowledge graphs from unstructured and structured data

Updated 9 months ago

analysis-pipelines • Science 10%

Enables data scientists to compose pipelines of analysis which consist of data manipulation, exploratory analysis & reporting, as well as modeling steps. Data scientists can use tools of their choice through an R interface, and compose interoperable pipelines between R, Spark, and Python.

Updated 9 months ago

https://github.com/dadananjesha/spark-streaming • Science 13%

Spark Streaming KPI Processing is a real-time data processing application built using Apache Spark Streaming

Updated 9 months ago

https://github.com/dadananjesha/redshift-etl-project • Science 13%

The project covers the complete data pipeline—from importing data from an RDS source to HDFS using Sqoop, processing data with Spark, to executing analytical queries on an AWS Redshift cluster.

Updated 9 months ago

https://github.com/dadananjesha/azuredataengine • Science 13%

AzureDataEngine is a robust, scalable batch processing data architecture built on the Azure platform. It efficiently extracts, transforms, and loads massive datasets for machine learning applications, leveraging Azure Blob Storage, PostgreSQL, Databricks, and Key Vault to ensure reliability and maintainability.

Updated 9 months ago

https://github.com/arturbomtempo-learning/projects-for-interdisciplinary-work-2 • Science 13%

Projects created to better learn about the technologies used in the Interdisciplinary Work 2 course of the Computer Science program.