gym-electric-motor (GEM)
gym-electric-motor (GEM): A Python toolbox for the simulation of electric drive systems - Published in JOSS (2021)
DeepBench
DeepBench: A simulation package for physical benchmarking data - Published in JOSS (2025)
ctbench - compile-time benchmarking and analysis
ctbench - compile-time benchmarking and analysis - Published in JOSS (2023)
yaib
🧪Yet Another ICU Benchmark: a holistic framework for the standardization of clinical prediction model experiments. Provide custom datasets, cohorts, prediction tasks, endpoints, preprocessing, and models. Paper: https://arxiv.org/abs/2306.05109
mmaction2
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
asreview-insights
Tools such as plots and metrics to analyze (simulated) reviews for ASReview LAB
tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
proteinworkshop
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
SciMLBenchmarks
Scientific machine learning (SciML) benchmarks, AI for science, and (differential) equation solvers. Covers Julia, Python (PyTorch, Jax), MATLAB, R
benchexec
BenchExec: A Framework for Reliable Benchmarking and Resource Measurement
tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
fluidx3d
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
compression_benchmark
Benchmarking FASTQ compression with 'mature' compression algorithms
https://github.com/cheind/py-motmetrics
:bar_chart: Benchmark multiple object trackers (MOT) in Python
lrebench
[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
rl4co
A PyTorch library for all things Reinforcement Learning (RL) for Combinatorial Optimization (CO)
benchmarks-acoustic-propagation
Coupled model development for acoustic propagation through multilayer systems for particle-velocity sensors
pytorch-benchmark
Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption
eval-suite
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
py-torchbenchmark
TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
benchmarl
BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, tasks, and models while being systematically grounded in its two core tenets: reproducibility and standardization.
qcd
Quantum Circuit Designer: A gymnasium-based set of environments for benchmarking reinforcement learning for quantum circuit design.
xfinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
aptv2
The official repo for the extension of [NeurIPS'22] "APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking": https://github.com/pandorgan/APT-36K
jreferral
An open-source tool that recommends the most energy efficient JVM configuration for java software
are-we-fast-yet
Are We Fast Yet? Comparing Language Implementations with Objects, Closures, and Arrays
opencl-benchmark
A small OpenCL benchmark program to measure peak GPU/CPU performance.
https://github.com/beuth-erdelt/benchmark-experiment-host-manager
This python tool helps managing DBMS benchmarking experiments in a Kubernetes-based HPC cluster environment. It enables users to configure hardware / software setups for easily repeating tests over varying configurations.
https://github.com/bark-simulator/bark
Open-Source Framework for Development, Simulation and Benchmarking of Behavior Planning Algorithms for Autonomous Driving
tax-retrieval-benchmark
An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.
https://github.com/bio-phys/mdbenchmark
Quickly generate, start and analyze benchmarks for molecular dynamics simulations.
https://github.com/google-deepmind/physics-iq-benchmark
Benchmarking physical understanding in generative video models
leakdb
LeakDB (Leakage Diagnosis Benchmark) is a realistic leakage dataset for water distribution networks. The dataset is comprised of a large number of artificially created but realistic leakage scenarios, on different water distribution networks, under varying conditions. A scoring algorithm in MATLAB code is provided to evaluate the results of different algorithms.
https://github.com/brucewlee/h-test
[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language
https://github.com/crowdstrike/cloud-resource-estimator
Cloud deployment size calculation utilities
https://github.com/cdjellen/otbench
Effective Benchmarks for Optical Turbulence Modeling
sceneflow_from_blender
Get 3D motion vectors / scene flow directly from Blender
https://github.com/aim-uofa/geobench
A toolbox for benchmarking SOTA discriminative and generative geometry estimation models.
https://github.com/lquenti/blackheap
An blackbox approach to I/O modelling. (Migrated to Codeberg)
https://github.com/citiususc/blinkg
BLINKG: Benchmark for LLM-Integrated Knowledge Graph Generation
https://github.com/avik-pal/deeplearningbenchmarks
Benchmarks across Deep Learning Frameworks in Julia and Python
https://github.com/ai-forever/ruscode
Official repository for RusCode benchmark dataset (NAACL 2025)
https://github.com/jurgisp/memory-maze
Evaluating long-term memory of reinforcement learning algorithms
https://github.com/crate/tsperf
TSPERF Time Series Database Benchmark Suite. Framework for evaluating and comparing the performance of time series databases, in the spirit of TimescaleDB's TSBS.
https://github.com/yegor256/plum
Programming language ultimate metrics (PLUM) collected automatically from GitHub, Google Scholar, Twitter, etc.
https://github.com/aliireza/ddio-bench
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks
https://github.com/cvanaret/nonconvex_solver_comparison
This repo collects results of nonlinear optimization solvers on standard benchmark problems
https://github.com/bblodfon/paad-survival-bench
Benchmark survival ML models against a multimodal TCGA dataset
https://github.com/bblodfon/ml-course-2022
Benchmarking ML classification models on spam dataset
https://github.com/hyperledger-caliper/caliper-benchmarks
Sample benchmark files for Hyperledger Caliper https://wiki.hyperledger.org/display/caliper
leaderboard
You can find the most recent KGQA benchmark numbers from publications here.
ai-for-drinking-water-chlorination-challenge-ijcai-25
1st AI for Drinking Water Chlorination Challenge @ IJCAI-2025
https://github.com/cosmaadrian/rocode
Official repository for "RoCode: A Dataset for Measuring Code Intelligence from Romanian Problem Definitions"
https://github.com/amazon-science/memerag
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
https://github.com/aida-ugent/nrl4lp
Instructions for replicating the experiments in the paper "Benchmarking Network Embedding Models for Link Prediction: Are We Making Progress?" (DSAA2020)
https://github.com/bethgelab/model-vs-human
Benchmark your model on out-of-distribution datasets with carefully collected human comparison data (NeurIPS 2021 Oral)
https://github.com/chakib-belgaid/jvm-comparaison
A benchmarking protocol that allows to study the behaviour of different JVMs
https://github.com/aksw/frankgraphbench
The FranKGraphBench is a Framework to allow KG Aware RSs to be benchmarked in a reproducible and easy to implement manner. It was first created on Google Summer of Code 2023 for Data Integration between DBpedia and some standard RS datasets in a reproducible framework.
https://github.com/compnet/signedbenchmark
Benchmark to study partitioning problems on signed graphs
https://github.com/bytedance/web-bench
Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development.
champkit
Benchmarking toolkit for patch-based histopathology image classification.
https://github.com/cedrickchee/dawnbench-analysis
DAWNBench analysis of CIFAR-10 time-to-accuracy.
fast_frechet
Comparison of different (fast) discrete Fréchet distance implementations in C++ and CUDA.
benchmark-privesc-linux
A comprehensive local Linux Privilege-Escalation Benchmark
https://github.com/boniolp/msad
[VLDB 2023] Model Selection for Anomaly Detection in Time Series
opfgym
A gymnasium-compatible framework to create reinforcement learning (RL) environment for solving the optimal power flow (OPF) problem. Contains five OPF benchmark environments for comparable research.
https://github.com/gagolews/clustering-data-v1
A framework for benchmarking clustering algorithms – Benchmark suite, version 1
https://github.com/cornell-zhang/heurigym
Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization