gym-electric-motor (GEM)
gym-electric-motor (GEM): A Python toolbox for the simulation of electric drive systems - Published in JOSS (2021)
DeepBench
DeepBench: A simulation package for physical benchmarking data - Published in JOSS (2025)
ctbench - compile-time benchmarking and analysis
ctbench - compile-time benchmarking and analysis - Published in JOSS (2023)
yaib
🧪Yet Another ICU Benchmark: a holistic framework for the standardization of clinical prediction model experiments. Provide custom datasets, cohorts, prediction tasks, endpoints, preprocessing, and models. Paper: https://arxiv.org/abs/2306.05109
mmaction2
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
asreview-insights
Tools such as plots and metrics to analyze (simulated) reviews for ASReview LAB
tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
proteinworkshop
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
SciMLBenchmarks
Scientific machine learning (SciML) benchmarks, AI for science, and (differential) equation solvers. Covers Julia, Python (PyTorch, Jax), MATLAB, R
benchexec
BenchExec: A Framework for Reliable Benchmarking and Resource Measurement
tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
fluidx3d
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
compression_benchmark
Benchmarking FASTQ compression with 'mature' compression algorithms
https://github.com/cheind/py-motmetrics
:bar_chart: Benchmark multiple object trackers (MOT) in Python
lrebench
[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
rl4co
A PyTorch library for all things Reinforcement Learning (RL) for Combinatorial Optimization (CO)
benchmarks-acoustic-propagation
Coupled model development for acoustic propagation through multilayer systems for particle-velocity sensors
pytorch-benchmark
Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption
eval-suite
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
py-torchbenchmark
TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
benchmarl
BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, tasks, and models while being systematically grounded in its two core tenets: reproducibility and standardization.
qcd
Quantum Circuit Designer: A gymnasium-based set of environments for benchmarking reinforcement learning for quantum circuit design.
xfinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
aptv2
The official repo for the extension of [NeurIPS'22] "APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking": https://github.com/pandorgan/APT-36K
jreferral
An open-source tool that recommends the most energy efficient JVM configuration for java software
are-we-fast-yet
Are We Fast Yet? Comparing Language Implementations with Objects, Closures, and Arrays
opencl-benchmark
A small OpenCL benchmark program to measure peak GPU/CPU performance.
https://github.com/beuth-erdelt/benchmark-experiment-host-manager
This python tool helps managing DBMS benchmarking experiments in a Kubernetes-based HPC cluster environment. It enables users to configure hardware / software setups for easily repeating tests over varying configurations.
https://github.com/bark-simulator/bark
Open-Source Framework for Development, Simulation and Benchmarking of Behavior Planning Algorithms for Autonomous Driving
tax-retrieval-benchmark
An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.
https://github.com/bio-phys/mdbenchmark
Quickly generate, start and analyze benchmarks for molecular dynamics simulations.
https://github.com/google-deepmind/physics-iq-benchmark
Benchmarking physical understanding in generative video models
leakdb
LeakDB (Leakage Diagnosis Benchmark) is a realistic leakage dataset for water distribution networks. The dataset is comprised of a large number of artificially created but realistic leakage scenarios, on different water distribution networks, under varying conditions. A scoring algorithm in MATLAB code is provided to evaluate the results of different algorithms.
https://github.com/brucewlee/h-test
[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language
https://github.com/crowdstrike/cloud-resource-estimator
Cloud deployment size calculation utilities
https://github.com/cdjellen/otbench
Effective Benchmarks for Optical Turbulence Modeling
sceneflow_from_blender
Get 3D motion vectors / scene flow directly from Blender
https://github.com/aim-uofa/geobench
A toolbox for benchmarking SOTA discriminative and generative geometry estimation models.
https://github.com/lquenti/blackheap
An blackbox approach to I/O modelling. (Migrated to Codeberg)
https://github.com/citiususc/blinkg
BLINKG: Benchmark for LLM-Integrated Knowledge Graph Generation
https://github.com/avik-pal/deeplearningbenchmarks
Benchmarks across Deep Learning Frameworks in Julia and Python
https://github.com/ai-forever/ruscode
Official repository for RusCode benchmark dataset (NAACL 2025)
https://github.com/jurgisp/memory-maze
Evaluating long-term memory of reinforcement learning algorithms
https://github.com/crate/tsperf
TSPERF Time Series Database Benchmark Suite. Framework for evaluating and comparing the performance of time series databases, in the spirit of TimescaleDB's TSBS.
https://github.com/yegor256/plum
Programming language ultimate metrics (PLUM) collected automatically from GitHub, Google Scholar, Twitter, etc.
https://github.com/aliireza/ddio-bench
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks
https://github.com/cvanaret/nonconvex_solver_comparison
This repo collects results of nonlinear optimization solvers on standard benchmark problems
https://github.com/bblodfon/paad-survival-bench
Benchmark survival ML models against a multimodal TCGA dataset
https://github.com/bblodfon/ml-course-2022
Benchmarking ML classification models on spam dataset
imdd-task
Short-reach Optical Communication: A Real-world Task for Neuromorphic Hardware
hyphi-gym
A Gymnasium benchmark suite for evaluating the robustness and multi-task performance of reinforcement learning algorithms in various discrete and continuous environments.
devformer
[ICML 2023] Official code for "DevFormer: A Symmetric Transformer for Context-Aware Device Placement"
champkit
Benchmarking toolkit for patch-based histopathology image classification.
flashrag
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
https://github.com/boniolp/msad
[VLDB 2023] Model Selection for Anomaly Detection in Time Series
agentdojo
A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
tsb-uad
An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection
maskedfacerepresentation
Masked face recognition focuses on identifying people using their facial features while they are wearing masks. We introduce benchmarks on face verification based on masked face images for the development of COVID-safe protocols in airports.
symbolic-governed-mistral-artifact
Tier-10 sealed governance artifact for Mistral-7B with exact-match benchmarks and symbolic verifier.
https://github.com/alan-turing-institute/tcpdbench
The Turing Change Point Detection Benchmark: An Extensive Benchmark Evaluation of Change Point Detection Algorithms on real-world data
ibp-sop-benchmarks-milp-cellphoneco
Benchmark instances modelling the supply chain of a fictive company producing and selling cell phones and accessories of different types. The instances formulate typical mixed-integer linear optimization problems in the standard MPS format.
fast_frechet-python
Comparison of different (fast) discrete Fréchet distance implementations in Python.
ember
Code and data for the paper "Bridging the Gap between Reality and Ideality of Entity Matching: A Revisiting and Benchmark Re-Construction" (IJCAI 2022)
fast_frechet
Comparison of different (fast) discrete Fréchet distance implementations in C++ and CUDA.