gym-electric-motor (GEM)
gym-electric-motor (GEM): A Python toolbox for the simulation of electric drive systems - Published in JOSS (2021)
DeepBench
DeepBench: A simulation package for physical benchmarking data - Published in JOSS (2025)
ctbench - compile-time benchmarking and analysis
ctbench - compile-time benchmarking and analysis - Published in JOSS (2023)
yaib
🧪Yet Another ICU Benchmark: a holistic framework for the standardization of clinical prediction model experiments. Provide custom datasets, cohorts, prediction tasks, endpoints, preprocessing, and models. Paper: https://arxiv.org/abs/2306.05109
mmaction2
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
asreview-insights
Tools such as plots and metrics to analyze (simulated) reviews for ASReview LAB
tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
proteinworkshop
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
SciMLBenchmarks
Scientific machine learning (SciML) benchmarks, AI for science, and (differential) equation solvers. Covers Julia, Python (PyTorch, Jax), MATLAB, R
benchexec
BenchExec: A Framework for Reliable Benchmarking and Resource Measurement
tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
fluidx3d
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
compression_benchmark
Benchmarking FASTQ compression with 'mature' compression algorithms
https://github.com/cheind/py-motmetrics
:bar_chart: Benchmark multiple object trackers (MOT) in Python
lrebench
[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
rl4co
A PyTorch library for all things Reinforcement Learning (RL) for Combinatorial Optimization (CO)
benchmarks-acoustic-propagation
Coupled model development for acoustic propagation through multilayer systems for particle-velocity sensors
pytorch-benchmark
Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption
eval-suite
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
py-torchbenchmark
TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
benchmarl
BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, tasks, and models while being systematically grounded in its two core tenets: reproducibility and standardization.
qcd
Quantum Circuit Designer: A gymnasium-based set of environments for benchmarking reinforcement learning for quantum circuit design.
xfinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
aptv2
The official repo for the extension of [NeurIPS'22] "APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking": https://github.com/pandorgan/APT-36K
jreferral
An open-source tool that recommends the most energy efficient JVM configuration for java software
are-we-fast-yet
Are We Fast Yet? Comparing Language Implementations with Objects, Closures, and Arrays
opencl-benchmark
A small OpenCL benchmark program to measure peak GPU/CPU performance.
https://github.com/beuth-erdelt/benchmark-experiment-host-manager
This python tool helps managing DBMS benchmarking experiments in a Kubernetes-based HPC cluster environment. It enables users to configure hardware / software setups for easily repeating tests over varying configurations.
https://github.com/bark-simulator/bark
Open-Source Framework for Development, Simulation and Benchmarking of Behavior Planning Algorithms for Autonomous Driving
tax-retrieval-benchmark
An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.
https://github.com/bio-phys/mdbenchmark
Quickly generate, start and analyze benchmarks for molecular dynamics simulations.
https://github.com/google-deepmind/physics-iq-benchmark
Benchmarking physical understanding in generative video models
leakdb
LeakDB (Leakage Diagnosis Benchmark) is a realistic leakage dataset for water distribution networks. The dataset is comprised of a large number of artificially created but realistic leakage scenarios, on different water distribution networks, under varying conditions. A scoring algorithm in MATLAB code is provided to evaluate the results of different algorithms.
https://github.com/brucewlee/h-test
[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language
https://github.com/crowdstrike/cloud-resource-estimator
Cloud deployment size calculation utilities
https://github.com/cdjellen/otbench
Effective Benchmarks for Optical Turbulence Modeling
sceneflow_from_blender
Get 3D motion vectors / scene flow directly from Blender
https://github.com/aim-uofa/geobench
A toolbox for benchmarking SOTA discriminative and generative geometry estimation models.
https://github.com/lquenti/blackheap
An blackbox approach to I/O modelling. (Migrated to Codeberg)
https://github.com/citiususc/blinkg
BLINKG: Benchmark for LLM-Integrated Knowledge Graph Generation
https://github.com/avik-pal/deeplearningbenchmarks
Benchmarks across Deep Learning Frameworks in Julia and Python
https://github.com/ai-forever/ruscode
Official repository for RusCode benchmark dataset (NAACL 2025)
https://github.com/jurgisp/memory-maze
Evaluating long-term memory of reinforcement learning algorithms
https://github.com/crate/tsperf
TSPERF Time Series Database Benchmark Suite. Framework for evaluating and comparing the performance of time series databases, in the spirit of TimescaleDB's TSBS.
https://github.com/yegor256/plum
Programming language ultimate metrics (PLUM) collected automatically from GitHub, Google Scholar, Twitter, etc.
https://github.com/aliireza/ddio-bench
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks
https://github.com/cvanaret/nonconvex_solver_comparison
This repo collects results of nonlinear optimization solvers on standard benchmark problems
https://github.com/bblodfon/paad-survival-bench
Benchmark survival ML models against a multimodal TCGA dataset
https://github.com/bblodfon/ml-course-2022
Benchmarking ML classification models on spam dataset
champkit
Benchmarking toolkit for patch-based histopathology image classification.
agentdojo
A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents.
molscore
An automated scoring function to facilitate and standardize the evaluation of goal-directed generative models for de novo molecular design
tsb-uad
An End-to-End Benchmark Suite for Univariate Time-Series Anomaly Detection
opfgym
A gymnasium-compatible framework to create reinforcement learning (RL) environment for solving the optimal power flow (OPF) problem. Contains five OPF benchmark environments for comparable research.
dpl
[NeurIPS 2023] Multi-fidelity hyperparameter optimization with deep power laws that achieves state-of-the-art results across diverse benchmarks.
maskedfacerepresentation
Masked face recognition focuses on identifying people using their facial features while they are wearing masks. We introduce benchmarks on face verification based on masked face images for the development of COVID-safe protocols in airports.
https://github.com/bytedance/portrait-mode-video
Video dataset dedicated to portrait-mode video recognition.
neteasecrowd-dataset
NetEaseCrowd dataset, a collection of data obtained from You Ling crowdsourcing platform, Fuxi AI Lab, NetEase.
symbolic-governed-mistral-artifact
Tier-10 sealed governance artifact for Mistral-7B with exact-match benchmarks and symbolic verifier.
ibp-sop-benchmarks-milp-cellphoneco
Benchmark instances modelling the supply chain of a fictive company producing and selling cell phones and accessories of different types. The instances formulate typical mixed-integer linear optimization problems in the standard MPS format.
ember
Code and data for the paper "Bridging the Gap between Reality and Ideality of Entity Matching: A Revisiting and Benchmark Re-Construction" (IJCAI 2022)
https://github.com/grrvlr/tsmd
The TSMD project brings together Motif Discovery methods for Time Series, aiming to compare their performance through well-defined research questions and to simplify their practical use. It provides both guidelines for selecting the most suitable methods based on the data, and accessible implementations of the most relevant approaches.
https://github.com/hyperledger-caliper/caliper-benchmarks
Sample benchmark files for Hyperledger Caliper https://wiki.hyperledger.org/display/caliper
small-object-detection-benchmark
icip2022 paper: sahi benchmark on visdrone and xview datasets using fcos, vfnet and tood detectors
fast_frechet
Comparison of different (fast) discrete Fréchet distance implementations in C++ and CUDA.