gym-electric-motor (GEM)
gym-electric-motor (GEM): A Python toolbox for the simulation of electric drive systems - Published in JOSS (2021)
DeepBench
DeepBench: A simulation package for physical benchmarking data - Published in JOSS (2025)
ctbench - compile-time benchmarking and analysis
ctbench - compile-time benchmarking and analysis - Published in JOSS (2023)
yaib
🧪Yet Another ICU Benchmark: a holistic framework for the standardization of clinical prediction model experiments. Provide custom datasets, cohorts, prediction tasks, endpoints, preprocessing, and models. Paper: https://arxiv.org/abs/2306.05109
mmaction2
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
asreview-insights
Tools such as plots and metrics to analyze (simulated) reviews for ASReview LAB
tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
proteinworkshop
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
SciMLBenchmarks
Scientific machine learning (SciML) benchmarks, AI for science, and (differential) equation solvers. Covers Julia, Python (PyTorch, Jax), MATLAB, R
benchexec
BenchExec: A Framework for Reliable Benchmarking and Resource Measurement
tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
fluidx3d
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
compression_benchmark
Benchmarking FASTQ compression with 'mature' compression algorithms
https://github.com/cheind/py-motmetrics
:bar_chart: Benchmark multiple object trackers (MOT) in Python
lrebench
[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
rl4co
A PyTorch library for all things Reinforcement Learning (RL) for Combinatorial Optimization (CO)
benchmarks-acoustic-propagation
Coupled model development for acoustic propagation through multilayer systems for particle-velocity sensors
pytorch-benchmark
Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption
eval-suite
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
py-torchbenchmark
TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
benchmarl
BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, tasks, and models while being systematically grounded in its two core tenets: reproducibility and standardization.
qcd
Quantum Circuit Designer: A gymnasium-based set of environments for benchmarking reinforcement learning for quantum circuit design.
xfinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
aptv2
The official repo for the extension of [NeurIPS'22] "APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking": https://github.com/pandorgan/APT-36K
jreferral
An open-source tool that recommends the most energy efficient JVM configuration for java software
are-we-fast-yet
Are We Fast Yet? Comparing Language Implementations with Objects, Closures, and Arrays
opencl-benchmark
A small OpenCL benchmark program to measure peak GPU/CPU performance.
https://github.com/beuth-erdelt/benchmark-experiment-host-manager
This python tool helps managing DBMS benchmarking experiments in a Kubernetes-based HPC cluster environment. It enables users to configure hardware / software setups for easily repeating tests over varying configurations.
https://github.com/bark-simulator/bark
Open-Source Framework for Development, Simulation and Benchmarking of Behavior Planning Algorithms for Autonomous Driving
tax-retrieval-benchmark
An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.
https://github.com/bio-phys/mdbenchmark
Quickly generate, start and analyze benchmarks for molecular dynamics simulations.
https://github.com/google-deepmind/physics-iq-benchmark
Benchmarking physical understanding in generative video models
leakdb
LeakDB (Leakage Diagnosis Benchmark) is a realistic leakage dataset for water distribution networks. The dataset is comprised of a large number of artificially created but realistic leakage scenarios, on different water distribution networks, under varying conditions. A scoring algorithm in MATLAB code is provided to evaluate the results of different algorithms.
https://github.com/brucewlee/h-test
[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language
https://github.com/crowdstrike/cloud-resource-estimator
Cloud deployment size calculation utilities
https://github.com/cdjellen/otbench
Effective Benchmarks for Optical Turbulence Modeling
sceneflow_from_blender
Get 3D motion vectors / scene flow directly from Blender
https://github.com/aim-uofa/geobench
A toolbox for benchmarking SOTA discriminative and generative geometry estimation models.
https://github.com/lquenti/blackheap
An blackbox approach to I/O modelling. (Migrated to Codeberg)
https://github.com/citiususc/blinkg
BLINKG: Benchmark for LLM-Integrated Knowledge Graph Generation
https://github.com/avik-pal/deeplearningbenchmarks
Benchmarks across Deep Learning Frameworks in Julia and Python
https://github.com/ai-forever/ruscode
Official repository for RusCode benchmark dataset (NAACL 2025)
https://github.com/jurgisp/memory-maze
Evaluating long-term memory of reinforcement learning algorithms
https://github.com/crate/tsperf
TSPERF Time Series Database Benchmark Suite. Framework for evaluating and comparing the performance of time series databases, in the spirit of TimescaleDB's TSBS.
https://github.com/yegor256/plum
Programming language ultimate metrics (PLUM) collected automatically from GitHub, Google Scholar, Twitter, etc.
https://github.com/aliireza/ddio-bench
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks
https://github.com/cvanaret/nonconvex_solver_comparison
This repo collects results of nonlinear optimization solvers on standard benchmark problems
https://github.com/bblodfon/paad-survival-bench
Benchmark survival ML models against a multimodal TCGA dataset
https://github.com/bblodfon/ml-course-2022
Benchmarking ML classification models on spam dataset
imdd-task
Short-reach Optical Communication: A Real-world Task for Neuromorphic Hardware
molscore
An automated scoring function to facilitate and standardize the evaluation of goal-directed generative models for de novo molecular design
benchscofi
Package which contains implementations of published collaborative filtering-based algorithms for drug repurposing.
https://github.com/becksteinlab/parallel-analysis-in-the-mdanalysis-library
Benchmarking MDAnalysis with Dask (and MPI). Supplementary Information for SciPy 2017 paper.
variantbenchmarking
Pipeline to evaluate and validate the accuracy of variant calling methods in genomic research
evalplus
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
small-object-detection-benchmark
icip2022 paper: sahi benchmark on visdrone and xview datasets using fcos, vfnet and tood detectors
https://github.com/compnet/signedbenchmark
Benchmark to study partitioning problems on signed graphs
https://github.com/birgitrijvers/copd-microbiome-shift-analysis
This is the repository for a bioinformatics project about identifying the optimal pipeline for identifying a potential microbiome shift in COPD patients after ceasing azithromycin treatment.
benchmark-privesc-linux
A comprehensive local Linux Privilege-Escalation Benchmark
https://github.com/ccao-data/report-model-benchmark
Benchmark of timing for CCAO models on different hardware
foundation-model-benchmarking-tool
Foundation model benchmarking tool. Run any model on any AWS platform and benchmark for performance across instance type and serving stack options.
https://github.com/cosmaadrian/rocode
Official repository for "RoCode: A Dataset for Measuring Code Intelligence from Romanian Problem Definitions"
https://github.com/alan-turing-institute/tcpdbench
The Turing Change Point Detection Benchmark: An Extensive Benchmark Evaluation of Change Point Detection Algorithms on real-world data
leaderboard
You can find the most recent KGQA benchmark numbers from publications here.
https://github.com/holistic-ai/holisticai
This is an open-source tool to assess and improve the trustworthiness of AI systems.
https://github.com/andrejorsula/space_robotics_bench
Robot Learning Beyond Earth
https://github.com/hyperledger-caliper/caliper-benchmarks
Sample benchmark files for Hyperledger Caliper https://wiki.hyperledger.org/display/caliper
https://github.com/grrvlr/tsmd
The TSMD project brings together Motif Discovery methods for Time Series, aiming to compare their performance through well-defined research questions and to simplify their practical use. It provides both guidelines for selecting the most suitable methods based on the data, and accessible implementations of the most relevant approaches.