gym-electric-motor (GEM)
gym-electric-motor (GEM): A Python toolbox for the simulation of electric drive systems - Published in JOSS (2021)
DeepBench
DeepBench: A simulation package for physical benchmarking data - Published in JOSS (2025)
ctbench - compile-time benchmarking and analysis
ctbench - compile-time benchmarking and analysis - Published in JOSS (2023)
yaib
🧪Yet Another ICU Benchmark: a holistic framework for the standardization of clinical prediction model experiments. Provide custom datasets, cohorts, prediction tasks, endpoints, preprocessing, and models. Paper: https://arxiv.org/abs/2306.05109
mmaction2
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
asreview-insights
Tools such as plots and metrics to analyze (simulated) reviews for ASReview LAB
tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
proteinworkshop
Benchmarking framework for protein representation learning. Includes a large number of pre-training and downstream task datasets, models and training/task utilities. (ICLR 2024)
SciMLBenchmarks
Scientific machine learning (SciML) benchmarks, AI for science, and (differential) equation solvers. Covers Julia, Python (PyTorch, Jax), MATLAB, R
benchexec
BenchExec: A Framework for Reliable Benchmarking and Resource Measurement
tasksource
Datasets collection and preprocessings framework for NLP extreme multitask learning
fluidx3d
The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs and CPUs via OpenCL. Free for non-commercial use.
compression_benchmark
Benchmarking FASTQ compression with 'mature' compression algorithms
https://github.com/cheind/py-motmetrics
:bar_chart: Benchmark multiple object trackers (MOT) in Python
lrebench
[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
rl4co
A PyTorch library for all things Reinforcement Learning (RL) for Combinatorial Optimization (CO)
benchmarks-acoustic-propagation
Coupled model development for acoustic propagation through multilayer systems for particle-velocity sensors
pytorch-benchmark
Easily benchmark PyTorch model FLOPs, latency, throughput, allocated gpu memory and energy consumption
eval-suite
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
py-torchbenchmark
TorchBench is a collection of open source benchmarks used to evaluate PyTorch performance.
benchmarl
BenchMARL is a library for benchmarking Multi-Agent Reinforcement Learning (MARL). BenchMARL allows to quickly compare different MARL algorithms, tasks, and models while being systematically grounded in its two core tenets: reproducibility and standardization.
qcd
Quantum Circuit Designer: A gymnasium-based set of environments for benchmarking reinforcement learning for quantum circuit design.
xfinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
aptv2
The official repo for the extension of [NeurIPS'22] "APT-36K: A Large-scale Benchmark for Animal Pose Estimation and Tracking": https://github.com/pandorgan/APT-36K
jreferral
An open-source tool that recommends the most energy efficient JVM configuration for java software
are-we-fast-yet
Are We Fast Yet? Comparing Language Implementations with Objects, Closures, and Arrays
opencl-benchmark
A small OpenCL benchmark program to measure peak GPU/CPU performance.
https://github.com/beuth-erdelt/benchmark-experiment-host-manager
This python tool helps managing DBMS benchmarking experiments in a Kubernetes-based HPC cluster environment. It enables users to configure hardware / software setups for easily repeating tests over varying configurations.
https://github.com/bark-simulator/bark
Open-Source Framework for Development, Simulation and Benchmarking of Behavior Planning Algorithms for Autonomous Driving
tax-retrieval-benchmark
An implementation of the TaxRetrievalBenchmark task for the 🤗 Massive Text Embedding Benchmark (MTEB) framework.
https://github.com/bio-phys/mdbenchmark
Quickly generate, start and analyze benchmarks for molecular dynamics simulations.
https://github.com/google-deepmind/physics-iq-benchmark
Benchmarking physical understanding in generative video models
leakdb
LeakDB (Leakage Diagnosis Benchmark) is a realistic leakage dataset for water distribution networks. The dataset is comprised of a large number of artificially created but realistic leakage scenarios, on different water distribution networks, under varying conditions. A scoring algorithm in MATLAB code is provided to evaluate the results of different algorithms.
https://github.com/brucewlee/h-test
[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language
https://github.com/crowdstrike/cloud-resource-estimator
Cloud deployment size calculation utilities
https://github.com/cdjellen/otbench
Effective Benchmarks for Optical Turbulence Modeling
sceneflow_from_blender
Get 3D motion vectors / scene flow directly from Blender
https://github.com/aim-uofa/geobench
A toolbox for benchmarking SOTA discriminative and generative geometry estimation models.
https://github.com/lquenti/blackheap
An blackbox approach to I/O modelling. (Migrated to Codeberg)
https://github.com/citiususc/blinkg
BLINKG: Benchmark for LLM-Integrated Knowledge Graph Generation
https://github.com/avik-pal/deeplearningbenchmarks
Benchmarks across Deep Learning Frameworks in Julia and Python
https://github.com/ai-forever/ruscode
Official repository for RusCode benchmark dataset (NAACL 2025)
https://github.com/jurgisp/memory-maze
Evaluating long-term memory of reinforcement learning algorithms
https://github.com/crate/tsperf
TSPERF Time Series Database Benchmark Suite. Framework for evaluating and comparing the performance of time series databases, in the spirit of TimescaleDB's TSBS.
https://github.com/yegor256/plum
Programming language ultimate metrics (PLUM) collected automatically from GitHub, Google Scholar, Twitter, etc.
https://github.com/aliireza/ddio-bench
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks
https://github.com/cvanaret/nonconvex_solver_comparison
This repo collects results of nonlinear optimization solvers on standard benchmark problems
https://github.com/bblodfon/paad-survival-bench
Benchmark survival ML models against a multimodal TCGA dataset
https://github.com/bblodfon/ml-course-2022
Benchmarking ML classification models on spam dataset
https://github.com/bytedance/web-bench
Web-Bench is a benchmark designed to evaluate the performance of LLMs in actual Web development.
https://github.com/cedrickchee/dawnbench-analysis
DAWNBench analysis of CIFAR-10 time-to-accuracy.
https://github.com/grrvlr/tsmd
The TSMD project brings together Motif Discovery methods for Time Series, aiming to compare their performance through well-defined research questions and to simplify their practical use. It provides both guidelines for selecting the most suitable methods based on the data, and accessible implementations of the most relevant approaches.
flashrag
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
fast_frechet
Comparison of different (fast) discrete Fréchet distance implementations in C++ and CUDA.
pano3d
Code and models for "Pano3D: A Holistic Benchmark and a Solid Baseline for 360 Depth Estimation", OmniCV Workshop @ CVPR21.
https://github.com/amdresearch/npueval
NPUEval is an LLM evaluation dataset written specifically to target AIE kernel code generation on RyzenAI hardware.
https://github.com/anibulus/benchmark.net
A console application to benchmark different ways to order. Just an experiment
https://github.com/bethgelab/model-vs-human
Benchmark your model on out-of-distribution datasets with carefully collected human comparison data (NeurIPS 2021 Oral)
https://github.com/chakib-belgaid/jvm-comparaison
A benchmarking protocol that allows to study the behaviour of different JVMs
https://github.com/ccao-data/report-model-benchmark
Benchmark of timing for CCAO models on different hardware
benchscofi
Package which contains implementations of published collaborative filtering-based algorithms for drug repurposing.
hyphi-gym
A Gymnasium benchmark suite for evaluating the robustness and multi-task performance of reinforcement learning algorithms in various discrete and continuous environments.
disa-windows-server-2016
This repository is part of the paper Automated Implementation of Windows-related Security-Configuration Guides presented at the 35th IEEE/ACM International Conference on Automated Software Engineering.
web-framework-benchmark-thesis
Benchmark comparison of Leptos to Leading JavaScript Web Frameworks
stochastic-benchmark
Repository for Stochastic Optimization Solvers Benchmark code