PyCM
PyCM: Multiclass confusion matrix library in Python - Published in JOSS (2018)
Jury
Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)
eva3dm
eva3dm: A R-package for model evaluation of 3D weather and air quality models - Published in JOSS (2025)
inundation-mapping
Flood inundation mapping and evaluation software configured to work with U.S. National Water Model.
oumi
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
avalanche-lib
Avalanche: an End-to-End Library for Continual Learning based on PyTorch.
tiny_qa_benchmark_pp
Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.
ipal_evaluate
Intrusion Detection Evaluation - A framework to evalute (Industrial) Intrusion Detection Systems.
ER-Evaluation
ER-Evaluation: End-to-End Evaluation of Entity Resolution Systems - Published in JOSS (2023)
mlflow
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
eval-suite
[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.
https://github.com/amenra/ranx
⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍
hydrotools
Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.
multivar_horner
multivar_horner: A Python package for computing Horner factorisations of multivariate polynomials - Published in JOSS (2020)
py-alpaca-eval
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
openqa-eval
ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large Language Models
yeast-in-microstructures-dataset
Official and maintained implementation of the dataset paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures" [EMBC 2023].
xfinder
[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
tyc-dataset
Official and maintained implementation of the dataset paper "The TYC Dataset for Understanding Instance-Level Semantics and Motions of Cells in Microstructures" [ICCVW 2023].
evaluate
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://github.com/time-series-machine-learning/tsml-eval
Evaluation tools for time series machine learning algorithms.
@superagent-ai/poker-eval
A comprehensive tool for assessing AI Agents performance in simulated poker environments
gval
A Python framework to evaluate geospatial datasets by comparing candidate and benchmark maps to compute agreement maps and statistics.
lrv-instruction
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
https://github.com/brucewlee/h-test
[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language
https://github.com/amazon-science/auto-rag-eval
Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation"
https://github.com/dru-mara/evalne-gui
EvalNE-GUI: The Graphical User Interface for EvalNE
promptfoo
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
https://github.com/amenra/guardbench
A Python library for guardrail models evaluation.
https://github.com/amenra/a-multi-domain-benchmark-for-personalized-search-evaluation
A Multi-domain Benchmark for Personalized Search Evaluation
https://github.com/aida-ugent/nrl4lp
Instructions for replicating the experiments in the paper "Benchmarking Network Embedding Models for Link Prediction: Are We Making Progress?" (DSAA2020)
https://github.com/amazon-science/memerag
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
https://github.com/aanzel/polar-diagrams-dashboard
"A Multi-Technique Strategy for Improving Summary Polar Diagrams" by Aleksandar Anžel, Zewen Yang, and Georges Hattab
fidelity-measure-for-dts
Algorithm to measure the level of fidelity of a Digital Twin System by aligning behavioral traces.
https://github.com/bytedance/pxmeter
Structural Quality Assessment for Biomolecular Structure Prediction Models
evaluation-paper
Supporting material and website for the paper "Anomaly Detection in Time Series: A Comprehensive Evaluation"
atrium-page-classification
Classification of historical page images using ViT - for ATRIUM project
https://github.com/brainlesion/panoptica
panoptica -- instance-wise evaluation of 3D semantic and instance segmentation maps
https://github.com/rentruewang/bocoel
Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.
milu
MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of LLMs across 11 Indic languages.
hinteval
HintEval💡: A Comprehensive Framework for Hint Generation and Evaluation for Questions
fieldstack
Reusable R notebooks, scripts, and tools for applied data work and evaluation — built for use in the field across health, gender, climate, and education programs.
rag-evaluation-harnesses
An evaluation suite for Retrieval-Augmented Generation (RAG).
securityeval
Repository for "SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques" published in MSR4P&S'22.
https://github.com/cyberagentailab/lctg-bench
LCTG Bench: LLM Controlled Text Generation Benchmark
llm-reliability
Code for the paper "Larger and more instructable language models become less reliable"
tno.sdg.tabular.eval.utility-metrics
TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics
stipa
MATLAB implementation of the Speech Transmission Index for Public Address (STIPA) method for evaluating the speech transmission quality.
equitystack
A structured repository of Python scripts and Jupyter notebooks for development sector data workflows — including public health, gender equity, women's economic empowerment (WEE), education, and MEL (Monitoring, Evaluation, and Learning). Includes plug-and-play templates, sample data, test coverage, and Colab-ready execution.