Scientific Software
Updated 6 months ago

Jury — Peer-reviewed • Rank 14.8 • Science 93%

Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)

Scientific Software
Updated 6 months ago

eva3dm — Peer-reviewed • Rank 7.8 • Science 93%

eva3dm: A R-package for model evaluation of 3D weather and air quality models - Published in JOSS (2025)

Updated 6 months ago

inundation-mapping • Rank 9.2 • Science 77%

Flood inundation mapping and evaluation software configured to work with U.S. National Water Model.

Updated 6 months ago

oumi • Rank 20.3 • Science 64%

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

Updated 6 months ago

tiny_qa_benchmark_pp • Rank 2.1 • Science 77%

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

Updated 6 months ago

ipal_evaluate • Rank 1.6 • Science 77%

Intrusion Detection Evaluation - A framework to evalute (Industrial) Intrusion Detection Systems.

Artificial Intelligence and Machine Learning (40%) Engineering (40%) Biology (40%)
Scientific Software
Updated 6 months ago

ER-Evaluation — Peer-reviewed • Rank 4.2 • Science 67%

ER-Evaluation: End-to-End Evaluation of Entity Resolution Systems - Published in JOSS (2023)

Updated 6 months ago

mlflow • Rank 35.0 • Science 36%

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

Updated 6 months ago

hydrotools • Rank 6.2 • Science 54%

Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.

Scientific Software
Updated 6 months ago

multivar_horner — Peer-reviewed • Rank 10.2 • Science 49%

multivar_horner: A Python package for computing Horner factorisations of multivariate polynomials - Published in JOSS (2020)

Updated 6 months ago

py-alpaca-eval • Rank 12.6 • Science 46%

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Updated 6 months ago

openqa-eval • Rank 3.8 • Science 54%

ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large Language Models

Updated 6 months ago

yeast-in-microstructures-dataset • Rank 3.7 • Science 54%

Official and maintained implementation of the dataset paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures" [EMBC 2023].

Updated 6 months ago

tyc-dataset • Rank 2.3 • Science 54%

Official and maintained implementation of the dataset paper "The TYC Dataset for Understanding Instance-Level Semantics and Motions of Cells in Microstructures" [ICCVW 2023].

Updated 6 months ago

evalify • Rank 6.5 • Science 49%

Evaluate your biometric verification models literally in seconds.

Updated 6 months ago

evaluate • Rank 27.2 • Science 26%

A version of eval for R that returns more information about what happened

Updated 6 months ago

evaluate • Rank 29.9 • Science 23%

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Updated 6 months ago

@superagent-ai/poker-eval • Rank 4.0 • Science 44%

A comprehensive tool for assessing AI Agents performance in simulated poker environments

Updated 6 months ago

verif • Rank 11.7 • Science 36%

Graphical tool for creating verification plots of weather forecasts

Updated 6 months ago

gval • Rank 2.8 • Science 44%

A Python framework to evaluate geospatial datasets by comparing candidate and benchmark maps to compute agreement maps and statistics.

Updated 5 months ago

https://github.com/brucewlee/h-test • Rank 0.7 • Science 33%

[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language

Updated 5 months ago

https://github.com/amazon-science/auto-rag-eval • Rank 5.1 • Science 23%

Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation"

Updated 6 months ago

https://github.com/dru-mara/evalne-gui • Rank 3.6 • Science 23%

EvalNE-GUI: The Graphical User Interface for EvalNE

Updated 6 months ago

llm-jp-eval • Science 26%

Modified llm-jp-eval with API and HF scripts for LFMs.

Updated 6 months ago

promptfoo • Science 26%

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Updated 5 months ago

https://github.com/amenra/guardbench • Science 36%

A Python library for guardrail models evaluation.

Updated 5 months ago

https://github.com/aida-ugent/nrl4lp • Science 23%

Instructions for replicating the experiments in the paper "Benchmarking Network Embedding Models for Link Prediction: Are We Making Progress?" (DSAA2020)

Updated 5 months ago

https://github.com/amazon-science/memerag • Science 36%

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Updated 6 months ago

fidelity-measure-for-dts • Science 31%

Algorithm to measure the level of fidelity of a Digital Twin System by aligning behavioral traces.

Updated 5 months ago

https://github.com/bytedance/pxmeter • Science 49%

Structural Quality Assessment for Biomolecular Structure Prediction Models

Updated 6 months ago

evaluation-paper • Science 44%

Supporting material and website for the paper "Anomaly Detection in Time Series: A Comprehensive Evaluation"

Updated 6 months ago

atrium-page-classification • Science 44%

Classification of historical page images using ViT - for ATRIUM project

Updated 5 months ago

https://github.com/brainlesion/panoptica • Science 36%

panoptica -- instance-wise evaluation of 3D semantic and instance segmentation maps

Updated 6 months ago

PRONE • Science 26%

R Package for preprocessing, normalizing, and analyzing proteomics data

Updated 6 months ago

https://github.com/rentruewang/bocoel • Science 26%

Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.

Updated 6 months ago

dashboard-prototype • Science 65%

Prototype data dashboard for Imageomics Data

Updated 6 months ago

milu • Science 41%

MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of LLMs across 11 Indic languages.

Updated 6 months ago

fieldstack • Science 67%

Reusable R notebooks, scripts, and tools for applied data work and evaluation — built for use in the field across health, gender, climate, and education programs.

Updated 6 months ago

rag-evaluation-harnesses • Science 54%

An evaluation suite for Retrieval-Augmented Generation (RAG).

Updated 6 months ago

abecto • Science 75%

An ABox Evaluation and Comparison Tool for Ontologies.

Updated 6 months ago

securityeval • Science 57%

Repository for "SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques" published in MSR4P&S'22.

Updated 5 months ago

https://github.com/cyberagentailab/lctg-bench • Science 13%

LCTG Bench: LLM Controlled Text Generation Benchmark

Updated 6 months ago

llm-reliability • Science 49%

Code for the paper "Larger and more instructable language models become less reliable"

Updated 6 months ago

tno.sdg.tabular.eval.utility-metrics • Science 44%

TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics

Updated 6 months ago

stipa • Science 44%

MATLAB implementation of the Speech Transmission Index for Public Address (STIPA) method for evaluating the speech transmission quality.

Updated 6 months ago

equitystack • Science 49%

A structured repository of Python scripts and Jupyter notebooks for development sector data workflows — including public health, gender equity, women's economic empowerment (WEE), education, and MEL (Monitoring, Evaluation, and Learning). Includes plug-and-play templates, sample data, test coverage, and Colab-ready execution.