Scientific Software
Updated 10 months ago

Jury — Peer-reviewed • Rank 14.8 • Science 93%

Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)

Scientific Software
Updated 10 months ago

eva3dm — Peer-reviewed • Rank 7.8 • Science 93%

eva3dm: A R-package for model evaluation of 3D weather and air quality models - Published in JOSS (2025)

Updated 10 months ago

inundation-mapping • Rank 9.2 • Science 77%

Flood inundation mapping and evaluation software configured to work with U.S. National Water Model.

Updated 10 months ago

oumi • Rank 20.3 • Science 64%

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

Updated 10 months ago

tiny_qa_benchmark_pp • Rank 2.1 • Science 77%

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

Updated 10 months ago

ipal_evaluate • Rank 1.6 • Science 77%

Intrusion Detection Evaluation - A framework to evalute (Industrial) Intrusion Detection Systems.

Artificial Intelligence and Machine Learning (40%) Engineering (40%) Biology (40%)
Scientific Software
Updated 10 months ago

ER-Evaluation — Peer-reviewed • Rank 4.2 • Science 67%

ER-Evaluation: End-to-End Evaluation of Entity Resolution Systems - Published in JOSS (2023)

Updated 10 months ago

mlflow • Rank 35.0 • Science 36%

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

Updated 10 months ago

hydrotools • Rank 6.2 • Science 54%

Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.

Scientific Software
Updated 10 months ago

multivar_horner — Peer-reviewed • Rank 10.2 • Science 49%

multivar_horner: A Python package for computing Horner factorisations of multivariate polynomials - Published in JOSS (2020)

Updated 10 months ago

py-alpaca-eval • Rank 12.6 • Science 46%

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Updated 10 months ago

openqa-eval • Rank 3.8 • Science 54%

ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large Language Models

Updated 10 months ago

yeast-in-microstructures-dataset • Rank 3.7 • Science 54%

Official and maintained implementation of the dataset paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures" [EMBC 2023].

Updated 10 months ago

tyc-dataset • Rank 2.3 • Science 54%

Official and maintained implementation of the dataset paper "The TYC Dataset for Understanding Instance-Level Semantics and Motions of Cells in Microstructures" [ICCVW 2023].

Updated 10 months ago

evalify • Rank 6.5 • Science 49%

Evaluate your biometric verification models literally in seconds.

Updated 10 months ago

evaluate • Rank 27.2 • Science 26%

A version of eval for R that returns more information about what happened

Updated 10 months ago

evaluate • Rank 29.9 • Science 23%

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

Updated 10 months ago

@superagent-ai/poker-eval • Rank 4.0 • Science 44%

A comprehensive tool for assessing AI Agents performance in simulated poker environments

Updated 10 months ago

verif • Rank 11.7 • Science 36%

Graphical tool for creating verification plots of weather forecasts

Updated 10 months ago

gval • Rank 2.8 • Science 44%

A Python framework to evaluate geospatial datasets by comparing candidate and benchmark maps to compute agreement maps and statistics.

Updated 10 months ago

https://github.com/brucewlee/h-test • Rank 0.7 • Science 33%

[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language

Updated 10 months ago

https://github.com/amazon-science/auto-rag-eval • Rank 5.1 • Science 23%

Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation"

Updated 10 months ago

https://github.com/dru-mara/evalne-gui • Rank 3.6 • Science 23%

EvalNE-GUI: The Graphical User Interface for EvalNE

Updated 10 months ago

stipa • Science 44%

MATLAB implementation of the Speech Transmission Index for Public Address (STIPA) method for evaluating the speech transmission quality.

Updated 10 months ago

https://github.com/brainlesion/panoptica • Science 36%

panoptica -- instance-wise evaluation of 3D semantic and instance segmentation maps

Updated 10 months ago

promptfoo • Science 26%

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

Updated 10 months ago

https://github.com/rentruewang/bocoel • Science 26%

Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.

Updated 10 months ago

PRONE • Science 26%

R Package for preprocessing, normalizing, and analyzing proteomics data

Updated 10 months ago

https://github.com/amazon-science/memerag • Science 36%

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

Updated 10 months ago

https://github.com/aida-ugent/nrl4lp • Science 23%

Instructions for replicating the experiments in the paper "Benchmarking Network Embedding Models for Link Prediction: Are We Making Progress?" (DSAA2020)

Updated 10 months ago

llm-jp-eval • Science 26%

Modified llm-jp-eval with API and HF scripts for LFMs.

Updated 10 months ago

tno.sdg.tabular.eval.utility-metrics • Science 44%

TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics

Updated 10 months ago

atrium-page-classification • Science 44%

Classification of historical page images using ViT - for ATRIUM project

Updated 10 months ago

equitystack • Science 49%

A structured repository of Python scripts and Jupyter notebooks for development sector data workflows — including public health, gender equity, women's economic empowerment (WEE), education, and MEL (Monitoring, Evaluation, and Learning). Includes plug-and-play templates, sample data, test coverage, and Colab-ready execution.

Updated 10 months ago

https://github.com/bytedance/pxmeter • Science 49%

Structural Quality Assessment for Biomolecular Structure Prediction Models

Updated 10 months ago

milu • Science 41%

MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of LLMs across 11 Indic languages.

Updated 10 months ago

https://github.com/amenra/guardbench • Science 36%

A Python library for guardrail models evaluation.

Updated 10 months ago

rag-evaluation-harnesses • Science 54%

An evaluation suite for Retrieval-Augmented Generation (RAG).

Updated 10 months ago

dashboard-prototype • Science 65%

Prototype data dashboard for Imageomics Data

Updated 10 months ago

https://github.com/cyberagentailab/lctg-bench • Science 13%

LCTG Bench: LLM Controlled Text Generation Benchmark

Updated 10 months ago

abecto • Science 75%

An ABox Evaluation and Comparison Tool for Ontologies.

Updated 10 months ago

fieldstack • Science 67%

Reusable R notebooks, scripts, and tools for applied data work and evaluation — built for use in the field across health, gender, climate, and education programs.

Updated 10 months ago

evaluation-paper • Science 44%

Supporting material and website for the paper "Anomaly Detection in Time Series: A Comprehensive Evaluation"

Updated 10 months ago

securityeval • Science 57%

Repository for "SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques" published in MSR4P&S'22.

Updated 10 months ago

fidelity-measure-for-dts • Science 31%

Algorithm to measure the level of fidelity of a Digital Twin System by aligning behavioral traces.

Updated 10 months ago

llm-reliability • Science 49%

Code for the paper "Larger and more instructable language models become less reliable"