Projects | Open Source Science

Scientific Software

Updated 10 months ago

PyCM — Peer-reviewed • Rank 22.3 • Science 98%

PyCM: Multiclass confusion matrix library in Python - Published in JOSS (2018)

accuracy ai artificial-intelligence classification confusion-matrix data data-analysis data-mining data-science deep-learning deeplearning evaluation machine-learning mathematics matrix ml multiclass-classification neural-network statistical-analysis statistics

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

Jury — Peer-reviewed • Rank 14.8 • Science 93%

Jury: A Comprehensive Evaluation Toolkit - Published in JOSS (2024)

datasets evaluate evaluation huggingface machine-learning metrics natural-language-processing nlp nlp-evaluation python pytorch transformers

Scientific Software · Peer-reviewed

Scientific Software

Updated 10 months ago

eva3dm — Peer-reviewed • Rank 7.8 • Science 93%

eva3dm: A R-package for model evaluation of 3D weather and air quality models - Published in JOSS (2025)

air-quality-model air-quality-model-evaluation atmos atmosphere atmospheric-chemistry atmospheric-modelling atmospheric-models atmospheric-science evaluation model-evaluation model-evaluation-metrics wrf-chem

Scientific Software · Peer-reviewed

Updated 10 months ago

rexmex • Rank 14.2 • Science 77%

A general purpose recommender metrics library for fair evaluation.

coverage deep-learning evaluation machine-learning metric metrics mrr personalization precision rank ranking recall recommender recommender-system recsys rsquared

Updated 10 months ago

inundation-mapping • Rank 9.2 • Science 77%

Flood inundation mapping and evaluation software configured to work with U.S. National Water Model.

evaluation flood-inundation-maps gis hydrology inundation mapping national-hydrography-dataset national-water-center noaa

Updated 10 months ago

oumi • Rank 20.3 • Science 64%

Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!

dpo evaluation fine-tuning gpt-oss gpt-oss-120b gpt-oss-20b inference llama llms sft vlms

Updated 10 months ago

avalanche-lib • Rank 19.8 • Science 64%

Avalanche: an End-to-End Library for Continual Learning based on PyTorch.

benchmarks continual-learning continualai deep-learning evaluation framework library lifelong-learning metrics pytorch strategies training

Updated 10 months ago

evalica • Rank 13.9 • Science 67%

Evalica, your favourite evaluation toolkit

arena bradley-terry elo evalica evals evaluation hacktoberfest leaderboard library llm pagerank pairwise-comparison pyo3 python ranking rating rust serbia statistics winrate

Updated 10 months ago

tiny_qa_benchmark_pp • Rank 2.1 • Science 77%

Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing & regression-catching.

benchmark dataset evaluation huggingface-datasets litellm llm llm-testing llmops qa-dataset smoke-test synthetic-data tinybenchmarks

Updated 10 months ago

ipal_evaluate • Rank 1.6 • Science 77%

Intrusion Detection Evaluation - A framework to evalute (Industrial) Intrusion Detection Systems.

anomaly-detection cps evaluation ids industrial intrusion-detection ipal

Artificial Intelligence and Machine Learning (40%) Engineering (40%) Biology (40%)

Updated 10 months ago

trajectopy • Rank 11.1 • Science 67%

Trajectopy - Trajectory Evaluation in Python

alignment benchmark comparison evaluation mapping metrics robotics trajectory trajectory-analysis

Scientific Software

Updated 10 months ago

ER-Evaluation — Peer-reviewed • Rank 4.2 • Science 67%

ER-Evaluation: End-to-End Evaluation of Entity Resolution Systems - Published in JOSS (2023)

author-name-disambiguation data-science deduplication disambiguation duplicate-detection entity-resolution evaluation fuzzy-matching inventor-name-disambiguation matching ml-evaluation ml-testing record-linkage statistics

Artificial Intelligence and Machine Learning (40%)

Scientific Software · Peer-reviewed

Updated 10 months ago

mlflow • Rank 35.0 • Science 36%

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation llmops machine-learning ml mlflow mlops model-management observability open-source openai prompt-engineering

Engineering Earth and Environmental Sciences (40%)

Updated 10 months ago

imageinwords • Rank 6.5 • Science 64%

Data release for the ImageInWords (IIW) paper.

dataset dataset-generation detailed-annotations detailed-descriptions evaluation human-annotation i2t image-captioning image-descriptions image-text image-to-text t2i

Updated 10 months ago

eval-suite • Rank 9.2 • Science 59%

[ACL 2024] User-friendly evaluation framework: Eval Suite & Benchmarks: UHGEval, HaluEval, HalluQA, etc.

benchmark ceval chatgpt dataset evaluation gpt-3 gpt-4 hallucination hallucination-detection hallucination-evaluation hallucinations huggingface huggingface-transformers large-language-models llm openai openai-api qwen

Updated 10 months ago

https://github.com/amenra/ranx • Rank 17.9 • Science 49%

⚡️A Blazing-Fast Python Library for Ranking Evaluation, Comparison, and Fusion 🐍

comparison data-fusion evaluation evaluation-metrics information-retrieval information-retrieval-evaluation information-retrieval-metrics metasearch numba python rank-fusion ranking-metrics recommender-systems score-fusion

Updated 10 months ago

deepr • Rank 4.7 • Science 57%

Deep R Programming (Open-Access Textbook)

cran data-frame data-science evaluation functional-programming graphics matrix-calculations numerical-methods numerical-simulations r scientific-computing scientific-visualization statistics statistics-for-data-science statistics-for-engineering tensor vector vectorization

Updated 10 months ago

machine-learning-novice-python • Rank 7.2 • Science 54%

Introduction to Machine Learning with Python

auroc bootstrapping carpentries-incubator data-leakage english evaluation lesson machine-learning pre-alpha prediction python

Updated 10 months ago

hydrotools • Rank 6.2 • Science 54%

Suite of tools for retrieving USGS NWIS observations and evaluating National Water Model (NWM) data.

evaluation forecasting hydrology modeling noaa observations pandas python simulation validation verification

Scientific Software

Updated 10 months ago

multivar_horner — Peer-reviewed • Rank 10.2 • Science 49%

multivar_horner: A Python package for computing Horner factorisations of multivariate polynomials - Published in JOSS (2020)

evaluation factorization horner horner-scheme hornerscheme-solver math mathematics multivariate multivariate-polynomials polynomial polynomial-evaluation polynomials python python3

Scientific Software · Peer-reviewed

Updated 10 months ago

py-alpaca-eval • Rank 12.6 • Science 46%

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

deep-learning evaluation foundation-models instruction-following large-language-models leaderboard nlp rlhf

Updated 10 months ago

evo • Rank 22.0 • Science 36%

Python package for the evaluation of odometry and SLAM

benchmark euroc evaluation kitti mapping metrics odometry robotics ros ros2 slam trajectory trajectory-analysis trajectory-evaluation tum

Updated 10 months ago

rollmatch • Rank 8.9 • Science 49%

Rolling Entry Matching R Package

econometrics evaluation healthcare matching propensity-scores

Updated 10 months ago

openqa-eval • Rank 3.8 • Science 54%

ACL 2023: Evaluating Open-Domain Question Answering in the Era of Large Language Models

evaluation large-language-models open-domain-qa openai-api question-answering

Updated 10 months ago

yeast-in-microstructures-dataset • Rank 3.7 • Science 54%

Official and maintained implementation of the dataset paper "An Instance Segmentation Dataset of Yeast Cells in Microstructures" [EMBC 2023].

cell-segmentation dataset deep-learning embc evaluation evaluation-framework evaluation-metrics instance-segmentation medical-imaging panoptic-segmentation segmentation yeast-dataset

Updated 10 months ago

xfinder • Rank 6.6 • Science 51%

[ICLR 2025] xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

benchmark cc-by-nc-nd-4 chatglm dataset evaluation gpt judge-model key-answer-extraction large-language-models llm llm-as-a-judge llm-as-evaluator lm-evaluation open-compass phi qwen regex reliability reliable-evaluation xfinder

Updated 10 months ago

tyc-dataset • Rank 2.3 • Science 54%

Official and maintained implementation of the dataset paper "The TYC Dataset for Understanding Instance-Level Semantics and Motions of Cells in Microstructures" [ICCVW 2023].

cell-morphology cell-segmentation computer-vision dataset evaluation evaluation-metrics iccv iccvw instance-segmentation optical-flow panoptic-segmentation representation-learning supervised-learning unsupervised-learning yeast yeast-dataset

Updated 10 months ago

evalify • Rank 6.5 • Science 49%

Evaluate your biometric verification models literally in seconds.

evaluation evaluation-framework evaluation-metrics face-recognition face-verification python

Updated 10 months ago

evaluate • Rank 27.2 • Science 26%

A version of eval for R that returns more information about what happened

evaluation parsing r r-package repl

Updated 10 months ago

evaluate • Rank 29.9 • Science 23%

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.

evaluation machine-learning

Updated 10 months ago

https://github.com/time-series-machine-learning/tsml-eval • Rank 12.2 • Science 36%

Evaluation tools for time series machine learning algorithms.

benchmarking data-science evaluation machine-learning python time-series

Updated 10 months ago

@superagent-ai/poker-eval • Rank 4.0 • Science 44%

A comprehensive tool for assessing AI Agents performance in simulated poker environments

agents ai evaluation llm llmops

Updated 10 months ago

verif • Rank 11.7 • Science 36%

Graphical tool for creating verification plots of weather forecasts

evaluation forecasting prediction statistics verification weather

Updated 10 months ago

gval • Rank 2.8 • Science 44%

A Python framework to evaluate geospatial datasets by comparing candidate and benchmark maps to compute agreement maps and statistics.

earth-observation earth-science environment evaluation evaluation-framework geospatial geospatial-analysis python remote-sensing research science skill spatial-analysis spatial-temporal statistics

Updated 10 months ago

lrv-instruction • Rank 5.7 • Science 41%

[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

chatgpt evaluation evaluation-metrics foundation-models gpt gpt-4 hallucination iclr iclr2024 llama llava multimodal object-detection prompt-engineering vicuna vision vision-and-language vqa

Updated 10 months ago

https://github.com/brucewlee/h-test • Rank 0.7 • Science 33%

[ACL 2024] Language Models Don't Learn the Physical Manifestation of Language

benchmark evaluation language-model

Updated 10 months ago

https://github.com/amazon-science/auto-rag-eval • Rank 5.1 • Science 23%

Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation"

evaluation genai llm machine-learning

Updated 10 months ago

https://github.com/dru-mara/evalne-gui • Rank 3.6 • Science 23%

EvalNE-GUI: The Graphical User Interface for EvalNE

dashboard embeddings evaluation gui research

Updated 10 months ago

https://github.com/amenra/a-multi-domain-benchmark-for-personalized-search-evaluation • Science 23%

A Multi-domain Benchmark for Personalized Search Evaluation

dataset evaluation information-retrieval personalization personalized-search

Updated 10 months ago

stipa • Science 44%

MATLAB implementation of the Speech Transmission Index for Public Address (STIPA) method for evaluating the speech transmission quality.

audio evaluation intelligibility psychoacoustics speech speech-transmission-index stipa

Updated 10 months ago

https://github.com/brainlesion/panoptica • Science 36%

panoptica -- instance-wise evaluation of 3D semantic and instance segmentation maps

3d evaluation instance instance-wise metrics panoptic segmentation semantic

Updated 10 months ago

promptfoo • Science 26%

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners

Updated 10 months ago

vec4ir • Science 67%

Word Embeddings for Information Retrieval

data-science embedding-models embeddings evaluation information-retrieval natural-language-processing nlp retrieval-model similarity-scoring word-embeddings

Updated 10 months ago

https://github.com/rentruewang/bocoel • Science 26%

Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few lines of modular code.

bayesian-optimization benchmarking evaluation language-model llm machine-learning

Updated 10 months ago

PRONE • Science 26%

R Package for preprocessing, normalizing, and analyzing proteomics data

data-analysis evaluation normalization proteomics

Updated 10 months ago

https://github.com/amazon-science/memerag • Science 36%

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

benchmark evaluation rag

Updated 10 months ago

https://github.com/aida-ugent/nrl4lp • Science 23%

Instructions for replicating the experiments in the paper "Benchmarking Network Embedding Models for Link Prediction: Are We Making Progress?" (DSAA2020)

benchmark evaluation link-prediction network-embedding representation-learning

Updated 10 months ago

https://github.com/cuc-zihang-liu/audioevaluation • Science 26%

多种音频评估方法复现

audio clap clip evaluation lpips rewas

Updated 10 months ago

https://github.com/aksw/dbnqa • Science 10%

DBpedia Neural Question Answering Dataset

dataset dbpedia deep-learning deep-learning-dataset deep-neural-networks evaluation machine-translation nl-to-sparql-translation nspm nspm-dataset seq2seq sparql sparql-machine-translation

Updated 10 months ago

llm-jp-eval • Science 26%

Modified llm-jp-eval with API and HF scripts for LFMs.

benchmark evaluation liquid-ai llm llm-jp-eval

Updated 10 months ago

tno.sdg.tabular.eval.utility-metrics • Science 44%

TNO PET Lab - Synthetic Data Generation (SDG) - Tabular - Evaluation - Utility Metrics

evaluation pet-lab sdg synthetic-data synthetic-data-generation tabular tno utility

Updated 10 months ago

atrium-page-classification • Science 44%

Classification of historical page images using ViT - for ATRIUM project

classification-model evaluation pdf-document-processor prediction training transformer-models

Updated 10 months ago

equitystack • Science 49%

A structured repository of Python scripts and Jupyter notebooks for development sector data workflows — including public health, gender equity, women's economic empowerment (WEE), education, and MEL (Monitoring, Evaluation, and Learning). Includes plug-and-play templates, sample data, test coverage, and Colab-ready execution.

colab data-analysis data-cleaning development-economics education evaluation gender jupyter-notebook mle open-data public-health python reproducibility wee

Updated 10 months ago

https://github.com/bytedance/pxmeter • Science 49%

Structural Quality Assessment for Biomolecular Structure Prediction Models

bioinformatics evaluation

Updated 10 months ago

milu • Science 41%

MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of LLMs across 11 Indic languages.

ai4bharat evaluation indic-languages llm-evaluation

Updated 10 months ago

https://github.com/amenra/guardbench • Science 36%

A Python library for guardrail models evaluation.

ai-safety benchmark evaluation guardrail-models guardrails llm

Updated 10 months ago

rag-evaluation-harnesses • Science 54%

An evaluation suite for Retrieval-Augmented Generation (RAG).

evaluation lm-evaluation rag retrieval-augmented-generation

Updated 10 months ago

dashboard-prototype • Science 65%

Prototype data dashboard for Imageomics Data

dashboard data data-visualization eda evaluation visualization

Updated 10 months ago

hinteval • Science 44%

HintEval💡: A Comprehensive Framework for Hint Generation and Evaluation for Questions

education educational-software evaluation framework generation hints information-retrieval ir ir-library natural-language-processing nlp nlp-library python-lib python-library qa question-answering teaching-tool

Updated 10 months ago

https://github.com/aanzel/polar-diagrams-dashboard • Science 13%

"A Multi-Technique Strategy for Improving Summary Polar Diagrams" by Aleksandar Anžel, Zewen Yang, and Georges Hattab

ai bioinformatics climate-model-evaluation climate-model-visualization data-visualization evaluation information-theory information-visualization machine-learning machine-learning-visualization ml-model-evaluation model-comparison mutual-information mutual-information-diagram predictive-analysis taylor-diagram visualization

Updated 10 months ago

https://github.com/cyberagentailab/lctg-bench • Science 13%

LCTG Bench: LLM Controlled Text Generation Benchmark

benchmark controllability evaluation llm

Updated 10 months ago

abecto • Science 75%

An ABox Evaluation and Comparison Tool for Ontologies.

abox comparison evaluation knowledge-graph ontologies owl owl2 rdf

Updated 10 months ago

fieldstack • Science 67%

Reusable R notebooks, scripts, and tools for applied data work and evaluation — built for use in the field across health, gender, climate, and education programs.

climate dashboarding data-analysis development-research education evaluation fieldstack gender mel open-data public-health quarto r shiny spatial-analysis sroi survey-data tidyverse

Updated 10 months ago

evaluation-paper • Science 44%

Supporting material and website for the paper "Anomaly Detection in Time Series: A Comprehensive Evaluation"

evaluation paper time-series time-series-anomaly-detection

Updated 10 months ago

securityeval • Science 57%

Repository for "SecurityEval Dataset: Mining Vulnerability Examples to Evaluate Machine Learning-Based Code Generation Techniques" published in MSR4P&S'22.

code-generation cwe dataset evaluation security

Updated 10 months ago

fidelity-measure-for-dts • Science 31%

Algorithm to measure the level of fidelity of a Digital Twin System by aligning behavioral traces.

digital-twins evaluation testing-tools validation

Updated 10 months ago

llm-reliability • Science 49%

Code for the paper "Larger and more instructable language models become less reliable"

bloom evaluation gpt llama llm reliability rlhf scaling supervision