LangFair
LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases - Published in JOSS (2025)
mlflow
The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.
propertyeval
PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing
https://github.com/amazon-science/llm-code-preference
Training and Benchmarking LLMs for Code Preference.
milu
MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of LLMs across 11 Indic languages.
promptfoo
Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.
https://github.com/amazon-science/idioms-incontext-mt
idioms in context dataset
https://github.com/alan-turing-institute/prompto
An open source library for asynchronous querying of LLM endpoints
https://github.com/cvs-health/uqlm
UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection
https://github.com/cedrickchee/vibe-jet
A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"
tgcsm-circuit
The original containment framework for recursion-stable cognition, collapse-resistant logic, and LLM self-reflection.
https://github.com/ai4bharat/anudesh-frontend
https://github.com/ai4bharat/anudesh
An open source platform to annotate data for Large language models - at scale