Projects

Scientific Software

Updated 11 months ago

LangFair — Peer-reviewed • Rank 14.9 • Science 95%

LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases - Published in JOSS (2025)

ai ai-safety artificial-intelligence bias bias-detection ethical-ai fairness fairness-ai fairness-ml fairness-testing large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics python responsible-ai

Engineering Mathematics (42%)

Scientific Software · Peer-reviewed

Updated 11 months ago

deepeval • Rank 28.2 • Science 54%

The LLM Evaluation Framework

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated 11 months ago

propertyeval • Rank 0.7 • Science 57%

PropertyEval: Synthesizing Thorough Test Cases for LLM Code Generation Benchmarks using Property-Based Testing

llm-evaluation property-based-testing python

Updated 11 months ago

nutcracker • Rank 1.9 • Science 54%

Large Model Evaluation Experiments

large-language-models llm llm-evaluation llmops

Updated 20 days ago

mlflow • Rank 35.5 • Science 10%

The open source developer platform to build AI/LLM applications and models with confidence. Enhance your AI applications with end-to-end tracking, observability, and evaluations, all in one integrated platform.

agentops agents ai ai-governance apache-spark evaluation langchain llm-evaluation llmops machine-learning ml mlflow mlops model-management observability open-source openai prompt-engineering

Engineering Earth and Environmental Sciences (40%)

Updated 10 months ago

https://github.com/amazon-science/llm-code-preference • Rank 5.1 • Science 36%

Training and Benchmarking LLMs for Code Preference.

code-generation llm-evaluation llm-training llms-benchmarking

Updated 10 months ago

promptfoo • Science 26%

Test your prompts, agents, and RAGs. AI Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

ci ci-cd cicd evaluation evaluation-framework llm llm-eval llm-evaluation llm-evaluation-framework llmops pentesting prompt-engineering prompt-testing prompts rag red-teaming testing vulnerability-scanners

Updated 10 months ago

https://github.com/ai4bharat/anudesh-frontend • Science 26%

data-annotation llm llm-evaluation

Updated 10 months ago

https://github.com/ai4bharat/anudesh • Science 13%

An open source platform to annotate data for Large language models - at scale

data-annotation llm llm-evaluation

Updated 10 months ago

https://github.com/cvs-health/uqlm • Science 36%

UQLM: Uncertainty Quantification for Language Models, is a Python package for UQ-based LLM hallucination detection

ai-evaluation ai-safety confidence-estimation confidence-score hallucination hallucination-detection hallucination-evaluation hallucination-mitigation llm llm-evaluation llm-hallucination llm-safety uncertainty-estimation uncertainty-quantification

Updated 11 months ago

tgcsm-circuit • Science 44%

The original containment framework for recursion-stable cognition, collapse-resistant logic, and LLM self-reflection.

ai ai-safety circuit-framework collapse-resistant-ai godel halting-problem llm llm-evaluation rail-detection recursive-containment

Updated 10 months ago

milu • Science 41%

MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of LLMs across 11 Indic languages.

ai4bharat evaluation indic-languages llm-evaluation

Updated 10 months ago

https://github.com/alan-turing-institute/prompto • Science 26%

An open source library for asynchronous querying of LLM endpoints

deep-learning hut23 large-language-models llm-eval llm-evaluation llms machine-learning natural-language-processing nlp python transformer transformers

Updated 10 months ago

https://github.com/cedrickchee/vibe-jet • Science 26%

A browser-based 3D multiplayer flying game with arcade-style mechanics, created using the Gemini 2.5 Pro through a technique called "vibe coding"

evaluation-framework flight-simulator game-development gemini-2-5-pro-exp llm-evaluation vibe-check vibe-coding

Updated 10 months ago

https://github.com/amazon-science/idioms-incontext-mt • Science 23%

idioms in context dataset

idiomatic-expressions llm-evaluation machine-translation

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

LangFair — Peer-reviewed • Rank 14.9 • Science 95%

deepeval • Rank 28.2 • Science 54%

propertyeval • Rank 0.7 • Science 57%

nutcracker • Rank 1.9 • Science 54%

mlflow • Rank 35.5 • Science 10%

https://github.com/amazon-science/llm-code-preference • Rank 5.1 • Science 36%

promptfoo • Science 26%

https://github.com/ai4bharat/anudesh-frontend • Science 26%

https://github.com/ai4bharat/anudesh • Science 13%

https://github.com/cvs-health/uqlm • Science 36%

tgcsm-circuit • Science 44%

milu • Science 41%

https://github.com/alan-turing-institute/prompto • Science 26%

https://github.com/cedrickchee/vibe-jet • Science 26%

https://github.com/amazon-science/idioms-incontext-mt • Science 23%