deepeval

The LLM Evaluation Framework

https://github.com/confident-ai/deepeval

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
✓
Committers with academic emails
5 of 168 committers (3.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary

Keywords

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Keywords from Contributors

agents langchain rag multi-agents gemini anthropic vector-database llamaindex application fine-tuning

Last synced: 6 months ago · JSON representation ·

Repository

The LLM Evaluation Framework

Basic Info

Host: GitHub
Owner: confident-ai
License: apache-2.0
Language: Python
Default Branch: main
Homepage: https://deepeval.com
Size: 94.5 MB

Statistics

Stars: 10,479
Watchers: 43
Forks: 902
Open Issues: 202
Releases: 52

Topics

evaluation-framework evaluation-metrics llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Created over 2 years ago · Last pushed 6 months ago

Metadata Files

Readme Contributing License Citation

The LLM Evaluation Framework

Documentation | Metrics and Features | Getting Started | Integrations | DeepEval Platform

DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemented via LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.

[!IMPORTANT] Need a place for your DeepEval testing data to live 🏡❤️? Sign up to the DeepEval platform to compare iterations of your LLM app, generate & share testing reports, and more.

Want to talk LLM evaluation, need help picking metrics, or just to say hi? Come join our discord.

🔥 Metrics and Features

🥳 You can now share DeepEval's test results on the cloud directly on Confident AI's infrastructure

Supports both end-to-end and component-level LLM evaluation.
Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that runs locally on your machine:
- G-Eval
- DAG (deep acyclic graph)
- RAG metrics:
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- Contextual Relevancy
- RAGAS
- Agentic metrics:
- Task Completion
- Tool Correctness
- Others:
- Hallucination
- Summarization
- Bias
- Toxicity
- Conversational metrics:
- Knowledge Retention
- Conversation Completeness
- Conversation Relevancy
- Role Adherence
- etc.
Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.
Generate synthetic datasets for evaluation.
Integrates seamlessly with ANY CI/CD environment.
Red team your LLM application for 40+ safety vulnerabilities in a few lines of code, including:
- Toxicity
- Bias
- SQL Injection
- etc., using advanced 10+ attack enhancement strategies such as prompt injections.
Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code., which includes:
- MMLU
- HellaSwag
- DROP
- BIG-Bench Hard
- TruthfulQA
- HumanEval
- GSM8K
100% integrated with Confident AI for the full evaluation lifecycle:
- Curate/annotate evaluation datasets on the cloud
- Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
- Fine-tune metrics for custom results
- Debug evaluation results via LLM traces
- Monitor & evaluate LLM responses in product to improve datasets with real-world data
- Repeat until perfection

[!NOTE] Confident AI is the DeepEval platform. Create an account here.

🔌 Integrations

🦄 LlamaIndex, to unit test RAG applications in CI/CD
🤗 Hugging Face, to enable real-time evaluations during LLM fine-tuning

🚀 QuickStart

Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.

Installation

pip install -U deepeval

Environment variables (.env / .env.local)

DeepEval auto-loads .env.local then .env from the current working directory at import time. Precedence: process env -> .env.local -> .env. Opt out with DEEPEVAL_DISABLE_DOTENV=1.

```bash cp .env.example .env.local

then edit .env.local (ignored by git)

```

Create an account (highly recommended)

Using the deepeval platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.

To login, run:

deepeval login

Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy here).

Writing your first test case

Create a test file:

bash touch test_chatbot.py

Open test_chatbot.py and write your first test case to run an end-to-end evaluation using DeepEval, which treats your LLM app as a black-box:

```python import pytest from deepeval import asserttest from deepeval.metrics import GEval from deepeval.testcase import LLMTestCase, LLMTestCaseParams

def testcase(): correctnessmetric = GEval( name="Correctness", criteria="Determine if the 'actual output' is correct based on the 'expected output'.", evaluationparams=[LLMTestCaseParams.ACTUALOUTPUT, LLMTestCaseParams.EXPECTEDOUTPUT], threshold=0.5 ) testcase = LLMTestCase( input="What if these shoes don't fit?", # Replace this with the actual output from your LLM application actualoutput="You have 30 days to get a full refund at no extra cost.", expectedoutput="We offer a 30-day full refund at no extra costs.", retrievalcontext=["All customers are eligible for a 30 day full refund at no extra costs."] ) asserttest(testcase, [correctnessmetric]) ```

Set your OPENAI_API_KEY as an environment variable (you can also evaluate using your own custom model, for more details visit this part of our docs):

export OPENAI_API_KEY="..."

And finally, run test_chatbot.py in the CLI:

deepeval test run test_chatbot.py

Congratulations! Your test case should have passed ✅ Let's breakdown what happened.

The variable input mimics a user input, and actual_output is a placeholder for what your application's supposed to output based on this input.
The variable expected_output represents the ideal answer for a given input, and GEval is a research-backed metric provided by deepeval for you to evaluate your LLM output's on any custom with human-like accuracy.
In this example, the metric criteria is correctness of the actual_output based on the provided expected_output.
All metric scores range from 0 - 1, which the threshold=0.5 threshold ultimately determines if your test have passed or not.

Read our documentation for more information on more options to run end-to-end evaluation, how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.

Evaluating Nested Components

If you wish to evaluate individual components within your LLM app, you need to run component-level evals - a powerful way to evaluate any component within an LLM system.

Simply trace "components" such as LLM calls, retrievers, tool calls, and agents within your LLM application using the @observe decorator to apply metrics on a component-level. Tracing with deepeval is non-instrusive (learn more here) and helps you avoid rewriting your codebase just for evals:

```python from deepeval.tracing import observe, updatecurrentspan from deepeval.test_case import LLMTestCase from deepeval.dataset import Golden from deepeval.metrics import GEval from deepeval import evaluate

correctness = GEval(name="Correctness", criteria="Determine if the 'actual output' is correct based on the 'expected output'.", evaluationparams=[LLMTestCaseParams.ACTUALOUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT])

@observe(metrics=[correctness]) def innercomponent(): # Component can be anything from an LLM call, retrieval, agent, tool use, etc. updatecurrentspan(testcase=LLMTestCase(input="...", actual_output="...")) return

@observe def llmapp(input: str): innercomponent() return

evaluate(observedcallback=llmapp, goldens=[Golden(input="Hi!")]) ```

You can learn everything about component-level evaluations here.

Evaluating Without Pytest Integration

Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.

```python from deepeval import evaluate from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase

answerrelevancymetric = AnswerRelevancyMetric(threshold=0.7) testcase = LLMTestCase( input="What if these shoes don't fit?", # Replace this with the actual output from your LLM application actualoutput="We offer a 30-day full refund at no extra costs.", retrievalcontext=["All customers are eligible for a 30 day full refund at no extra costs."] ) evaluate([testcase], [answerrelevancymetric]) ```

Using Standalone Metrics

DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:

```python from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase

answerrelevancymetric = AnswerRelevancyMetric(threshold=0.7) testcase = LLMTestCase( input="What if these shoes don't fit?", # Replace this with the actual output from your LLM application actualoutput="We offer a 30-day full refund at no extra costs.", retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."] )

answerrelevancymetric.measure(testcase) print(answerrelevancy_metric.score)

All metrics also offer an explanation

print(answerrelevancymetric.reason) ```

Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.

Evaluating a Dataset / Test Cases in Bulk

In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:

```python import pytest from deepeval import asserttest from deepeval.dataset import EvaluationDataset, Golden from deepeval.metrics import AnswerRelevancyMetric from deepeval.testcase import LLMTestCase

dataset = EvaluationDataset(goldens=[Golden(input="What's the weather like today?")])

for golden in dataset.goldens: testcase = LLMTestCase( input=golden.input, actualoutput=yourllmapp(golden.input) ) dataset.addtestcase(test_case)

@pytest.mark.parametrize( "testcase", dataset, ) def testcustomerchatbot(testcase: LLMTestCase): answerrelevancymetric = AnswerRelevancyMetric(threshold=0.5) asserttest(testcase, [answerrelevancymetric]) ```

```bash

Run this in the CLI, you can also add an optional -n flag to run tests in parallel

deepeval test run test_.py -n 4 ```

Alternatively, although we recommend using deepeval test run, you can evaluate a dataset/test cases without using our Pytest integration:

```python from deepeval import evaluate ...

evaluate(dataset, [answerrelevancymetric])

or

dataset.evaluate([answerrelevancymetric]) ```

LLM Evaluation With Confident AI

The correct LLM evaluation lifecycle is only achievable with the DeepEval platform. It allows you to:

Curate/annotate evaluation datasets on the cloud
Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
Fine-tune metrics for custom results
Debug evaluation results via LLM traces
Monitor & evaluate LLM responses in product to improve datasets with real-world data
Repeat until perfection

Everything on Confident AI, including how to use Confident is available here.

To begin, login from the CLI:

bash deepeval login

Follow the instructions to log in, create your account, and paste your API key into the CLI.

Now, run your test file again:

bash deepeval test run test_chatbot.py

You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!

Demo GIF

Configuration

Environment variables via .env files

Using .env.local or .env is optional. If they are missing, DeepEval uses your existing environment variables. When present, dotenv environment variables are auto-loaded at import time (unless you set DEEPEVAL_DISABLE_DOTENV=1).

Precedence: process env -> .env.local -> .env

```bash cp .env.example .env.local

then edit .env.local (ignored by git)

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

Roadmap

Features:

[x] Integration with Confident AI
[x] Implement G-Eval
[x] Implement RAG metrics
[x] Implement Conversational metrics
[x] Evaluation Dataset Creation
[x] Red-Teaming
[ ] DAG custom metrics
[ ] Guardrails

Authors

Built by the founders of Confident AI. Contact jeffreyip@confident-ai.com for all enquiries.

License

DeepEval is licensed under Apache 2.0 - see the LICENSE.md file for details.

Owner

Name: Confident AI
Login: confident-ai
Kind: organization

Website: www.confident-ai.com
Repositories: 1
Profile: https://github.com/confident-ai

Citation (CITATION.cff)

cff-version: 1.2.0
message: If you use this software, please cite it as below.
authors:
  - family-names: Ip
    given-names: Jeffrey
  - family-names: Vongthongsri
    given-names: Kritin
title: deepeval
version: 3.4.7
date-released: "2025-08-25"
url: https://confident-ai.com
repository-code: https://github.com/confident-ai/deepeval
license: Apache-2.0
type: software
description: The Open-Source LLM Evaluation Framework

Committers

Last synced: 9 months ago

All Time

Total Commits: 3,867
Total Committers: 168
Avg Commits per committer: 23.018
Development Distribution Score (DDS): 0.464

Past Year

Commits: 1,620
Committers: 125
Avg Commits per committer: 12.96
Development Distribution Score (DDS): 0.401

Top Committers

Name	Email	Commits
Jeffrey Ip	j**p@c**m	2,071
Jacky Wong	c**g@g**m	858
Kritin Vongthongsri	7****v	441
Anindyadeep	p**p@g**m	47
Anindyadeep	a**a@p**n	35
Vasilije	8****0	26
Pratyush-exe	p**5@g**m	23
Aman Gokrani	a**i@g**m	17
Mayank Solanki	b**u@g**m	16
fetz236	5****6	14
Peilun Li	p**l@z**m	13
Jack Luar	3****s	11
Vytenis Šliogeris	v**s@n**m	9
Frederico Schuh	f**h@g**m	9
johnlemmon	j**n@s**m	8
Andrea Romano	1****o	7
Jan F.	5****4	7
Jon Bennion	1****b	7
Serghei Iakovlev	g**t@s**l	7
Christian Bernhard	4****d	6
Simon Podhajsky	s**y@g**m	5
Rami Pellumbi	r**i@g**m	5
Vamshi Adimalla	v**a@V**l	5
Yleisnero	f**5@g**m	4
Umut Hope YILDIRIM	u**5@g**m	4
Philip Nuzhnyi	p**y@g**m	4
Paul Lewis	p**1@g**m	4
Nicolas Torres	n**i@g**m	4
Bjarni	b**1@g**m	4
Andrés Pérez Manríquez	a**m@g**m	4
and 138 more...

Committer Domains (Top 20 + Academic)

google.com: 2 qq.com: 2 onepoint62.com: 1 osu.edu: 1 mwam.com: 1 umich.edu: 1 hp.com: 1 tngtech.com: 1 obexmetrics.com: 1 tno.nl: 1 purdue.edu: 1 mysidewalk.com: 1 dimagi.com: 1 ttcglobal.com: 1 annalect.com: 1 scienta.nl: 1 nabeelcwalasmbp.attlocal.net: 1 talkdesk.com: 1 postech.ac.kr: 1 serghei.pl: 1 umbc.edu: 1 kiit.ac.in: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 403
Total pull requests: 1,669
Average time to close issues: 23 days
Average time to close pull requests: 4 days
Total issue authors: 290
Total pull request authors: 222
Average comments per issue: 1.03
Average comments per pull request: 1.29
Merged pull requests: 1,337
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 236
Pull requests: 1,132
Average time to close issues: 6 days
Average time to close pull requests: 2 days
Issue authors: 195
Pull request authors: 166
Average comments per issue: 1.08
Average comments per pull request: 1.33
Merged pull requests: 871
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

ColabDog (24)
penguine-ip (18)
prescod (10)
AndresPrez (5)
piseabhijeet (5)
CAW-nz (4)
tjasmin111 (4)
Sara-Hossny (4)
luarss (4)
behnamsattar (3)
ymzayek (3)
jmaczan (3)
spike-spiegel-21 (3)
jtquach1 (3)
enrico-stauss (3)

Pull Request Authors

penguine-ip (546)
kritinv (421)
spike-spiegel-21 (78)
A-Vamshi (64)
ColabDog (62)
john-lemmon-lime (16)
luarss (16)
ChristianBernhard (12)
ramipellumbi (12)
kira-offgrid (10)
sergeyklay (8)
obadakhalili (6)
sid-murali (6)
realei (6)
adityabharadwaj198 (5)

Top Labels

Issue Labels

help wanted (15) enhancement (10) good first issue (5) documentation (2) bug (2)

Pull Request Labels

Packages

Total packages: 2
Total downloads:
- pypi 985,847 last-month

Total dependent packages: 7
(may contain duplicates)
Total dependent repositories: 1
(may contain duplicates)
Total versions: 653
Total maintainers: 1

proxy.golang.org: github.com/confident-ai/deepeval

Documentation: https://pkg.go.dev/github.com/confident-ai/deepeval#section-documentation
License: apache-2.0

Versions: 210
Dependent Packages: 0
Dependent Repositories: 0

Rankings

Dependent packages count: 6.5%

Average: 6.7%

Dependent repos count: 7.0%

Last synced: over 1 year ago

pypi.org: deepeval

The LLM Evaluation Framework

Homepage: https://github.com/confident-ai/deepeval
Documentation: https://deepeval.com
License: Apache-2.0
Latest release: 3.4.4
published 6 months ago

Versions: 443
Dependent Packages: 7
Dependent Repositories: 1
Downloads: 985,847 Last month

Rankings

Stargazers count: 2.8%

Downloads: 4.0%

Forks count: 8.2%

Average: 9.3%

Dependent packages count: 10.1%

Dependent repos count: 21.6%

Maintainers (1)

penguineip

Last synced: 6 months ago

Dependencies

.github/workflows/black.yml actions

actions/checkout v3 composite
psf/black stable composite

.github/workflows/test.yml actions

actions/checkout v2 composite
actions/setup-python v4 composite

docs/package.json npm

@docusaurus/module-type-aliases 2.4.1 development
@docusaurus/core 2.4.1
@docusaurus/preset-classic 2.4.1
@mdx-js/react ^1.6.22
clsx ^1.2.1
prism-react-renderer ^1.3.5
react ^17.0.2
react-dom ^17.0.2

docs/yarn.lock npm

1024 dependencies

requirements.txt pypi

pytest *
requests *
sentence-transformers *
tabulate *
tqdm *
transformers *

setup.py pypi

pytest *
requests *
sentence-transformers *
tabulate *
tqdm *
transformers *

.github/workflows/test-no-login.yml actions

actions/cache v2 composite
actions/checkout v2 composite
actions/setup-python v2 composite

pyproject.toml pypi

deepeval

Science Score: 54.0%

Keywords

Keywords from Contributors

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

The LLM Evaluation Framework

Documentation | Metrics and Features | Getting Started | Integrations | DeepEval Platform

🔥 Metrics and Features

🔌 Integrations

🚀 QuickStart

Installation

Environment variables (.env / .env.local)

then edit .env.local (ignored by git)

Create an account (highly recommended)

Writing your first test case

Evaluating Nested Components

Evaluating Without Pytest Integration

Using Standalone Metrics

All metrics also offer an explanation

Evaluating a Dataset / Test Cases in Bulk

Run this in the CLI, you can also add an optional -n flag to run tests in parallel

or

LLM Evaluation With Confident AI

Configuration

Environment variables via .env files

then edit .env.local (ignored by git)

Contributing

Roadmap

Authors

License

Owner

Citation (CITATION.cff)

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

proxy.golang.org: github.com/confident-ai/deepeval

Rankings

pypi.org: deepeval

Rankings

Maintainers (1)

Dependencies