Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
5 of 168 committers (3.0%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
The LLM Evaluation Framework
Basic Info
- Host: GitHub
- Owner: confident-ai
- License: apache-2.0
- Language: Python
- Default Branch: main
- Homepage: https://deepeval.com
- Size: 94.5 MB
Statistics
- Stars: 10,479
- Watchers: 43
- Forks: 902
- Open Issues: 202
- Releases: 52
Topics
Metadata Files
README.md
The LLM Evaluation Framework
Documentation | Metrics and Features | Getting Started | Integrations | DeepEval Platform
Deutsch | Español | français | 日本語 | 한국어 | Português | Русский | 中文
DeepEval is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.
Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemented via LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.
[!IMPORTANT] Need a place for your DeepEval testing data to live 🏡❤️? Sign up to the DeepEval platform to compare iterations of your LLM app, generate & share testing reports, and more.
Want to talk LLM evaluation, need help picking metrics, or just to say hi? Come join our discord.
🔥 Metrics and Features
🥳 You can now share DeepEval's test results on the cloud directly on Confident AI's infrastructure
- Supports both end-to-end and component-level LLM evaluation.
- Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that runs locally on your machine:
- G-Eval
- DAG (deep acyclic graph)
- RAG metrics:
- Answer Relevancy
- Faithfulness
- Contextual Recall
- Contextual Precision
- Contextual Relevancy
- RAGAS
- Agentic metrics:
- Task Completion
- Tool Correctness
- Others:
- Hallucination
- Summarization
- Bias
- Toxicity
- Conversational metrics:
- Knowledge Retention
- Conversation Completeness
- Conversation Relevancy
- Role Adherence
- etc.
- Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.
- Generate synthetic datasets for evaluation.
- Integrates seamlessly with ANY CI/CD environment.
- Red team your LLM application for 40+ safety vulnerabilities in a few lines of code, including:
- Toxicity
- Bias
- SQL Injection
- etc., using advanced 10+ attack enhancement strategies such as prompt injections.
- Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code., which includes:
- MMLU
- HellaSwag
- DROP
- BIG-Bench Hard
- TruthfulQA
- HumanEval
- GSM8K
- 100% integrated with Confident AI for the full evaluation lifecycle:
- Curate/annotate evaluation datasets on the cloud
- Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
- Fine-tune metrics for custom results
- Debug evaluation results via LLM traces
- Monitor & evaluate LLM responses in product to improve datasets with real-world data
- Repeat until perfection
[!NOTE] Confident AI is the DeepEval platform. Create an account here.
🔌 Integrations
- 🦄 LlamaIndex, to unit test RAG applications in CI/CD
- 🤗 Hugging Face, to enable real-time evaluations during LLM fine-tuning
🚀 QuickStart
Let's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.
Installation
pip install -U deepeval
Environment variables (.env / .env.local)
DeepEval auto-loads .env.local then .env from the current working directory at import time.
Precedence: process env -> .env.local -> .env.
Opt out with DEEPEVAL_DISABLE_DOTENV=1.
```bash cp .env.example .env.local
then edit .env.local (ignored by git)
```
Create an account (highly recommended)
Using the deepeval platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.
To login, run:
deepeval login
Follow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy here).
Writing your first test case
Create a test file:
bash
touch test_chatbot.py
Open test_chatbot.py and write your first test case to run an end-to-end evaluation using DeepEval, which treats your LLM app as a black-box:
```python import pytest from deepeval import asserttest from deepeval.metrics import GEval from deepeval.testcase import LLMTestCase, LLMTestCaseParams
def testcase(): correctnessmetric = GEval( name="Correctness", criteria="Determine if the 'actual output' is correct based on the 'expected output'.", evaluationparams=[LLMTestCaseParams.ACTUALOUTPUT, LLMTestCaseParams.EXPECTEDOUTPUT], threshold=0.5 ) testcase = LLMTestCase( input="What if these shoes don't fit?", # Replace this with the actual output from your LLM application actualoutput="You have 30 days to get a full refund at no extra cost.", expectedoutput="We offer a 30-day full refund at no extra costs.", retrievalcontext=["All customers are eligible for a 30 day full refund at no extra costs."] ) asserttest(testcase, [correctnessmetric]) ```
Set your OPENAI_API_KEY as an environment variable (you can also evaluate using your own custom model, for more details visit this part of our docs):
export OPENAI_API_KEY="..."
And finally, run test_chatbot.py in the CLI:
deepeval test run test_chatbot.py
Congratulations! Your test case should have passed ✅ Let's breakdown what happened.
- The variable
inputmimics a user input, andactual_outputis a placeholder for what your application's supposed to output based on this input. - The variable
expected_outputrepresents the ideal answer for a giveninput, andGEvalis a research-backed metric provided bydeepevalfor you to evaluate your LLM output's on any custom with human-like accuracy. - In this example, the metric
criteriais correctness of theactual_outputbased on the providedexpected_output. - All metric scores range from 0 - 1, which the
threshold=0.5threshold ultimately determines if your test have passed or not.
Read our documentation for more information on more options to run end-to-end evaluation, how to use additional metrics, create your own custom metrics, and tutorials on how to integrate with other tools like LangChain and LlamaIndex.
Evaluating Nested Components
If you wish to evaluate individual components within your LLM app, you need to run component-level evals - a powerful way to evaluate any component within an LLM system.
Simply trace "components" such as LLM calls, retrievers, tool calls, and agents within your LLM application using the @observe decorator to apply metrics on a component-level. Tracing with deepeval is non-instrusive (learn more here) and helps you avoid rewriting your codebase just for evals:
```python from deepeval.tracing import observe, updatecurrentspan from deepeval.test_case import LLMTestCase from deepeval.dataset import Golden from deepeval.metrics import GEval from deepeval import evaluate
correctness = GEval(name="Correctness", criteria="Determine if the 'actual output' is correct based on the 'expected output'.", evaluationparams=[LLMTestCaseParams.ACTUALOUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT])
@observe(metrics=[correctness]) def innercomponent(): # Component can be anything from an LLM call, retrieval, agent, tool use, etc. updatecurrentspan(testcase=LLMTestCase(input="...", actual_output="...")) return
@observe def llmapp(input: str): innercomponent() return
evaluate(observedcallback=llmapp, goldens=[Golden(input="Hi!")]) ```
You can learn everything about component-level evaluations here.
Evaluating Without Pytest Integration
Alternatively, you can evaluate without Pytest, which is more suited for a notebook environment.
```python from deepeval import evaluate from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase
answerrelevancymetric = AnswerRelevancyMetric(threshold=0.7) testcase = LLMTestCase( input="What if these shoes don't fit?", # Replace this with the actual output from your LLM application actualoutput="We offer a 30-day full refund at no extra costs.", retrievalcontext=["All customers are eligible for a 30 day full refund at no extra costs."] ) evaluate([testcase], [answerrelevancymetric]) ```
Using Standalone Metrics
DeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:
```python from deepeval.metrics import AnswerRelevancyMetric from deepeval.test_case import LLMTestCase
answerrelevancymetric = AnswerRelevancyMetric(threshold=0.7) testcase = LLMTestCase( input="What if these shoes don't fit?", # Replace this with the actual output from your LLM application actualoutput="We offer a 30-day full refund at no extra costs.", retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."] )
answerrelevancymetric.measure(testcase) print(answerrelevancy_metric.score)
All metrics also offer an explanation
print(answerrelevancymetric.reason) ```
Note that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.
Evaluating a Dataset / Test Cases in Bulk
In DeepEval, a dataset is simply a collection of test cases. Here is how you can evaluate these in bulk:
```python import pytest from deepeval import asserttest from deepeval.dataset import EvaluationDataset, Golden from deepeval.metrics import AnswerRelevancyMetric from deepeval.testcase import LLMTestCase
dataset = EvaluationDataset(goldens=[Golden(input="What's the weather like today?")])
for golden in dataset.goldens: testcase = LLMTestCase( input=golden.input, actualoutput=yourllmapp(golden.input) ) dataset.addtestcase(test_case)
@pytest.mark.parametrize( "testcase", dataset, ) def testcustomerchatbot(testcase: LLMTestCase): answerrelevancymetric = AnswerRelevancyMetric(threshold=0.5) asserttest(testcase, [answerrelevancymetric]) ```
```bash
Run this in the CLI, you can also add an optional -n flag to run tests in parallel
deepeval test run test_
Alternatively, although we recommend using deepeval test run, you can evaluate a dataset/test cases without using our Pytest integration:
```python from deepeval import evaluate ...
evaluate(dataset, [answerrelevancymetric])
or
dataset.evaluate([answerrelevancymetric]) ```
LLM Evaluation With Confident AI
The correct LLM evaluation lifecycle is only achievable with the DeepEval platform. It allows you to:
- Curate/annotate evaluation datasets on the cloud
- Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
- Fine-tune metrics for custom results
- Debug evaluation results via LLM traces
- Monitor & evaluate LLM responses in product to improve datasets with real-world data
- Repeat until perfection
Everything on Confident AI, including how to use Confident is available here.
To begin, login from the CLI:
bash
deepeval login
Follow the instructions to log in, create your account, and paste your API key into the CLI.
Now, run your test file again:
bash
deepeval test run test_chatbot.py
You should see a link displayed in the CLI once the test has finished running. Paste it into your browser to view the results!

Configuration
Environment variables via .env files
Using .env.local or .env is optional. If they are missing, DeepEval uses your existing environment variables. When present, dotenv environment variables are auto-loaded at import time (unless you set DEEPEVAL_DISABLE_DOTENV=1).
Precedence: process env -> .env.local -> .env
```bash cp .env.example .env.local
then edit .env.local (ignored by git)
Contributing
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
Roadmap
Features:
- [x] Integration with Confident AI
- [x] Implement G-Eval
- [x] Implement RAG metrics
- [x] Implement Conversational metrics
- [x] Evaluation Dataset Creation
- [x] Red-Teaming
- [ ] DAG custom metrics
- [ ] Guardrails
Authors
Built by the founders of Confident AI. Contact jeffreyip@confident-ai.com for all enquiries.
License
DeepEval is licensed under Apache 2.0 - see the LICENSE.md file for details.
Owner
- Name: Confident AI
- Login: confident-ai
- Kind: organization
- Website: www.confident-ai.com
- Repositories: 1
- Profile: https://github.com/confident-ai
Citation (CITATION.cff)
cff-version: 1.2.0
message: If you use this software, please cite it as below.
authors:
- family-names: Ip
given-names: Jeffrey
- family-names: Vongthongsri
given-names: Kritin
title: deepeval
version: 3.4.7
date-released: "2025-08-25"
url: https://confident-ai.com
repository-code: https://github.com/confident-ai/deepeval
license: Apache-2.0
type: software
description: The Open-Source LLM Evaluation Framework
Committers
Last synced: 9 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| Jeffrey Ip | j****p@c****m | 2,071 |
| Jacky Wong | c****g@g****m | 858 |
| Kritin Vongthongsri | 7****v | 441 |
| Anindyadeep | p****p@g****m | 47 |
| Anindyadeep | a****a@p****n | 35 |
| Vasilije | 8****0 | 26 |
| Pratyush-exe | p****5@g****m | 23 |
| Aman Gokrani | a****i@g****m | 17 |
| Mayank Solanki | b****u@g****m | 16 |
| fetz236 | 5****6 | 14 |
| Peilun Li | p****l@z****m | 13 |
| Jack Luar | 3****s | 11 |
| Vytenis Šliogeris | v****s@n****m | 9 |
| Frederico Schuh | f****h@g****m | 9 |
| johnlemmon | j****n@s****m | 8 |
| Andrea Romano | 1****o | 7 |
| Jan F. | 5****4 | 7 |
| Jon Bennion | 1****b | 7 |
| Serghei Iakovlev | g****t@s****l | 7 |
| Christian Bernhard | 4****d | 6 |
| Simon Podhajsky | s****y@g****m | 5 |
| Rami Pellumbi | r****i@g****m | 5 |
| Vamshi Adimalla | v****a@V****l | 5 |
| Yleisnero | f****5@g****m | 4 |
| Umut Hope YILDIRIM | u****5@g****m | 4 |
| Philip Nuzhnyi | p****y@g****m | 4 |
| Paul Lewis | p****1@g****m | 4 |
| Nicolas Torres | n****i@g****m | 4 |
| Bjarni | b****1@g****m | 4 |
| Andrés Pérez Manríquez | a****m@g****m | 4 |
| and 138 more... | ||
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 403
- Total pull requests: 1,669
- Average time to close issues: 23 days
- Average time to close pull requests: 4 days
- Total issue authors: 290
- Total pull request authors: 222
- Average comments per issue: 1.03
- Average comments per pull request: 1.29
- Merged pull requests: 1,337
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 236
- Pull requests: 1,132
- Average time to close issues: 6 days
- Average time to close pull requests: 2 days
- Issue authors: 195
- Pull request authors: 166
- Average comments per issue: 1.08
- Average comments per pull request: 1.33
- Merged pull requests: 871
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- ColabDog (24)
- penguine-ip (18)
- prescod (10)
- AndresPrez (5)
- piseabhijeet (5)
- CAW-nz (4)
- tjasmin111 (4)
- Sara-Hossny (4)
- luarss (4)
- behnamsattar (3)
- ymzayek (3)
- jmaczan (3)
- spike-spiegel-21 (3)
- jtquach1 (3)
- enrico-stauss (3)
Pull Request Authors
- penguine-ip (546)
- kritinv (421)
- spike-spiegel-21 (78)
- A-Vamshi (64)
- ColabDog (62)
- john-lemmon-lime (16)
- luarss (16)
- ChristianBernhard (12)
- ramipellumbi (12)
- kira-offgrid (10)
- sergeyklay (8)
- obadakhalili (6)
- sid-murali (6)
- realei (6)
- adityabharadwaj198 (5)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 2
-
Total downloads:
- pypi 985,847 last-month
-
Total dependent packages: 7
(may contain duplicates) -
Total dependent repositories: 1
(may contain duplicates) - Total versions: 653
- Total maintainers: 1
proxy.golang.org: github.com/confident-ai/deepeval
- Documentation: https://pkg.go.dev/github.com/confident-ai/deepeval#section-documentation
- License: apache-2.0
Rankings
pypi.org: deepeval
The LLM Evaluation Framework
- Homepage: https://github.com/confident-ai/deepeval
- Documentation: https://deepeval.com
- License: Apache-2.0
-
Latest release: 3.4.4
published 6 months ago
Rankings
Maintainers (1)
Dependencies
- actions/checkout v3 composite
- psf/black stable composite
- actions/checkout v2 composite
- actions/setup-python v4 composite
- @docusaurus/module-type-aliases 2.4.1 development
- @docusaurus/core 2.4.1
- @docusaurus/preset-classic 2.4.1
- @mdx-js/react ^1.6.22
- clsx ^1.2.1
- prism-react-renderer ^1.3.5
- react ^17.0.2
- react-dom ^17.0.2
- 1024 dependencies
- pytest *
- requests *
- sentence-transformers *
- tabulate *
- tqdm *
- transformers *
- pytest *
- requests *
- sentence-transformers *
- tabulate *
- tqdm *
- transformers *
- actions/cache v2 composite
- actions/checkout v2 composite
- actions/setup-python v2 composite